Introduction
The human brain is a complex network of functionally and structurally interconnected regions. Although each region has its own task and function, these different brain regions continuously share information with each other and then form a complex integrative network named the brain network. To understand the organization of the human brain, one can study the underlying connectivity of different functional brain regions, or functional connectivity, as well as physical or structural connectivity in the brain.
Functional connectivity is primarily explored and investigated through resting state functional magnetic resonance imaging (rfMRI or R-fMRI) and is typically analyzed in terms of correlation or spatial grouping based on temporal similarities [1]. These approaches are supported by the fact that during rest, in the absence of any explicit task, the spontaneous neuronal activity patterns of multiple brain regions observed through changes in a blood-oxygen-level dependent (BOLD) signal (or rfMRI time-series) are not random and unstructured, but, in contrast, are highly correlated. In other words, functional connectivity can be explored by measuring the level of synchronization of rfMRI time-series between anatomically separated brain regions. These approaches assume similar patterns of activation can reflect functional and neuronal communication between brain regions regardless of the apparent physical connectedness of the regions. Functional networks generated using these approaches are also termed resting-state networks [2].
Since rfMRI relies on the assumption that spontaneous low frequency BOLD fluctuations (0.01-0.1 Hz) are a measure of intrinsic activity in the brain, a group of researchers have questioned whether the fluctuations observed during the resting-state could be artifacts of other bodily functions [3]. Although the true neuronal basis of these fluctuations has not yet been fully understood, there are several supports for a possible neuronal basis of rfMRI. For instance, most of the resting-state connected activities tends to occur along structural networks in the brain [4] as well there is an association between information derived from rfMRI and from other measures of neuronal activity [5].
The first and the most fundamental resting-state network is the so-called default mode network (DMN), first presented in a seminal rfMRI study of Biswal and colleagues [4] and later confirmed by a series of studies (e.g., [6], [7]). Unlike other brain networks that can be observed and identified by their activation during tasks, DMN is a group of brain regions that is active during rest, in a baseline or default mode of the brain, and deactivated during a variety of cognitive tasks. These studies also suggest that brain networks which activate or deactivate together during tasks maintain their signature connectivity at rest. It means that neuroscientists can study the known functional brain networks of both healthy and abnormal brain without the use of specially designed tasks, which may be unable to be completed by young children or patients who cannot perform either complex cognitive tasks or long experiments.
Other advantages of employing rfMRI [8] include the simplicity of the procedure, which may offer a better signal-to-noise ratio (SNR), and its relatively short period of acquisition time, which allows for increased sample size or big data. Unlike task-based imaging which typically extracts only one feature brain network, rfMRI allows us to observe many brain networks at once (or multi-purpose data sets [8]). With rfMRI, functional connectivity can also be applied to examine several hypothesized and believed functional dysconnectivity effects in brain disorders and diseases such as Alzheimers disease, amyotrophic lateral sclerosis, attention deficit hyperactivity disorder (ADHD), autism, epilepsy, Parkinson's disease, schizophrenia, multiple sclerosis, and obsessive compulsive disorder (for a review, see [8], [9]). This information will be useful for clinicians for prognosis, diagnosis and treatments. Unfortunately, clinical applications of rfMRI are still at an early stage of development.
Although functional connectivity based on rfMRI can reveal interesting new findings about the functional connections of brain regions and networks, huge amounts of data are necessary to explore these complex networks. Recent advances in neuroimaging technologies combined with the unique methodological approach of rfMRI have enabled us to an era of “Biomedical Big Data”. The 1000 Functional Connectomes Project [10] and the Human Connectome Project [11], both neuroimaging databases, for instance, have publicly released over 1,000 rfMRI data sets. Here we present the recent progress of existing shared rfMRI big data sets (in Section 2). The increasing amount of shared neuroimaging datasets has greatly increased the importance of developing data preprocessing pipelines and advanced analytic techniques, which are better at handling large-scale rfMRI data.
Before applying any analytic technique on rfMRI data, several preprocessing steps are required in order to: reduce various artifacts, align the data acquired at different points in time for an individual subject, establish some correspondence between the brains of different subjects, and so on. While it is acknowledged in the literature that different methods and their order in the preprocessing pipelines can affect the results obtained from statistical group difference tests and classification models (e.g., [12], [13], [14], [15]), most studies used their own specific pipeline and no consensus across the studies has been found regarding the optimal preprocessing pipeline. Here we present state-of-the-art rfMRI preprocessing pipelines, with a focus on software packages designed for large-scale rfMRI data analysis (in Section 3).
After the rfMRI data has been preprocessed, there are several commonly used methods for examining functional connectivity such as seed-based correlation analysis (SCA), cluster analysis, principal component analysis (PCA), independent component analysis (ICA) and graph theory (for a review, see [1], [16]). However, these traditional methods encounter limits in terms of their descriptive power when faced with complex, highly-dimensional datasets describing interactions between large number of elements, as is often the case in the analysis of big data. New tools to complement the exploration and analysis of such data sets are necessary. Therefore, we lastly propose a set of novel methods, which are rooted in algebraic topology and collectively referred to as “Topological Data Analysis” to rfMRI functional connectivity and their properties for big data analysis are also discussed (in Section 4).
Big rfMRI Data
Large shared rfMRI data sets are necessary to obtain new insights and interesting findings in the large-scale organization of complex cognitive operations in the human brain. Besides the fact that some clinical and research questions cannot be answered using a single small data set since each sub-population may exhibit different features that are not shared by others, larger samples are generally preferable in order to compensate for the large inter-subject and intra-subject variability typical in rfMRI recordings. There are many advantages of big data sharing (or Big Value) such as improving reliability and reproducibility of research (i.e., increasing statistical power and reducing false-positive rates), improving research practices, maximizing the contribution of research subjects, backing up valuable data and reducing the cost of research within the neuroimaging community [17].
Thanks to the unique methodological approach of rfMRI, a long-standing interest in acquiring the large-scale functional neuroimaging data sets has been increasingly fulfilled over the last decade. Recent advances in neuroimaging technologies as well data storage, management and sharing systems also enable the unrestricted sharing and open access of big neuroimaging data involving the projects with special emphasis on rfMRI data: the 1000 Functional Connectomes Project [10] and the Human Connectome Project [11]. The recent progress of these two big data sharing projects is focused and presented in this section.
2.1 The 1000 Functional Connectomes Project
The 1,000 Functional Connectomes Project (FCP) was launched in 2009 by gathering rfMRI data from over 1,300 subjects collected independently at 33 international institutes and centers [10]. All datasets are fully accessible upon successful registration at http://fcon_1000.projects.nitrc.org. All datasets are anonymous and demographic information provided is limited to age, gender and handedness. No extensive data preprocessing has been performed for any of the data sets. However, scripts for further preprocessing of the data sets are provided as part of the project involving motion correction, spatial filtering with 6 mm FWHM (full width at half maximum) and 12 DOF (degrees of freedom) affine transformation to MNI152 (the Montreal Neurological Institute of McGill University Health Centre) stereotaxic space [10].
To demonstrate the feasibility of pooled rfMRI data from multiple sites, Biswal et al. [10] performed several functional connectivity analyses using two commonly used methods: SCA and ICA on 1,093 subjects from 24 sites. The results show evidence of a universal functional architecture (i.e., the consistent patterns of functional connectivity across data collection sites) as well age- and sex-related differences in rfMRI measures-based frequency-domain analysis. These findings confirm the usefulness of the high-throughput rfMRI data. Consequently, data from this project have been used as a common test bed to evaluate new methods proposed in this field of research (e.g., [18], [19]).
This project served as the parent project for many large-scale datasets under the International Neuroimaging Data-Sharing Initiative (INDI) project1: for instance, the Autism Brain Imaging Data Exchange (ABIDE)2 with 1,026 individuals with autism spectrum disorder (ASD) and 1,130 typical controls from 17 different sites [20], ADHD-2003 with 383 children and adolescents with ADHD and 491 controls from 8 multiple sites [21], and the Consortium for Reliability and Reproducibility (CoRR) 4 with 1,652 subjects [22]. The ongoing phase of this project is to regularly release (e.g., weekly, monthly, or quarterly) prospective rfMRI data sets 5 such as the enhanced Nathan Kline Institute-Rockland Sample (NKI-RS)6 with a current total of 973 subjects [23].
All the FCP datasets are distributed using XNAT7, the most widely-used imaging informatics platform developed by the Neuroinformatics Research Group [24]. To support cloud computing, the FCP data is recently available for download from an Amazon Simple Storage Service (S3) bucket8 . Further, the data from both FCP and INDI was preprocessed using different preprocessing pipelines and is openly shared under the new project, namely, the Preprocessed Connectomes Project. Unfortunately, several limitations for the FCP have been acknowledged, for instance, rfMRI data is pooled from previously collected data so there is no prior coordination of data acquisition methods [10].
2.2 The Human Connectome Project
The Human Connectome Project (HCP) was launched in 2010 led by the WU-Minn HCP consortium [11], [25]. In the first phase of this project, methods for data acquisition and analysis were developed. The standardized imaging protocols and preprocessing pipelines [26] were then applied in the second phase when data was being acquired from a target number of 1,200 subjects at three different institutes. The subjects being studied are healthy twins and their non-twin siblings ages 22-35 from varying ethnic groups. All neuroimaging data and most of the behavioral data are accessible upon successful registration at www.humanconnectome.org. This neuroimaging data includes not only rfMRI but also diffusion MRI (dMRI) with tractography analysis, task-evoked fMRI (tfMRI) and magnetoencephalography (MEG). Getting access to restricted data elements: family structure (twin or non-twin status), age and handedness requires the acceptance of the HCP Restricted Data Use Term.
The first subset of the whole target samples were released on March, 2013. To date, HCP released the entire data sets for 1,206 subjects for a total of more than 64 terabytes via ConnectomeDB [27] , a data management system based on XNAT. Similar to FCP, the HCP data is also made available on Amazon S3 to allow users to process and analyze the data directly through Amazon Web Services (AWS), a cloud-based data processing. Instead of downloading all the datasets, one can also order the data on eight 8-terabyte hard drives (the so-called Connectome in a Box). In addition, a set of software packages are provided as part of the projects involving the HCP minimum preprocessing pipeline scripts [26].
This project currently served as a baseline for many new large-scale data sharing projects. The new projects are built upon the HCP by using the same data acquisition and analysis. For example, the Developing Human Connectome Project (dHCP)9 is a study of human brain connectivity from 20 to 44 weeks post-conceptional age; the Baby Connectome Project (BCP) for children from birth through five years of age, the Lifespan Human Connectome Project (L-HCP) 10 for different age groups across the lifespan (4-6, 8-9, 14-15, 25-35, 45-55, 66-75) [28]. In addition to healthy subjects, more than ten projects are funded to study connectomes related to human disease.
2.3 Challenges of Big rfMRI Data
Data gathered for the FCP and the HCP does exhibit several big data quantities (V's definitions [29]). Although the size of rfMRI data is not as big as other forms of data (such as genome sequencing data), these shared large-scale datasets are big enough that a single computer cannot process them. In other words, this rfMRI data does exhibit Big Volume. Many methods for preprocessing rfMRI data and functional connectivity have been designed when the data size is not really big. These approaches thus have difficulty in handling the large-scale data (e.g., PCA [30], [31]). Considering the FCP and the HCP data has only recently been released, and only few recent methods are able to handle large-scale rfMRI data, research based on this data is considerably new. Novel methods capable of analyzing such data should be developed either by modifying traditional methods that rely on parallel computing environment or by proposing new methods that work naturally on a parallel computing or a cloud computing environment.
Big Variety refers to the diversity of information within a single big rfMRI dataset (intra-dataset variety) or the diversity of multiple rfMRI datasets (inter-dataset variety). Big Variety can also occur when rfMRI data is analyzed together with other neuroimaging data and behavioral data. This is a critical stage in Big Data research since it is widely acknowledged that no single big data set should be considered to be true, and thus cross-validation of several imaging modalities is necessary. Thanks to the HCP, it involves multiple imaging modalities (rfMRI, tfMRI, dMRI, MEG) allowing investigators to apply multimodal data integration techniques to improve the reliability and robustness of the results [32]. Big data sharing projects that are focused primarily on sharing of other MRI data types are the OpenfMRI project 11 (which are focused primarily on sharing of tfMRI) and the Open Access Series of Imaging Studies (OASIS) project 12 (which has shared more than 500 subjects worth of structural MRI data). For the OpenfMRI project, the number of currently available subjects across 63 datasets is 2,158. Furthermore, HCP also provide different types of preprocessed fMRI data ranging from unprocessed NIfTI images, minimally preprocessed NIfTI images, ICA denoised rfMRI data to functional connectivity data. This increases the degree of utility and flexibility to re-analyze the data for investigators, as compared to coordinate-based data and statistical maps (which typically included in most neuroimaging papers or are available through several data sharing projects such as BrainMap, 13 Neurosynth,14 SumsDB15 and NeuroVault16).
Big Veracity refers to the noise, incomplete, inconsistent or erroneous in data. Although big data is very useful in detection correlations, especially subtle correlations, that might be missed by analyzing smaller datasets, scientists are likely to find many statistically significant correlations every time looking on larger dataset and thus scientists should be very aware of which correlations are meaningful. This is due to the fact that in large-scale datasets, large deviations are more attributable to variance (or noise) than to real information (or signal). Specifically, non-neuronal fluctuation in rfMRI data can increase the apparent functional connectivity between brain regions (i.e., increasing an opportunity to find spurious and/or fluke correlations) by introducing spurious common variance across rfMRI time-series. Data preprocessing is thus necessary and is a crucial stage in Big Data research. Several preprocessing steps are progressively becoming more accepted as standard in the analysis of rfMRI data, although these advanced techniques used in data preprocessing pipelines often dramatically increase the computational burden. A new software suit that is capable of preprocessing big data using advanced analytic techniques should be developed. The reduction of data is another crucial stage especially when dealing with large-scale data sets with Big Veracity, that is, discriminating relevant and meaningful features using selection or extraction methods from the whole set of features which potentially contains irrelevant, redundant and noisy information. These tasks can also be done using Topological Data Analysis. This approach is not only reducing the effect of negative elements but also reducing the amount of storage space required.
Big Velocity could come from prospective rfMRI data in a research setting. Big Velocity also occurs when data is coming in and processing at higher speed such as a real-time monitoring of a patient's current condition in a clinical setting [33].
Big Data Preprocessing Pipelines
Before applying any rfMRI technique for investigating functional connectivity, several data preprocessing steps need to be performed to remove all unwanted effects in rfMRI data and also increase the possibility of observing neural effects. This large number of inter-connected preprocessing steps collectively referred to as a pipeline (or workflow). So far there is no agreement on what constitutes the optimal data preprocessing pipeline nor how to select the best pipeline given a specific intended application. Most studies use their own specific pipeline, often defined by the experimenters’ personal preference, or by the defaults of the software package used. No consensus thus has been found across the studies [12]. Further, it is widely acknowledged that different versions of preprocessing pipelines can affect the results obtained from statistical group difference tests and classification models [13]. Three important characteristics that have been changed from one rfMRI study to another study are: (1) which preprocessing steps are applied; (2) in what order; and (3) their values of parameters involved in certain steps. Due to the large number of possible combinations, it is difficult to evaluate all of them on big rfMRI datasets. There have been few systematic approaches assessing the effect of different preprocessing pipelines applied in rfMRI methods to study functional connectivity of the brain, particularly in large-scale datasets [14], [15]. In this paper, we present three alternative ways to access big preprocessed rfMRI data: (1) the minimal preprocessing pipelines; (2) the Preprocessed Connectomes Project; and (3) the software packages for big rfMRI data. Each of which has its own advantages and disadvantages depending on the type of analysis.
3.1 Minimal Preprocessing Pipelines
Although the unprocessed NIfTI (Neuroimaging Informatics Technology Initiative [34]) data is available through data sharing projects, these projects anticipate that investigators will prefer to use the preprocessed data obtained from the minimal preprocessing pipelines developed by their team members. The principal goal of the minimal preprocessing pipelines is to provide rfMRI data with a minimum standard of data quality while the amount of information actually removed from the data is minimized. This minimally preprocessed data could be used as the starting point for any analysis. This is particularly advantageous for investigators who lack sufficient computational resources to preprocess large-scale datasets.
To obtain optimal results, however, it is important to apply further preprocessing steps which are dependent on the rfMRI methods used and/or characteristics of the data acquisition (in case of applying these pipelines on their own data). The notable minimal preprocessing pipelines are the ones implemented in the data sharing projects like the HCP. Since the HCP minimal preprocessing pipelines [26] are specially designed to their own specific data acquisition protocols, any study that would like to use the HCP minimal preprocessing pipelines requires their minimum data acquisition protocols. The interesting characteristic of the HCP acquisition system is the use of the fast repetition time (TR) sampling based multiband pulse sequences. Based on this approach, all slices acquired in each volume are very close together (as compared to typical fMRI acquisition system) and thus it is not necessary (but still optional) to carry out slice timing correction in the HCP pipelines.
Specifically, the HCP minimal preprocessing pipelines for functional preprocessing pipelines consist of correction of gradient-nonlinearity-induced distortion, realignment of the time-series to correct for subject head motion, registration of the fMRI data to the structural data, reduction of the bias field, normalization of the 4D image to a global mean, masking the data with the final brain mask, and the spatial smoothing using a novel geodesic Gaussian surface smoothing algorithm with 2 mm FWHM [26]. Preprocessing steps that may remove significant amounts of information (e.g., temporal filtering, significant spatial filtering, nuisance signal regression, and movement scrubbing) are not included in these pipelines. For instance, although high frequencies have been commonly related to nuisance signals [35], some studies suggest that there is important information contained in high frequencies (0.1 to 0.5 Hz) [36] . Therefore, the preprocessing steps that still remain a topic of debate are generally excluded from the minimal preprocessing pipelines. It is interesting that the HCP minimal preprocessing pipelines include the field map distortion correction step which in practice is often neglected instead.
For the FCP data, only three simple preprocessing steps have been performed comprising of NIfTI format conversion, uniform orientation placement and the first-5-time-points removal. These few preprocessing steps may not be sufficient to yield the minimum data quality standard, and further preprocessing steps may be necessary. Further, besides the minimal preprocessing pipelines implemented in the data sharing projects, a few software packages provide the minimal preprocessing pipelines as an option, such as SPM and C-PAC. Note that the full name of software tools and packages can be found in Tables 1 and 2 . The contributions of these tools and packages have been presented throughout this section. More details about their principles as well as the pros and cons can be found in [37] and [38].
3.2 Preprocessed Connectomes Project
The principal goal of the Preprocessed Connectomes Project (PCP)17 is to provide systematically preprocessed rfMRI data from the FCP and the INDI databases using different preprocessing pipelines. This is due to the fact that there is no consensus on the best preprocessing pipelines in this research field. Different preprocessing choices will allow investigators to compare the results and consequently will lead us to find the best preprocessing strategies later. Another reason behinds this project is to broaden the range of investigators who can access to the large-scale rfMRI data. Each of which was implemented using the chosen parameters and default settings of commonly used preprocessing pipeline softwares. All the preprocessed data is available on the Neuroimaging Informatics Tools and Resources Clearinghouse (NITRC) and on the Amazon S3 bucket.
It is interesting that the preprocessing steps implemented by the different common software suits are quite similar although the specific algorithms and their parameters used in each of the steps may vary, as can be observed in Table 3. This is due to the fact that most of them are developed by integrating several common brain imaging tools for functional and structural preprocessing together. A list of neuroimaging tools for general and wide-ranging purposes used by the preprocessing pipeline and functional connectivity software packages related to rfMRI analysis is presented in Table 1. For instance, CCS [49] builds upon a set of three main available tools: AFNI, FSL and FreeSurfer together with in-house developed functions while C-PAC [50] is developed by integrating many functions from three tools including AFNI, FSL and ANTS. Likewise, a general purpose software tool named SPM has been used as a basis for building many software suits with more specific purposes such as BrainVISA, CONN, cPPI, gPPI, SEM, SnPM, and TDT (for more details, see Table 2).
The first preprocessed data in this project is from the ADHD-200 data. This data was preprocessed by three different pipelines: the Athena pipeline18 (using AFNI and FSL), the NIAK pipeline19 (using NIAK on CBRAIN), and the Burner pipeline20 (using SPM). The forthcoming release is the preprocessed data using the CIVET pipeline 21 [71]. It should be noted that the CBRAIN platform 22 is a web-based collaborative research platform that allows investigators to integrate large neuroimaging data resources, preprocessing and analysis software tools as well as high-performance distributed computing facilities together within a controlled, secure environment [72]. Other datasets from the FCP and the INDI databases have been added later including the Beijing Enhanced Diffusion Tensor Imaging dataset, the Neurofeedback Skull-stripped repository and ABIDE. For the preprocessed ABIDE data, four different software packages were used involving CCS, C-PAC, DPARSF and NIAK. Besides the default settings used in each software suit ( Table 3), two preprocessing steps that still remain a topic of debate, i.e., temporal filtering (0.01-0.1 Hz) and global signal regression, were included and excluded, which provide four different preprocessing strategies for each pipeline. Further, statistical derivatives (e.g., amplitude of regional homogeneity (ReHo) [73], low frequency fluctuations (ALFF) [74] and fractional ALFF (fALFF) [75] ) were also calculated from each of the preprocessing data sets using the C-PAC software.
3.3 Software Packages for Big rfMRI Data
A number of requirements and features are necessary to be offered by the pipeline softwares designed to handle large-scale rfMRI data such as configurable, robust, reliable, extendable and provenance tracking (e.g., [49], [66]). Currently, there are some progress toward parallelization for the three major neuroimaging software tools: SPM, FSL and AFNI. By performing them with an additional package (such as Condor) or platform (such as OpenMP), some functions can then be executed in parallel on several central processing unit (CPU) cores or on several computers. However in common neuroimaging tools (Table 1), parameters may need to be manually set step-by-step and subject-by-subject which will be time-consuming and not suitable for big data analysis. Many preprocessing pipeline software suits then have been developed to provide a user-friendly environment ( Table 2). Unfortunately, only few of them have been mainly designed to preprocess and analyze big data.
Parallel computing capacity may be considered as the most important feature which developers have paid attention to. In order to preprocess a total of 418 subjects from the NKI-RS datasets, for example, the CCS pipeline took approximately 15,000 CPU h in the Dell Blade Cluster System [49]. Thus pipelines that can execute jobs in parallel on a multi-core machine or a supercomputer are needed, which allow us to reduce the total time necessary to complete an analysis. C-PAC and PSOM are two common big data processing software packages. These softwares link together many functions from the common neuroimaging tools into pipelines that can execute in a single run on high-performance computing architectures, after a proper configuration has been set. Bellec et al. [66] tested the performance of the PSOM framework using the ADHD-200 datasets and showed that we could reduce the processing time for 198 subjects (with a total data size of 7.7 gigabytes and 5,153 jobs included in the NIAK pipeline) from over a week down to less than 3 h with 200 computing cores.
PSOM also offers other two important features that allow us to handle with big data, i.e., fault tolerance and smart updates. Specifically, PSOM will run each job for multiple attempts before considering it as a failed job while all the failed jobs can be automatically restarted after the pipeline termination by the investigator. Furthermore, if the restart of an analysis is needed, only the parts of the pipeline that need to be reprocessed or impacted by the changes will be executed which can be detected automatically by the toolbox. These two features are very useful particularly in the development phase (e.g., selecting the optimal algorithms and parameters of the pipeline) since the pipelines may be needed to restart multiple times at several stages. However, this framework does not focus on pipeline mapping and this key feature is performed by interfacing PSOM pipelines to another software tool with powerful pipeline mapping capabilities such as CBRAIN instead.
Another group of interesting big data processing software suits is the one designed to enable the advantages of parallel computing with a special emphasis on using inexpensive and powerful graphics processing units (GPUs). BROCCOLI [48] is one of the softwares in this group which is written in OpenCL (Open Computing Language). This makes BROCCOLI able to run the analysis in parallel. To test the parallelization efficiency of BROCCOLI, Eklund et al. [48] have run several benchmark experiments on a number of open access fMRI datasets with three different hardware configurations (i.e., an Intel CPU, an Nvidia GPU, and an AMD GPU). As compared the results for non-linear spatial normalization as an example with other three major neuroimaging tools, BROCCOLI with an Nvidia GPU can run 525 times faster than FSL and AFNI and 195 times faster than AFNI with OpenMP [48]. The results clearly support that parallel processing of the rfMRI data can lead to significantly faster analysis pipelines, which is very important for big data analysis. However, several limitations of this software suit are acknowledged. For instance, BROCCOLI does not provide a graphical user interface. Since this software suit is implemented using OpenCL, it performs best for Nvidia GPUs and thus code optimization for other hardware platforms (e.g., Intel and AMD) is necessary. Biananes [45] is another software in this group which uses GPUs to compute the voxel-wise correlation/connectivity matrix in the highest HCP resolution of all in-brain voxels. This software also provides a distributed file reader for 4D NIfTI fMRI data for use in an Apache Spark environment. By using a scalable platform [45], [76] , [77], we can move data analysis and computational tasks to cloud service providers, for example the AWS cloud which can run the Spark Framework with the GPU accelerated computation.
3.4 Challenges of Data Preprocessing Pipelines
The first two alternative approaches could be consecutively used as the first and the second starting points for investigators who would like to perform functional connectivity analyses on big rfMRI data but do not have enough sufficient computational resources to acquire or preprocess large-scale data, or those who prefer to focus on data analysis rather than data acquisition and preprocessing. As previously mentioned the minimally preprocessed data provides a minimum standard of data quality while greater amount of information is still contained in the data. If further preprocessing steps are necessary, preprocessed data from the PCP that was prepared using the default chosen parameters and settings of several common preprocessing software suits would be a safe bet for further data analyses as they would represent peer-reviewed accepted preprocessing implementations. Investigators can choose one of the pipelines that is appropriate to their application or even compare the results across the different pipelines.
On the other hand, if investigators have enough resources to preprocess large-scale rfMRI data, they could use one of the software packages designed for preprocessing large-scale rfMRI data as discussed in the third alternative approach. They could also consider to preprocess their own data using the minimum preprocessing pipelines and/or the default preprocessing pipelines from common software packages as the starting points. However, several modifications and additional steps may be required to make the pipeline more suitable to the challenge of unique characteristics of specific rfMRI data and functional connectivity analysis methods proposed. For example, the dHCP minimal preprocessing pipelines which are developed based on the HCP have modified several preprocessing steps in order to preprocess the data with low and variable contrast and high levels of head motion in neonate acquisition [78]. To yield the valid and optimal results for specific application, a comprehensive investigation of optimal preprocessing steps and parameter values is necessary.
There are several specific areas in which the preprocessing pipelines need to be improved, and novel methods will continue to be developed. Since there is currently no solution to find the best preprocessing pipelines, data preprocessing steps that are consensus across common software pipelines and/or high-quality peer-reviewed research studies by using a systematic review and meta-analysis could be one of the solutions. For instance, most recently, Caballero and Reynolds [79] suggested some guidelines to choose the preprocessing steps and their order. Specifically, the preprocessing pipeline could start by despiking the fMRI data and then applying a block of operations involving physiological noise correction, slice timing correction, volume registration and correction of magnetic field distortions. The choice of the order within this block is still controversial and they recommend to integrate these four operations into a unified framework. Next, the alignment of the subject's anatomical image to the functional data could be performed. The final steps consist of spatial smoothing, and the combination of nuisance regression, temporal filtering and censoring. The nuisance regressors can be defined either on anatomical masks or by data decomposition techniques such as PCA, kernel PCA and ICA. An additional advantage of data driven approaches is that it can also reduce multiple noise fluctuations simultaneously. However, it has been suggested that for example spatial ICA cannot completely separate physiological noise components. Denoising physiological noise based on external recordings is necessary prior to ICA decomposition. In future studies, more comprehensive investigations are still needed to determine better evidence-based recommendations and best practices for minimal and/or optimal preprocessing pipelines.
Further, as it is acknowledged that data preprocessing pipelines can affect the final results obtained from statistical group difference tests and classification models, and there have been very few systematic studies investigating these effects, a better understanding of whether and which preprocessing steps and parameters affect the results derived from any analysis method is warranted. This is also very important to determine the best, or the optimal, preprocessing pipelines. For example, Vergara et al. [14] evaluated the effect of several preprocessing pipelines in the detection of abnormal functional network connectivity and the classification of patient and control using group ICA methods. Four different pipelines were tested with special emphasis on the effects of (1) the order of head motion correction: before or after group ICA applied, and (2) temporal filtering to remove relatively high frequency content. Both experimental and simulation data was used. For real data, two different cohorts were included in the study: one cohort is mild traumatic brain injury patients with controls and the other cohort is smokers and non-smokers. The results of this study show that data preprocessing pipeline can change the final results. That is, if motion correction is applied before group ICA, patient-control group differences are increased as well as correlation with behavioral assessments are stronger.
Andronache et al. [15] evaluated the effect of several preprocessing pipelines in the detection of the DMN using the SCA and ICA methods. Five different pipelines were tested by adding several preprocessing steps (e.g., removal of co-variance with movement parameters, band-pass filtering, etc.) to the minimum preprocessing pipelines (i.e., realignment, slice timing correction, normalization to MNI space, and spatial smoothing). Only the real data was used in this study including patients with disorders of consciousness and their control counterparts. The results support the study of Vergara et al. [14] that data preprocessing pipeline can change the final results. The results of this study also show that different functional connectivity methods (SCA and ICA) are affected by data preprocessing pipelines differently. Although the effect is reduced when extensive preprocessing steps are applied, it may be due to the fact that some meaningful variability in the data is removed and the valid results are not obtained. The effect of preprocessing pipelines on other commonly used or novel analysis methods should be investigated in future studies.
rfMRI Techniques
Functional Magnetic Resonance provides complex signals to study the highly variable and entangled activity of the brain. Being able to parse it and extract meaningful information is one of the great challenges of neuroimaging research. We can broadly identify two main types of analysis: one focuses on identifying functionally independent brain regions, or functional subnetworks, usually associated to specific functions; a second one focuses instead on the relational among the activities of sets of regions. Classic examples of the first approach are decomposition techniques, like ICA and PCA which we already mentioned in previous sections. Here we put our focus on the second type. The most relevant examples are techniques that produce simplified topological representation (e.g., Mapper [80], [81]), graph-theoretic and network tools amenable to statistical mechanical treatments [82], and finally full fledged topological data analysis tools, in particular persistent homology [83] . In the following, we briefly illustrate the merits of each and their relevance for big data analysis.
4.1 Mapper Algorithms and Data-Driven Methods
Mapper, first introduced by Singh et al. [81], is one of the most used topological tools (Table 4) for direct data exploration. Its fundamentally new character, shared with persistent homology, comes from its algebraic foundation: its recovers the shape of topological spaces at the mesoscopic scale by going beyond the standard measures defined on data points’ pair. Given a point cloud dataset, typically in high dimensions, one begins by dividing the space into a set of overlapping slices. Within each of these a local clustering algorithm is performed to partition the points in a set of separate clusters. Since the slices are overlapping, there will be common points between adjacent ones. One can then build a topologically simplified skeleton of the original dataset by joining together clusters that belong to adjacent slices and that have non-empty intersection (i.e., that contain some of the same points across the two slices) [80]. This type of approach is guaranteed to preserve the overall topology via the gluing of local clusterings.
Mapper lends itself to the analysis of very large datasets, because the complete problem (e.g., the overall clustering structure) is subdivided in any number of smaller local problems (i.e., the clusterings within slices), which can be run in parallel and that are merged only at the final step. Moreover, the local clusterings depend only on the distances between the points in the slices, hence also high-dimensional data are projected effectively down to a (typically small) distance matrix. These properties make Mapper a very good tool for the analysis of large-scale data as this approach can be naturally performed in a framework of big data analysis such as the Google's MapReduce paradigm [98].
Despite the useful properties of Mapper, to our knowledge only one recent study has leveraged it for the study of
rfMRI data. Kyeong et al. [99] used the Mapper algorithm to investigate
the relationship between brain functional connectivity and characteristics of ADHD (from the ADHD-200 datasets).
Because ADHD is defined as a single disorder without subtypes [100], thus
the topological network obtained from the Mapper algorithms is presented as a long gradual progression. Although this
study does not show the clustering potential of the Mapper algorithm to identify meaningful subtypes, the resulting
topological network of the Mapper algorithm can significantly distinguish patients with ADHD from normal control
subjects (
To discuss this in more details, standard clustering approaches for rfMRI work by constructing a series of spatially (or ICA-) coherent coarse-grained regions [1] that are then thought as nodes for a similarity or correlation network. However, Zuo and Xing [101] strongly recommend voxel-wise analysis because the analysis of the signal averaged from multiple voxel based on anatomical structure can lead to difficulties in the reliability and interpretation of derived results. The clustering of activity time-series obtained during rfMRI is the direct and natural application of the Mapper algorithm. Thanks to their scalability, Mapper approaches would be able to address directly high-resolution voxel-level datasets without the need for any preliminary coarse-graining of the regions or resampling data to a lower isotropic resolution and would be able to yield a fully functional representation. Thus we can use clustering-based mapper algorithms instead of existing slower methods used for rfMRI studies: hierarchical clustering [102] , spectral clustering, k-means clustering, or fuzzy clustering [103].
Further, clustering is considered as an exploratory data-driven approach which is used to overcome the limitation of model-based analyses (e.g., SCA, ReHo, ALFF and fALFF). Despite serving similar purposes as other common data-driven methods such as ICA and PCA, a comparison between several different clustering and ICA methods in a systematic fMRI study [104] showed that clustering outperforms ICA (i.e., the most frequency used method for rfMRI studies [105]) for classification purposes. While the efficacy of PCA is strongly dependent on assumptions of linearity, normality, and high SNR of the rfMRI data, clustering-based mapper algorithms are free from these assumptions and have achieved to extract non-trivial qualitative information from large-scale datasets (e.g., extracting a previously unknown subtype of breast cancer with a unique mutational profile and excellent survival [106]).
Note also that the output of Mapper depends critically on the chosen slicing of the original dataset. In other words, choosing the slicing defines what will be the interpretation of the resulting network. This opens the door to combining the full set of existing data-reduction and data-analysis techniques with Mapper. For example, by using the projections of the dataset along main directions obtained by (group) PCA, ICA, or similar decomposition techniques [30], [31], [105], that is using information that is fully contained within the dataset itself; it is however also possible to augment this information by including in the slicing function meta-information about the subjects under study, making this tool extremely versatile for both data exploration and feature extraction in large complex datasets.
4.2 Graph Theory and Networks
Graph theory is the mathematics of networks which describe pairwise relationships [107], as sets of nodes and links, usually equipped with a weight. Networks, thanks to their expressive power and simplicity, have become over the last decade one of the most popular tools to describe both the brain's physical structure and its patterns of activity [108]. Indeed, via network representations it has been possible to uncover a large set of properties of brain function that previously could hardly be described: among others, for example, we now know that specific functional subnetworks correspond to known cognitive and sensory modalities [109], that the observed robustness of the brain to lesions and perturbations is rooted in the combination of small-worldness and strong local clustering coefficient displayed by real-world networks [110], or that information in the brain is processed in tightly integrated modules and then shared across longer distances via long-range links [111]. Until recently, most of the research in functional network however focused on small-size parcellations because they provided anatomically interpretable descriptions and also facilitated the computation of graph metrics, which can often be rather cumbersome computationally. This trend is changing however due to the combined effect of increased computational power, optimised network analysis libraries [112], [113] and accurate measurements. For example, the first tools to analyze large-scale neural network data over Spark architectures [114], as well as scalable techniques able to process, analyze, correlate fMRI data at the full-voxel matrix level [115], are being developed, allowing de facto the scaling of network techniques to the scale of big data. Despite their success, networks however can only describe many-body interactions as the sum of pairwise interactions, an assumption that is not always verified and that, in some applications, can provide a biased representation of the system under study.
4.3 Persistent Homology
One progressively more popular answer to the need to describe higher-order interactions is given by another TDA technique, persistent homology. It yields deeper, quantitative information about the shape of a dataset than that obtained through Mapper, and allows richer descriptions than those provided by networks, at the cost of increased interpretative complexity. Persistent homology works by building a multi-scale summary of a whole dataset via a series of progressively finer approximations, called filtration, of the relation between neighbourhoods of points. Filtration is the key point in order to consider all possible thresholds, avoiding one of the main cons in graph theory. In addition, persistent homology is phrased in the language of simplicial complexes that, by construction, describe many-body interaction patterns and thus go beyond the network description based on two-point interactions (i.e., edges defined on two points, simplices are generic sets of points) [1]. For this reason, it has found wide application in neuroscience with direct applications to the study of rfMRI correlation networks for healthy [116], [117], [118] and altered [119] or pathological [120] brain states, models of spatial learning [121], [122], and dynamical functional connectivity [123].
Indeed, even when starting directly from network data, persistent homology is able to provide information that is not easily–or sometimes at all–available from the standard combinatoric or statistical mechanical point of view, e.g., topological distances defined via persistence diagram useful in discriminating between brain network [124] and multi-scale network descriptions, i.e., that do not require choosing a threshold, of the functional network yielding discrimination power that was absent from a pure graph-theoretic perspective [119], [125] .
Interestingly, once topological features are detected, statistical mechanical methods can give an important contribution to their interpretation, e.g., via projections to simpler representation (e.g., scaffolds [116]), and the modeling of what should be considered significant structure and what noise, e.g., by constructing minimal topological random null models [126] , [127], [128].
One of the main limits for the application to large datasets is however that persistent homology can be computationally cumbersome if computed naively. However, recent algorithmic advances have significant reduced its complexity and parallel algorithms have become available (such as a spectral sequence algorithm [129], a chunk algorithm [93], [94] and a number of others (e.g., [130] , [131], [132], [133])). As a result, persistent homology can be now used to approach very large, high-dimensional data sets, for example fMRI data.
Furthermore, there have been recent advances in methods to compare the information obtained from persistent homology across subjects and groups: the persistence landscape, introduced by Bubenik et al. [134], allows the direct comparison of the persistence profiles of different subjects, while kernelization techniques [135], [136] will allow to apply machine-learning techniques to the persistent homology.
Persistent homology, while very promising, is still in its infancy as a branch of data science. It provides a radically new perspective on how we approach data and brings with itself a new language grounded in algebraic topology. However, there are still open challenges in order to fully leverage its potential in the study of large rfMRI datasets. The first and most obvious one is the necessity to keep improving the computational scalability of persistent homology. While topological simplification via Mapper is cheap and scalable, it also does not directly yield the quantitative output that persistent homology provides. It is then paramount to improve further on the existing implementations, in particular in the direction of effective simplicial complex reduction schemes preserving not only the topological information at the global level, but also the actual localization of homology classes [130]. A second challenge is lowering the entry cost for practitioners coming from outside the TDA community and seeking to apply these techniques to their specific case studies. Although the required mathematical background is significant, having user-friendly and well documented software packages dedicated to the fMRI analysis would already go a long way in this direction.
Conclusion
The era of “Biomedical Big Data” has arrived for the rfMRI research, thanks to the unrestricted sharing and open access of big neuroimaging data: the 1000 Functional Connectomes Project and the Human Connectome Project. These large-scale rfMRI data does exhibit the 5 V's of Big Data: Volume, Veracity, Variety, Velocity and Value. Thus, there is an urgent need to develop data preprocessing pipelines and analyses methods for big rfMRI data.
For data preprocessing pipelines, three alternative approaches to get access to big preprocessed rfMRI data were presented. If investigators would like to perform analyses on big rfMRI data but lack sufficient resources to acquire or preprocess them, or prefer to focus on data analysis rather than data acquisition and preprocessing, the first two approaches: the minimal preprocessing pipelines and the Preprocessed Connectomes Project are the good starting points for their own analysis. If investigators have enough resources to preprocess large-scale data, they can choose one of the software suits designed for preprocessing big data. However, a comprehensive investigation of the effects of data preprocessing steps on the results obtained from functional connectivity analyses as well as an extensive development of the new preprocessing software packages for large-scale data is highly necessary in future studies.
After rfMRI data has been preprocessed, there are several methods commonly used in rfMRI studies to examine functional connectivity such as SCA, PCA, ICA and clustering methods. To enable these approaches to identify large-scale brain networks, recently more sophisticated studies have been performed. However, we still should consider some limitations of the existing common methods, and a novel method is essential for big rfMRI data analysis. We proposed a technique called Topological Data Analysis to rs-fMRI functional connectivity. Many TDA properties clearly show the potential of different TDA methods to be used as big rfMRI data analyses methods. Clinical applications of rfMRI-based TDA should be explored in future studies.
ACKNOWLEDGMENTS
The authors acknowledge the support of the ADnD project by Compagnia San Paolo.