Data Coordination And Annalysis

The Data Coordination Centre will be responsible for data storage and distribution, as well as for the quality assessment, following parallel developments in other large-scale genome-based consortia, particularly ICGC, ENCODE and 1000 Genomes projects. BLUEPRINTs’ policies are in full agreement with those of IHEC.

Raw Data Storage

The three basic data collection sites will use their own LIMS and local databases to store the raw data and they will apply their internal data cleaning pipelines and in association with the agreed standard quality assessment. The National Genome Analysis Centre (CNAG – PCB) in Barcelona will be responsible for the project general quality standards. The considerable amount of data that will be produced by the project cannot be stored and transmitted easily, therefore BLUEPRINT will rely on the public EBI European Nucleotide Archive (ENA) to store the primary data (reads) and the EBI resources to perform the initial curation and their subsequent alignments. The ENA repository ensures public access, allows data to be on hold during the analysis phase and has implemented the proper mechanisms restricting access to clinical / medical data according to IHEC policies.

BLUEPRINT Database and Data Access Mode

The project database will contain the pointers to the primary data (reads) and alignments and will use the Ensembl annotation standards and genomic coordinate system. The database will contain additionally all the metadata resulting from the sample annotation (clinical/medical data) as well as the results of the computational analysis. The database will correspond to the decentralized, federated and coordinated model using the BioMart data technology also implemented by the ICGC consortium, and will allow the integrated analysis of the data with information extracted from key databases such as Reactome, UniProt, HapMap, Ensembl and others. Data integration will be in line with IHEC policy. The key site for this federate database will be hosted at the Barcelona Supercomputing Centre in both test and operational versions.

Data Analysis

One of the main goals of the project is to comprehensively analyze the diverse epigenomic maps and make them available as integrated BLUEPRINT-IHEC resource to the scientific community. Integration is envisaged with related projects within species (e.g. 1000 Genomes Project) and between species (e.g. modENCODE) to better understand functional aspects (e.g. shared pathways) and the evolution cell linage development. The interface to the BLUEPRINT data will be based on the Ensembl system that will be customized to allow the integration of epigenomic and genomic information. Additionally the project will provide a realistic user-friendly and intuitive interface to the data for non-experts users. The analysis of the BLUEPRINT data should also catalyze a better understanding of the relationship between epigenetic and genomic information, and in particular will set the basis for the construction of a new generation of methods for the prediction of epigenetic maps from characteristics of the genome sequence. Such prediction methods will facilitate a move towards a more quantitative knowledge and modeling of epigenetic mechanisms. As a result of BLUEPRINT, such models could in the future lead to “reverse engineering” of regulatory networks to repair/restore epigenetic codes that have been perturbed by disease and thus create newopportunities for diagnosis, prevention and treatment.

Research Area Leader: Alfonso Valencia

Data Coordination Centre (DCC)Leader: Paul Flicek
Data Analysis Centre (DAC)Leader: Alfonso Valencia