Data Integration in Genetics and Genomics: Methods and Challenges
- Jemila S. Hamid jemila{at}utstat.toronto.edu1
- Pingzhao Hu phu{at}sickkids.ca2
- Nicole M. Roslin nroslin{at}sickkids.ca2
- Vicki Ling vicki.ling{at}utoronto.ca1,3
- Celia M. T. Greenwood celia.greenwood{at}utoronto.ca4
- Joseph Beyene joseph{at}utstat.toronto.edu1,2,4
- 1Biostatistics Methodology Unit, The Hospital for Sick Children Research Institute, 555 University Avenue, Toronto, ON, Canada, M5G 1X8
- 2The Center for Applied Genomics, The Hospital for Sick Children Research Institute, 555 University Avenue, Toronto, ON, Canada, M5G 1X8
- 3Program in Developmental and Stem Cell Biology, The Hospital for Sick Children Research Institute, 555 University Avenue, Toronto, ON, Canada, M5G 1X8
- 4Dalla Lana School of Public Health, University of Toronto, 555 University Avenue, Toronto, ON, Canada, M5G 1X8
Abstract
Due to rapid technological advances, various types of genomic and proteomic data with different sizes, formats, and structures have become available. Among them are gene expression, single nucleotide polymorphism, copy number variation, and protein-protein/gene-gene interactions. Each of these distinct data types provides a different, partly independent and complementary, view of the whole genome. However, understanding functions of genes, proteins, and other aspects of the genome requires more information than provided by each of the datasets. Integrating data from different sources is, therefore, an important part of current research in genomics and proteomics. Data integration also plays important roles in combining clinical, environmental, and demographic data with high-throughput genomic data. Nevertheless, the concept of data integration is not well defined in the literature and it may mean different things to different researchers. In this paper, we first propose a conceptual framework for integrating genetic, genomic, and proteomic data. The framework captures fundamental aspects of data integration and is developed taking the key steps in genetic, genomic, and proteomic data fusion. Secondly, we provide a review of some of the most commonly used current methods and approaches for combining genomic data with focus on the statistical aspects.
- Received September 25, 2008.
- Accepted December 1, 2008.
- © 2009 Jemila S. Hamid et al.