TFE 2011-2012 (final year project)

Using biological networks to search for interacting loci in genome-wide association studies

Because the costs have become affordable, we can now perform routinely whole-genome sequencing of individual human genomes. However, the more we know about the genetic etiology of a complex disease, the more we realize there is a lot more to know. Genomes are composed of both protein-coding and nonprotein-coding DNA and we are only beginning to have clear handles on the mechanisms of gene expression1, the initial product of genome expression being the transcriptome, the final product being the proteome. Focusing on one particular platform or data type may miss an obvious signal. A combination of different viewpoints to genome sequences, whether derived from the general population or diseased individuals, involving an “integrated” genome- wide analysis of DNA (genomics), RNA expression (transcriptomics), protein expression (proteomics), DNA methylation (epigenomics), and accounting for existing interactions within and between these omics data sets will be crucial to increase insight into disease pathways while creating new opportunities for understanding cellular functional architecture. When envisaging an “integrated” approach, several challenges exist, including 1) data pre-processing and quality control, 2) high dimensionality requiring complex computational analyses, 3) elevated multiple testing, 4) finding the most optimal way to integrate data, balancing between enough detail and parsimony, 5) interpretation (validation) of the final model(s). This project focuses on the interlinked challenges 2)- 4).

Two data sources that are routinely being integrated are transcriptome data and genome data. In this thesis, we focus on the high-throughput and genome-wide measurement of gene expression in a natural population of unrelated humans, and on the subsequent association of variation in expression to “expression quantitative trait loci” (eQTLs) on DNA using oligonucleotide arrays with hundreds of thousands of single-nucleotide polymorphism (SNP) markers that capture most of the human genetic variation well (Franke et al ). This strategy has been successfully applied to several diseases such as celiac disease (Hunt et al. 2008, Nat Genet 40, 395-402) and asthma (Moffatt et al. 2007, Nature 448, 470-473): associated genetic variants have been identified that affect levels of gene expression in cis or in trans, providing insight into the biological pathways affected by these diseases. Less commonly used is to find associations of variation in expression to multiple genetic loci or clusters of genetic loci.

In practice, the thesis will consist of three main parts:

  • Building a co-expression network on available transcriptome data (e.g., Zhang et al 2005, Fuller et al 2007), investigating network properties, and linking important gene modules to the disease trait of interest. Remark: the choice of using an immune-related phenotype or cancer-related phenotype has not yet been fixed and is under negotiation with external international collaborators.
  • Building an epistasis statistical network using fast gene-gene interaction detection tools on genomic variants (e.g., Mahachie-John et al 2011) and investigating network properties
  • Comparing results obtained from both and performing an epistasis screening on genomic data with transcripts as quantitative traits, hereby linking transcriptome to genome data, extending classical eQTL analyses.

Depending on the progress made in this project, the work may lead to a genuine scientific publication. This project will allow you to work together with other academic institutions throughout Europe.

  • Emily et al (2009) Using biological networks to search for interacting loci in genome-wide association studies. European Journal of Human Genetics 17: 1231-1240.
  • Franke and Jansen (2009) eQTL analysis in humans. Methods Mol Biol. 573:311-28.
  • Fuller, T.F., et al. Weighted gene coexpression network analysis strategies applied to mouse weight. Mammalian Genome 18, 463-472 (2007).
  • Mahachie John JM, Cattaert T, Van Lishout F, Van Steen K (2011) Model-Based Multifactor Dimensionality Reduction to detect epistasis for quantitative traits in the presence of error-free and noisy data. European Journal of Human Genetics 19, 696-703.
  • Zhang, B. & Horvath, S. A general framework for weighted gene co-expression network analysis. Statistical Applications in Genetics and Molecular Biology 4, - (2005).