TFE 2010-2011 (final year project)

Feature selection as pre-screening tool for multifactor dimensionality reduction methods

Understanding the effects of genes and environmental factors on the development of complex diseases, such as cancer, is a major aim of genetic epidemiology. These kinds of diseases are controlled by complex molecular mechanisms characterized by the joint action of several genes, each having only a small effect. In this context traditional methods involving single markers have limited use and more advanced and efficient methods are needed to identify gene interactions or epistatic patterns.

The Multifactor Dimensionality Reduction method, MDR, (Ritchie et al. 2001) has achieved a great popularity. The MDR strategy tackles the dimensionality problem related to interaction detection and reduces the multiple dimensions to one by pooling multi-locus genotypes into two groups of risk: high and low. It is an attractive technique to detect gene-gene interaction in case-control studies because it allows for the detection of multiple genetic loci jointly associated with a discrete clinical endpoint in the absence of a main effect, it is non-parametric in nature, no assumptions need to be formulated about the underlying genetic inheritance model, it generates low false positive rates. The method has been further improved and has led to an in-house developed similar strategy Model-Based Multifactor Dimensionality Reduction (MB-MDR) with improved characteristics. The basic steps of MB-MDR are displayed in Figure 1.

Without parallel computing, performing MB-MDR at a large genome-wide scale (involving perhaps 1 million markers) is prohibitive. But also with parallel computing, and the emergence of genome-wide next generation sequencing results, a pre-selection of markers is mandatory in order to keep computation time and memory storage within limits. It will make a huge difference whether to investigate all possible couples or trios of markers in a group of N=250 or N=1,000,000 markers! Nevertheless, whether or not being successful in detecting higher-order genetic interactions, using N markers only, may heavily depend on the choice of the subset.

The topic of this project is to investigate several existing strategies to select favorable combinations of markers, so as to increase the power to identify important genetic interactions. The study will first involve performing a literature search about several feature selection methods that are currently available (e.g., wrapper models, filter models such as the established TuRF method of Moore and White 2007, or Bayesian methods such as those developed by Sebastiani et al 2008) and to structure the available methods. A good starting point is the paper of Sayes et al (2007), but since then, additional strategies have been developed. Although several reviews are available at the moment, they most of the time address the topic from a single angle only or discuss the validity of the methods in fields other than epistasis. The literature study should give you a good understanding of the pros and cons of the methods in the context of epistasis screening and the search for networks of interacting markers. Second, the performance of the most promising techniques needs to be assessed via simulations and compared to MB-MDR computational properties without pres-screening. Third, and time permitting, a real-life data application may supplement the study. Depending on the approaches taken in this project, the thesis may actually lead to a publication in for instance BMC Genomics.

See van_steen_1_2011_prescreening_before_mdr.doc for references and figures.

Renseignements, Promoteur:

Kristel Van Steen (Kristel.VanSteen@ulg.ac.be)