Understanding the effects of genes and environmental factors on the development of complex diseases, such as cancer, is a major aim of genetic epidemiology. These kinds of diseases are controlled by complex molecular mechanisms characterised by the joint action of several genes, each having only a small effect. In this context traditional methods involving single markers have limited use and more advanced and efficient methods are needed to identify gene interactions and epistatic patterns of susceptibility.

Standard methods to analyse case-control data in this context broadly fall into two classes: parametric multi-locus methods including regression (e.g., Park and Hastie 2007) and (bagged) logic regression (Ruczinski et al., 2004) or non-parametric multilocus techniques such as most machine learning and data mining approaches. Several data mining methods have been used for interaction detection such as tree-based methods (e.g., Recursive Partitioning and Random Forests), pattern recognition methods (e.g., Symbolic Discriminant Analysis, Mining association rules, Neural networks and Support vector machines), and data reduction methods (e.g., Detection of Informative Combined Effects, Multifactor Dimensionality Reduction and Logic regression). A nice overview is given by Onkamo and Toivonen (2006).

Whereas the aforementioned non-parametric approaches are appealing because no distributional assumptions are imposed on the genotype-phenotype effect, parametric approaches have severe limitations when there are too many independent variables in relation to the number of observed outcome events. However, when analyzing gene interactions in case-control studies adjustment for confounding variables and for main effects is usually required and parametric methods might be more flexible.

In this project you will focus attention on the Multifactor Dimensionality Reduction method, MDR, (Ritchie et al. 2001; Figure 2) that has recently achieved a great popularity. The MDR strategy tackles the dimensionality problem related to interaction detection and reduces the multiple dimensions to one by pooling multilocus genotypes into two groups of risk: high and low. Although MDR has proven its usefulness and has some nice properties, it suffers from some major drawbacks including that some important interactions could be missed due to pooling too many cells together when defining high and low risk cells. Hence, the main goal is to come up with alternative definitions to define high and low risk cells, and to compare these via simulations in R with the existing definition. R code on particular steps in the classical MDR approach is already available.

See vansteen-highlow.pdf for references and figures.