Proteomic and genomic data classification using decision tree based ensemble methods - Application to the diagnosis of inflammatory diseases
Modern medical instrumentation and acquisition technologies (mass spectrometry, microarray, sequencing tools...) generate large datasets describing for example patients, animals, tissues, or cells. The analysis of such amount of data is impossible without the help of efficient computer based tools. Data Mining refers to the application of machine learning and visualization techniques in order to help a human expert to extract potentially interesting and synthetic knowledge from these large volumes of raw data. Potential medical applications are the automatic design of diagnostic or prognostic tools for a given disease or the identification of potential biomarkers for this disease.
Recent developments in machine learning allow one to exploit datasets characterized by small numbers of very high-dimensional samples, without prior feature selection or extraction. We propose a systematic approach for mining proteomic and genomic data based on decision tree ensemble methods. The overall objective of the proposed software is to use experimental datasets in order to identify one or several biomarkers specific of a given disease, or able to discriminate among a certain class of diseases, or indicative of treatment response, and to construct predictive models exploiting the biomarker intensities to help physicians in the context of medical diagnosis or prognosis. It includes clearly defined pre- and post-processing steps as well as the invocation of a toolbox of generic decision tree based methods (Bagging, Random Forests, Extra-Trees, Boosting). We choose decision tree methods because these methods are computationally very efficient and can be easily exploited to assist physicians in identifying among a large number of candidate biomarkers those that are best suited for a particular discrimination task.
We provide results of our approach on datasets of two experimental studies based on surface-enhanced laser desorption/ionization time of flight mass spectrometry (SELDI-TOF-MS). They concern the diagnosis of patients suffering from inflammatory diseases, namely rheumatoid arthritis and inflammatory bowel diseases. The motivation of the first study is to complement existing diagnostic tests for rheumatoid arthritis in order to allow an earlier diagnosis of this disease. The goal of the second study is to provide a robust and easy-to-use diagnostic tool much less invasive than current diagnostic tests including endoscopy and histology.
Each database collects mass spectra obtained from sera of healthy and disease patients using different chip arrays including strong anion-exchange (Q10), weak cation-exchange (CM10), or hydrophobic (H4) surface. The same framework was applied to the two problems and has given very promising results both for the induction of predictive models and for the identification of biomarkers. In both cases, we obtained predictive models which are more than 80 % specific and sensitive[1]. Sensitivity and specificity were further increased to more than 90% by combining several MS replicas per patient. These results are superior to those obtained with standard pre- and post-processing techniques used in these applications (peak-detection and p-values based biomarker selection) and are also competitive with existing practice for the diagnosis of these diseases.
Moreover, the methodology revealed a small number of variables which are deemed sufficient by the tree based models to discriminate among classes of patients and thus constitute potential biomarkers specific to the studied diseases. The interest of these biomarkers is twofold. First, the downstream purification and identification of the proteins associated with these biomarkers could help physicians to better understand such diseases and to highlight new therapeutic targets. Second, the application of machine learning algorithms on these reduced sets of biomarkers could possibly provide more reliable models than those using the whole set of variables. The approach being generic, it could be applied to other problems and other proteomic and/or genomic data acquisition schemes. Current work concerns its application to microarray data for the classification of brain tumors.
BibTex references
@InProceedings\{GFDMMW05,
author = "Geurts, Pierre and Fillet, Marianne and deSeny, Dominique and Meuwis, Marie-Alice and Merville, Marie-Paule and Wehenkel, Louis",
title = "Proteomic and genomic data classification using decision tree based ensemble methods - Application to the diagnosis of inflammatory diseases",
booktitle = "Proc. Benelux Bioinformatics Conference, Ghent",
month = "apr",
year = "2005",
keywords = "bioinformatics, machine learning",
url = "http://www.montefiore.ulg.ac.be/services/stochastic/pubs/2005/GFDMMW05"
}
![geurts-bbc05.pdf [611Ko]](http://www.montefiore.ulg.ac.be/services/stochastic/pubs/images/pdf.png)