Institut Montefiore - Service de Méthodes Stochastiques

Bioinformatics

(In collaboration with

)

Biological data classification

Contact: Pierre Geurts , Raphaël Marée, Louis Wehenkel

Because of the rapid progress of computer and information technology, large amounts of data are nowadays available in almost any field of application. For example, modern medical instrumentation and acquisition technologies (mass spectrometry, microarray, sequencing tools...) generate large datasets describing for example patients, animals, tissus, or cells. The analysis of such amount of data is impossible without the help of efficient computer based tools. Data Mining refers to the application of automatic learning and visualization techniques in order to help a human expert to extract potentially interesting and synthetic knowledge from these large volumes of raw data. Potential medical applications are the automatic design of diagnostic or prognostic tools for a given disease or the identification of potential biomarkers for this disease. Among many other applications, machine learning techniques can also be applied for the identification of coding and non-coding regions in the genome of a given species or for genetic linkage analysis.

Here at the Systems and Modeling research unit, we analyse existing tools and develop new machine learning methods and methodologies. These fundamental researches are often driven by the application needs. In terms of application, we help the user to collect, clean, and design their databases. Then, our approach for a given task is to apply and compare several modern machine learning techniques (e.g. decision tree based methods, neural networks, support vector machines, bayesian networks). these algorithms are developed and adapted internally so as to provide a toolbox of machine learning techniques available to the user.

Case study

In an ongoing research project in collaboration with the laboratory of clinical chemistry and rheumatology (see also this project), we apply machine learning techniques for the diagnosis of inflammatory diseases from proteomic mass spectra. A database containing data from healthy and disease patients has been gathered using Surface Enhanced Laser Desorption/Ionisation-Time of Flight-Mass Spectrometry (SELDI-TOF-MS). Several machine learning tools were applied to this data. The results in terms of accuracy of the diagnostic rule and identified biomarkers are very promising. The methodology is furthermore generic and it could be applied to data obtained from other medical instrumentation like for example microarray.

Publications

Proteomic mass spectra classification using decision tree based ensemble methods
P. Geurts, M. Fillet, D. de Seny, M.-A. Meuwis, M.-P. Merville, L. Wehenkel
Bioinformatics 2005; 21: 3138-3145.
Discovery of new rheumatoid arthritis biomarkers using SELDI-TOF-MS ProteinChip approach
Dominique deSeny, Marianne Fillet, Marie-Alice Meuwis, Pierre Geurts, Laurence Lutteri, Clio Ribbens, Vincent Bours, Louis Wehenkel, Jacques Piette, Michel Malaise, Marie-Paule Merville
Accepted for publication in Arthritis and Rheumatism - 2005
Segment and combine approach for Biological Sequence Classification
Pierre Geurts, Antia Blanco Cuesta, Louis Wehenkel
To appear in Proc. IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology (CIBCB 2005) - 2005

Final thesis

2004-2005

Développement d'un outil de recherche dans les bases de données génétiques, Benjamin Renwart.
Analyse de séquences biologiques par arbres de décision, Antia Blanco Cuesta.
Apprentissage automatique sur données biomédicales à l'aide de méthodes à base de noyaux, Christophe Grosfils

2002-2003

Application de l'apprentissage automatique à l'extraction et la gestion de connaissance des DNA Arrays, François Van Lishout

2000-2001

Application de l'apprentissage automatique à la localisation de gènes à effets quantitatifs, Estelle Graas.

Life Science Image Classification

Contact: Raphaël Marée, Louis Wehenkel

With the improvements in sensor and image acquisition technology, possibilities to gather image data about the natural world are multitudinous. Scientists have investigated the possiblity to use these image data together with computer vision technology to solve real-world problems. This has lead to some successful applications such as classification of blood cells and human radiographs, identification of animal species or individuals (mollusc, salamanders, ...), recognition of seeds, etc. Using databases including labeled images provided by human experts, scientists have been able to design specific computer vision programs to classify automatically previously unseen images of such natural "objects" or to estimate medical parameters.

Here at the Montefiore Institute / GIGA Bioinformatics Unit, we developed and adapted a new generic Data Mining approach for image classification. It was successfully applied to several types of image classification problems: recognition of digits, faces, 3D objects, textures, buildings, general purpose photographs, ...
We envision now to use such techniques to image classification problems related to life sciences. As our approach has already been tested successfully on a large number of problems, we expect good performances on natural world images. The approach being automatic and generic, we believe some interesting results could rapidly be obtained if human experts simply provide a set of labeled images.
The first step of that kind of project consists in providing images (in usual computer formats) with labels. The task of our method is to automatically construct a model able to classify new images. First results are then rapidly obtained in the form of error rates on a set of images. If the evaluation is successful, an autonomous program (classifier) can then be developped and integrated in the real environment.

Publications

Biomedical Image Classification with Random Subwindows and Decision Trees
Raphaël Marée, Pierre Geurts, Justus Piater, Louis Wehenkel
To appear in Proc. ICCV workshop on Computer Vision for Biomedical Image Applications - 2005

Final thesis

2004-2005

Classement automatique des poudres, Claudio Rudi.
This thesis is about automatic powder classification with machine learning methods. One possible application is in pharmaceutical powders.

Stochastic Methods

Bioinformatics

Biological data classification

Case study

Publications

Final thesis

2004-2005

2002-2003

2000-2001

Life Science Image Classification

Final thesis

2004-2005