Stochastic Methods


Bioinformatics

(In collaboration with cbig GIGA)

Biological data classification

Contact: Pierre Geurts , Raphaël Marée, Louis Wehenkel

Because of the rapid progress of computer and information technology, large amounts of data are nowadays available in almost any field of application. For example, modern medical instrumentation and acquisition technologies (mass spectrometry, microarray, sequencing tools...) generate large datasets describing for example patients, animals, tissus, or cells. The analysis of such amount of data is impossible without the help of efficient computer based tools. Data Mining refers to the application of automatic learning and visualization techniques in order to help a human expert to extract potentially interesting and synthetic knowledge from these large volumes of raw data. Potential medical applications are the automatic design of diagnostic or prognostic tools for a given disease or the identification of potential biomarkers for this disease. Among many other applications, machine learning techniques can also be applied for the identification of coding and non-coding regions in the genome of a given species or for genetic linkage analysis.

Here at the Systems and Modeling research unit, we analyse existing tools and develop new machine learning methods and methodologies. These fundamental researches are often driven by the application needs. In terms of application, we help the user to collect, clean, and design their databases. Then, our approach for a given task is to apply and compare several modern machine learning techniques (e.g. decision tree based methods, neural networks, support vector machines, bayesian networks). these algorithms are developed and adapted internally so as to provide a toolbox of machine learning techniques available to the user.

Case study

In an ongoing research project in collaboration with the laboratory of clinical chemistry and rheumatology (see also this project), we apply machine learning techniques for the diagnosis of inflammatory diseases from proteomic mass spectra. A database containing data from healthy and disease patients has been gathered using Surface Enhanced Laser Desorption/Ionisation-Time of Flight-Mass Spectrometry (SELDI-TOF-MS). Several machine learning tools were applied to this data. The results in terms of accuracy of the diagnostic rule and identified biomarkers are very promising. The methodology is furthermore generic and it could be applied to data obtained from other medical instrumentation like for example microarray.

Publications

Final thesis

2004-2005
2002-2003
2000-2001

Life Science Image Classification

Contact: Raphaël Marée, Louis Wehenkel

With the improvements in sensor and image acquisition technology, possibilities to gather image data about the natural world are multitudinous. Scientists have investigated the possiblity to use these image data together with computer vision technology to solve real-world problems. This has lead to some successful applications such as classification of blood cells and human radiographs, identification of animal species or individuals (mollusc, salamanders, ...), recognition of seeds, etc. Using databases including labeled images provided by human experts, scientists have been able to design specific computer vision programs to classify automatically previously unseen images of such natural "objects" or to estimate medical parameters.

Here at the Montefiore Institute / GIGA Bioinformatics Unit, we developed and adapted a new generic Data Mining approach for image classification. It was successfully applied to several types of image classification problems: recognition of digits, faces, 3D objects, textures, buildings, general purpose photographs, ...
We envision now to use such techniques to image classification problems related to life sciences. As our approach has already been tested successfully on a large number of problems, we expect good performances on natural world images. The approach being automatic and generic, we believe some interesting results could rapidly be obtained if human experts simply provide a set of labeled images.
The first step of that kind of project consists in providing images (in usual computer formats) with labels. The task of our method is to automatically construct a model able to classify new images. First results are then rapidly obtained in the form of error rates on a set of images. If the evaluation is successful, an autonomous program (classifier) can then be developped and integrated in the real environment.

Publications

Final thesis

2004-2005