Research
I carry out research in machine learning. Although my interests are manyfold, I have approached two main research topics until now (may 2002):
- The improvement of decision tree based learning algorithms.
- The treatment of temporal data in machine learning.
Improvement of decision tree based algorithms
Decision tree induction (see here for a rudimentary java implementation) is a popular supervised learning algorithm, which focuses on the modeling of input/output relationships in the form of if-then rules. It has been claimed that among the various classes of learning methods, decision trees come closest to meeting the requirements for serving as an off-the-shelf procedure for data mining. Indeed, they are quite fast, produce relatively interpretable models, are immune to outliers, resistant to irrelevant variables, insensitive to variable rescaling, and can be easily extended to cope with a large variety of data types (numerical, symbolic, time series...). However, in terms of accuracy they seldom are competitive with other automatic learning algorithms, such as neural networks and nearest neighbors. It is commonly admitted that this suboptimality is mostly due to the excessive variance of this method. The variance of an automatic learning method measures the dependence of the models it produces on the random nature of the data set used (sampling and noise). In the context of decision trees, this variance affects all the parameters of the extracted rules and is detrimental in terms of both accuracy and interpretability. Furthermore, this variance tends to increase very strongly in applications such as time series classification where the input variables correspond to a large number of low level and, sometimes, very noisy features.
In this context, we started our investigations by an empirical study which shows quantitatively how important decision tree variance is, i.e. how strongly they depend on the random nature of the database used to infer them. These experiments confirm that the rules induced for a given problem are indeed highly variable from one learning sample to another, and thus that variance is detrimental not only for accuracy but also from the point of view of interpretability (see [1] for some preliminary results and [3] for a broader study).
With the goal of improving both interpretability and accuracy, we have then investigated three complementary variance reduction techniques:
- First, we have proposed several methods to stabilize the parameters chosen during tree induction (see [1,3]). While these methods succeed in reducing the variability of the parameters, they produce only a slight improvement of accuracy.
- Next, we have considered perturb and combine algorithms, which aggregate predictions of several models obtained by randomizing in some way the learning process. Among the existing such methods, the most popular ones, namely Bagging and Boosting, strongly improve tree accuracy but jeopardize interpretability and computational efficiency. Thus, inspired by the high variance of tree parameters, we have proposed a new algorithm based on extremely randomized trees (extra-trees [3,4]) . This method aggregates the predictions of many large trees, each one obtained by choosing most of its parameters fully at random. These extra-trees compare very favorably with both bagging and boosting, in terms of variance reduction, accuracy and computational efficiency. They are thus largely more accurate than standard trees, while remaining competitive with them in terms of computational efficiency.
- In order to also recover interpretability, we have then proposed a "dual" perturb and combine algorithm [2,3] which randomizes input attributes at the prediction stage while using one single model. In combination with decision trees, this method actually yields soft decisions and eventually bridges the gap between single trees and the ensembles of trees produced by the perturb and combine methods. Of the first, it saves most of the interpretability, and with perturb and combine algorithms, it shares part of the accuracy improvement.
- P. Geurts and L. Wehenkel. Investigation and reduction of discretization variance in decision tree induction. In Proceedings of the 11th European Conference on Machine Learning (ECML2000), pages 162-170, Barcelona, Spain, May 2000.
- P. Geurts. Dual perturb and combine algorithm. In Proceedings of the 8th International Workshop on Artificial Intelligence and Statistics, pages 196-201, Key-west, Florida, January 2001.
- P. Geurts. Contributions to decision tree induction: bias/variance tradeoff and time series classification. PhD thesis, Department of Electrical Engineering and Computer Science, University of Liège, Belgium. May 2002.
- P.Geurts. Extremely randomized trees. Technical report, University of Liège, June 2003.
- International Workshop on Multiple Classifier Systems
- www.boosting.org
- International school on neural nets: Ensemble Methods for Learning Machines
- ... (to be expanded)
Treatment of temporal data
Classical machine learning algorithms like decision tree induction assume that objects of the learning sample are described by attributes which take simple values, essentially numerical or symbolic. However, because of the success of automatic learning in a lot of domains and because of the few hypotheses underlying such methods (they only need to gather data), automatic learning is applied to a lot of domains which need to handle more complex data types, like image, text, or biological sequences. These data are characterized by large numbers of input variables organized in a one-, two-, or three-dimensional space. They correspond to many problems, for the time tackled in an unsatisfactory way by automatic learning, such as text classification, time-series analysis, image and biological data analysis. Generally these fields correspond to very large databases and are treated in an ad hoc way using prior knowledge about the application. However, generic techniques such as decision tree induction generally fail when applied to these problems unless the large number of rather low level basic variables are replaced by hand, by a much smaller number of features extracted by domain experts. Among all these types of data, we have focused in our research on temporal data.
Our earlier work in this domain aimed at handling the early detection problem. In general, the early detection problem corresponds to applications where one seeks to induce a monitoring system that can trigger emergency procedures of a system. Examples of applications are intrusion detection, patient monitoring in a medical application, or anticipation of power systems black-out. One way to handle this problem by automatic learning is to use a constant (over time) desired output, a causal temporal model, and to incorporate into the search procedure used by the automatic learning method a preference with respect to models which carry out the detection as early as possible. Note that in these problems, accuracy and earliness of detection are usually competing objectives, because in many detection problems the ``undesired'' behavior tends to become sooner or later obvious to detect. For example, in the work reported in [1,2], we have designed a method based on temporal decision trees for the early on-line detection of instability in electric power systems; in this application the objective was to detect (or predict) instability when it is still time to prevent it, i.e. before its undesired consequences start to show up. In this problem, instability sooner or later translates into a monotone and strong decrease of voltage levels in the power system, and, without bias towards early detection, the automatic learning method would have produced useless models detecting instability when it is already far too late to react.
More recently, we have relaxed the early detection constraint and restricted ourselves to supervised classification problems with numerical time series. In this case, objects of the learning sample correspond to trajectories of a dynamic system which are described by several temporal attributes and the goal of learning is to discriminate among the trajectories according to a given classification. Many practical problems satisfy this definition. For example, in speech recognition, each object corresponds to the utterance of a word and is described by the evolution of the acoustic speech signal over time. The goal of learning is to recognize the word which has been pronounced. Other application domains concern for example computer system intrusion detection, gestures recognition, and electrocardiogram (ECG) signal classification.
We have evaluated three approaches to solve the time series classification
problem. First, the most direct way to solve this problem is to apply existing
learning algorithms on low-level variables which correspond to the values of a
time series at several time points. Experiments with the tree-based algorithms
studied in the first research topic show that this approach has its
limitations (see [3,4]). We have thus designed a variance reduction technique
specifically for this kind of data. This algorithm (see [4]) consists in aggregating the
predictions given by a classification model based on small subsequences of the
original time series. This method is very successful in terms of accuracy but
does not provide fully interpretable models. We have therefore proposed a
second method which extends decision tree tests by allowing them to detect
local shift invariant properties, or patterns, in time series. This method,
although more limited in scope, offers the possibility to extract interpretable
and accurate rules from time series (see [3,4]).
References:
- P. Geurts and L. Wehenkel. Early prediction of electric power system blackouts by temporal machine learning. In Proceedings of the ICML98-AAAI98 Workshop on "Predicting the Future: AI Approaches to Time-Series Problems", Madison (Wisconsin), 1998.
- P. Geurts and L. Wehenkel. Temporal machine learning for switching control. In Proceedings of the 4th European Conference on Principles of Data Mining and Knowledge Discovery (PKDD2000), Lyon, France, September 2000.
- P. Geurts. Pattern extraction for time series classification. In Proceedings of the 5th European Conference on Principles of Data Mining and Knowledge Discovery (PKDD2001), pages 115-127, Freiburg, Germany, September 2001.
- P. Geurts. Contributions to decision tree induction: bias/variance tradeoff and time series classification. PhD thesis, Department of Electrical Engineering and Computer Science, University of Liège, Belgium. May 2002.