Pierre Geurts


Research


I carry out research in machine learning. Although my interests are manyfold, I have approached two main research topics until now (may 2002): The following text gives a summary of my research results in each of these domains. For a list of my publications, see here.


Improvement of decision tree based algorithms

Decision tree induction (see here for a rudimentary java implementation) is a popular supervised learning algorithm, which focuses on the modeling of input/output relationships in the form of if-then rules. It has been claimed that among the various classes of learning methods, decision trees come closest to meeting the requirements for serving as an off-the-shelf procedure for data mining. Indeed, they are quite fast, produce relatively interpretable models, are immune to outliers, resistant to irrelevant variables, insensitive to variable rescaling, and can be easily extended to cope with a large variety of data types (numerical, symbolic, time series...). However, in terms of accuracy they seldom are competitive with other automatic learning algorithms, such as neural networks and nearest neighbors. It is commonly admitted that this suboptimality is mostly due to the excessive variance of this method. The variance of an automatic learning method measures the dependence of the models it produces on the random nature of the data set used (sampling and noise). In the context of decision trees, this variance affects all the parameters of the extracted rules and is detrimental in terms of both accuracy and interpretability. Furthermore, this variance tends to increase very strongly in applications such as time series classification where the input variables correspond to a large number of low level and, sometimes, very noisy features.

In this context, we started our investigations by an empirical study which shows quantitatively how important decision tree variance is, i.e. how strongly they depend on the random nature of the database used to infer them. These experiments confirm that the rules induced for a given problem are indeed highly variable from one learning sample to another, and thus that variance is detrimental not only for accuracy but also from the point of view of interpretability (see [1] for some preliminary results and [3] for a broader study).

With the goal of improving both interpretability and accuracy, we have then investigated three complementary variance reduction techniques:

References:
  1. P. Geurts and L. Wehenkel. Investigation and reduction of discretization variance in decision tree induction. In Proceedings of the 11th European Conference on Machine Learning (ECML2000), pages 162-170, Barcelona, Spain, May 2000.
  2. P. Geurts. Dual perturb and combine algorithm. In Proceedings of the 8th International Workshop on Artificial Intelligence and Statistics, pages 196-201, Key-west, Florida, January 2001.
  3. P. Geurts. Contributions to decision tree induction: bias/variance tradeoff and time series classification. PhD thesis, Department of Electrical Engineering and Computer Science, University of Liège, Belgium. May 2002.
  4. P.Geurts. Extremely randomized trees. Technical report, University of Liège, June 2003.
Related links:

Treatment of temporal data

Classical machine learning algorithms like decision tree induction assume that objects of the learning sample are described by attributes which take simple values, essentially numerical or symbolic. However, because of the success of automatic learning in a lot of domains and because of the few hypotheses underlying such methods (they only need to gather data), automatic learning is applied to a lot of domains which need to handle more complex data types, like image, text, or biological sequences. These data are characterized by large numbers of input variables organized in a one-, two-, or three-dimensional space. They correspond to many problems, for the time tackled in an unsatisfactory way by automatic learning, such as text classification, time-series analysis, image and biological data analysis. Generally these fields correspond to very large databases and are treated in an ad hoc way using prior knowledge about the application. However, generic techniques such as decision tree induction generally fail when applied to these problems unless the large number of rather low level basic variables are replaced by hand, by a much smaller number of features extracted by domain experts. Among all these types of data, we have focused in our research on temporal data.

Our earlier work in this domain aimed at handling the early detection problem. In general, the early detection problem corresponds to applications where one seeks to induce a monitoring system that can trigger emergency procedures of a system. Examples of applications are intrusion detection, patient monitoring in a medical application, or anticipation of power systems black-out. One way to handle this problem by automatic learning is to use a constant (over time) desired output, a causal temporal model, and to incorporate into the search procedure used by the automatic learning method a preference with respect to models which carry out the detection as early as possible. Note that in these problems, accuracy and earliness of detection are usually competing objectives, because in many detection problems the ``undesired'' behavior tends to become sooner or later obvious to detect. For example, in the work reported in [1,2], we have designed a method based on temporal decision trees for the early on-line detection of instability in electric power systems; in this application the objective was to detect (or predict) instability when it is still time to prevent it, i.e. before its undesired consequences start to show up. In this problem, instability sooner or later translates into a monotone and strong decrease of voltage levels in the power system, and, without bias towards early detection, the automatic learning method would have produced useless models detecting instability when it is already far too late to react.

More recently, we have relaxed the early detection constraint and restricted ourselves to supervised classification problems with numerical time series. In this case, objects of the learning sample correspond to trajectories of a dynamic system which are described by several temporal attributes and the goal of learning is to discriminate among the trajectories according to a given classification. Many practical problems satisfy this definition. For example, in speech recognition, each object corresponds to the utterance of a word and is described by the evolution of the acoustic speech signal over time. The goal of learning is to recognize the word which has been pronounced. Other application domains concern for example computer system intrusion detection, gestures recognition, and electrocardiogram (ECG) signal classification.

We have evaluated three approaches to solve the time series classification problem. First, the most direct way to solve this problem is to apply existing learning algorithms on low-level variables which correspond to the values of a time series at several time points. Experiments with the tree-based algorithms studied in the first research topic show that this approach has its limitations (see [3,4]). We have thus designed a variance reduction technique specifically for this kind of data. This algorithm (see [4]) consists in aggregating the predictions given by a classification model based on small subsequences of the original time series. This method is very successful in terms of accuracy but does not provide fully interpretable models. We have therefore proposed a second method which extends decision tree tests by allowing them to detect local shift invariant properties, or patterns, in time series. This method, although more limited in scope, offers the possibility to extract interpretable and accurate rules from time series (see [3,4]).

References:

  1. P. Geurts and L. Wehenkel. Early prediction of electric power system blackouts by temporal machine learning. In Proceedings of the ICML98-AAAI98 Workshop on "Predicting the Future: AI Approaches to Time-Series Problems", Madison (Wisconsin), 1998.
  2. P. Geurts and L. Wehenkel. Temporal machine learning for switching control. In Proceedings of the 4th European Conference on Principles of Data Mining and Knowledge Discovery (PKDD2000), Lyon, France, September 2000.
  3. P. Geurts. Pattern extraction for time series classification. In Proceedings of the 5th European Conference on Principles of Data Mining and Knowledge Discovery (PKDD2001), pages 115-127, Freiburg, Germany, September 2001.
  4. P. Geurts. Contributions to decision tree induction: bias/variance tradeoff and time series classification. PhD thesis, Department of Electrical Engineering and Computer Science, University of Liège, Belgium. May 2002.
Related links: