# Tobias Jung

Last update: 11/3/2013, 11:33 a.m.

E-mail: toby dot jung at gmail dot com

Previous positions: Post-doc at the University of Liège and the University of Texas at Austin.

Research interests:

• Reinforcement learning, approximate dynamic programming and stochastic optimal control.
• Regression and function approximation, Bayesian methods.
• Numerical implementation.
• Cool applications.

## Publications (ordered by topic)

#### Approximate dynamic programming: Value function approximation

Thesis: Reinforcement Learning with Regularization Networks

Abstract: This thesis aims at learning autonomously optimal behavior for high-dimensional control tasks using reinforcement learning with a kernel-based approach. Harnessing the representational power of kernel-based methods, we hope to escape the so-called ’curse of dimensionality’ which otherwise implies an exponential growth in the number of basis functions. Specifically, we apply regularization networks as underlying function approximator in least-squares based policy evaluation. The samples used to build this approximation are generated online, from an agent interacting with the environment. This poses an enormous computational challenge since kernel methods inherently scale with O(n**3), where n is the number of training samples. Our first contribution hence is an efficient recursive implementation of regularization networks particularly tailored for online learning. This is made possible by using the subset of regressors approximation which approximates the kernel using a vastly reduced number of basis functions. To select this subset, we employ sparse greedy online selection that automatically constructs a dictionary of basis functions directly from the data stream. Since parsimoniousness is of great importance for efficiency in an online-setting, we extend the original procedure to additionally incorporate a notion of relevance, i.e., the reduction of error in the regression task, thereby eliminating redundancy on-the-fly and further reducing the number of basis functions necessary. The resulting new online algorithm is evaluated on a number of benchmark regression problems and shown to perform well (an order of magnitude faster with only a slight decrease of performance) in comparison with state-of-the-art offline methods. We then apply this algorithm to carry out approximate policy evaluation and embed it into an approximate policy iteration framework suitable for optimal control. The result is a reinforcement learning method that works online and in a model-free fashion, allows non-deterministic transitions, is very data-efficient and effortlessly scales to high-dimensional state-spaces. We demonstrate this using some high-dimensional benchmarks, among them RoboCup-Keepaway and the recently proposed control of an octopus-arm.

Tobias Jung. Reinforcement Learning with Regularization Networks (written in German). In: Dissertation, Johannes Gutenberg Universitaet Mainz , 2007 (download thesis .pdf) (download slides .pdf)

Learning RoboCup-Keepaway with Kernels

Abstract: We apply kernel-based methods to solve the difficult reinforcement learning problem of 3vs2 keepaway in RoboCup simulated soccer. Key challenges in keepaway are the high-dimensionality of the state space (rendering conventional discretization-based function approximation like tilecoding infeasable), the stochasticity due to noise and multiple learning agents needing to cooperate (meaning that the exact dynamics of the environment are unknown) and real-time learning (meaning that an efficient online implementation is required). We employ the general framework of approximate policy iteration with least-squares-based policy evaluation (LSPE and LSTD). As underlying function approximator we consider the family of regularization networks with subset of regressors approximation. The core of our proposed solution is an efficient recursive implementation with automatic supervised selection of relevant basis functions. Simulation results indicate that the behavior learned through our approach clearly outperforms the best results obtained with tilecoding by Stone et al. (2005).

Tobias Jung and Daniel Polani. Learning RoboCup-Keepaway with Kernels. In: JMLR Workshop and Conference Proceedings (Gaussian Processes in Practice) , Vol. 1:33-57, 2007 (download article .pdf)

Kernelizing LSPE(lambda)

Abstract: We propose the use of kernel-based methods as underlying function approximator in the least-squares based policy evaluation framework of LSPE($\lambda$) and LSTD($\lambda$). In particular we present the 'kernelization' of model-free LSPE($\lambda$). The 'kernelization' is computationally made possible by using the subset of regressors approximation, which approximates the kernel using a vastly reduced number of basis functions. The core of our proposed solution is an efficient recursive implementation with automatic supervised selection of the relevant basis functions. The LSPE method is well-suited for optimistic policy iteration and can thus be used in the context of online reinforcement learning. We use the high-dimensional Octopus benchmark to demonstrate this.

Tobias Jung and Daniel Polani. Kernelizing LSPE(lambda). In: Proc. IEEE Symposium on Approximate Dynamic Programming and Reinforcement Learning (ADPRL), p. 338-345, 2007 (download paper .pdf) (download slides .pdf)

Least-Squares SVM for Least-Squares TD Learning

Abstract: We formulate the problem of least squares temporal difference learning (LSTD) in the framework of least squares SVM (LS-SVM). To cope with the large amount (and possible sequential nature) of training data arising in reinforcement learning, we employ a subspace based variant of LS-SVM that sequentially processes the data and is hence especially suited for online learning. This approach is adapted from the context of Gaussian process regression and turns the unwieldy original optimization problem (with computational complexity being cubic in the number of processed data) into a reduced problem (with computional complexity being linear in the number of processed data). We introduce a QR decomposition based approach to solve the resulting generalized normal equations incrementally that is numerically more stable than existing recursive least squares based update algorithms. We also allow a forgetting factor in the updates to track non-stationary target functions (i.e. for the use with optimistic policy iteration). Experimental comparison with standard CMAC function approximation indicate that LS-SVMs are well-suited for online RL.

Tobias Jung and Daniel Polani. Least-Squares SVM for Least-Squares TD Learning. In: Proc. of European Conference on Artificial Intelligence (ECAI), p. 499-503, 2006 (download paper .pdf) (download slides .pdf)

Experiments in Value Function Approximation with Sparse Support Vector Regression

Abstract: We present first experiments using Support Vector Regression as function approximator for an on-line, {\em sarsa}-like reinforcement learner. To overcome the batch nature of SVR two ideas are employed. The first is sparse greedy approximation: the data is projected onto the subspace spanned by only a small subset of the original data (in feature space). This subset can be built up in an on-line fashion. Second, we use the sparsified data to solve a reduced quadratic problem, where the number of variables is independent of the total number of training samples seen. The feasability of this approach is demonstrated on two common toy-problems.

Tobias Jung and Thomas Uthmann. Experiments in Value Function Approximation with Sparse Support Vector Regression. In: Proc. of European Conference on Machine Learning (ECML), p. 180-191, 2004 (download paper .pdf) (download slides .pdf)

#### Approximate dynamic programming and reinforcement learning

Gaussian Processes for Sample Efficient Reinforcement Learning with RMAX-Like Exploration

Abstract: We present an implementation of model based online reinforcement learning (RL) for continuous domains with deterministic transitions that is specifically designed to achieve low sample complexity. To achieve low sample complexity, since the environment is unknown, an agent must intelligently balance exploration and exploitation, and must be able to rapidly generalize from observations. While in the past a number of sample efficient RL algorithms have been proposed with provable PAC-style performance guarantees for both discrete (e.g., RMAX) and continuous (e.g., ARL) domains, to allow theoretical analysis, only methods with weak generalization capabilities were considered. We present an implementation of RMAX for continuous domains that separates function approximation in the model learner (which does require samples) from the interpolation in the planner (which does not require samples). For model learning we apply Gaussian processes regression (GP) which is able to automatically adjust itself to the complexity of the problem (via Bayesian hyperparameter selection) and, in practice, able to learn a highly accurate model from very little data. In addition, a GP provides a natural way to determine the uncertainty of its predictions, which allows us to implement the "optimism in the face of uncertainty" exploration of RMAX in a principled way. Our method is evaluated on four common benchmark domains.

Tobias Jung and Peter Stone. Gaussian Processes for Sample Efficient Reinforcement Learning with RMAX-Like Exploration. In: Proc. of European Conference on Machine Learning (ECML) , 2010 (download paper .pdf) (download slides .pdf)

Feature Selection for Value Function Approximation Using Bayesian Model Selection

Abstract: Feature selection in reinforcement learning (RL), i.e., choosing basis functions such that useful approximations of the unkown value function can be obtained, is one of the main challenges in scaling RL to real-world applications. Here we consider the Gaussian process based framework GPTD for approximate policy evaluation, and propose feature selection through marginal likelihood optimization of the associated hyperparameters. Our approach has two appealing benefits: (1) given just sample transitions, we can solve the policy evaluation problem fully automatically (without looking at the learning task, and, in theory, independent of the dimensionality of the state space), and (2) model selection allows us to consider more sophisticated kernels, which in turn enable us to identify relevant subspaces and eliminate irrelevant state variables such that we can achieve substantial computational savings and improved prediction performance.

Tobias Jung and Peter Stone. Feature Selection for Value Function Approximation Using Bayesian Model Selection. In: Proceedings of European Conference on Machine Learning (ECML) , 2009 (download paper .pdf) (download slides .pdf)

#### Information theory and control

Empowerment for Continuous Agent-Environment Systems

Abstract: This paper develops generalizations of empowerment to continuous states. Empowerment is a recently introduced information-theoretic quantity motivated by hypotheses about the efficiency of the sensorimotor loop in biological organisms, but also from considerations stemming from curiosity-driven learning. Empowemerment measures, for agent-environment systems with stochastic transitions, how much influence an agent has on its environment, but only that influence that can be sensed by the agent sensors. It is an information-theoretic generalization of joint controllability (influence on environment) and observability (measurement by sensors) of the environment by the agent, both controllability and observability being usually defined in control theory as the dimensionality of the control/observation spaces. Earlier work has shown that empowerment has various interesting and relevant properties, e.g., it allows us to identify salient states using only the dynamics, and it can act as intrinsic reward without requiring an external reward. However, in this previous work empowerment was limited to the case of small-scale and discrete domains and furthermore state transition probabilities were assumed to be known. The goal of this paper is to extend empowerment to the significantly more important and relevant case of continuous vector-valued state spaces and initially unknown state transition probabilities. The continuous state space is addressed by Monte-Carlo approximation; the unknown transitions are addressed by model learning and prediction for which we apply Gaussian processes regression with iterated forecasting. In a number of well-known continuous control tasks we examine the dynamics induced by empowerment and include an application to exploration and online model learning.

Tobias Jung, Daniel Polani and Peter Stone. Empowerment for Continuous Agent-Environment Systems. In: Adaptive Behavior , Vol. 19(1), pp.16-39, 2011 (download article .pdf) (download slides from GSO Workshop '11 .pdf)

Evolution and Learning: Evolution of Sensors in a Simple MDP Environment

Abstract: Natural intelligence and autonomous agents face difficulties when acting in information-dense environments. Assailed by a multitude of stimuli, they have to make sense of the inflow of information, filtering and processing what is necessary, but discarding that which is unimportant. This article aims at investigating the interactions between \emph{evolution} of the sensorial channel extracting the information from the environment and the simultaneous \emph{individual adaptation} of agent-control. Our particular goal is to study the influence of learning on the evolution of sensors, with learning duration being the tunable parameter. A genetic algorithm governs the evolution of sensors appropriate for the agent solving a simple grid world task. The performance of the agent is taken as fitness; 'sensors' are conceived as map from environmental states to agent observations, and individual adaptation is modeled by Q-learning. Our experimental results show that due to the principles of cognitive economy learning and varying the degree thereof actually transforms the fitness-landscape. In particular we identify a trade-off between learning speed (load) and sensor accuracy (error). These results are further reinforced by theoretical analysis: we derive an analytical measure for the quality of sensors based on the mutual entropy between the system of states and the selection of an optimal action, a concept recently proposed by Polani, Martinetz and Kim.

Tobias Jung, Peter Dauscher and Thomas Uthmann. Evolution and Learning: Evolution of Sensors in a Simple MDP Environment. In: Adaptive Behavior,Vol. 11(3):159-177, 2003 (download article .pdf)

Optimized Look-Ahead Tree Policies: A Bridge Between Look-Ahead Tree Policies and Direct Policy Search

Abstract: Direct policy search (DPS) and look-ahead tree (LT) policies are two popular techniques for solving difficult sequential decision-making problems. They both are simple to implement, widely applicable without making strong assumptions on the structure of the problem, and capable of producing high performance control policies. However, computationally both of them are, each in their own way, very expensive. DPS can require huge offline resources (effort required to obtain the policy) to first select an appropriate space of parameterized policies that works well for the targeted problem, and then to determine the best values of the parameters via global optimization. LT policies do not require any offline resources; however, they typically require huge online resources (effort required to calculate the best decision at each step) in order to grow trees of sufficient depth. In this paper, we propose optimized look-ahead trees (OLT), a model-based policy learning scheme that lies at the intersection of DPS and LT. In OLT, the control policy is represented indirectly through an algorithm that at each decision step develops, as in LT using a model of the dynamics, a small look-ahead tree until a prespecified online budget is exhausted. Unlike LT, the development of the tree is not driven by a generic heuristic; rather, the heuristic is optimized for the target problem and implemented as a parameterized node scoring function learned offline via DPS. We experimentally compare OLT with pure DPS and pure LT variants on optimal control benchmark domains. The results show that the LT-based representation is a versatile way of compactly representing policies in a DPS scheme (which results in OLT being easier to tune and having lower offline complexity than pure DPS); while at the same time, DPS helps to significantly reduce the size of the look-ahead trees that are required to take high-quality decisions (which results in OLT having lower online complexity than pure LT). Moreover, OLT produces overall better performing policies than pure DPS and pure LT and also results in policies that are robust with respect to perturbations of the initial conditions.

Tobias Jung, Louis Wehenkel, Damien Ernst and Francis Maes. Optimized Look-Ahead Tree Policies: A Bridge Between Look-Ahead Tree Policies and Direct Policy Search. In: International Journal of Adaptive Control and Signal Processing , 2013 (to appear) (download paper .pdf) (download slides from RL Workshop Barbados'13 .pdf)

Optimized Look-Ahead Tree Policies: Extension to Large and Continuous Action Spaces

Abstract: This paper studies look-ahead tree based control policies from the viewpoint of online decision making with constraints on the computational budget allowed per decision (expressed as number of calls to the generative model). We consider optimized look-ahead tree (OLT) policies, a recently introduced family of hybrid techniques, which combine the advantages of look-ahead trees (high precision) with the advantages of direct policy search (low online cost) and which are specifically designed for limited online budgets. We present two extensions of the basic OLT algorithm that on the one side allow tackling deterministic optimal control problems with large and continuous action spaces and that on the other side can also help to further reduce the online complexity.

Tobias Jung, Damien Ernst and Francis Maes. Optimized Look-Ahead Tree Policies: Extension to Large and Continuous Action Spaces. In: Proc. of IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning (ADPRL) , 2013 (download paper .pdf) (download slides .pdf)

#### Reinforcement learning and computer networks

Outbound SPIT Filter with Optimal Performance Guarantees

Abstract: This paper presents a formal framework for identifying and filtering SPIT calls (SPam in Internet Telephony) in an outbound scenario with provable optimal performance. In so doing, our work is largely different from related previous work: our goal is to rigorously formalize the problem in terms of mathematical decision theory, find the optimal solution to the problem, and derive concrete bounds for its expected loss (number of mistakes the SPIT filter will make in the worst case). This goal is achieved by considering an abstracted scenario amenable to theoretical analysis, namely SPIT detection in an outbound scenario with pure sources. Our methodology is to first define the cost of making an error (false positive and false negative), apply Wald's sequential probability ratio test to the individual sources, and then determine analytically error probabilities such that the resulting expected loss is minimized. The benefits of our approach are: (1) the method is optimal (in a sense defined in the paper); (2) the method does not rely on manual tuning and tweaking of parameters but is completely self-contained and mathematically justified; (3) the method is computationally simple and scalable. These are desirable features that would make our method a component of choice in larger, autonomic frameworks.

Tobias Jung, Sylvain Martin, Mohamed Nassar, Damien Ernst and Guy Leduc. Outbound SPIT Filter with Optimal Performance Guarantees. In: Computer Networks, 2013 (to appear) (download article .pdf)

Tobias Jung, Sylvain Martin, Damien Ernst and Guy Leduc. SPRT for SPIT: Using the Sequential Probability Ratio Test for Spam in VoIP Prevention. In: Proc. of Int. Conf. on Autonomous Infrastructure, Management and Security (AIMS), 2012. (download paper .pdf)

Contextual Multi-armed Bandits for Web Server Defense

Abstract: In this paper we argue that contextual multi-armed bandit algorithms could open avenues for designing self-learning security modules for computer networks and related tasks. The paper has two contributions: a conceptual and an algorithmical one. The conceptual contribution is to formulate the real-world problem of preventing HTTP-based attacks on web servers as a one-shot sequential learning problem, namely as a contextual multi-armed bandit. Our second contribution is to present CMABFAS, a new and computationally very cheap algorithm for general contextual multi-armed bandit learning that specifically targets domains with finite actions. We illustrate how CMABFAS could be used to design a fully self-learning meta filter for web servers that does not rely on feedback from the end-user (i.e., does not require labeled data) and report first convincing simulation results.

Tobias Jung, Sylvain Martin, Damien Ernst and Guy Leduc. Contextual Multi-armed Bandits for Web Server Defense. In: Proc. of International Joint Conference on Neural Networks (IJCNN), 2012 (download paper .pdf) (download slides .pdf)

Tobias Jung, Sylvain Martin, Damien Ernst and Guy Leduc. Contextual Multi-armed Bandits for the Prevention of Spam in VoIP Networks. In: ULg Technical Report 12-07,(arXiv:1201.6181v2), 2012. (download paper .pdf)

#### Other

Connectivity-based Localization in Robot Networks

Abstract: Consider a small team of autonomous robots, each equipped with a radio, that are deployed in an ad-hoc fashion and whose goal it is to act as signal relay nodes to form a temporary, adaptive, and highly robust communication network. To perform this type of self-optimization and self-healing, relative localization (i.e. knowing direction and distance to every other robot in the network) is necessary. In a sense, the problem is similar to the one studied in ad-hoc sensor networks. The key differences are that (1) anchor nodes with known locations are not available; that (2) the connectivity graph is very sparse, because of a comparatively small number of nodes involved; and that (3) the communication nodes are actually mobile robots such that apart from location we also have to estimate the directions to other nodes (which can not be obtained from a single time slice). To solve this problem, we propose a global approach that exploits the mobility of the robots to obtain multiple connectivity measurements over a small time window. Together with the odometry of individualrobots, we then try to estimate underlying locations that best explain the observerd connectivity data by minimizing a suitable stress function. Through simulation of a concrete real-world scenario we show that our approach performs reasonably well with as few as ten robots. We examine its performance both under outdoor and indoor conditions (i.e. uniform and non-uniform signal propagation). In addition, we also consider the case where we are able to observe the distance between connected tobots, which further improves accuracy substantially.

Tobias Jung, Mazda Ahmadi and Peter Stone. Connectivity-based Localization in Robot Networks. In: Proc. of IEEE Workshop on Robotic Sensor Networks (DCOSS) , 2009 (download paper .pdf) (download slides .pdf)

Sequential Learning with LS-SVM for Large-Scale Data Sets

Abstract: We present a subspace-based variant of LS-SVMs (i.e. regularization networks) that sequentially processes the data and is hence especially suited for online learning tasks. The algorithm works by selecting from the data set a small subset of basis functions that is subsequently used to approximate the full kernel on arbitrary points. This subset is identified online from the data stream. We improve upon existing approaches (esp. the kernel recursive least squares algorithm) by proposing a new, supervised criterion for the selection of the relevant basis functions that takes into account the approximation error incurred from approximating the kernel as well as the reduction of the cost in the original learning task. We use the large-scale data set 'forest' to compare performance and efficiency of our algorithm with greedy batch selection of the basis functions via orthogonal least squares. Using the same number of basis functions we achieve comparable error rates at much lower costs (CPU-time and memory wise).

Tobias Jung and Daniel Polani. Sequential Learning with LS-SVM for Large-Scale Data Sets. In: Proc. of 16th ICANN),p. 381-390, 2006 (download paper .pdf) (download slides .pdf)

Long Term Prediction of Product Quality in a Glass Manufacturing Process Using a Kernel Based Approach

Abstract: In this paper we report the results obtained using a kernel-based approach to predict the temporal development of four response signals in the process control of a glass melting tank with 16 input parameters. The data set is a revised version from the modelling challenge in EUNITE-2003. The central difficulties are: large time-delays between changes in the inputs and the outputs, large number of data, and a general lack of knowledge about the relevant variables that intervene in the process. The methodology proposed here comprises Support Vector Machines (SVM) and Regularization Networks (RN). We use the idea of sparse approximation both as a means of regularization and as a means of reducing the computational complexity. Furthermore, we will use an incremental approach to add new training examples to the kernel-based method and efficiently update the current solution. This allows us to use a sophisticated learning scheme, where we iterate between prediction and training, with good computational efficiency and satisfactory results.

Tobias Jung, Luis Herrera and Bernhard Schoelkopf. Long Term Prediction of Product Quality in a Glass Manufacturing Process Using a Kernel Based Approach. In: Proc. of 8th IWANN,p. 960-967, 2005 (download paper .pdf) (download slides .pdf)