Last update: 11/3/2013, 11:33 a.m.

**E-mail:** toby dot jung at gmail dot com

**Previous positions:** Post-doc at the University of Liège and the University of Texas at Austin.

**Research interests:**

- Reinforcement learning, approximate dynamic programming and stochastic optimal control.
- Regression and function approximation, Bayesian methods.
- Numerical implementation.
- Cool applications.

NIPS'12 Workshop: Information in Perception and Action

Thesis: Reinforcement Learning with Regularization Networks

** Abstract: **
This thesis aims at learning autonomously optimal behavior for high-dimensional control tasks using
reinforcement learning with a kernel-based approach. Harnessing the representational power of kernel-based
methods, we hope to escape the so-called ’curse of dimensionality’ which otherwise implies an exponential
growth in the number of basis functions. Specifically, we apply regularization networks as
underlying function approximator in least-squares based policy evaluation. The samples used to build
this approximation are generated online, from an agent interacting with the environment. This poses
an enormous computational challenge since kernel methods inherently scale with O(n**3), where n is the
number of training samples.
Our first contribution hence is an efficient recursive implementation of regularization networks
particularly tailored for online learning. This is made possible by using the subset of regressors approximation
which approximates the kernel using a vastly reduced number of basis functions. To select this subset, we
employ sparse greedy online selection that automatically constructs a dictionary of basis functions directly
from the data stream. Since parsimoniousness is of great importance for efficiency in an online-setting,
we extend the original procedure to additionally incorporate a notion of relevance, i.e., the reduction of
error in the regression task, thereby eliminating redundancy on-the-fly and further reducing the number
of basis functions necessary. The resulting new online algorithm is evaluated on a number of benchmark
regression problems and shown to perform well (an order of magnitude faster with only a slight decrease
of performance) in comparison with state-of-the-art offline methods.
We then apply this algorithm to carry out approximate policy evaluation and embed it into an approximate
policy iteration framework suitable for optimal control. The result is a reinforcement learning method
that works online and in a model-free fashion, allows non-deterministic transitions, is very data-efficient
and effortlessly scales to high-dimensional state-spaces. We demonstrate this using some high-dimensional
benchmarks, among them RoboCup-Keepaway and the recently proposed control of an octopus-arm.

** Tobias Jung.**
Reinforcement Learning with Regularization Networks (written in German).
In: * Dissertation, Johannes Gutenberg Universitaet Mainz *, 2007
(download thesis .pdf) (download slides .pdf)

Learning RoboCup-Keepaway with Kernels

** Abstract: **
We apply kernel-based methods to solve the difficult reinforcement learning
problem of 3vs2 keepaway in RoboCup simulated soccer. Key challenges in keepaway
are the high-dimensionality of the state space (rendering conventional
discretization-based function approximation like tilecoding infeasable),
the stochasticity due to noise and multiple learning agents needing
to cooperate (meaning that the exact dynamics of the environment are
unknown) and real-time learning (meaning that an efficient online implementation
is required). We employ the general framework of approximate policy iteration
with least-squares-based policy evaluation (LSPE and LSTD).
As underlying function approximator we consider the family of regularization
networks with subset of regressors approximation. The core of our proposed
solution is an efficient recursive implementation with automatic supervised
selection of relevant basis functions.
Simulation results indicate that the behavior learned through our approach
clearly outperforms the best results obtained with tilecoding
by Stone et al. (2005).

** Tobias Jung and Daniel Polani.**
Learning RoboCup-Keepaway with Kernels.
In: * JMLR Workshop and Conference Proceedings (Gaussian Processes in Practice) *, Vol. 1:33-57, 2007
(download article .pdf)

Kernelizing LSPE(lambda)

** Abstract: **
We propose the use of kernel-based methods as underlying function approximator
in the least-squares based policy evaluation framework of LSPE($\lambda$) and
LSTD($\lambda$). In particular we present the 'kernelization' of model-free
LSPE($\lambda$). The 'kernelization' is computationally made possible by using the
subset of regressors approximation, which approximates the kernel using a vastly reduced
number of basis functions. The core of our proposed solution is an efficient
recursive implementation with automatic supervised selection of the relevant
basis functions. The LSPE method is well-suited for optimistic policy iteration
and can thus be used in the context of online reinforcement learning. We use the
high-dimensional Octopus benchmark to demonstrate this.

** Tobias Jung and Daniel Polani.**
Kernelizing LSPE(lambda).
In: * Proc. IEEE Symposium on Approximate Dynamic Programming and Reinforcement Learning (ADPRL)*, p. 338-345, 2007
(download paper .pdf) (download slides .pdf)

Least-Squares SVM for Least-Squares TD Learning

** Abstract: **
We formulate the problem of least squares temporal difference learning (LSTD) in the
framework of least squares SVM (LS-SVM). To cope with the large amount (and possible
sequential nature) of training data arising in reinforcement learning, we employ a
subspace based variant of LS-SVM that sequentially processes the data and is hence
especially suited for online learning. This approach is adapted from the context
of Gaussian process regression and turns the unwieldy original optimization problem
(with computational complexity being cubic in the number of processed data) into a
reduced problem (with computional complexity being linear in the number of processed data).
We introduce a QR decomposition based approach to solve the resulting generalized normal
equations incrementally that is numerically more stable than existing recursive
least squares based update algorithms. We also allow a forgetting factor in the updates
to track non-stationary target functions (i.e. for the use with optimistic policy iteration).
Experimental comparison with standard CMAC function approximation indicate that LS-SVMs are
well-suited for online RL.

** Tobias Jung and Daniel Polani.**
Least-Squares SVM for Least-Squares TD Learning.
In: * Proc. of European Conference on Artificial Intelligence (ECAI)*, p. 499-503, 2006
(download paper .pdf) (download slides .pdf)

Experiments in Value Function Approximation with Sparse Support Vector Regression

** Abstract: **
We present first experiments using Support Vector Regression as
function approximator for an on-line, {\em sarsa}-like reinforcement
learner. To overcome the batch nature of SVR two ideas are employed.
The first is sparse greedy approximation: the data is projected onto
the subspace spanned by only a small subset of the original data (in
feature space). This subset can be built up in an on-line fashion.
Second, we use the sparsified data to solve a reduced quadratic
problem, where the number of variables is independent of the total
number of training samples seen. The feasability of this approach
is demonstrated on two common toy-problems.

** Tobias Jung and Thomas Uthmann.**
Experiments in Value Function Approximation with Sparse Support Vector Regression.
In: * Proc. of European Conference on Machine Learning (ECML)*, p. 180-191, 2004
(download paper .pdf) (download slides .pdf)

Gaussian Processes for Sample Efficient Reinforcement Learning with RMAX-Like Exploration

** Abstract: **
We present an implementation of model based online reinforcement learning (RL) for continuous
domains with deterministic transitions that is specifically designed to achieve low sample complexity.
To achieve low sample complexity, since the environment is unknown, an agent must intelligently
balance exploration and exploitation, and must be able to rapidly generalize from observations.
While in the past a number of sample efficient RL algorithms have been proposed with provable
PAC-style performance guarantees for both discrete (e.g., RMAX) and continuous (e.g., ARL)
domains, to allow theoretical analysis, only methods with weak generalization capabilities were
considered.
We present an implementation of RMAX for continuous domains that separates function approximation
in the model learner (which does require samples) from the interpolation in the planner (which
does not require samples). For model learning we apply Gaussian processes regression (GP) which is
able to automatically adjust itself to the complexity of the problem (via Bayesian hyperparameter
selection) and, in
practice, able to learn a highly accurate model from very little data. In addition, a GP provides
a natural way to determine the uncertainty of its predictions, which allows us to implement the
"optimism in the face of uncertainty" exploration of RMAX in a principled way. Our method is
evaluated on four common benchmark domains.

** Tobias Jung and Peter Stone. **
Gaussian Processes for Sample Efficient Reinforcement Learning with RMAX-Like Exploration.
In: * Proc. of European Conference on Machine Learning (ECML) *, 2010
(download paper .pdf) (download slides .pdf)

Feature Selection for Value Function Approximation Using Bayesian Model Selection

** Abstract: **
Feature selection in reinforcement learning (RL), i.e., choosing basis functions
such that useful approximations of the unkown value function can be obtained,
is one of the main challenges in scaling RL to real-world applications. Here we
consider the Gaussian process based framework GPTD for approximate policy evaluation, and
propose feature selection through marginal likelihood optimization of the associated
hyperparameters. Our approach has two appealing benefits: (1) given just sample transitions,
we can solve the policy evaluation problem fully automatically (without looking at the
learning task, and, in theory, independent of the dimensionality of the state space), and
(2) model selection allows us to consider more sophisticated kernels, which in turn enable
us to identify relevant subspaces and eliminate irrelevant state variables such that
we can achieve substantial computational savings and improved prediction performance.

** Tobias Jung and Peter Stone. **
Feature Selection for Value Function Approximation Using Bayesian Model Selection.
In: * Proceedings of European Conference on Machine Learning (ECML) *, 2009
(download paper .pdf) (download slides .pdf)

Empowerment for Continuous Agent-Environment Systems

** Abstract: **
This paper develops generalizations of empowerment to continuous states. Empowerment is a recently introduced
information-theoretic quantity motivated by hypotheses about the efficiency of the sensorimotor loop in biological
organisms, but also from considerations stemming from curiosity-driven learning.
Empowemerment measures, for agent-environment systems with stochastic transitions, how much influence an agent has on
its environment, but only that influence that can be sensed by the agent sensors. It is an information-theoretic
generalization of joint controllability (influence on environment) and observability (measurement by sensors) of
the environment by the agent, both controllability and observability being usually defined in control theory as
the dimensionality of the control/observation spaces.
Earlier work has shown that empowerment has various interesting and relevant properties, e.g., it allows us to identify
salient states using only the dynamics, and it can act as intrinsic reward without requiring an external reward.
However, in this previous work empowerment was limited to the case of small-scale and discrete domains and furthermore
state transition probabilities were assumed to be known.
The goal of this paper is to extend empowerment to the significantly more important and relevant case of continuous
vector-valued state spaces and initially unknown state transition probabilities. The continuous state space is addressed
by Monte-Carlo approximation; the unknown transitions are addressed by model learning and prediction for which we apply
Gaussian processes regression with iterated forecasting. In a number of well-known continuous control tasks we examine
the dynamics induced by empowerment and include an application to exploration and online model learning.

** Tobias Jung, Daniel Polani and Peter Stone. **
Empowerment for Continuous Agent-Environment Systems.
In: * Adaptive Behavior *, Vol. 19(1), pp.16-39, 2011
(download article .pdf) (download slides from GSO Workshop '11
.pdf)

Evolution and Learning: Evolution of Sensors in a Simple MDP Environment

** Abstract: **
Natural intelligence and autonomous agents face difficulties when
acting in information-dense environments. Assailed by a multitude of
stimuli, they have to make sense of the inflow of information,
filtering and processing what is necessary, but discarding that
which is unimportant. This article aims at investigating the
interactions between \emph{evolution} of the sensorial channel
extracting the information from the environment and the simultaneous
\emph{individual adaptation} of agent-control. Our particular goal
is to study the influence of learning on the evolution of sensors,
with learning duration being the tunable parameter. A genetic
algorithm governs the evolution of sensors appropriate for the agent
solving a simple grid world task. The performance of the agent
is taken as fitness; 'sensors' are conceived as map from
environmental states to agent observations, and individual adaptation
is modeled by Q-learning. Our experimental results show that due to
the principles of cognitive economy learning and varying the degree
thereof actually transforms the fitness-landscape. In particular we
identify a trade-off between learning speed (load) and sensor
accuracy (error). These results are further reinforced by
theoretical analysis: we derive an analytical measure for the
quality of sensors based on the mutual entropy between the system
of states and the selection of an optimal action, a concept recently
proposed by Polani, Martinetz and Kim.

** Tobias Jung, Peter Dauscher and Thomas Uthmann. **
Evolution and Learning: Evolution of Sensors in a Simple MDP Environment.
In: * Adaptive Behavior*,Vol. 11(3):159-177, 2003
(download article .pdf)

Optimized Look-Ahead Tree Policies: A Bridge Between Look-Ahead Tree Policies and Direct Policy Search

** Abstract: **
Direct policy search (DPS) and look-ahead tree (LT) policies are two popular techniques for solving
difficult sequential decision-making problems. They both are simple to implement, widely applicable
without making strong assumptions on the structure of the problem, and capable of producing high
performance control policies. However, computationally both of them are, each in their own way, very
expensive. DPS can require huge offline resources (effort required to obtain the policy) to first select
an appropriate space of parameterized policies that works well for the targeted problem, and then to
determine the best values of the parameters via global optimization. LT policies do not require any
offline resources; however, they typically require huge online resources (effort required to calculate the best
decision at each step) in order to grow trees of sufficient depth.
In this paper, we propose optimized look-ahead trees (OLT), a model-based policy learning scheme that
lies at the intersection of DPS and LT. In OLT, the control policy is represented indirectly through an
algorithm that at each decision step develops, as in LT using a model of the dynamics, a small look-ahead
tree until a prespecified online budget is exhausted. Unlike LT, the development of the tree is not driven
by a generic heuristic; rather, the heuristic is optimized for the target problem and implemented as a parameterized
node scoring function learned offline via DPS.
We experimentally compare OLT with pure DPS and pure LT variants on optimal control benchmark domains. The results show
that the LT-based representation is a versatile way of compactly representing policies in a DPS scheme (which
results in OLT being easier to tune and having lower offline complexity than pure DPS); while at
the same time, DPS helps to significantly reduce the size of the look-ahead trees that are required to
take high-quality decisions (which results in OLT having lower online complexity than pure LT). Moreover,
OLT produces overall better performing policies than pure DPS and pure LT and also results in policies
that are robust with respect to perturbations of the initial conditions.

** Tobias Jung, Louis Wehenkel, Damien Ernst and Francis Maes. **
Optimized Look-Ahead Tree Policies: A Bridge Between Look-Ahead Tree Policies and Direct Policy Search.
In: * International Journal of Adaptive Control and Signal Processing *, 2013 (to appear)
(download paper .pdf) (download slides from RL Workshop Barbados'13
.pdf)

Optimized Look-Ahead Tree Policies: Extension to Large and Continuous Action Spaces

** Abstract: **
This paper studies look-ahead tree based control policies from the viewpoint of
online decision making with constraints on the computational budget allowed per decision
(expressed as number of calls to the generative model). We consider optimized
look-ahead tree (OLT) policies, a recently introduced family of hybrid techniques,
which combine the advantages of look-ahead trees (high precision) with the advantages
of direct policy search (low online cost) and which are specifically designed for
limited online budgets. We present two extensions of the basic OLT algorithm that
on the one side allow tackling deterministic optimal control problems with large
and continuous action spaces and that on the other side can also help to further
reduce the online complexity.

** Tobias Jung, Damien Ernst and Francis Maes. **
Optimized Look-Ahead Tree Policies: Extension to Large and Continuous Action Spaces.
In: * Proc. of IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning (ADPRL) *, 2013
(download paper .pdf) (download slides .pdf)

Outbound SPIT Filter with Optimal Performance Guarantees

** Abstract: **
This paper presents a formal framework for identifying and filtering SPIT calls (SPam in Internet
Telephony) in an outbound scenario with provable optimal performance. In so doing, our work is largely
different from related previous work: our goal is to rigorously formalize the problem in terms of
mathematical decision theory, find the optimal solution to the problem, and derive concrete bounds for
its expected loss (number of mistakes the SPIT filter will make in the worst case).
This goal is achieved by considering an abstracted scenario amenable to theoretical analysis, namely
SPIT detection in an outbound scenario with pure sources. Our methodology is to first define the cost
of making an error (false positive and false negative), apply Wald's sequential probability ratio test
to the individual sources, and then determine analytically error probabilities such that the resulting
expected loss is minimized.
The benefits of our approach are: (1) the method is optimal (in a sense defined in the paper);
(2) the method does not rely on manual tuning and tweaking of parameters but is completely self-contained
and mathematically justified; (3) the method is computationally simple and scalable. These are desirable
features that would make our method a component of choice in larger, autonomic frameworks.

** Tobias Jung, Sylvain Martin, Mohamed Nassar, Damien Ernst and Guy Leduc. **
Outbound SPIT Filter with Optimal Performance Guarantees.
In: * Computer Networks*, 2013 (to appear)
(download article .pdf)

** Tobias Jung, Sylvain Martin, Damien Ernst and Guy Leduc. ** SPRT for SPIT:
Using the Sequential Probability Ratio Test for Spam in VoIP Prevention. In:
* Proc. of Int. Conf. on Autonomous Infrastructure, Management and Security (AIMS)*, 2012.
(download paper
.pdf)

Contextual Multi-armed Bandits for Web Server Defense

** Abstract: **
In this paper we argue that contextual multi-armed bandit algorithms
could open avenues for designing self-learning security modules for
computer networks and related tasks. The paper has two contributions:
a conceptual and an algorithmical one. The conceptual contribution
is to formulate the real-world problem of preventing HTTP-based attacks on web servers
as a one-shot sequential learning problem, namely as a contextual
multi-armed bandit. Our second contribution is to present CMABFAS,
a new and computationally very cheap algorithm for general contextual
multi-armed bandit learning that specifically targets domains with finite actions.
We illustrate how CMABFAS could be used to design a fully self-learning meta
filter for web servers that does not rely on feedback from the end-user (i.e.,
does not require labeled data) and report first convincing simulation results.

** Tobias Jung, Sylvain Martin, Damien Ernst and Guy Leduc. **
Contextual Multi-armed Bandits for Web Server Defense.
In: * Proc. of International Joint Conference on Neural Networks (IJCNN)*, 2012
(download paper .pdf)
(download slides .pdf)

** Tobias Jung, Sylvain Martin, Damien Ernst and Guy Leduc. **
Contextual Multi-armed Bandits for the Prevention of Spam in VoIP Networks.
In: * ULg Technical Report 12-07*,(arXiv:1201.6181v2), 2012.
(download paper .pdf)

Connectivity-based Localization in Robot Networks

** Abstract: **
Consider a small team of autonomous robots, each equipped with a radio, that are deployed
in an ad-hoc fashion and whose goal it is to act as signal relay nodes to form a temporary,
adaptive, and highly robust communication network. To perform this type of self-optimization
and self-healing, relative localization (i.e. knowing direction and distance to every other
robot in the network) is necessary.
In a sense, the problem is similar to the one studied in ad-hoc sensor networks. The key differences
are that (1) anchor nodes with known locations are not available; that (2) the connectivity graph
is very sparse, because of a comparatively small number of nodes involved; and that (3) the
communication nodes are actually mobile robots such that apart from location we also have
to estimate the directions to other nodes (which can not be obtained from a single time slice).
To solve this problem, we propose a global approach that exploits the mobility of the robots
to obtain multiple connectivity measurements over a small time window. Together with the odometry
of individualrobots, we then try to estimate underlying locations that best explain the observerd
connectivity data by minimizing a suitable stress function.
Through simulation of a concrete real-world scenario we show that our approach performs reasonably well
with as few as ten robots. We examine its performance both under outdoor and indoor conditions
(i.e. uniform and non-uniform signal propagation). In addition, we also consider the case
where we are able to observe the distance between connected tobots, which further improves accuracy
substantially.

** Tobias Jung, Mazda Ahmadi and Peter Stone. **
Connectivity-based Localization in Robot Networks.
In: * Proc. of IEEE Workshop on Robotic Sensor Networks (DCOSS) *, 2009
(download paper .pdf) (download slides .pdf)

Sequential Learning with LS-SVM for Large-Scale Data Sets

** Abstract: **
We present a subspace-based variant of LS-SVMs (i.e. regularization networks) that
sequentially processes the data and is hence especially suited for online learning
tasks. The algorithm works by selecting from the data set a small subset of basis
functions that is subsequently used to approximate the full kernel on arbitrary points.
This subset is identified online from the data stream.
We improve upon existing approaches (esp. the kernel recursive least squares algorithm)
by proposing a new, supervised criterion for the selection of the relevant basis functions
that takes into account the approximation error incurred from approximating the kernel
as well as the reduction of the cost in the original learning task.
We use the large-scale data set 'forest' to compare performance and efficiency of our algorithm
with greedy batch selection of the basis functions via orthogonal least squares. Using the same number
of basis functions we achieve comparable error rates at much lower costs (CPU-time and memory wise).

** Tobias Jung and Daniel Polani. **
Sequential Learning with LS-SVM for Large-Scale Data Sets.
In: * Proc. of 16th ICANN)*,p. 381-390, 2006
(download paper .pdf) (download slides .pdf)

Long Term Prediction of Product Quality in a Glass Manufacturing Process Using a Kernel Based Approach

** Abstract: **
In this paper we report the results obtained using a kernel-based
approach to predict the temporal development of four response
signals in the process control of a glass melting tank with 16
input parameters. The data set is a revised version from the modelling
challenge in EUNITE-2003. The central difficulties are: large time-delays
between changes in the inputs and the outputs, large number of
data, and a general lack of knowledge about the relevant variables
that intervene in the process. The methodology proposed here
comprises Support Vector Machines (SVM) and Regularization
Networks (RN). We use the idea of sparse approximation both as a
means of regularization and as a means of reducing the
computational complexity. Furthermore, we will use an incremental
approach to add new training examples to the kernel-based method
and efficiently update the current solution. This allows us to use
a sophisticated learning scheme, where we iterate between
prediction and training, with good computational efficiency and
satisfactory results.

** Tobias Jung, Luis Herrera and Bernhard Schoelkopf. **
Long Term Prediction of Product Quality in a Glass Manufacturing Process Using a Kernel Based Approach.
In: * Proc. of 8th IWANN*,p. 960-967, 2005
(download paper .pdf) (download slides .pdf)