Last update: 12/19/2012, 8:40 a.m.
Since its inception for describing the laws of communication in the 1940's, information theory has been considered in fields beyond its original application area and, in particular, it was long attempted to utilize it for the description of intelligent agents. Already Attneave (1954) and Barlow (1961) suspected that neural information processing might follow principles of information theory and Laughlin (1998) demonstrated that information processing comes at a high metabolic cost; this implies that there would be evolutionary pressure pushing organismic information processing towards the optimal levels of data throughput predicted by information theory. This becomes particularly interesting when one considers the whole perceptionaction cycle, including feedback. In the last decade, significant progress has been made in this direction, linking information theory and control. The ensuing insights allow to address a large range of fundamental questions pertaining not only to the perceptionaction cycle, but to general issues of intelligence, and allow to solve classical problems of AI and machine learning in a novel way.
The workshop will present recent work on progress in AI, machine learning, control, as well as biologically plausible cognitive modeling, that is based on information theory.
Daniel Y. Little (University of Berkeley, USA)
Evangelos A. Theodorou (University of Washington, USA)
Naftali Tishby (Hebrew University, Israel)
Daniel Polani (University of Hertfordshire, UK)
Tobias Jung (University of Liège, Belgium)
We gratefully acknowledge our sponsors.
7:30  7:45  Welcome & Opening Remarks 
7:45  8:30  Talk:

8:30  9:00  Spotlight Presentations

9:00  9:30  Coffee Break + Poster Session 
9:30  10:30  Talks

15:30  16:15  Invited Talk

16:15  17:00  Invited Talk

17:00  17:30  Coffee Break 
17:30  18:10  Talks

18:10  19:00  Talk:

19:00  24:00  Discussion, Closing Remarks, Dinner 
"An informationtheoretic model of learningdriven exploration"
Daniel Y. Little, University of Berkeley
Abstract:
Psychologists have long held that curiosity constitutes the
primary drive of explorative behavior. Such intrinsic desire to learn
does not require reinforcement signals from extrinsic motivators.
Previous computational modeling of exploration, however, has largely
focused on its role in the acquisition of external rewards often
presented in terms of an explorationexploitation dichotomy. While
these studies have increased our understanding of control for reward
optimization, the investigation of learning as a primary objective of
behavior may provide fresh insights into the principles underlying
human and animal exploration. To this end, we will describe an
informationtheoretic approach to studying learning for learning's
sake in embodied agents. In simple worlds without external reward
signals, we demonstrate how an agent can estimate the expected
information gains of an action and use this estimate, called
predictive information gain (PIG), to optimize behavior towards
learning. We discuss the similarities and differences of our approach
with other informationtheoretic models of behavior. Finally we
present recent results suggesting how a combination of two
informationtheoretic models may explain the interaction between
investigation and play, the two major components of human exploration
identified by the psychology literature.
"Information Theoretic Views of Path Integral Control and Applications To Robotics"
Evangelos A. Theodorou, University of Washington
Abstract:
Recently reinforcement learning has moved towards combining
classical techniques from stochastic optimal optimal control and dynamic
programming with learning techniques from statistical estimation theory and
the connection between stochastic differential and partial differential
equations via the FeynmanKac Lemma. The resulting framework transforms
nonlinear stochastic optimal control into an approximation of a path
integral. In this talk, I will present connections of Path Integral(PI)
and KullbackLiebler(KL) control as presented in machine
learning and robotics communities with work in control theory on
logarithmic transformations of diffusions processes. The analysis provides
an information theoretic view of PI stochastic optimal control based on
the duality between Free Energy and Relative Entropy. Comparisons
between the information theoretic and Dynamic Programming point of view
in terms of generalizations and extensions will be discussed. Finally, I
will present algorithmic developments on iterative path integral control
and show applications to robotics as well as connections to free energy
based policy gradients.
Download extended abstract (PDF)
"Information flow in perceptionaction cycles and the emergence of hierarchies and reversehierarchies"
Naftali Tishby, Hebrew University
Abstract:
Starting form Large Deviation Theory (Sanov's theorem) we can obtain the connection between
the reward rate and the control and sensing information capacities, for systems in "metabolic
information equilibrium" with stationary stochastic environments (Tishby & Polani, 2010). This
result can be considered as an equilibrium characterisation for systems that achieved a certain
value through interactions with the environment, but have no new learning (e.g. "stupid" cleaning
robots). The affect of learning can be considered by revisiting the subextensivity of predictive
information in stationary environments (Bialek, Nemenman & Tishby 2002) and combining it with the
requirement of computational tractability of planning. We argue that planning is possible if the
information flow terms remain proportional to the reward terms on the one hand, but still bounded
by the sub extensive predictive information on the other hand.
I will discuss the possible implications of this new computational principle to the emergence of
hierarchical representations and discounting of rewards in our generalised Bellman equation.
"Information Constraints as Drivers for the Emergence of Cognitive Architectures and Concepts"
Daniel Polani, University of Hertfordshire
Abstract:
Ashby's Law of Requisite Variety (1956) and its extensions by
Touchette and Lloyd (2000, 2004) indicate that Shannon information
constraints govern the potential organisation and administration of
any cognitive task. In addition, there is increasing evidence that
the tradeoffs implied by these constraints are indeed exploited by
biological organisms close to the limit in adaptive
(quasi)equilibrium.
The talk will briefly discuss above principles and then present
several scenarios which illustrate some consequences of these
hypotheses. Depending on time and interest, this may include one or
more of the following scenarios:
"HigherOrder Predictive Information for Learning an Infinite Stream of Episodes"
ByoungTak Zhang, Seoul National University
Abstract:
We consider the problem of lifelong learning from an indefinite stream of
temporal episodes, i.e. a time series consisting of episodes, where the
number of the episodes is potentially infinite and the length of each
episode varies. What kinds of objective function should the lifelong
learner use to balance the shortterm and longterm performance? How should
the learner optimize its model complexity when the statistics of the
episodes change over time? Maximization of the expected future reward, such
as a value function used in reinforcement learning, might be useful if we
could define rewards for a prespecified goal. For learning an indefinite
stream of episodes, we find the mutual informationbased measures of
information theory, such as predictive information and empowerment
suitable. The predictive information is, however, typically approximated by
restricting the time horizons to a single time step. Though this is exact
under the Markov assumption, i.e. the probability of a state depends only
on the probability of the previous state, and still can generate
explorative behavior, the predictive power can be improved by increasing
the order of temporal dependency. Here we extend the firstorder predictive
information to the *k*thorder predictive information for lifelong learning
from a continuous stream of timeseries episodes. This higherorder
predictive information can be efficiently approximated by an importance
samplingbased Monte Carlo method.
"Regulating the information in spikes: a useful bias"
David Balduzzi, ETH Zurich
Abstract:
The bias/variance tradeoff is fundamental to learning: increasing a model's complexity can
improve its fit on training data, but potentially worsens performance on future samples. Remarkably, however,
the human brain effortlessly handles a widerange of complex pattern recognition tasks. On the basis of these
conflicting observations, it has been argued that useful biases in the form of "generic mechanisms for representation"
must be hardwired into cortex (Geman et al).
I describe a useful bias, taking the form of an constraint on the information embedded in spiking outputs,
that encourages cooperative learning. The constraint is both biologically plausible and rigorously justified.
"InformationMaximizing Local Spatial Scale Selection in Early Visual Processing"
Sander M. Bohte, CWI, Amsterdam
H. Steven Scholte, University of Amsterdam
Sennay Ghebreab, University of Amsterdam
Abstract:
From an information theoretic point of view, the optimal amount of spatial
pooling in optical sensors is determined by local contrast and mean object
intensity. In other words, the optimal scale of a filter is determined by
environmental conditions. As neural filters in early visual processing span
many different spatial scales, the question is whether the brain uses
optimal filterscale selection. In a model of early visual processing, we
derive local filter scale that maximizes information, based on early work
by Snyder et al (1977). We show that such information maximizing scale
selection produces a neural response distribution that is as predictive for
EEG responses as the best current heuristics for local scale control. We
furthermore show that a simple neural network can quickly learn such scale
selection. Taking predictability of EEGresponses as model evidence, this
finding suggests that the brain may hierarchically pool simple features so
as to maximize information transfer given uncertainty due to local contrast
statistics.
"Relating information theoretic principles in learning to structure in human cultural behavior"
Tessa Verhoef, University of Amsterdam
Abstract:
Many human behaviors are rooted in culture. Cultural traditions such
as language, music, dance and art are built on systems that are often
acquired by social learning and that have been transmitted from
generation to generation. Cultural evolution has been studied at
length with the use of both artificial and human learners and a key
finding from this work is that structure in transmitted systems is
shaped by the cognitive biases of its users. In studies investigating
structures that emerge from cultural evolution in experiments with
humans, compressible and predictable systems appear to be a prevalent
result. Findings from cultural evolution research may therefore
provide additional sources of evidence about informationtheoretic
biases in cognition. As an example, data will be shown from an
experiment in which artificial whistled languages (produced with slide
whistles) are transmitted. These languages evolved in such a way that
the set of basic sound primitives was reduced and these primitives
were more extensively reused and combined in a predictable way,
yielding more compressible systems. Such an efficient combinatorial
structure is one of the basic features of linguistic systems, but is
also present in artistic systems such as music and dance. Presumably
these systems exhibit this type of efficient structure because they
are the result of cultural evolution and reflect human compression
biases.
"Continuoustime recursive Bayesian updating in networks of stochastic spiking neurons"
Ben Moran
Nicolas Della Penna, Australian National University
Abstract:
Neural systems operate under uncertainty, and to behave adaptively must
update their beliefs with information from their surroundings. This
entails maintaining a probability distribution over possible states, and
updating this distribution as sensory data arrive. Optimal updates are
given by Bayes' theorem, but it is useful to consider what kinds of
network could support this computation. One such formulation arises
from exploring the formal connection between Bayes' rule and the
replicator equation, a model of biological evolution. We can identify
the composition of species within a population with the prior
distribution, and "evolutionary fitness" with log likelihood. This
analogy is mathematically precise, and holds also in the case of
continuous time [Harper 2010] The continuous replicator dynamic is a
LotkaVolterra system, so it is possible to construct a stochastic
spiking network of linearexponentialPoisson neurons with mean rates
following the same dynamic [Cardanobile & Rotter 2011]. The replicator
dynamic describes only iterated Bayesian inference with no transition
model, but we can implement dynamic state filtering by adapting a
generalization, the replicatormutator equation. This suggests expanding
the model to incorporate additional linear connections which act as the
generator matrix of a state transition Markov process.
"Kernel Information Bottleneck"
Nori Jacoby, Hebrew University
Tali Tishby, Hebrew University
Abstract:
The Information Bottleneck (IB) method was introduced as a principled
approach to extracting efficient representations of one set of variables
with respect to another from empirical data, thus extending the classical
notion of minimal sufficient statistics. The method was proposed as a
general computational principle for information processing in the brain and
has been used in many machine learning and neuroscience applications. The
original algorithm for solving the problem was based on the ArimotoBlahut
alternating projection algorithm, but was not guaranteed to converge to a
global optimum, which jeopardized the practicality of the approach. One
exception was the multivariate Gaussian case, for which the IB was shown to
have an efficient globally converging algorithm (GIB) that extended
Canonical Correlation Analysis (CCA). The main advantage over CCA was that
it provided a continuous optimal tradeoff between the minimality and
sufficiency of the representation (described by the information curve),
hence allowing for optimal multiscale analysis of the data using simple
spectral methods.
Here we extend the Gaussian solution of the Information Bottleneck (GIB) to
a much wider family of distributions using the kernel trick, and make it
practical for essentially any empirical data. Our main theoretical result
is in proving that we can obtain a bound linking the true informationcurve
and the one obtained by our approach, using information geometry. We
illustrate the algorithm on real data, and discuss some of its potential
new applications.
"Adaptive Coding of Actions and Observations"
Pedro A. Ortega, MPI Tuebingen
Daniel A. Braun, MPI Tuebingen
Abstract:
The application of expected utility theory to construct adaptive
agents is both computationally intractable and statistically
questionable. To overcome these difficulties, agents need the
ability to delay the choice of the optimal policy to a later stage
when they have learned more about the environment. How
should agents do this optimally? An informationtheoretic
answer to this question is given by the Bayesian control rule  the
solution to the adaptive coding problem when there are not only
observations but also actions. This paper reviews the central
ideas behind the Bayesian control rule.
Download extended abstract (PDF)
"Active inference with embodied cognitive limitations"
Nisheeth Srivastava, University of Minnesota
Paul R. Schrater, University of Minnesota
Abstract:
Scientists trying to find controltheoretic descriptions of the behavior of
biological organisms in choice tasks have gradually begun to turn away from
basic optimal control formulations of the problem to one of active inference,
where the agent both decides which actions to choose, and crucially, which
pieces of information to process while interacting with the environment. In this
talk, we will describe a new decision theory that uses a rational optimality
criterion grounded in embodied limitations of biological agents  trying to
minimize the metabolic costs of decisionmaking. This theory builds upon a model
for learning relative preferences without utility computations we have proposed
in a companion paper in the NIPS main conference. Agents
in our proposed framework do not experience numerical rewards as outcomes;
they themselves decide how to represent information about the world as they
navigate it. Simulated agents utilizing our model endogenously replicate a
number of violations of behaviors expected by simple probability matching,
but systematically observed in human subjects.
"The value of information in online learning: A study of partial monitoring problems"
Gabor Bartok, ETH Zurich
David Pal, University of Alberta
Csaba Szepesvari, University of Alberta
Abstract:
In online learning, a learner makes decisions on a turnbyturn basis.
After making her decision, the learner suffers some loss depending on
her action and some (possibly random) unknown process running in the
background. Then, the learner receives some feedback and the next turn
begins. The goal of the learner is to minimize her cumulative loss. The
performance is measured by the socalled regret: the excess cumulative
loss of the learner compared to that of the best fixed action in hindsight.
How quickly an agent can learn depends on the quality of feedback
information. While online learning is well understood under the special
cases of "fullinformation" and "bandit" feedback, other feedback
structures have not been thoroughly investigated. In our work, we study
the problem of partial monitoring, a general framework for online
learning with arbitrary feedback structure. We examine the natural
question of how the feedback structure determines the "hardness" of a
game. What regret rate is achievable for different problems? What
learner strategies are able to achieve the best possible regret? Is
there an algorithm that, given a loss and a feedback function as input,
approaches the worst case regret of the best strategy? In our work, we
answer these and other related questions. We give a full
characterization of finite partial monitoring problems based on the
growth rate of their minimax regret. Furthermore, we present algorithms
that achieve nearoptimal regret rate for every partial monitoring problem.
"Active Sensing as BayesOptimal Sequential DecisionMaking"
Sheeraz Ahmad, UC San Diego
Angela J. Yu, UC San Diego
Abstract:
Active sensing, or the way interactive agents use selfmotion to focus
limited sensing resources on taskrelevant environmental features, is an
important problem for both machine learning and cognitive science. Here, we
present a Bayesoptimal inference and control framework for active sensing,
which minimizes a cost function that explicitly takes into account
behavioral costs such as response delay, error, and effort. Unlike
previously proposed algorithms that optimize abstract statistical
objectives such as expected entropy reduction [Butko & Movellan, 2010] or
onestep lookahead accuracy [Najemnik & Geisler, 2005], this model is
goaldirected and contextsensitive, and capable of yielding fine temporal
dynamics such as fixation duration and switch times. We use simulations to
illustrate some scenarios in which contextsensitivity is particularly
useful. To address the computational complexity of the optimal
algorithm, we also present two value iteration algorithms that learn
approximations to the value function using either fixed radial basis
functions or a nonparametric Gaussian process, both of which achieve great
reduction in computational complexity while retaining performance
comparable to the optimal algorithm.