WITMSE 2013 - 26-29 August 2013 - Tokyo, Japan

Abstract

Title:	"What is the information content of an algorithm?"
Author:	Joachim M. Buhmann
Abstract:	Algorithms are exposed to randomness in the input or noise during the computation. How well can they preserve the information in the data w.r.t. the output space? Algorithms especially in Machine Learning are required to show robustness to input fluctuations or randomization during execution. This talk elaborates a new framework to measure the "informativeness" of algorithmic procedures and their "stability" against noise. An algorithm is considered to be a noisy channel which is characterized by a generalization capacity (GC). The generalization capacity objectively ranks different algorithms for the same data processing task based on the bit rate of their respective capacities. The problem of grouping data is used to demonstrate this validation principle for clustering algorithms, e.g. k-means, pairwise clustering, normalized cut, adaptive ratio cut and dominant set clustering. Our new validation approach selects the most informative clustering algorithm, which filters out the maximal number of stable, task-related bits relative to the underlying hypothesis class. The concept also enables us to measure how many bit are extracted by sorting algorithms when the input and thereby the pairwise comparisons are subject to fluctuations.

Title:	"The flow of information in networks of probability vectors"
Author:	Wray Buntine
Abstract:	Recent non-parametric techniques for learning use the Pitman-Yor distribution as a way of handling hierarchical and networked nodes of probability vectors. This is done for sequential topic models, n-gram models, part-of-speech models and many others. The Pitman-Yors are an example of the general class of species sampling models that are, in a particular sense, conjugate to the multinomial and themselves. Conjugacy comes at the expense of introducing auxiliary latent variables that correspond to the portion of a sample that passes up the hierarchy/network. Information among nodes (probability vectors) in such a network therefore passes upwards in the form of partialsamples and downwards in the form of prior parameters. One can therefore measure, from both these influences, the entropy change in vectors with this flow of information. The talk will present some results on the information flow and examples taken from a number of networks.

Title:	"Approximate Iterative Bayes Optimal Estimates for High-Rate Sparse Superposition Codes"
Author:	Sanghee Cho & Andrew Barron
Abstract:	This paper is concerned with sparse superposition codes with iterative term selection for additive white Gaussian noise channel with power control. In particular, we consider a soft decision decoder with Bayes optimal estimates at each step, presuming uniform prior on the choice of the terms that are sent. Bayes optimal estimates are formulated and shown to have a Martingale property that provides alternative representations of a posterior probability of error. Since the Bayes optimal estimates are infeasible, an approximation method is suggested. We analyze the performance of the approximation method in comparison with the infeasible estimates.

Title:	"MODEL SELECTION IN A SETTING WITH LATENT VARIABLES"
Author:	Ralf Eggeling, Teemu Roos, Petri Myllymäki & Ivo Grosse
Abstract:	[PDF]

Title:	"Estimating the structure of interacting coordinates for a multivariate stochastic process"
Author:	Jesus Enrique Garcia & Verónica Andrea González-López
Abstract:	In this paper we address the problem of extracting the dependence structure between the coordinates of a multivariate variable length Markov chain. In our setting we have a set of k different sources. At each time t each source produces a letter in the alphabet A={0,1}. The sources interact between them depending on the past states of the set of k sources. We want to obtain (estimate) a partition of the past such that two possible pasts are in the same part of the partition if and only if, given each of this two pasts, the set of sources interacting is the same. Also we want to have for each possible past, the set of sources which interact between them. We can imagine a very simplified model of interacting neurons. Consider a set of k neurons. Discretize time in intervals of a fixed size. For each time interval we will say that the value of a particular neuron is 1 if there was (at least) one spike in that neuron in that interval of time. In this setting, two neurons interact means that the fact that one of them fire, change the probabilities of the other one to fire. This interaction can be in any of two kinds, it can increase or decrease the probability of firing for the other neuron. All along this work we will use the family of the partition Markov models (PMM).

Title:	"DETECTING REGIME CHANGES IN MARKOV MODELS"
Author:	Verónica Andrea González-López
Abstract:	[PDF]

Title:	" Regret Analysis of a Bandit Problem for Normal Distributions with Unknown Means and Variances"
Author:	Junya Honda
Abstract:	We consider a stochastic multiarmed bandit problem such that a reward of each arm follows a normal distribution with an unknown mean and variance. This normal distribution model is one of the most natural settings of stochastic bandit problems and many algorithms have been proposed for this model. However, although some algorithms are numerically shown to perform near the theoretical bound, it has been unknown whether they can actually achieve the theoretical regret bound or not. In this talk, we analyze the UCB algorithm considered by Burnetas and Katehakis and derive a finite-time regret bound which achieves the theoretical bound asymptotically. We also consider the extension to general exponential families.

Title:	"Optimization of probability measure and its applications in information theory"
Author:	Shiro Ikeda
Abstract:	Since Shannon's seminal work, channel capacity has been a fundamental quantity in information thoery. The general definition is formulated as an optimization problem of a probability measure under some moment constraints. Similarly, the definition of rate-distortion function is formulated as a probability measure optimization problem. We have observed that the optimal measure becomes discrete even if continuous measures are allowed. The same phenomena are observed in Bayesian statistics, where the channel capacity problem is deeply related to the reference prior. In this talk, the background of the problem is introduced with some examples, and its applications to communication theory is discussed.

Title:	"Bayesian predictive densities when the distributions of data and target variables are different"
Author:	Fumiyasu Komaki
Abstract:	Bayesian predictive densities are investigated from the viewpoint of information geometry. The performance of predictive densities is evaluated by the Kullback-Leibler divergence. When the observed data x and the target variable y have different distributions, a new metric, which we call the predictive metric, and the volume element based on it play essential roles correspoindng to the Fisher-Rao metric and the Jeffreys prior in the conventional setting.

Title:	TBA
Author:	Anjali Mazumder
Abstract:	TBA

Title:	"Network Coding and Its Application to Content Centric Networking"
Author:	Shigeki Miyake & Hitoshi Asaeda
Abstract:	Content Centric Networking (CCN) is one of predominant proposals for the next generation content distribution platform. In CCN each router in a network is equipped with large sized cache memory and becomes content router which can be regarded as a temporal content server. This mechanism can reduce concentration of data traffic into the original content server. To enlarge merits of CCN, we propose an application of network coding (NWC) to CCN. By applying network coding, distributed data storing of cached data can be realized, and multi-path transmission between content routers and an access router (or an end user) becomes possible. CCN with NWC can provide larger throughput for end users, and achieve less network load. Some issues to be solved to realize CCN with NWC are also specified.

Title:	"Distance-based Change-Point Detection with Entropy Estimation"
Author:	Noboru Murata, Kensuke Koshijima & Hideitsu Hino
Abstract:	Change-point detection is a problem of finding time points where stochastic properties of time series suddenly change. This problem has been actively discussed under various contexts in the area of data mining. In this talk, we introduce a non-parametric method for change-point detection by an entropy estimator. Since our entropy estimator is based on distances between data points, it is applied not only for ordinal vectorial data but also variable length data and non vectorial data. We will demonstrate its validity through numerical experiments with real-world data.

Title:	"Information Bounds and Poisson Inference"
Author:	Joseph A. O'Sullivan
Abstract:	A wide range of problems in classification and imaging rely on Poisson data with mean linearly related to underlying quantities of interest. In imaging, these include optical, hyperspectral, scatter tomography, and positron emission tomography. Classification problems use such indirect measurements to as the basis for decisions. The performance of both classification and imaging problems using Poisson data can be addressed using a unified framework. The performance degradation relative to direct measurement is determined by a combination of the object being measured and the point spread function. This same framework provides a way to quantify the impact of errors in the model on system performance.

Title:	"Information capacity of full-body movements"
Author:	Teemu Roos
Abstract:	We present a novel metric for information capacity of full-body movements where throughput (bits per second) is calculated as mutual information in repeated motor sequences. It is affected by the complexity of movements and the precision with which an actor reproduces them. Computation requires decorrelating co-dependencies of movement features (e.g., wrist and elbow) and temporal alignment of sequences. Human-computer interaction researchers can use the metric as an analysis tool when designing and studying user interfaces. This is joint work with Antti Oulasvirta, Arttu Modig, and Laura Leppanen.

Title:	"Using the Method of Nearby Measures in Superposition Coding with a Bernoulli Dictionary"
Author:	Cynthia Rush & Andrew Barron
Abstract:	The work discussed in this presentation is motivated by the problem of decoding superposi- tion codes when using Bernoulli 1 superpositions. Under this encoding scheme, the output, Y , is a linear combination of the Bernoulli 1 random variables and noise. Decoding then requires the study of the conditional distribution of the Bernoulli 1 random variables given the output, Y . We nd this distribution analogous to that of summands given the sum as studied by Cover and Campenhout and Imre Csisz ar. Motivated by this work we wish to bound the R enyi relative entropy distance between the true conditional distribution and the product of tilted distributions in order to show that events that are exponentially rare under the product distribution remain rare under the true distribution.

Title:	"Estimation in Markov processes"
Author:	Narayana Prasad Santhanam
Abstract:	We observe a length-n sample generated by an unknown, stationary ergodic Markov process ("model") over a finite alphabet A. Motivated by applications in what are known as backplane channels, given any string w of symbols from A we want estimates of the conditional probability distribution of symbols following w ("model parameters"), as well as the stationary probability of w. Two distinct problems that complicate estimation in this setting are (i) long memory, and (ii) slow mixing, which could happen even with only one bit of memory. For our problem, any consistent estimator can only converge pointwise over the class of all ergodic Markov models. Namely, given any estimator and any sample size n, the underlying model could be such that the estimator performs poorly on a sample of size n with high probability. But can we look at a length-n sample and identify if an estimate is likely to be accurate? Since the memory is unknown a-priori, a natural approach is to estimate a potentially coarser model with memory k_n=O(log n). As n grows, estimates get refined and this approach is consistent, with the above scaling of k_n also known to be essentially optimal. But while effective asymptotically, the situation is quite different when we want the best answers possible with a length-n sample, rather than just consistency. Combining results in universal compression with Aldous' coupling arguments, we obtain sufficient conditions on the length-n sample (even for slow mixing models) to identify when naive (i) estimates of the model parameters and (ii) estimates related to the stationary probabilities are accurate; and also bound the deviations of the naive estimates from true values.

Title:	"Lossless contour compression using chain-code representations and context tree coding"
Author:	Ionut Schiopu & Ioan Tăbuş
Abstract:	In this paper we experimentally study the context tree coding of contours in two alternative chain-code representations. The contours of the image regions are intersecting, resulting in many contour segments, each contour segment being represented by its starting point and a subsequence of chain-code symbols. A single stream of symbols is obtained, under a given ordering of the segments, by concatenating their subsequences. We present efficient ways for the challenging task of ordering the contour segments. The necessary starting points and the chain-code symbols are encoded using context tree coding. The coding is performed in a semi-adaptive way, designing at the encoder by dynamical programming the context tree, whose structure is encoded as side information. The symbols are encoded using adaptively collected distributions at each context of the encoded tree. We study for depth map images and fractal images the tree-depth of the optimal context tree and the similarity of the context trees resulted to be optimal for each image.

Title:	"The MDL principle for arbitrary data: either discrete or continuous or none of them"
Author:	Joe Suzuki
Abstract:	The MDL principle selects a model that minimizes the description length given a way of describing data sequence. However, the existing methods realizing the principle assume that each random variable in the sequence takes a value in a finite set (finite alphabet). In general, X : Ω ! R is a random variable if X is F-measurable for the sample and event spaces (ΩF), and a measure q over the entire sequences of length n is universal if as n ! 1 with probability one for any probability measure p. This paper extends the definition of universality in terms of Radon-Nikodym derivatives and construct a universal measure for the extended MDL principle so that the random variable is arbitrary: either discrete or continuous or none of them. The extended nontion of the MDL principle has many applications such as estimating the structure of a Bayesian network, feature selection in pattern recognition when discrete and continuous variables are present. In fact, in reality, any database contains both discrete and continuous fields.

Title:	"Dual Averaging and Proximal Gradient Descent for Online Alternating Direction Multiplier Method"
Author:	Taiji Suzuki
Abstract:	We develop new stochastic optimization methods that are applicable to a wide range of structured regularizations. Basically our methods are combinations of basic stochastic optimization techniques and Alternating Direction Multiplier Method (ADMM). ADMM is a general framework for optimizing a composite function, and has a wide range of applications. We propose two types of online variants of ADMM, which correspond to online proximal gradient descent and regularized dual averaging respectively. The proposed algorithms are computationally efficient and easy to implement. It is shown that our methods yield O(1/\sqrt{T}) convergence of the expected risk, and O(log(T)/T) convergence for a strongly convex loss. Numerical experiments show effectiveness of our methods in learning tasks with structured sparsity such as overlapped group lasso.

Title:	"Time- and space-varying context tree models for image coding and analysis"
Author:	Ioan Tăbuş & Jorma Rissanen
Abstract:	The talk will present ways to refine the context tree coding traditionally used for image compression, by considering in the model time dependency of the strings at the contexts, and by using grouping of the context according to spatial information. The practical outcome of the approach is improving the coding performance and additionally it brings the possibility to use the refined models for image analysis tasks, e.g. for refining an initially given segmentation.

Title:	Asymptotically Minimax Prediction for Markov Sources
Author:	Jun'ichi Takeuchi
Abstract:	We argue the problem of sequential prediction by the Jeffreys mixture for Markov sources with finite alphabet. For memoryless case, slight modification of the Jeffreys mixture was shown to be asymptotically minimax with respect to logarithmic cumulative regret. For the case of binary alphabet {0,1}, it yields Laplace like rule (k+1/2)/(n+1), where n is the length of the given sequence and k is the number of occurrences of `1' in it. It is extended to the case of Finite State Machine (FSM; a certain kind of Markov model), where the Jeffreys prior contains stationary probabilities and differs from the Dirichlet prior. As a result, the prediction rule is not the form of (k+alpha)/(n+beta) and strict computation is not easy. In this talk, we explain this situation and further discuss the case of the parametric model defined by a context tree. We also report results of numerical experiments using an approximation formula of Jeffreys rule and the Monte Carlo method. This talk is based on joint works with Andrew Barron and Tsutomu Kawabata.

Title:	Combinatorial online prediction by continuous relaxation
Author:	Eiji Takimoto
Abstract:	We study online linear optimization problems over concept classes that are defined in some combinatorial ways. Typically, those concept classes contain finite but exponentially many concepts and hence the complexity issue arises. In this paper, we give a universal and efficient algorithm that works for a wide family of concept classes such that the corresponding offline problems can be efficiently approximated by continuous relaxation.

Title:	"Convex Optimization for Tensor Decomposition"
Author:	Ryota Tomioka
Abstract:	Tensor is a natural extension of matrix and it appears widely in many application areas. Conventionally tensor decomposition has been tackled through non-convex optimization leaving the optimality of the solution unclear. In this talk, I introduce a class of structural regularization terms that extends nuclear norm and enables estimation of low-rank tensors through convex optimization problems. I will talk about different formulations, algorithms, and performance guarantees for them. Furthermore, I will discuss the limitations of the current approach and possible solutions.

Title:	"Achievability of Asymptotic Minimax Optimality in Online and Batch Coding"
Author:	Kazuho Watanabe, Teemu Roos & Petri Myllymäki
Abstract:	The normalized maximum likelihood model achieves the minimax coding (log-loss) regret for data of fixed sample size $n$. However, it is a batch strategy, i.e., it requires that $n$ be known in advance. Furthermore, it is computationally infeasible for most statistical models, and several computationally feasible alternative strategies have been devised. We characterize the achievability of asymptotic minimaxity by batch strategies (i.e., strategies that depend on $n$) as well as online strategies (i.e., strategies independent of $n$). On one hand, we conjecture that for a large class of models, no online strategy can be asymptotically minimax. We prove that this holds under a slightly stronger definition of asymptotic minimaxity. Our numerical experiments support the conjecture about non-achievability by so called last-step minimax algorithms, which are independent of $n$. On the other hand, we show that in the multinomial model, a Bayes mixture defined by the conjugate Dirichlet prior with a simple dependency on $n$ achieves asymptotic minimaxity for all sequences, thus providing a simpler asymptotic minimax strategy compared to earlier work by Xie and Barron. The numerical results also demonstrate superior finite-sample behavior by a number of novel batch and online algorithms.

Title:	"WAIC and WBIC are Information Criteria for Singular Statistical Model Evaluation"
Author:	Sumio Watanabe
Abstract:	Many learning machines with hierarchical structures such as neural networks, normal mixtures, Boltzmann machines, and hidden Markov models are not regular but singular statistical models. In singular models, neither AIC, BIC, nor DIC can be used in statistical model evaluation, because the log likelihood function can not be approximated by any quadratic form even asymptotically. Recently, we proposed widely applicable information criterion (WAIC) and widely applicable Bayesian information criterion (WBIC) for singular statistical model evaluation. In this talk, we compare AIC with WAIC, BIC with WBIC, and WAIC with WBIC. WAIC and WBIC are asymptotically unbiased estimators of the Bayes generalization loss and the minus log margial likelihood even if the posterior distribution can not be approximated by any normal distribution. For regular statistical models, WAIC and WBIC are respectivelly the same random variables as AIC and BIC asymtotically, whereas in singular cases, they are different.

Title:	Stochastic Complexity for Piecewise Stationary Memoryless Sources.
Author:	Kenji Yamanishi, Hiroki Kanazawa
Abstract:	A piecewise stationary memoryless source (PSMS) is a probabilistic model for which the parameter value changes over segments. We are concerned with the issue of tracking and detecting parameter changes of PSMSs. As for the performance measure, we consider the minimax regret for PSMSs and derive stochastic complexity for PSMSs as an optimal solution to it. Since the stochastic complexity is hard to calculate exactly, we introduce an algorithm that approximately attains it. We design it with the technique of discretizing the parameter space and derive an upper bound on its total code-length, which is related to the complexity of PSMSs.

Title:	"Distribution-Based Estimation of the Latent Variables and its Accuracy"
Author:	Keisuke Yamazaki
Abstract:	Hierarchical probabilistic models such as a mixture of distributions and a hidden Markov model are widely used for unsupervised learning. They consist of the observable and the latent variables, which represent the observed data and the underlying data-generating process, respectively. There are two type of use due to the estimated variable; the prediction of future/unseen data is the observable-variable (OV) estimation, and the analysis how the given data were generated is the latent-variable (LV) estimation.. The asymptotic accuracy of the OV estimation has been elucidated in many models and the estimation methods such as the maximum-likelihood (ML) and the Bayes methods. On the other hand, the LV estimation has not sufficiently been studied. In this talk, the error function to measure the accuracy of the LV estimation is formulated, and its asymptotic form is derived for the ML and the Bayes methods. The results provide a distribution-based evaluation of the unsupervised learning, and show that the Bayes method has the better accuracy than the ML method.

Title:	"Large Alphabet Coding and Prediction"
Author:	Xiao Grace Yang
Abstract:	This paper introduces a convenient strategy for coding and predict- ing sequences of independent, identically distributed random variables generated from a large alphabet of size m. In particular, the size of the sample is allowed to be random. The employment of a Poisson model and tilting simpli es the implementation and analysis through independence. The resulting strategy is optimal within the class of distributions satisfying a moment condition, and is close to optimal for a smaller class { the class of distributions with an analogous con- dition on the counts. Moreover, the method can be used to code and predict sequences in a subset with the tail counts satisfying a given condition. It can also be applied to classes of distributions with some given feature, for example, an envelope class.