Summary |
Multivariate density estimation is a fundamental problem in Applied Statistics and Machine Learning. Given a collection of data sampled from an unknown distribution, the task is to approximately reconstruct the generative distribution. There are two different approaches to the problem, the parametric approach and the non-parametric approach. In the parametric approach, the approximate distribution is represented by a model from a predetermined family. In this thesis, we adopt the parametric approach and investigate the use of a model family called latent tree models for the task of density estimation. Latent tree models are tree-structured Bayesian networks in which leaf nodes represent observed variables, while internal nodes represent hidden variables. Such models can represent complex relationships among observed variables, and in the meantime, admit efficient inference among them. Consequently, they are a desirable tool for density estimation. While latent tree models are studied for the first time in this thesis for the purpose of density estimation, they have been investigated earlier for clustering and latent structure discovery. Several algorithms for learning latent tree models have been proposed. The state-of-the-art is an algorithm called EAST. EAST determines model structures through principled and systematic search, and determines model parameters using the EM algorithm. It has been shown to be capable of achieving good trade-off between fit to data and model complexity. It is also capable of discovering latent structures behind data. Unfortunately, it has a high computational complexity, which limits its applicability to density estimation problems. In this thesis, we propose two latent tree model learning algorithms specifically for density estimation. The two algorithms have distinct characteristics and are suitable for different applications. The first algorithm is called HCL. HCL assumes a predetermined bound on model complexity and restricts to binary model structures. It first builds a binary tree structure based on mutual information and then runs the EM algorithm once on the resulting structure to determine the parameters. As such, it is efficient and can deal with large applications. The second algorithm is called Pyramid. Pyramid does not assume predetermined bounds on model complexity and does not restrict to binary tree structures. It builds model structures using heuristics based on mutual information and local search. It is slower than HCL. However, it is faster than EAST and is only slightly inferior to EAST in terms of the quality of the resulting models. In this thesis, we also study two applications of the density estimation techniques that we develop. The first application is to approximate probabilistic inference in Bayesian networks. A Bayesian network represents a joint distribution over a set of random variables. It often happens that the network structure is very complex and making inference directly on the network is computational intractable. We propose to approximate the joint distribution using a latent tree model and exploit the latent tree model for faster inference. The idea is to sample data from the Bayesian network, learn a latent tree model from the data offline, and when online, make inference with the latent tree model instead of the original Bayesian network. HCL is used here because the sample size needs to be large to produce accurate approximation and it is possible to predetermine a bound on the online running. Empirical evidence shows that this method can achieve good approximation accuracy at low online computational cost. The second application is classification. A common approach to this task is to formulate it as a density estimation problem: One constructs the class-conditional density for each class and then uses the Bayes rule to make classification. We propose to estimate those class-conditional densities using either EAST or Pyramid. Empirical evidence shows that this method yields good classification performances. Moreover, the latent tree models built for the class-conditional densities are often meaningful, which is conducive to user confidence. A comparison between EAST and Pyramid reveals that Pyramid is significantly more efficient than EAST, while it results in more or less the same classification performance as the latter. |