Mixture models are used for datasets that may contain mixtures of data from several populations. Mixture modeling attempts to unravel such datasets: identify the different components, and the distribution of data from each of the components.
Fitting mixture models, and performing statistical inference on the results, is a complex and challenging problem. The most basic question is `how many components'? Our work has focused on the development of statistical tests of hypothesis, valid for testing arbitrary numbers of components in smooth families of distributions. The methods we have developed use a score process, measuring the evidence for a mixture component centered at any given location. Considering the maximal evidence over all possible locations leads to problems involving the distribution of extreme values for certain stochastic processes. These distributions are approximated using results from work on boundary crossing probabilities.
We are also developing computational algorithms for fitting mixture models. Our current interest is in developing fast versions of the EM algorithm suitable for use in high dimensional spaces.
My interest in boundary crossing problems was stimulated by
undergraduate work in Brownian motion, and extended to
discrete space processes through my thesis work on Poisson
process change points.
Mainly, I use methods based on approximating first hitting time distributions. This approach requires computing two components: firstly, what is the probability of the process being on the boundary at time t; and secondly, given that it is on the boundary at time t, what is the probability that this is the first hit? These questions lead to recursive methods and integral equations that can be used to compute the crossing probabilities, and also to simple approximations based on large deviation methods and the related tangent approximation.
My Crossings software provides
subroutines implementing numerical algorithms for several different
classes of process. The present code supports binary random walks;
Poisson and empirical processes; Gaussian Markov process such as
Brownian motion; and pattern matching problems and discrete scan
statistics.
My work on local regression has covered theory, computational methods,
inferential procedures, and relation to other smoothing techniques
such as kernel methods.
I have also extended the local regression idea
to many local likelihood settings; in particular,
density estimation and several models for censored survival data.
Inferential issues have been a major interest of mine. I developed methods for finding simultaneous confidence bands for smooth regression curves, based on volume-of-tubes formulae. The resulting bands are fairly straightforward to compute, and in numerical studies I found that they significantly outperformed other methods (in particular, the bootstrap) even when the underlying normality assumption is not satisfied.
Another related interest is in model selection issues. My focus has been methods such as cross validation, which have often been roundly criticized as being `too variable' in comparison with `modern' methods. In fact, my research showed that cross validation methods, when properly interpreted and understood, are far more informative than the `modern' methods; in particular, the latter pay for their reduced variability by missing important features when given difficult smoothing problems. I have also extended cross validation and AIC to local likelihood methods, and developed methods for locally adaptive bandwidth choice based on these criteria.
Much of my work in local regression is covered in my book, Local Regression and Likelihood, Springer, 1999. I have also developed an extensive software system, LOCFIT, implementing local regression and likelihood methods.
Applications of this formula in statistics are numerous, particularly in the area of simultaneous inference and testing. Despite both the mathematical beauty and simplicity of the end results, the method remains underutilized in statistics, with more computational, less accurate and non-reproducible simulations being the prefered choice of many authors.
I have developed an implementation of the tube formula for statistical problems, in conjunction with my work on simultaneous confidence bands, and related work in local regression. It has now matured to the stage where it can be used as a separate linkable library, libtube.