Catherine Loader


Research Areas

Mixture Models

(Joint work with Professor Ramani S. Pilla, Case Western Reserve University.)

Mixture models are used for datasets that may contain mixtures of data from several populations. Mixture modeling attempts to unravel such datasets: identify the different components, and the distribution of data from each of the components.

Fitting mixture models, and performing statistical inference on the results, is a complex and challenging problem. The most basic question is `how many components'? Our work has focused on the development of statistical tests of hypothesis, valid for testing arbitrary numbers of components in smooth families of distributions. The methods we have developed use a score process, measuring the evidence for a mixture component centered at any given location. Considering the maximal evidence over all possible locations leads to problems involving the distribution of extreme values for certain stochastic processes. These distributions are approximated using results from work on boundary crossing probabilities.

We are also developing computational algorithms for fitting mixture models. Our current interest is in developing fast versions of the EM algorithm suitable for use in high dimensional spaces.

Boundary Crossing Probabilities

boundaries My interest in boundary crossing problems was stimulated by undergraduate work in Brownian motion, and extended to discrete space processes through my thesis work on Poisson process change points.

Mainly, I use methods based on approximating first hitting time distributions. This approach requires computing two components: firstly, what is the probability of the process being on the boundary at time t; and secondly, given that it is on the boundary at time t, what is the probability that this is the first hit? These questions lead to recursive methods and integral equations that can be used to compute the crossing probabilities, and also to simple approximations based on large deviation methods and the related tangent approximation.

runs My Crossings software provides subroutines implementing numerical algorithms for several different classes of process. The present code supports binary random walks; Poisson and empirical processes; Gaussian Markov process such as Brownian motion; and pattern matching problems and discrete scan statistics.

Computing Probability Functions

For large sample sizes, standard algorithms for computing binomial probabilities and other statistical distributions can be inaccurate. I have developed an alternative set of algorithms, based on saddle point methods and Stirling's formula, that significantly improve accuracy in large samples. The code below provides an implementation for the Binomial, Poisson, Hypergeometric, Gamma and Chi-Square distributions.

Local Regression and Likelihood

lr 1d My work on local regression has covered theory, computational methods, inferential procedures, and relation to other smoothing techniques such as kernel methods. I have also extended the local regression idea to many local likelihood settings; in particular, density estimation and several models for censored survival data.

Inferential issues have been a major interest of mine. I developed methods for finding simultaneous confidence bands for smooth regression curves, based on volume-of-tubes formulae. The resulting bands are fairly straightforward to compute, and in numerical studies I found that they significantly outperformed other methods (in particular, the bootstrap) even when the underlying normality assumption is not satisfied.

lr 2d

Another related interest is in model selection issues. My focus has been methods such as cross validation, which have often been roundly criticized as being `too variable' in comparison with `modern' methods. In fact, my research showed that cross validation methods, when properly interpreted and understood, are far more informative than the `modern' methods; in particular, the latter pay for their reduced variability by missing important features when given difficult smoothing problems. I have also extended cross validation and AIC to local likelihood methods, and developed methods for locally adaptive bandwidth choice based on these criteria.

Much of my work in local regression is covered in my book, Local Regression and Likelihood, Springer, 1999. I have also developed an extensive software system, LOCFIT, implementing local regression and likelihood methods.

Simultaneous Inference, Tube Formula

The volume-of-tubes formula, originally developed by Hotelling and Weyl in 1939, computes the volumes of tubular neigbourhoods of manifolds. Given a suitable set (curve, surface e.t.c.) lying in n-dimensional space, the tube formula gives an expression for the volume of the set of all points within a specified radius of the set. With some additional mathematics, the tube formula provides accurate distributional approximations for the extreme values of Gaussian and other stochastic processes.

Applications of this formula in statistics are numerous, particularly in the area of simultaneous inference and testing. Despite both the mathematical beauty and simplicity of the end results, the method remains underutilized in statistics, with more computational, less accurate and non-reproducible simulations being the prefered choice of many authors.

I have developed an implementation of the tube formula for statistical problems, in conjunction with my work on simultaneous confidence bands, and related work in local regression. It has now matured to the stage where it can be used as a separate linkable library, libtube.