"Solving Unusual Problems in Statistics"
In practice, models of data are often inadequate for
problems in regression, clustering, data mining, etc.
A few algorithms hint at the challenges (and the
opportunities). For example, algorithms exist for
fitting "mixtures of regressions." The mclust
algorithm in Splus allows for Poisson noise while
fitting a mixture of normal densities.
In each case, an unknown subset of the data is "bad."
How should we know the model of the bad data? What is
an appropriate model for the regression curve of the
bad data? What if the bad data are not Poisson in
the mixture problem? How can one identify multiple
outliers in a data mining setting?
I describe some algorithms that only require a
model of the good data and are not affected by the
particular shape of the bad data in many situations.
Applications of these ideas are surprisingly diverse.
For example, in a high-dimensional clustering problem,
can I find the largest eigenvector of one of the
clusters without fitting a complete mixture model?
Come to this talk and see what is possible today.