Finding Outliers in Models of Spatial Data
Statistical models fit to data often require extensive
and challenging re-estimation before achieving final
form. For example, outliers can adversely affect
fits. In other cases involving spatial data, a cluster
may exist for which the model is incorrect, also adversely
affecting the fit to the ``good'' data. In both cases,
estimate residuals must be checked and rechecked until
the data are cleaned and the appropriate model found.
%
In this article, we demonstrate an algorithm that
fits models to the largest subset of the data that
is appropriate. Specifically, if a hypothesized linear
regression model fits ninety percent of the data, our
algorithm can not only find an excellent fit as if
only that ``good'' data were presented, but will also
highlight the ten percent of the ``bad'' data that is not fit.
%
Our work in digital government has focused on mapping
data. Thus we illustrate how models fit to census
track data work, and how the data in the ``bad'' set
can be viewed spatially through ArcView or other tools.
%
This approach greatly simplifies the task of modeling
spatial data, and makes us of advanced map visualization
tools to understand the nature of subsets of the data
for which the model is not appropriate.