Next: ims.15 Up: Institute of Mathematical Statistics Previous: ims.13

ims.14

IMS

Session Slot: 2:00- 3:50 Tuesday

Estimated Audience Size: 125-175

AudioVisual Request: Two Overheads

Session Title: DATA MINING

Theme Session: No

Applied Session: No

Session Organizer: Land, Stephanie Carnegie Mellon University

Address: Department of Statistics Carnegie Mellon University Pittsburgh, PA 15213

Phone: (412) 268-1880

Fax: (412) 268-7828

Email: steph@stat.cmu.edu

Session Timing: 110 minutes total (Sorry about format):

Opening Remarks by Chair - 2 minutes First Speaker - 30 minutes Floor Discusion - 5 minutes Second Speaker - 30 minutes Floor Discusion - 5 minutes Third Speaker - 30 minutes Floor Discusion - 8 minutes

Session Chair: Land, Stephanie Carnegie Mellon University

Address: Department of Statistics Carnegie Mellon University Pittsburgh, PA 15213

Phone: (412) 268-1880

Fax: (412) 268-7828

Email: steph@stat.cmu.edu

1. Probabilistic Models for Finding Patterns in Large Time-Series and Image Databases

Smyth, Padhraic, University of California, Irvine

Address: Information and Computer Science, University of California, Irvine, CA 92697-3425

Phone: (714) 824-2558

Fax: (714) 824 4056

Email: smyth@ics.uci.edu

Abstract: Large databases containing multiple time series and multiple images are relatively commonplace across a variety of applications such as finance, medicine, remote-sensing, and engineering. In such applications it is common to want to match a prototype pattern of interest against the database, e.g., "find all images or time series segments in the database which are similar to this one". This is often referred to as ``retrieval by content." Various different methods can be characterized by (1) representation (how the time series or image data are modeled), (2) similarity criteria (how similarity is quantified) and (3) search (how to avoid exponential time and space complexity when searching for patterns). Relatively little of this prior work is within a probabilistic framework, leading to ad hoc algorithms and systems. In this context, probabilistic models for time-series pattern matching are introduced. The probabilistic approach provides a natural semantics for describing prior knowledge on pattern deformations, and in turn implicitly defines a similarity metric. Results on a variety of databases demonstrate orders of magnitude speed-up over direct search.

2. Identifying Exceptions in Large Datasets

Ng, Raymond, University of British Columbia

Address: Department of Computer Science, 2366 Main Mall, Vancouver, B.C. V6T 1Z4, Canada

Phone: (604) 822-2394

Fax: (604) 822-5485

Email: rng@cs.ubc.ca

Abstract: "A person's noise is another person's signal." Most data mining techniques focus on finding patterns that fit the majority. Relatively little attention, except in the statistical community, has been given to the "dual" problem of finding exceptions.
In this talk, we shall consider two approaches for finding exceptions in large datasets with multiple attributes. The first approach is based on Tukey's definition of depth contours. We shall present algorithms for computing depth contours for 2 and 3 dimensions/attributes, and outline how to generalize the algorithms for even higher dimensions. The algorithms are designed to be able to deal with millions of objects/tuples residing on disks.
While the first approach can be considered to be order-based, the second approach is distance-based. Given a function defining the distance between every pair of objects, an object O is a (p,D)-outlier is there are at least p% of other objects in the dataset that are at least distance D away from O. We shall present algorithms for computing such outliers for high dimensional datasets. We shall compare and contrast the two approaches.

3. Very Sparse Structures for Caching Contingency-Tables Applied to Computationally Efficient Statistics and Datamining

Moore, Andrew, Carnegie Mellon University

Address: Robotics Institute EDSH 221 Carnegie Mellon University Pittsburgh, PA 15213

Phone: (412) 268-7599

Fax: (412) 268-5571

Email: awm@cs.cmu.edu

Abstract: In this talk we introduce new data structures and algorithms that can empirically give a 50-fold to 1000-fold computational speedup for machine learning methods such as Bayes net structure finders, rule learners and feature selectors operating on large datasets. Many machine learning algorithms need to do vast numbers of counting queries (e.g. "How many records have Color=Blue, Nationality=British, Smoker=False, Status=Married, Nose=Big?"). Similarly, many algorithms need to build and test many contingency tables. We provide methods for which, subject to certain assumptions, the costs of these kinds of operations can be shown to be independent of the number of records in the dataset and loglinear in the number of non-zero entries in the contingency table.
We provide a very sparse data structure, the ADTree (all-dimensions tree), to minimize memory use. We provide analytical worst-case memory and construction-cost bounds for this structure for several models of data distribution. We empirically demonstrate that tractably-sized data structures can be produced for large real-world datasets by (a)using a sparse tree structure that never allocates memory for counts of zero, (b) never allocating memory for counts that can be deduced from other counts, and (c) not bothering to expand the tree fully near its leaves.
We show how the ADTree can be used to accelerate Bayes net structure finding algorithms, rule learning algorithms, and feature selection algorithms, and we provide a number of empirical results comparing ADTree methods against traditional direct counting approaches. We also discuss the possible uses of ADTrees in other machine learning methods, and discuss the merits of ADTrees in comparison with alternative representations such as kd-trees, R-trees and frequent sets.

List of speakers who are nonmembers:

Next: ims.15 Up: Institute of Mathematical Statistics Previous: ims.13

David Scott
6/1/1998