"Tools for Discovering Patterns in Data" Wednesday, May 14, 1997 9:00-4:30 (break 12:00-1:30) John Elder; Chief Scientist, Quantitative Solutions * Course Description Find the useful information hidden in your data! This course surveys the leading computer-intensive methods for data analysis and inductive modelling, drawn from Statistics, Machine Learning, and Data Mining. Dr. Elder will describe the key inner workings of various algorithms, compare their merits, and (briefly) demonstrate their relative effectiveness on practical applications. We'll first review classical statistical techniques, both linear and nonparametric, then outline the ways in which these basic tools are modified and combined into more modern methods. The course pays particular attention to four powerful approaches: neural networks, polynomial networks, kernels, and decision trees, and uses actual scientific and business problems to demonstrate useful accompanying techniques (such as scientific visualization, resampling, and bundling) employed by experienced analysts. * Handouts Comprehensive notes and the recent book chapter, "A Statistical Perspective on Knowledge Discovery in Databases", by Elder & Pregibon. * Instructor John Elder is Chief Scientist of Quantitative Solutions, a Data Mining research firm in Charlottesville, Virginia, and an Adjunct Professor at the University of Virginia. He has over a decade of experience developing and applying adaptive, data-driven techniques to practical problems. He has been a researcher at Rice University, and Director of Research at an engineering consulting firm and for an investment management company. Dr. Elder has authored four book chapters and numerous articles on pattern discovery, and is the technical chair of the Adaptive and Learning Systems Group of the IEEE Systems, Man, and Cybernetics Society. * Who Should Attend? Those from industry and academia who work with data and wish to understand recent developments in pattern discovery, data mining, and inductive modeling. At the conclusion of this course, one should be able to discern the basic strengths of competing methods and select the appropriate tools for one's applications. Participants should have prior working experience with computers and knowledge of, or interest in, applied statistical techniques. * Course Outline *Pattern Discovery: An Overview *Inducing Models from Data: Benefits and Dangers *The Data Mining Process *Classical Statistical Techniques *Regression *Discriminant Analysis *Nonparametric: *Scatterplot Smoothers *Nearest Neighbors *Kernels *Modern Methods *Neural Networks *Polynomial Networks *Decision Trees *Key General Tools: *Scientific Visualization *Resampling *Optimization *Data Issues *Case Diagnostics (Outlying, Influential, Leverage, & Missing points) *Feature Creation and Selection *(Brief) Outline of Other Methods *Projection Pursuit *ASH (Average Shifted Histograms) *MARS (Multivariate Adaptive Regression Splines) *RBF (Radial Basis Functions) *Comparing and Combining Methods *Matching an algorithm to your application *Bundling & Fusing models * A note about the course scope: Each of the major topics discussed could clearly comprise a semester-long course if presented in full detail! What this (admittedly intensive) short course provides however, is a broad overview of the highlights, drawing connections between major developments in the diverse fields that contribute to the emerging discipline of Data Mining. Previous participants have found this "big picture" to be particularly useful for identifying avenues worthy of further exploration, whether for research or practical problem-solving.