Data Mining Course Outline

Parts of this course are based on textbook Witten and Eibe, Data Mining: Practical Machine Learning Tools and Techniques, Morgan Kaufmann, 1999 and 2nd Edition (2005), (W&E). The course will be using Weka software and the final project will be a KDD-Cup-style competition to analyze DNA microarray data.

The course is organized as 19 modules (lectures) of 75 minutes each.
(*) marks more advanced topics which can be skipped for a less advanced course.

Modules

M1: Introduction: Machine Learning and Data Mining

Data Flood
Data Mining Application Examples
Data Mining and Knowledge Discovery
Data Mining Tasks

Study: Course Notes,
Introduction to KDD (AI Mag 1996) (KDnuggets.com/gpspubs/aimag-kdd-overview-1996-Fayyad.pdf)

M2: Machine Learning and Classification

Machine Learning and Classification
Examples
Learning as Search
Bias
Weka

Study: W&E, Chapter 1.

M3. Input: Concepts, instances, attributes

What is a concept?
What is an example?
What is an attribute?
Preparing the data

Study: W&E, Chapter 2.

M4. Output: Knowledge Representation

Decision tables
Decision trees
Decision rules
Rules involving relations
Instance-based representation

Study: W&E, Chapter 3.

M5. Classification - Basic methods

OneR
NaiveBayes

Study: W&E, Chapter 4

M6: Classification: Decision Trees

Top-Down Decision Trees
Choosing the Splitting Attribute
Information Gain and Gain ratio

Study: W&E, Chapter 4

M7: Classification: C4.5

Handling Numeric Attributes
Finding Best Split
Dealing with Missing Values
Pruning
Pre-pruning, Post-Pruning, Estimating Error Rates
From Trees to Rules

Study: W&E, Chapter 5

M8: Classification: CART

CART Overview and Gymtutor Tutorial Example
Splitting Criteria
Handling Missing Values
Pruning
Finding Optimal Tree

Study: CART Tutorial, CART Manual, www.salford-systems.com

M9: Classification: more methods

Rules
Regression
Instance-based (Nearest neighbor)

Study: W&E, Chapter 4

M10: Evaluation and Credibility

Introduction
Classification with Train, Test, and Validation sets
Handling Unbalanced Data; Parameter Tuning
*Predicting Performance
Evaluation on "small data": Cross-validation
*Bootstrap
Comparing Data Mining Schemes
*Choosing a Loss Function

Study: W&E, Chapter 5.

M11: Evaluation - Lift and Costs

Lift and Gains charts
*ROC
Cost-sensitive learning
Evaluating numeric predictions
MDL principle and Occam's razor

Study: W&E, Chapter 5.

M12: Data Preparation for Knowledge Discovery

Data understanding
Data cleaning
Date transformation
Discretization
False "predictors" (information leakers)
Feature reduction, leaker detection
Randomization
Learning with unbalanced data

Study: Course notes

M13: Clustering

Introduction
K-means
Hierarchical

Study: W&E, Course notes

M14: Associations

Transactions
Frequent itemsets
Association rules
Applications

Study: Course notes

M15: Visualization

Graphical excellence and lie factor
Representing data in 1,2, and 3-D
Representing data in 4+ dimensions
- Parallel coordinates
- Scatterplots
- Stick figures
- ...

Study: Course notes

M16: Summarization and Deviation Detection

Summarization
KEFIR: Key Findings Reporter
WSARE: What is Strange About Recent Events

Study: KEFIR book chapter and demo,
Rule-based Anomaly Pattern Detection for Detecting Disease Outbreaks, by Weng-Keen Wong et al (about WSARE system).

M17: Applications: Targeted Marketing and Customer Modeling

Direct Marketing Review
Evaluation: Lift, Gains
KDD Cup 1997
Lift and Benefit estimation
KDD Cup 1998

Study: KDD Cup 1997 report, KDD Cup 1998 report,
G. Piatetsky-Shapiro, B. Masand, Estimating Campaign Benefits and Modeling Lift, Proc. KDD-99, ACM.

M18: Applications: Genomic Microarray Data Analysis

Study: SIGKDD Explorations Special Issue on Microarray Data Mining,
Capturing Best Practice for Microarray Gene Expression Data Analysis, G. Piatetsky-Shapiro, T. Khabaza, S. Ramaswamy, in Proceedings of KDD-2003.

M19: Data Mining and Society; Future Directions

Data Mining and Society: Ethics, Privacy, and Security issues
Future Directions for Data Mining
web mining, text mining, multi-media data
Course Summary

Study: Knowledge Discovery in Databases vs. Personal Privacy Symposium, editor Gregory Piatetsky-Shapiro, IEEE Expert, April 1995.

Bayardo & Srikant, Technological Solutions for Protecting Privacy, IEEE Computer, Sep 2003.