The course is organized as 19 modules (lectures) of 75 minutes each.
(*) marks more advanced topics which can be skipped for a less advanced course.
Modules
M1: Introduction: Machine Learning and Data Mining
- Data Flood
- Data Mining Application Examples
- Data Mining and Knowledge Discovery
- Data Mining Tasks
Introduction to KDD (AI Mag 1996) (KDnuggets.com/gpspubs/aimag-kdd-overview-1996-Fayyad.pdf)
M2: Machine Learning and Classification
- Machine Learning and Classification
- Examples
- Learning as Search
- Bias
- Weka
M3. Input: Concepts, instances, attributes
- What is a concept?
- What is an example?
- What is an attribute?
- Preparing the data
M4. Output: Knowledge Representation
- Decision tables
- Decision trees
- Decision rules
- Rules involving relations
- Instance-based representation
M5. Classification - Basic methods
- OneR
- NaiveBayes
M6: Classification: Decision Trees
- Top-Down Decision Trees
- Choosing the Splitting Attribute
- Information Gain and Gain ratio
M7: Classification: C4.5
- Handling Numeric Attributes
Finding Best Split - Dealing with Missing Values
- Pruning
Pre-pruning, Post-Pruning, Estimating Error Rates - From Trees to Rules
M8: Classification: CART
- CART Overview and Gymtutor Tutorial Example
- Splitting Criteria
- Handling Missing Values
- Pruning
Finding Optimal Tree
M9: Classification: more methods
- Rules
- Regression
- Instance-based (Nearest neighbor)
M10: Evaluation and Credibility
- Introduction
- Classification with Train, Test, and Validation sets
Handling Unbalanced Data; Parameter Tuning - *Predicting Performance
- Evaluation on "small data": Cross-validation
- *Bootstrap
- Comparing Data Mining Schemes
- *Choosing a Loss Function
M11: Evaluation - Lift and Costs
- Lift and Gains charts
- *ROC
- Cost-sensitive learning
- Evaluating numeric predictions
- MDL principle and Occam's razor
M12: Data Preparation for Knowledge Discovery
- Data understanding
- Data cleaning
- Date transformation
- Discretization
- False "predictors" (information leakers)
- Feature reduction, leaker detection
- Randomization
- Learning with unbalanced data
Study: Course notes
M13: Clustering
- Introduction
- K-means
- Hierarchical
Study: W&E, Course notes
M14: Associations
- Transactions
- Frequent itemsets
- Association rules
- Applications
Study: Course notes
M15: Visualization
- Graphical excellence and lie factor
- Representing data in 1,2, and 3-D
- Representing data in 4+ dimensions
- Parallel coordinates
- Scatterplots
- Stick figures
- ...
Study: Course notes
M16: Summarization and Deviation Detection
- Summarization
- KEFIR: Key Findings Reporter
- WSARE: What is Strange About Recent Events
Rule-based Anomaly Pattern Detection for Detecting Disease Outbreaks, by Weng-Keen Wong et al (about WSARE system).
M17: Applications: Targeted Marketing and Customer Modeling
- Direct Marketing Review
- Evaluation: Lift, Gains
- KDD Cup 1997
- Lift and Benefit estimation
- KDD Cup 1998
G. Piatetsky-Shapiro, B. Masand, Estimating Campaign Benefits and Modeling Lift, Proc. KDD-99, ACM.
M18: Applications: Genomic Microarray Data Analysis
Study: SIGKDD Explorations Special Issue on Microarray Data Mining,
Capturing Best Practice for Microarray Gene Expression Data Analysis, G. Piatetsky-Shapiro, T. Khabaza, S. Ramaswamy, in Proceedings of KDD-2003.
M19: Data Mining and Society; Future Directions
- Data Mining and Society: Ethics, Privacy, and Security issues
- Future Directions for Data Mining
web mining, text mining, multi-media data - Course Summary
Bayardo & Srikant, Technological Solutions for Protecting Privacy, IEEE Computer, Sep 2003.