KDD Nuggets 95:15, e-mailed 95-06-29 Contents: * R. Uthurusamy, KDD-95 final program, http://www-aig.jpl.nasa.gov/kdd95program.html * GPS, IEEE Expert Mini-symposium: KDD vs Privacy * Y. Reich, ECOBWEB: a public domain clustering tool, http://or.eng.tau.ac.il:7777/topics/ecobweb.html * R. Quinlan, New Releases of C4.5 and FOIL, ftp://ftp.cs.su.oz.au/pub/ml/patch.tar.Z * H. Roberts, Communications Week on Data Mining * S. Tafolla, Peter Clark's Machine Learning Software, http://www.cs.utexas.edu/users/pclark/software.html The KDD Nuggets is a moderated mailing list for news and information relevant to Knowledge Discovery in Databases (KDD), also known as Data Mining, Knowledge Extraction, etc. Relevant items include tool announcements and reviews, summaries of publications, information requests, interesting ideas, clever opinions, etc. Please include a descriptive subject line in your submission. Nuggets frequency is approximately bi-weekly. Back issues of Nuggets, a catalog of S*i*ftware (data mining tools), references, FAQ, and other KDD-related information are available at Knowledge Discovery Mine, URL http://info.gte.com/~kdd/ or by anonymous ftp to ftp.gte.com, cd /pub/kdd, get README E-mail add/delete requests to kdd-request@gte.com E-mail contributions to kdd@gte.com -- Gregory Piatetsky-Shapiro (moderator) ********************* Official disclaimer *********************************** * All opinions expressed herein are those of the writers (or the moderator) * * and not necessarily of their respective employers (or GTE Laboratories) * ***************************************************************************** ~~~~~~~~~~~~ Quotable Quote ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ But of the tree of the knowledge of good and evil, thou shalt not eat of it: for in the day that thou eatest thereof thou shalt surely die. Genesis 2:171 The desire of knowledge, like the thirst of riches, increases ever with the acquisition of it. Laurence Sterne, Tristram Shandy [1760] >~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Date: Wed, 28 Jun 1995 11:05:54 -0500 (EST) From: "R. Uthurusamy" Subject: KDD-95 final program The First International Conference on -------------------------------------- Knowledge Discovery and Data Mining (KDD95) ------------------------------------------- Montreal, Canada, August 20-21, 1995 ==================================== Sponsored by AAAI and in Cooperation with IJCAI, Inc. Co-located with IJCAI-95. Co-sponsored by: AT&T Global Information Solutions NASA - Jet Propulsion Laboratory GTE Laboratories Inc. Conference Co-Chairs: ==================== Usama M. Fayyad (Jet Propulsion Lab, California Institute of Technology) Ramasamy Uthurusamy (General Motors Research) Program Committee ================= Rakesh Agrawal (IBM Almaden Research Center, USA) Tej Anand (AT&T Global Information Solutions, USA) Ron Brachman (AT&T Bell Laboratories, USA) Wray Buntine (NASA AMES Research Center, USA) Nick Cercone (University of Regina, Canada) Peter Cheeseman (NASA AMES Research Center, USA) Greg Cooper (University of Pittsburgh, USA) Brian Gaines (University of Calgary, Canada) Clark Glymour (Carnegie-Mellon University, USA) David Hand (Open University, UK) David Heckerman (Microsoft Corporation, USA) Se June Hong (IBM T.J. Watson Research Center, USA) Larry Jackel (AT&T Bell Labs, USA) Larry Kerschberg (George Mason University, USA) Willi Kloesgen (GMD, Germany) David Madigan (University of Washington, USA) Chris Matheus (GTE Laboratories, USA) Heikki Mannila (University of Helsinki, Finland) Gregory Piatetsky-Shapiro (GTE Laboratories, USA) Daryl Pregibon (AT&T Bell Laboratories, USA) Arno Siebes (CWI, Netherlands) Evangelos Simoudis (Lockheed Research Center, USA) Andrzej Skowron (University of Warsaw, Poland) Padhraic Smyth (Jet Propulsion Laboratory, USA) Alex Tuzhilin (NYU Stern School, USA) Xindong Wu (Monash University, Australia) Wojciech Ziarko (University of Regina, Canada) Jan Zytkow (Wichita State University, USA) Publicity Chair: Padhraic Smyth, Jet Propulsion Laboratory Industry Liaison: Gregory Piatetsky-Shapiro, GTE Laboratories Demo Sessions Chair: Tej Anand, AT&T Global Information Solutions CONTACT INFORMATION: Please send KDD-95 conference registration and related inquiries to: ------------------------------------------------------------------- KDD-95 American Association for Artificial Intelligence (AAAI) 445 Burgess Drive Menlo Park, CA 94025-3496. U.S.A. Phone: (+1 415) 328-3123; Fax: (+1 415) 321-4457 Email: kdd@aaai.org Please send KDD-95 Publicity and related inquiries to: ----------------------------------------------------- Padhraic Smyth (KDD-95) email: kdd95@aig.jpl.nasa.gov Jet Propulsion Laboratory, 525-3660, California Institute of Technology 4800 Oak Grove Drive, Pasadena, CA 91109 U.S.A. Phone: (+1 818) 306-6422 Fax: (+1 818) 306-6912 Inquiries about KDD-95 sponsorship and industry participation to: ---------------------------------------------------------------- Gregory Piatetsky-Shapiro, e-mail: gps@gte.com GTE Laboratories, MS-45 tel: 617-466-4236 40 Sylvan Road fax: 617-466-2960 Waltham MA 02154-1120 USA URL: http://info.gte.com/~kdd/ ---------------------------------------------------------------------------- Technical Program ----------------- ***************** Sunday - August 20, 1995 DAY 1 ********************* 7:30 - 8:30 Registration 8:30 - 9:00 WELCOME, Opening remarks, Overview of KDD (U. Fayyad) 9:00 - 10:15 SESSION 1: Databases and Data Mining Session Chair: Heikki Mannila Applying a Data Miner To Heterogeneous Schema Integration Son Dao and Brad Perry, Hughes Research Laboratories Active Data Mining Rakesh Agrawal and Giuseppe Psaila, IBM Almaden Research Center A Database Interface for Clustering in Large Spatial Databases Martin Ester, Hans-Peter Kriegel, and Xiaowei Xu University of Munich, Germany 10:15 - 10:30 SPOTLIGHT SESSION 1 -- 6 poster summaries (P1 through P6) 10:30 - 10:50 COFFEE BREAK 10:50 - 11:00 SPOTLIGHT SESSION 2 -- 4 poster summaries (P7 through P10) 11:00 - 11:50 INVITED SPEAKER: David Haussler, UCSC Using Hidden Markov Models to Search Biosequence Databases 11:50 - 12:00 SPOTLIGHT SESSION 3 -- 4 poster summaries (P11 through P14) 12:00 - 1:30 LUNCH BREAK 1:30 - 2:30 PANEL SESSION Commercial KDD Applications: The Secret Ingredients for Success Panel Chairs: Gregory Piatetsky-Shapiro, GTE Labs and Evangelos Simoudis, IBM Almaden Research 2:30 - 3:20 SESSION 2: Causality and Bayes Networks Session Chair: Alex Tuzhilin Available Technology for Discovering Causal Models, Building Bayes Nets, and Selecting Predictors: The TETRAD II Program Clark Glymour, Carnegie Mellon University Learning Bayesian Networks with Discrete Variables from Data Peter Spirtes and Christopher Meek, Carnegie Mellon University 3:20 - 3:30 SPOTLIGHT SESSION 4 -- 3 poster summaries (P15 through P17) 3:30 - 3:50 COFFEE BREAK 3:50 - 4:00 SPOTLIGHT SESSION 5 -- 4 poster summaries (P18 through P21) 4:00 - 6:00 PARALLEL SESSION 3A PARALLEL SESSION 3B =================== =================== Session Chair: Session Chair: Jan Zytkow Willi Kloesgen 6:00 - 8:00 KDD-95 RECEPTION POSTER SESSION 1 DEMO SESSION Demo Session Chair: Tej Anand, AT&T Global Info. Solutions ***************** MONDAY - August 21, 1995 DAY 2 ********************* 7:30 - 8:30 Registration 8:30 - 9:20 SESSION 4: Temporal Databases Session Chair: Wray Buntine Fast Spatio-Temporal Data Mining of Large Geophysical Datasets Paul Stolorz, JPL, et al. Discovering Frequent Episodes in Sequences H. Mannila, H. Toivonen, and A.I. Verkamo, Univ. of Helsinki 9:20 - 9:30 SPOTLIGHT SESSION 6 -- 4 poster summaries (P22 through P25) 9:30 - 10:30 INVITED SPEAKER: Tomasz Imielinski, Rutgers University A Database Perspective on Knowledge Discovery 10:30 - 10:50 COFFEE BREAK 10:50 - 11:00 SPOTLIGHT SESSION 7 -- 4 poster summaries (P26 through P29) 11:00 - 11:50 SESSION 5: Inductive Learning Session Chair: Xindong Wu MDL-Based Decision Tree Pruning M. Mehta, J. Rissanen, and R. Agrawal, IBM Almaden Res. Center Estimating the Robustness of Discovered Knowledge Chun-Nan Hsu and Craig A. Knoblock, U.S.C. 11:50 - 12:00 SPOTLIGHT SESSION 8 -- 4 poster summaries (P30 through P33) 12:00 - 1:30 LUNCH BREAK 1:30 - 2:30 INVITED SPEAKER: Jerome Friedman, Stanford University Intelligent Local Learning: Statistical Algorithms for Prediction with High Dimensional Data 2:30 - 3:20 SESSION 6: KDD and STATISTICS Session Chair: Padhraic Smyth A Statistical Perspective On Knowledge Discovery In Databases John Elder, Rice Univ. and Daryl Pregibon, AT&T Bell Labs. Discriminant Adaptive Nearest Neighbor Classification Trevor Hastie, Stanford University and Robert Tibshirani, University of Toronto 3:20 - 3:50 COFFEE BREAK 3:30 - 5:30 POSTER SESSION 2 DEMO SESSION Repeated 5:30 - 6:00 CONCLUDING REMARKS, SUMMARY and WRAP-UP Session (R. Uthurusamy) *************************************************************** PARALLEL SESSION 3A: Rough Sets and Databases ============================================== ***** Sunday, August 20, 1995: 4:00 - 6:00PM ***** Discovery of Concurrent Data Models from Experimental Tables: A Rough Set Approach Andrzej Skowron, Warsaw Univ. and Zbigniew Suraj, Pedagogical Univ., Poland Automated Discovery of Functional Components of Proteins from Amino-Acid Sequences Based on Rough Sets and Change of Representation Shusaku Tsumoto and Hiroshi Tanaka, Tokyo Medical and Dental Univ., Japan Using Rough Sets as Tools for Knowledge Discovery Ning Shan, Wojciech Ziarko, Howard J. Hamilton, and Nick Cercone, University of Regina, Canada Exploiting Upper Approximation in the Rough Set Methodology Jitender S. Deogun, University of Nebraska at Lincoln; Vijay V. Raghavan and Hayri Sever, University of Southwestern Louisiana A Perspective on Databases and Data Mining Marcel Holsheimer and Martin Kersten, CWI Database Res. Group, The Netherlands Heikki Mannila and Hannu Toivonen, University of Helsinki, Finland Compression-Based Evaluation of Partial Determinations Bernhard Pfahringer and Stefan Kramer, Austrian Research Inst. for AI, Austria PARALLEL SESSION 3B: Supervised Learning: Issues and Applications ================================================================== ***** Sunday, August 20, 1995: 4:00 - 6:00PM ***** Knowledge Discovery in Telecommunication Services Data Using Bayesian Network Models Kazuo J. Ezawa and Steve W. Norton, AT&T Bell Laboratories Analyzing the Benefits of Domain Knowledge in Substructure Discovery Surnjani Djoko, Diane J. Cook, and Lawrence B. Holder, University of Texas at Arlington Decision Tree Induction: How Effective is the Greedy Heuristic? Sreerama K. Murthy and Steven Salzberg, Johns Hopkins University Feature Subset Selection Using the Wrapper Method: Overfitting and Dynamic Search Space Topology Ron Kohavi and Dan Sommerfield, Stanford University Learning Arbiter and Combiner Trees from Partitioned Data for Scaling Machine Learning Philip K. Chan and Salvatore J. Stolfo, Columbia University Are We Losing Accuracy While Gaining Confidence in Induced Rules: An Assessment of PrIL F. Ozden Gur-Ali, GE Corporate Research and Development and William A. Wallace, Rensselaer Polytechnic Institute ------------------------------------------------------------------------------ DEMO SESSION : ***** Sunday, August 20, 1995: 6:00 - 8:00PM ***** ============== Knowledge Discovery from Multiple Databases James Ribiero, George Mason University Knowledge Discovery in Textual Databases Ronen Feldman, Bar-Ilan University Exploiting Visualization in Knowledge Discovery Hing-Yan Lee, Hwee-Leng Ong and Lee-Hian Quek Information Technology Institute KEFIR: The Key Findings Reporter for the analysis of healthcare information Christopher Matheus and Gregory Piatetsky-Shapiro, GTE Labs. Automated Large-scale Data Mining by Forty-Niner (49er) Arun Sanjeev and Jan Zytkow POSTER SESSION 1: ***** Sunday, August 20, 1995: 6:00 - 8:00PM ***** ================= SPOTLIGHT SESSION 1: P1: STAR: A General Architecture for the Support of Distortion Oriented Displays Paul Anderson, Ray Smith, and Zhongwei Zhang, Monash University, Australia P2: Learning First Order Logic Rules with a Genetic Algorithm S. Augier, G. Venturini, and Y. Kodratoff, Univ. Paris-Sud, France P3: Discovery and Maintenance of Functional Dependencies by Independencies Siegfried Bell, University Dortmund, Germany P4: Intelligent Instruments: Discovering How to Turn Spectral Data into Information Wray L. Buntine and Tarang Patel, NASA Ames Research Center P5: Designing Neural Networks from Statistical Models: A New Approach to Data Exploration Antonio Ciampi, McGill University, Canada and Yves Lechevallier INRIA-Rocquencourt, France P6: Capacity and Complexity Control in Predicting the Spread Between Borrowing and Lending Interest Rates Corinna Cortes, Harris Drucker, Dennis Hoover, and Vladimir Vapnik, AT&T Bell Laboratories SPOTLIGHT SESSION 2: P7: Limits on Learning Machine Accuracy Imposed by Data Quality Corinna Cortes, L. D. Jackel, and Wan-Ping Chiang, AT&T Bell Laboratories P8: Knowledge Discovery in a Water Quality Database Saso Dzeroski, Jozef Stefan Institute and Jasna Grbovic, Hydrometeorological Institute of Slovenia P9: Data Mining for Loan Evaluation at ABN AMRO: A Case Study A. J. Feelders and A. J. F. le Loux, University of Twente; J. W. van't Zand, ABN AMRO Bank, The Netherlands P10: Knowledge Discovery in Textual Databases (KDT) Ronen Feldman and Ido Dagan, Bar-Ilan University, Israel SPOTLIGHT SESSION 3: P11: Optimization and Simplification of Hierarchical Clusterings Doug Fisher, Vanderbilt University P12: Structured and Unstructured Induction with EDAGs Brian R. Gaines, University of Calgary, Canada P13: Restructuring Databases for Knowledge Discovery by Consolidation and Link Formation Henry G. Goldberg and Ted E. Senator, Financial Crimes Enforcement Network (FinCEN), U.S. Dept. of Treasury P14: Rough Sets Similarity-Based Learning from Databases Xiaohua Hu and Nick Cercone, University of Regina, Canada SPOTLIGHT SESSION 4: P15: Efficient Algorithms for Attribute-Oriented Induction Hoi-Yee Hwang and Wai-Chee Fu, Chinese University of Hong Kong P16: Robust Decision Trees: Removing Outliers from Databases George H. John, Stanford University P17: Conceptual Clustering in Structured Databases: A Practical Approach A. Ketterlin, P. Gancarski, and J. Korczak, LSIIT, Univ. Louis Pasteur, France POSTER SESSION 2: ***** Monday, August 21, 1995: 3:30-5:30PM ***** ================= SPOTLIGHT SESSION 5: P18: Anonymization Techniques for Knowledge Discovery in Databases Willi Kloesgen, German National Research Center for Info. Technology (GMD) P19: Exploiting Visualization in Knowledge Discovery Hing-Yan Lee, Hwee-Leng Ong, and Lee-Hian Quek, Information Technology Institute, Singapore P20: Knowledge-Based Scientific Discovery in Geological Databases Cen Li and Gautam Biswas, Vanderbilt University P21: An Iterative Improvement Approach for the Discretization of Numeric Attributes in Bayesian Classifiers Michael J. Pazzani, University of California, Irvine SPOTLIGHT SESSION 6: P22: Knowledge Discovery from Multiple Databases James S. Ribeiro, Kenneth A. Kaufman, and Larry Kerschberg, George Mason University P23: Discovering Enrollment Knowledge in University Databases Arun P. Sanjeev and Jan M. Zytkow, Wichita State University P24: Extracting Support Data for a Given Task Bernhard Schoelkopf, Chris Burges, and Vladimir Vapnik, AT&T Bell Labs. P25: Feature Extraction for Massive Data Mining V. Seshadri and Raguram Sasisekharan, AT&T Bell Laboratories; Sholom M. Weiss, Rutgers University SPOTLIGHT SESSION 7: P26: Data Surveying: Foundations of an Inductive Query Language Arno Siebes, CWI, Database Research Group, The Netherlands P27: On Subjective Measures of Interestingness in Knowledge Discovery Avi Silberschatz, AT&T Bell Labs and Alexander Tuzhilin, New York Univ. P28: Using Recon for Data Cleaning Evangelos Simoudis, IBM Almaden Research Center; Brian Livezey and Randy Kerber, Lockheed Palo Alto Research Laboratories P29: Accelerated Quantification of Bayesian Networks with Incomplete Data Bo Thiesson, Aalborg University, Denmark SPOTLIGHT SESSION 8: P30: Automated Selection of Rule Induction Methods Based on Recursive Iteration of Resampling Methods and Multiple Statistical Testing Shusaku Tsumoto and Hiroshi Tanaka, Tokyo Medical and Dental Univ., Japan P31: Fuzzy Interpretation of Induction Results Xindong Wu, Monash University, Australia and Petter Mahlen, Royal Institute of Technology, Sweden P32: Resource and Knowledge Discovery in Global Information Systems: A Preliminary Design and Experiment Osmar R. Zaiane and Jiawei Han, Simon Fraser University, Canada P33: Toward a Multi-Strategy and Cooperative Discovery System Ning Zhong, The Univ. of Tokyo and Setsuo Ohsuga, The Waseda Univ., Japan ------------------------------------------------------------------------------ Additional details can be found at http://www-aig.jpl.nasa.gov/kdd95/ and at http://www.aaai.org/ ------------------------------------------------------------------------------ >~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Date: Mon, 19 Jun 1995 09:47:19 -0400 From: gps0 (Gregory Piatetsky-Shapiro) Subject: IEEE Expert April 1995 I am pleased to announce the publication in IEEE Expert April 1995 issue of a mini-symposium on Knowledge Discovery in Personal Data Versus Privacy, based on the paper by Dan O'Leary on the subject. The mini-symposium consists of a paper by Dan O'Leary on the subject and responses from an internation panel of experts: Yew-Tuan Khaw and Hing-Yan Lee from Singapore; Willi Kloesgen from GMD, Germany; Wojtek Ziarko from University of Regina, Canada and Steven Bonorris from Office of Technology Assessment, USA Below is a condensed version of my introduction to mini-symposium, in Latex -- GPS ------------------------ \title{Knowledge Discovery in Personal Data vs. Privacy \\ a mini-symposium} \author{Gregory Piatetsky-Shapiro \\ \ \\ GTE Laboratories Incorporated\\ 40 Sylvan Rd., Waltham MA 02254\\ {\em gps@gte.com} } \date{April 7, 1995} \maketitle \begin{flushright} But of the tree of the knowledge of good and evil, thou shalt not eat of it: for in the day that thou eatest thereof thou shalt surely die. Genesis 2:171 \ \\ The desire of knowledge, like the thirst of riches, increases ever with the acquisition of it. Laurence Sterne, Tristram Shandy [1760] \end{flushright} Dr. Chandrasekaran, during his tenure as IEEE Expert Editor-in-chief, has asked me to put together a mini-symposium on the issues of Knowledge Discovery in Databases and Privacy, based on the paper by Dan O'Leary on the subject. I am very pleased to have been able to assemble a distinguished panel of experts in the areas of Knowledge Discovery in Databases. This panel, international by design to reflect the geographical differences in the privacy issue, consists of Yew-Tuan Khaw and Hing-Yan Lee from Singapore; Willi Kloesgen from GMD, Germany; and Wojtek Ziarko from University of Regina, Canada. Steven Bonorris from Office of Technology Assessment gives the legal perspective. Here I briefly review the recent successes of Knowledge Discovery and highlight some of the important areas where it may conflict with privacy desires. The other articles follow. The world-wide computerization of many business and government transactions in the developed countries and their increasing storage and availability on-line have created mountains of data that contain potentially valuable knowledge. Finding nuggets of knowledge in this data is the focus of the rapidly growing field known as Data Mining or Knowledge Discovery in Databases (Piatetsky-Shapiro and Frawley 1991, Piatetsky-Shapiro 1991, Cercone and Tsuchiya 1993, Fayyad and Uthurusamy 1994, Piatetsky-Shapiro et al 1994, Piatetsky-Shapiro 1995, Fayyad and Uthurusamy 1995, Fayyad et al 1995). While successful Knowledge Discovery in Databases (KDD) applications have been developed for scientific and other non-personal databases, most of the public attention has been focused on the analysis of databases of personal information. Database marketing, which is the application of KDD tools to customer data in order to find patterns of customers who buy particular products, has even appeared on the cover of Business Week (Sep 5, 1994). Database marketing, while apparently very successful, has sometimes been controversial. Wall Street Journal warned to avoid the dark site of database marketing: too much personalization increases customers' annoyance (Rosenfield 1994). In 1990 Lotus has developed and was planning to sell a CD-rom with data on about 100 million American households. This plan generated such a firestorm of protests over the privacy issues, that Lotus was forced to cancel the product (Rosenberg 1992). Privacy concerns have long been expressed with regards to basic data collection and retrieval, and a number of guidelines for privacy protection have already been proposed in most developed countries. The guidelines and the existing privacy protections differ significantly around the world, and they also differ with respect to private and public data collectors. The strongest data protection currently exists in European Union countries, most of which adopted the Organization for Economic Development (OECD) guidelines which are the subject of Daniel O'Leary's article. In USA there are privacy laws regulating the government usage of data, but very few laws dealing with private corporations' use of data. There are, however, the NII "Draft Principles for Providing and Using Personal Information", discussed in Steven Bonorris's article. While concerns for privacy issues have long predated Knowledge Discovery, the vastness of existing databases and the sophistication of the advanced KDD methods have opened new potential vulnerabilities in the personal privacy protection. We can divide the privacy issues in the analysis of personal data into 3 types: \begin{enumerate} \item Privacy vs Basic Storage and Retrieval \item Privacy vs Pattern Discovery \item Privacy vs Combination of Group Patterns \end{enumerate} These issues are reviewed below. \section{Privacy vs Basic Storage and Retrieval} The most fundamental privacy issues deal with basic storage and retrieval of personal data, which precede any discovery. Who can find out "What widgets did X buy on April 7, 1995 ?" Both OECD guidelines and NII Draft Principles suggest limiting the collection of sensitive data and limiting the access to personal data. They suggest limiting the data use to the purposes for which either there is an advance consent of the data subject or the use us authorized by law. \section{Privacy vs Pattern discovery} If retrieval of specific information, such as "What widgets did X buy on April 7, 1995" is allowed, then it is technically possible to find patterns such as how frequently X buys widgets, what brand X prefers, etc. The technical equivalence between allowing retrieval and pattern discovery is a point that should be considered in establishing privacy guidelines. The NII Draft Principles permit the use of "transactional records," such as phone numbers called, credit card payments, etc, as long as such use is compatible with the original notice. The use of transactional records probably includes also discovery of patterns. We should also note that discovered patterns in personal data may involve very controversial fields, such as race, sex, religion, and sexual orientation. A recent example is the debate over the research by Murray and Herrnstein which ranked different racial groups with respect to their IQ (New Republic, 1994). However, the First Amendment guarantees the freedom of speech, and even though some patterns can be very controversial, and can be illegal to discriminate upon, they can still be discovered and debated. \section{Privacy vs Combination of Group Patterns} \begin{flushright} Even if you are paranoid, it does not mean they are not after you -- anonymous \end{flushright} In many cases (e.g. medical research, socio-economic studies) the goal is not to discover patterns not about specific individuals, but about groups, -- e.g. which group is more likely to buy a widget, which group has high unemployment rate, or which group has low incidence of AIDS. It would appear that such aggregate patterns are not covered by the restrictions on personal data. The problem arises because the combination of several such patterns, especially in small datasets, may allow identification of specific personal information, either with certainty or with high probability. E.g. by learning that in the selected sample \begin{itemize} \item "people with code=A don't have AIDS" \item "people with code=B don't have AIDS" \item there are 10 people with code not equal to A or B \item there are 9 cases of AIDS \item person X has code=C \end{itemize} it is possible to infer that X has AIDS with the probability of 0.9. A number of technical solutions have been proposed (see Kloesgen's article) that would allow discovery of aggregate patterns while avoiding the potential invasion of privacy. These solutions include \begin{itemize} \item Removing or replacing identifying fields from data such as telephone numbers, names, addresses (however, a person could still be identified from secondary fields). \item Replacing direct querying of data with querying on a randomly selected (and each time different) sample. This, however, may still allow identification by a determined intruder. % ref ?? \item Combining similar (in some way) individuals into groups and only storing data on those groups. This does not allow identification of individual data but may lose some interesting aggregate patterns. \item Generating synthetic data which has the same marginal distribution as the original data (however, it is very difficult to generate such data for a large number of variables). \end{itemize} These topics, which pose interesting research issues, are discussed more by Kloesgen. \ \\ I hope that this mini-symposium will shed the light on the issues of privacy in for knowledge discovery in personal databases and will help in generating guidelines that protect both the individual privacy and the society's right to know. {\bf Acknowledgements}: I want to thank Dr. Chandrasekaran for suggesting a symposium on this topic, and Lance Hoffman for useful comments on O'Leary's paper. \section{References} \parindent 0pt N. Cercone and M. Tsuchiya, 1993. Guest editors, Special Issue on Learning and Discovery in Databases, {\em IEEE Trans. on Knowledge and Data Engineering}, 5(6), Dec. U. Fayyad and R. Uthurusamy, 1994. Editors, Proceedings of KDD-94: the AAAI-94 workshop on Knowledge Discovery in Databases, AAAI Press report 94-WS-03. U. Fayyad and R. Uthurusamy, 1995. Editors, Proceedings of KDD-95: First International Conference on Knowledge Discovery and Data Mining, AAAI Press. U. Fayyad, G. Piatetsky-Shapiro, P. Smyth, R. Uthurusamy, 1995. Editors, Advances in Knowledge Discovery and Data Mining, AAAI/MIT Press. New Republic, Oct 31, 1994, Special Issue on Murray and Herrnstein's The Bell Curve. G. Piatetsky-Shapiro and W. Frawley, 1991. Editors, {\em Knowledge Discovery in Databases}, Cambridge, Mass.: AAAI/MIT Press. G. Piatetsky-Shapiro, 1991. Report on AAAI-91 workshop on Knowledge Discovery in Databases, {\em IEEE Expert}, 6(5): 74--76. G. Piatetsky-Shapiro, C. Matheus, P. Smyth, and R. Uthurusamy, 1994. KDD-93: Progress and Challenges in Knowledge Discovery in Database, {\em AI Magazine}, 15:3, 77--87. G. Piatetsky-Shapiro, 1995. Editor, Special issue on Knowledge Discovery in Databases, {\em J. of Intelligent Information Systems} 4:1, January. J. Rosenfield, Avoid Dark Side of Database Marketing, Wall Street Journal, Oct 3, 1994, p. A20. See also KDD Nugget 94:20, http://info.gte.com/~kdd/nuggets/94/n20.txt M. Rosenberg, 1992. Protecting Privacy, Inside Risks column, {\em Communications of ACM}, 35(4), p. 164. \end{document} >~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Return-Path: Date: Tue, 20 Jun 1995 18:25:41 +0300 (IDT) From: Yoram Reich To: KDD Nuggets Moderator Subject: New entry to siftware I'd like to add an entry to the siftware list. It is already in an HTML format. Thanks in advance. Yoram *Name: ECOBWEB
*Description: ECOBWEB is a concept formation program for the creation of hierarchical classification trees. It implements several extensions to Fisher's COBWEB program. In particular, it can work well with numeric attributes, it can perform simple constructive induction, it has a procedure for mitigating order effects, it has an experimentation procedure, and it has several methods for classification that make it suitable for design domains. ECOBWEB employs multistrategy learning; it is a concept formation program that includes case-based reasoning capabilities.
ECOBWEB was implemented in Common Lisp. I expect it to run on most implementations of the language. Longer description with relevant publications and code are here.
*Discovery methods: Clustering.
*Platform(s): Unix. (but may be will run on other operating systems with Common Lisp).
*Contact: Yoram Reich, Faculty of Engineering, Tel Aviv University, Ramat Aviv 69978, Israel, yoram@eng.tau.ac.il, phone: +972-3-640-7385, fax: +972-3-640-7617
*Status: public domain.
*Updated by: Yoram Reich on 1995-6-20
------------------------------------------------------------------------ Yoram Reich, Department of Solid Mechanics, Materials and Structures, Faculty of Engineering, Tel Aviv University, Ramat Aviv 69978, Israel Tel: + 972 3 6407385, Fax: + 972 3 6407617, email: yoram@eng.tau.ac.il Yoram Reich >~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ From: Ross Quinlan Date: Fri, 16 Jun 1995 11:08:49 +1000 Subject: New Releases of C4.5 and FOIL C4.5 Release 7 The latest release of C4.5 is now available. If you have Release 5 (i.e. the disk from Morgan Kaufmann), you can obtain the altered files by anonymous ftp from ftp.cs.su.oz.au, directory pub/ml, file patch.tar.Z. The file Modifications summarizes the changes since Release 5. Needless to say, it is advisable to retain the old files until you are satisfied with Release 7! FOIL Version 6.3 This version fixes several bugs and incorporates some improvements. It is available by anonymous ftp from ftp.cs.su.oz.au, directory pub, file foil6.sh. Please report any problems to quinlan@cs.su.oz.au. >~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Return-Path: <@bt-web.bt.co.uk:roberts_h_d@bt-web.bt.co.uk> X-Vms-To: R11F::GTE.COM::EUREKA::KDD To: kdd <@gte.com:kdd@eureka> From: roberts_h_d Subject: data mining agents Date: Mon, 26 Jun 1995 18:10:08 +0100 Content-Type: text Content-Length: 1578 Gregory - Below is my summary of the Communications Week article, for possible inclusion in KDD Nuggets. On re-reading, the article does not actually mention data mining, but is more about agent and intelligent information retrieval. But, the future plans mentioned, and the database framework they have set up may be of interest/relevance. Regards, Huw Roberts Data Mining Group BT Laboratories ----------------------- Communications Week International, 12th June 1995, reports (in "Nabisco Unleashes Agents") that Nabisco is planning to develop and deploy intelligent agent software on more than 5,000 employee desktops. The agents search various company databases on consumer buying patterns and company and competitor sales, analyze the data, and recommend courses of action. The data comes from two main sources: an internal database of company sales and customer data, and an Express DBMS (from Information Resources Inc.) holding general food industry information. The databases are integrated using Axsys middleware from Information Advantage Inc., and in-house agent technology. The agents find, filter and present the results to 300 Nabisco executives in near-real time. Smarter agents are planned which will "provide concise analyses of the data and its implications for a specific decision-making process". To reduce network traffic, Nabisco plan to store as much of the key data as close to the user as possible. "As little as 20% of the data accessed by a given user typically provides 80% of the answers he or she is looking for." >~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ From: SUSAN.F.TAFOLLA@sam.usace.army.mil Date: Tue, 13 Jun 95 12:23:09 CST To: kdd@gte.com Peter Clark - Machine Learning Software

Peter Clark - Machine Learning Software

Software which co-workers and I have developed. To be used solely at your own risk! We'd appreciate an acknowledgement if you use any of these packages in your research. We'd also be interested in hearing any comments or results you have from using this code. Have fun!

Keep up to date: Don't waste time reinventing new features! If you would like to be notified of upgrades or bug-fixes to the s/w below, please send me an email and I'll add you to a "users list" to keep you up to date. Similarly, if you have any questions or problems with the s/w, please email me (pclark@cs.utexas.edu).

Contents:

  1. Guiding Inductive Learning with a Qualitative Model
  2. LPE - Lazy Partial Evaluation
  3. CN2 - Rule induction from examples

1. Guiding Inductive Learning with a Qualitative Model

Overview

Input: a set of training examples and a qualitative model. Output: a set of propositional if...then... classification rules which are also "explainable" by the qualitative model. This package allows a qualitative model to bias induction of propositional if...then... rules (using CN2), so that only rules which are also "explainable" by the qualitative model (approximately: having a corresponding path in the influence graph) are found. This is important for practical application of ML, where we wish to use domain knowledge as well as training data to guide rule learning.

Learning occurs in two phases: First, a specialisation lattice containing only (and all) rules "explainable" by the QM is explicitly enumerated. Second, the CN2 induction algorithm is used to learn rules from training data, but CN2's specialisation operator restricted to work on the QM-generated specialisation lattice. (NB: other implementations of this method, eg. which don't explicitly enumerate the lattice a priori, would be equally valid).

The authors are Stan Matwin (stan@csi.uottawa.ca) and myself.

Software

The algorithm is implemented in Quintus Prolog. It was made available on WWW in Jan 1995 so has not been extensively tested outside our lab yet. Contact us if you have questions. The software contains source code, the domain models and data sets used in the ML93 paper (below), and documentation. Knowledge of Prolog isn't needed to use the software. The software is public domain and freely available.

For those without Quintus Prolog -- we also provide Sun Sparc executables of this software (ie. compiled Prolog, without the Prolog development environment). These do not require a Quintus Prolog licence (nor even any knowledge of Prolog) to run, but of course require a Sparc machine. A licence may be needed to use these executables for commercial use; contact me for info.

To download, click below. The software is tar'ed - to unbundle it, do "tar xf <file>" where <file> is the file that you stored the downloaded code in.

References

  • P. Clark and S. Matwin. Using qualitative models to guide inductive learning. In P. Utgoff, editor, Proc. Tenth Int. Machine Learning Conference (ML-93), pages 49-56, CA, 1993. Kaufmann. (Abstract and postscript).

2. LPE - Lazy Partial Evaluation

Overview

Lazy partial evaluation is a form of speed-up learning, when reasoning with a domain theory. It is a hybrid between:
  • partial evaluation (PE), where a procedure is "unwound" in all possible ways and the results cached and indexed.
  • explanation-based learning (EBL), where just execution paths through the procedure which prove specific theorems are identified and cached.
LPE does "partial evaluation on demand". It can be advantageous over PE as it avoids redundant expansion of a procedure (hence saving memory and CPU time). It can be advantageous over EBL as it avoids proving theorems from scratch with the (slow) original domain theory (when no cached solution applies), and avoids the "masking effect" where suboptimal, cached solutions are chosen in preference to better solutions implicit in the domain theory (when a cached solution applies). It is described in detail in the paper below. The authors are Rob Holte (holte@csi.uottawa.ca) and myself.

Implementation

LPE is implemented in Quintus Prolog. It comes with documentation, demos, and the domain theories used in that paper. The software is public domain and freely available. To download, click below. The software is tar'ed - to unbundle it, do "tar xf <file>" where <file> is the file that you stored the downloaded code in.

References

  • P. Clark and R. Holte. Lazy partial evaluation: An integration of explanation-based generalisation and partial evaluation. In D. Sleeman and P. Edwards, editors, Proc. Ninth Int. Machine Learning Conference (ML-92), pages 82-91, CA, 1992. Kaufmann. (Abstract and postscript).

3. CN2 - Rule induction from examples

Overview

This algorithm inductively learns a set of propositional if...then... rules from a set of training examples. To do this, it performs a general-to-specific beam search through rule-space for the "best" rule, removes training examples covered by that rule, then repeats until no more "good" rules can be found. The original algorithm (Machine Learning Journal paper below) defined "best" using a combination of entropy and a significance test. The algorithm was later improved to replace this evaluation function with the Laplace estimate (EWSL-91 paper, below), and also to induce unordered rule sets as well as ordered rule lists ("decision lists"). The software implements the latest version (ie. using the Laplace heuristic), but has flags which can be set to return it to the original version. The algorithm was designed by Tim Niblett (tim.niblett@turing.gla.ac.uk) and myself.

Software

The revised version of CN2 was implemented in C in 1990 by Robin Boswell (robin@csd.abdn.ac.uk). Email me if you would like further information on obtaining a copy (pclark@cs.utexas.edu).

References

  • P. Clark and R. Boswell. Rule induction with CN2: Some recent improvements. In Y. Kodratoff, editor, Machine Learning - EWSL-91, pages 151-163, Berlin, 1991. Springer-Verlag. (Abstract and postscript).
  • P. Clark and T. Niblett. The CN2 Induction Algorithm. Machine Learning, 3(4):261-283, 1989. (Abstract and postscript).

pclark@cs.utexas.edu
>~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~