KDD Nuggets 95:15, e-mailed 95-06-29
Contents: 
	* R. Uthurusamy, KDD-95 final program,
		http://www-aig.jpl.nasa.gov/kdd95program.html
	* GPS, IEEE Expert Mini-symposium: KDD vs Privacy
	* Y. Reich, ECOBWEB: a public domain clustering tool,
		http://or.eng.tau.ac.il:7777/topics/ecobweb.html
	* R. Quinlan, New Releases of C4.5 and FOIL,
		ftp://ftp.cs.su.oz.au/pub/ml/patch.tar.Z 
	* H. Roberts, Communications Week on Data Mining
	* S. Tafolla, Peter Clark's Machine Learning Software, 
		http://www.cs.utexas.edu/users/pclark/software.html

The KDD Nuggets is a moderated mailing list for news and information
relevant to Knowledge Discovery in Databases (KDD), also known as
Data Mining, Knowledge Extraction, etc.  Relevant items include
tool announcements and reviews, summaries of publications, information
requests, interesting ideas, clever opinions, etc.
Please include a descriptive subject line in your submission.

Nuggets frequency is approximately bi-weekly. 

 Back issues of Nuggets, a catalog of S*i*ftware (data mining tools), 
 references, FAQ, and other KDD-related information are available 
 at Knowledge Discovery Mine, URL http://info.gte.com/~kdd/  or 
 by anonymous ftp to ftp.gte.com, cd /pub/kdd, get README

E-mail add/delete requests to kdd-request@gte.com
E-mail contributions to kdd@gte.com
	-- Gregory Piatetsky-Shapiro (moderator)

********************* Official disclaimer ***********************************
* All opinions expressed herein are those of the writers (or the moderator) * 
* and not necessarily of their respective employers (or GTE Laboratories)   *
*****************************************************************************

~~~~~~~~~~~~ Quotable Quote  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
		But of the tree of the knowledge of good and evil, 
		thou shalt not eat of it: 
		for in the day that thou eatest thereof thou shalt surely die.
								Genesis  2:171

		The desire of knowledge, like the thirst of riches, 
		increases ever with the acquisition of it.

			Laurence Sterne, Tristram Shandy [1760]
>~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Date: Wed, 28 Jun 1995 11:05:54 -0500 (EST)
From: "R. Uthurusamy" <SAMY@gmr.com>
Subject: KDD-95 final program

                  The  First International Conference on
                  --------------------------------------
                Knowledge Discovery and Data Mining (KDD95)
                -------------------------------------------

                 Montreal, Canada, August 20-21, 1995
                 ====================================

	Sponsored by AAAI and in Cooperation with IJCAI, Inc.
	Co-located with IJCAI-95. 
Co-sponsored by:  AT&T Global Information Solutions
                  NASA - Jet Propulsion Laboratory
                  GTE Laboratories Inc.

Conference Co-Chairs:
====================
      Usama M. Fayyad (Jet Propulsion Lab, California Institute of Technology)
      Ramasamy Uthurusamy (General Motors Research)

Program Committee
=================
        Rakesh Agrawal            (IBM Almaden Research Center, USA)
        Tej Anand                 (AT&T Global Information Solutions, USA)
        Ron Brachman              (AT&T Bell Laboratories, USA)
        Wray Buntine              (NASA AMES Research Center, USA)
	Nick Cercone		  (University of Regina, Canada)
        Peter Cheeseman           (NASA AMES Research Center, USA)
        Greg Cooper               (University of Pittsburgh, USA)
        Brian Gaines              (University of Calgary, Canada)
        Clark Glymour             (Carnegie-Mellon University, USA) 
        David Hand                (Open University, UK)
        David Heckerman           (Microsoft Corporation, USA)
        Se June Hong              (IBM T.J. Watson Research Center, USA)
        Larry Jackel              (AT&T Bell Labs, USA)
        Larry Kerschberg          (George Mason University, USA)
        Willi Kloesgen            (GMD, Germany)
        David Madigan             (University of Washington, USA)
        Chris Matheus             (GTE Laboratories, USA)
        Heikki Mannila            (University of Helsinki, Finland)
        Gregory Piatetsky-Shapiro (GTE Laboratories, USA)
        Daryl Pregibon            (AT&T Bell Laboratories, USA)
        Arno Siebes               (CWI, Netherlands)
        Evangelos Simoudis        (Lockheed Research Center, USA)
        Andrzej Skowron           (University of Warsaw, Poland)
        Padhraic Smyth            (Jet Propulsion Laboratory, USA)
        Alex Tuzhilin             (NYU Stern School, USA)
        Xindong Wu                (Monash University, Australia)
        Wojciech Ziarko           (University of Regina, Canada)
        Jan Zytkow                (Wichita State University, USA)

Publicity Chair:     Padhraic Smyth, Jet Propulsion Laboratory
Industry Liaison:    Gregory Piatetsky-Shapiro, GTE Laboratories
Demo Sessions Chair: Tej Anand, AT&T Global Information Solutions

CONTACT INFORMATION:

Please send KDD-95 conference registration and related inquiries to:
-------------------------------------------------------------------
KDD-95
American Association for Artificial Intelligence (AAAI)
445 Burgess Drive Menlo Park, CA 94025-3496.  U.S.A.
Phone: (+1 415) 328-3123; Fax: (+1 415) 321-4457    Email: kdd@aaai.org

Please send KDD-95 Publicity and related inquiries to:
-----------------------------------------------------
Padhraic Smyth (KDD-95)         email: kdd95@aig.jpl.nasa.gov
Jet Propulsion Laboratory, 525-3660, California Institute of Technology
4800 Oak Grove Drive, Pasadena, CA 91109 U.S.A.
Phone: (+1 818) 306-6422  Fax: (+1 818) 306-6912 

Inquiries about KDD-95 sponsorship and industry participation to: 
----------------------------------------------------------------
Gregory Piatetsky-Shapiro,      e-mail: gps@gte.com  
GTE Laboratories, MS-45         tel: 617-466-4236 
40 Sylvan Road                  fax:  617-466-2960 
Waltham MA 02154-1120 USA       URL: http://info.gte.com/~kdd/          
----------------------------------------------------------------------------

			    Technical Program
			    -----------------

*****************   Sunday - August 20, 1995  DAY 1 *********************

 7:30 -  8:30	Registration

 8:30 -  9:00	WELCOME, Opening remarks, Overview of KDD   (U. Fayyad)

 9:00 - 10:15	SESSION 1:  Databases and Data Mining
		Session Chair: Heikki Mannila

		Applying a Data Miner To Heterogeneous Schema Integration
		Son Dao and Brad Perry, Hughes Research Laboratories 

		Active Data Mining
		Rakesh Agrawal and Giuseppe Psaila, IBM Almaden Research Center

		A Database Interface for Clustering in Large Spatial Databases
		Martin Ester, Hans-Peter Kriegel, and Xiaowei Xu
		University of Munich, Germany

10:15 - 10:30	SPOTLIGHT SESSION 1  -- 6 poster summaries (P1 through P6)

10:30 - 10:50	COFFEE BREAK

10:50 - 11:00	SPOTLIGHT SESSION 2  -- 4 poster summaries (P7 through P10)

11:00 - 11:50	INVITED SPEAKER:    David Haussler, UCSC
		Using Hidden Markov Models to Search Biosequence Databases

11:50 - 12:00	SPOTLIGHT SESSION 3  -- 4 poster summaries (P11 through P14)

12:00 -  1:30	LUNCH BREAK

 1:30 -  2:30	PANEL SESSION

		Commercial KDD Applications: The Secret Ingredients for Success
		Panel Chairs:	Gregory Piatetsky-Shapiro, GTE Labs and
				Evangelos Simoudis, IBM Almaden Research

 2:30 -  3:20	SESSION 2:  Causality and Bayes Networks
		Session Chair: Alex Tuzhilin

		Available Technology for Discovering Causal Models, Building 
		Bayes Nets, and Selecting  Predictors: The TETRAD II Program
		Clark Glymour, Carnegie Mellon University 

		Learning Bayesian Networks with Discrete Variables from Data
		Peter Spirtes and Christopher Meek, Carnegie Mellon University

 3:20 -  3:30	SPOTLIGHT SESSION 4  -- 3 poster summaries (P15 through P17)

 3:30 -  3:50	COFFEE BREAK

 3:50 -  4:00	SPOTLIGHT SESSION 5  -- 4 poster summaries (P18 through P21)

 4:00 -  6:00	PARALLEL SESSION 3A		PARALLEL SESSION 3B
		===================		===================
		Session Chair:			Session Chair:
		Jan Zytkow			Willi Kloesgen

 6:00 -  8:00	KDD-95 RECEPTION
 
		POSTER SESSION 1

		DEMO SESSION

		Demo Session Chair: Tej Anand, AT&T Global Info. Solutions

*****************   MONDAY - August 21, 1995  DAY 2 *********************

 7:30 -  8:30	Registration

 8:30 -  9:20	SESSION 4:  Temporal Databases
		Session Chair: Wray Buntine

		Fast Spatio-Temporal Data Mining of Large Geophysical Datasets
		Paul Stolorz, JPL, et al.

		Discovering Frequent Episodes in Sequences
		H. Mannila, H. Toivonen, and A.I. Verkamo, Univ. of Helsinki

 9:20 -  9:30	SPOTLIGHT SESSION 6 -- 4 poster summaries (P22 through P25)

 9:30 - 10:30	INVITED SPEAKER:  Tomasz Imielinski, Rutgers University
		A Database Perspective on Knowledge Discovery

10:30 - 10:50	COFFEE BREAK

10:50 - 11:00	SPOTLIGHT SESSION 7 -- 4 poster summaries (P26 through P29)

11:00 - 11:50	SESSION 5:  Inductive Learning
		Session Chair: Xindong Wu

		MDL-Based Decision Tree Pruning
		M. Mehta, J. Rissanen, and R. Agrawal, IBM Almaden Res. Center

		Estimating the Robustness of Discovered Knowledge
		Chun-Nan Hsu and Craig A. Knoblock, U.S.C.

11:50 - 12:00	SPOTLIGHT SESSION 8 -- 4 poster summaries (P30 through P33)

12:00 -  1:30	LUNCH BREAK

 1:30 -  2:30	INVITED SPEAKER:  Jerome Friedman, Stanford University
		Intelligent Local Learning: Statistical Algorithms 
		for Prediction with High Dimensional Data

 2:30 -  3:20	SESSION 6:  KDD and STATISTICS
		Session Chair: Padhraic Smyth

		A Statistical Perspective On Knowledge Discovery In Databases
		John Elder, Rice Univ. and Daryl Pregibon, AT&T Bell Labs.

		Discriminant Adaptive Nearest Neighbor Classification
		Trevor Hastie, Stanford University and 
		Robert Tibshirani, University of Toronto

 3:20 -  3:50	COFFEE BREAK

 3:30 -  5:30	POSTER SESSION 2

		DEMO SESSION Repeated

 5:30 -  6:00	CONCLUDING REMARKS, SUMMARY and WRAP-UP Session (R. Uthurusamy)

***************************************************************

PARALLEL SESSION 3A:  Rough Sets and Databases
==============================================
	*****  Sunday, August 20, 1995: 4:00 - 6:00PM  *****

Discovery of Concurrent Data Models from Experimental Tables: 
A Rough Set Approach
Andrzej Skowron, Warsaw Univ. and Zbigniew Suraj, Pedagogical Univ., Poland

Automated Discovery of Functional Components of Proteins from Amino-Acid 
Sequences Based on Rough Sets and Change of Representation
Shusaku Tsumoto and Hiroshi Tanaka, Tokyo Medical and Dental Univ., Japan

Using Rough Sets as Tools for Knowledge Discovery
Ning Shan, Wojciech Ziarko, Howard J. Hamilton, and Nick Cercone, 
University of Regina, Canada

Exploiting Upper Approximation in the Rough Set Methodology
Jitender S. Deogun, University of Nebraska at Lincoln; 
Vijay V. Raghavan and Hayri Sever, University of Southwestern Louisiana

A Perspective on Databases and Data Mining
Marcel Holsheimer and Martin Kersten, CWI Database Res. Group, The Netherlands
Heikki Mannila and Hannu Toivonen, University of Helsinki, Finland

Compression-Based Evaluation of Partial Determinations
Bernhard Pfahringer and Stefan Kramer, Austrian Research Inst. for AI, Austria

PARALLEL SESSION 3B:  Supervised Learning: Issues and Applications
==================================================================
	*****  Sunday, August 20, 1995: 4:00 - 6:00PM  *****

Knowledge Discovery in Telecommunication Services Data 
Using Bayesian Network Models
Kazuo J. Ezawa and Steve W. Norton, AT&T Bell Laboratories

Analyzing the Benefits of Domain Knowledge in Substructure Discovery
Surnjani Djoko, Diane J. Cook, and Lawrence B. Holder, 
University of Texas at Arlington

Decision Tree Induction: How Effective is the Greedy Heuristic?
Sreerama K. Murthy and Steven Salzberg, Johns Hopkins University

Feature Subset Selection Using the Wrapper Method:  
Overfitting and Dynamic Search Space Topology
Ron Kohavi and Dan Sommerfield, Stanford University 

Learning Arbiter and Combiner Trees from Partitioned Data for 
Scaling Machine Learning
Philip K. Chan and Salvatore J. Stolfo, Columbia University

Are We Losing Accuracy While Gaining Confidence in Induced Rules:
An Assessment of PrIL
F. Ozden Gur-Ali, GE Corporate Research and Development and 
William A. Wallace, Rensselaer Polytechnic Institute
------------------------------------------------------------------------------
DEMO SESSION :   *****  Sunday, August 20, 1995: 6:00 - 8:00PM  *****
==============

Knowledge Discovery from Multiple Databases
James Ribiero, George Mason University

Knowledge Discovery in Textual Databases 
Ronen Feldman, Bar-Ilan University

Exploiting Visualization in Knowledge Discovery 
Hing-Yan Lee, Hwee-Leng Ong and Lee-Hian Quek Information Technology Institute

KEFIR: The Key Findings Reporter for the analysis of healthcare information 
Christopher Matheus and Gregory Piatetsky-Shapiro, GTE Labs.

Automated Large-scale Data Mining by Forty-Niner (49er) 
Arun Sanjeev and Jan Zytkow

POSTER SESSION 1:  *****  Sunday, August 20, 1995: 6:00 - 8:00PM  *****
=================

SPOTLIGHT SESSION 1:

P1:  STAR: A General Architecture for the Support of 
     Distortion Oriented Displays
     Paul Anderson, Ray Smith, and Zhongwei Zhang, Monash University, Australia

P2:  Learning First Order Logic Rules with a Genetic Algorithm
     S. Augier, G. Venturini, and Y. Kodratoff, Univ. Paris-Sud, France

P3:  Discovery and Maintenance of Functional Dependencies by Independencies
     Siegfried Bell, University Dortmund, Germany

P4:  Intelligent Instruments: Discovering How to Turn 
     Spectral Data into Information
     Wray L. Buntine and Tarang Patel, NASA Ames Research Center 

P5:  Designing Neural Networks from Statistical Models: 
     A New Approach to Data Exploration
     Antonio Ciampi, McGill University, Canada and 
     Yves Lechevallier INRIA-Rocquencourt, France

P6:  Capacity and Complexity Control in Predicting the Spread Between 
     Borrowing and Lending Interest Rates
     Corinna Cortes, Harris Drucker, Dennis Hoover, and Vladimir Vapnik,
     AT&T Bell Laboratories

SPOTLIGHT SESSION 2:

P7:  Limits on Learning Machine Accuracy Imposed by Data Quality
     Corinna Cortes, L. D. Jackel, and Wan-Ping Chiang, AT&T Bell Laboratories 

P8:  Knowledge Discovery in a Water Quality Database
     Saso Dzeroski, Jozef Stefan Institute and Jasna Grbovic, 
     Hydrometeorological Institute of Slovenia

P9:  Data Mining for Loan Evaluation at ABN AMRO: A Case Study
     A. J. Feelders and A. J. F. le Loux, University of Twente; 
     J. W. van't Zand, ABN AMRO Bank, The Netherlands 

P10: Knowledge Discovery in Textual Databases (KDT)
     Ronen Feldman and Ido Dagan, Bar-Ilan University, Israel

SPOTLIGHT SESSION 3:

P11: Optimization and Simplification of Hierarchical Clusterings
     Doug Fisher, Vanderbilt University 

P12: Structured and Unstructured Induction with EDAGs
     Brian R. Gaines, University of Calgary, Canada 

P13: Restructuring Databases for Knowledge Discovery by 
     Consolidation and Link Formation  
     Henry G. Goldberg and Ted E. Senator, 
     Financial Crimes Enforcement Network (FinCEN), U.S. Dept. of Treasury

P14: Rough Sets Similarity-Based Learning from Databases
     Xiaohua Hu and Nick Cercone, University of Regina, Canada

SPOTLIGHT SESSION 4:

P15: Efficient Algorithms for Attribute-Oriented Induction
     Hoi-Yee Hwang and Wai-Chee Fu, Chinese University of Hong Kong 

P16: Robust Decision Trees:  Removing Outliers from Databases
     George H. John, Stanford University  

P17: Conceptual Clustering in Structured Databases: A Practical Approach
     A. Ketterlin, P. Gancarski, and J. Korczak, LSIIT, 
     Univ. Louis Pasteur, France

POSTER SESSION 2:   *****  Monday, August 21, 1995: 3:30-5:30PM *****
=================

SPOTLIGHT SESSION 5:

P18: Anonymization Techniques for Knowledge Discovery in Databases
     Willi Kloesgen, German National Research Center for Info. Technology (GMD)

P19: Exploiting Visualization in Knowledge Discovery
     Hing-Yan Lee, Hwee-Leng Ong, and Lee-Hian Quek, 
     Information Technology Institute, Singapore 

P20: Knowledge-Based Scientific Discovery in Geological Databases
     Cen Li and Gautam Biswas, Vanderbilt University

P21: An Iterative Improvement Approach for the Discretization of Numeric
     Attributes in Bayesian Classifiers
     Michael J. Pazzani, University of California, Irvine 

SPOTLIGHT SESSION 6:

P22: Knowledge Discovery from Multiple Databases
     James S. Ribeiro, Kenneth A. Kaufman, and Larry Kerschberg, 
     George Mason University

P23: Discovering Enrollment Knowledge in University Databases
     Arun P. Sanjeev and Jan M. Zytkow, Wichita State University

P24: Extracting Support Data for a Given Task
     Bernhard Schoelkopf, Chris Burges, and Vladimir Vapnik, AT&T Bell Labs.

P25: Feature Extraction for Massive Data Mining
     V. Seshadri and Raguram Sasisekharan, AT&T Bell Laboratories; 
     Sholom M. Weiss, Rutgers University

SPOTLIGHT SESSION 7:

P26: Data Surveying: Foundations of an Inductive Query Language
     Arno Siebes, CWI, Database Research Group, The Netherlands

P27: On Subjective Measures of Interestingness in Knowledge Discovery
     Avi Silberschatz, AT&T Bell Labs and Alexander Tuzhilin, New York Univ.

P28: Using Recon for Data Cleaning
     Evangelos Simoudis, IBM Almaden Research Center; Brian Livezey and 
     Randy Kerber, Lockheed Palo Alto Research Laboratories

P29: Accelerated Quantification of Bayesian Networks with Incomplete Data
     Bo Thiesson, Aalborg University, Denmark

SPOTLIGHT SESSION 8:

P30: Automated Selection of Rule Induction Methods Based on Recursive 
     Iteration of Resampling Methods and Multiple Statistical Testing
     Shusaku Tsumoto and Hiroshi Tanaka, Tokyo Medical and Dental Univ., Japan

P31: Fuzzy Interpretation of Induction Results
     Xindong Wu, Monash University, Australia and 
     Petter Mahlen, Royal Institute of Technology, Sweden

P32: Resource and Knowledge Discovery in Global Information Systems:  
     A Preliminary Design and Experiment
     Osmar R. Zaiane and Jiawei Han, Simon Fraser University, Canada

P33: Toward a Multi-Strategy and Cooperative Discovery System
     Ning Zhong, The Univ. of Tokyo and Setsuo Ohsuga, The Waseda Univ., Japan
------------------------------------------------------------------------------
Additional details can be found at http://www-aig.jpl.nasa.gov/kdd95/  and
at http://www.aaai.org/
------------------------------------------------------------------------------


>~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Date: Mon, 19 Jun 1995 09:47:19 -0400
From: gps0 (Gregory Piatetsky-Shapiro)
Subject: IEEE Expert April 1995

I am pleased to announce the publication in IEEE Expert April 1995
issue of a mini-symposium on Knowledge Discovery in Personal Data
Versus Privacy, based on the paper by Dan O'Leary on the subject.  
The mini-symposium consists of a paper by Dan O'Leary on the subject 
and responses from an internation panel of experts: 
	Yew-Tuan Khaw and Hing-Yan Lee from Singapore;
	Willi Kloesgen from GMD, Germany; 
	Wojtek Ziarko from University of Regina, Canada
	and Steven Bonorris from Office of Technology Assessment, USA


Below is a condensed version of my introduction to mini-symposium, in Latex

-- GPS

------------------------

\title{Knowledge Discovery in Personal Data vs. Privacy \\
a mini-symposium}

\author{Gregory Piatetsky-Shapiro \\
\ \\
GTE Laboratories Incorporated\\
40 Sylvan Rd., Waltham MA 02254\\
{\em gps@gte.com} }
\date{April 7, 1995}
\maketitle

\begin{flushright}
		But of the tree of the knowledge of good and evil, 
		thou shalt not eat of it: 
		for in the day that thou eatest thereof thou shalt surely die.
								Genesis  2:171
\ \\
		The desire of knowledge, like the thirst of riches, 
		increases ever with the acquisition of it.

			Laurence Sterne, Tristram Shandy [1760]
\end{flushright}

Dr. Chandrasekaran, during his tenure as IEEE Expert Editor-in-chief,
has asked me to put together a mini-symposium on the issues of
Knowledge Discovery in Databases and Privacy, based on the paper by
Dan O'Leary on the subject.  I am very pleased to have been able to
assemble a distinguished panel of experts in the areas of Knowledge
Discovery in Databases.  This panel, international by design to
reflect the geographical differences in the privacy issue, consists of
Yew-Tuan Khaw and Hing-Yan Lee from Singapore; Willi Kloesgen from
GMD, Germany; and Wojtek Ziarko from University of Regina, Canada.
Steven Bonorris from Office of Technology Assessment gives the legal
perspective.

Here I briefly review the recent successes of Knowledge Discovery 
and highlight some of the important areas where it may conflict 
with privacy desires.  The other articles follow. 

The world-wide computerization of many business and government
transactions in the developed countries and their increasing storage
and availability on-line have created mountains of data that contain
potentially valuable knowledge.  Finding nuggets of knowledge in this
data is the focus of the rapidly growing field known as Data Mining or
Knowledge Discovery in Databases (Piatetsky-Shapiro and
Frawley 1991, Piatetsky-Shapiro 1991, Cercone and Tsuchiya 1993, 
Fayyad and Uthurusamy 1994, Piatetsky-Shapiro et al 1994, 
Piatetsky-Shapiro 1995, Fayyad and Uthurusamy 1995, Fayyad et al 1995).

While successful Knowledge Discovery in Databases (KDD) applications
have been developed for scientific and other non-personal databases,
most of the public attention has been focused on the analysis of
databases of personal information. Database marketing, which is the
application of KDD tools to customer data in order to find patterns of
customers who buy particular products, has even appeared on the cover
of Business Week (Sep 5, 1994).

Database marketing, while apparently very successful, has sometimes
been controversial.  Wall Street Journal warned to avoid the dark site
of database marketing: too much personalization increases customers'
annoyance (Rosenfield 1994).
In 1990 Lotus has developed and was planning to sell a CD-rom with data
on about 100 million American households.  This plan generated such a
firestorm of protests over the privacy issues, that Lotus was forced to
cancel the product (Rosenberg 1992).

Privacy concerns have long been expressed with regards to basic data
collection and retrieval, and a number of guidelines for privacy
protection have already been proposed in most developed countries.
The guidelines and the existing privacy protections differ
significantly around the world, and they also differ with respect to
private and public data collectors.  The strongest data protection
currently exists in European Union countries, most of which adopted
the Organization for Economic Development (OECD) guidelines which are
the subject of Daniel O'Leary's article.  In USA there are privacy
laws regulating the government usage of data, but very few laws
dealing with private corporations' use of data.  There are, however,
the NII "Draft Principles for Providing and Using Personal
Information", discussed in Steven Bonorris's article.

While concerns for privacy issues have long predated Knowledge
Discovery, the vastness of existing databases and the sophistication
of the advanced KDD methods have opened new potential vulnerabilities
in the personal privacy protection.  We can divide the privacy issues in the
analysis of personal data into 3 types:

\begin{enumerate}
\item Privacy vs Basic Storage and Retrieval

\item Privacy vs Pattern Discovery

\item Privacy vs Combination of Group Patterns

\end{enumerate}

These issues are reviewed below.

\section{Privacy vs Basic Storage and Retrieval}

 The most fundamental privacy issues deal with basic storage and retrieval 
of personal data, which precede any discovery.  
Who can find out "What widgets did X buy on April 7, 1995  ?"
Both OECD guidelines and NII Draft Principles
suggest limiting the collection of sensitive data and limiting the 
access to personal data.  They suggest limiting the data use to the purposes 
for which either there is an advance consent of the data subject or the use
us authorized by law. 


\section{Privacy vs Pattern discovery}

 If retrieval of specific information, such as "What widgets did X buy
on April 7, 1995" is allowed, then it is technically possible to find
patterns such as how frequently X buys widgets, what brand X prefers,
etc.  The technical equivalence between allowing retrieval and pattern
discovery is a point that should be considered in establishing
privacy guidelines.

 The NII Draft Principles permit the use of "transactional
records," such as phone numbers called, credit card payments, etc, as
long as such use is compatible with the original notice.
The use of transactional records probably includes
 also discovery of patterns.

We should also note that discovered patterns in personal data may
involve very controversial fields, such as race, sex, religion, and
sexual orientation.  A recent example is the debate over the research
by Murray and Herrnstein which ranked different racial groups with
respect to their IQ (New Republic, 1994).  However, the First
Amendment guarantees the freedom of speech, and even though some
patterns can be very controversial, and can be illegal to discriminate
upon, they can still be discovered and debated.


\section{Privacy vs Combination of Group Patterns}
\begin{flushright}
	Even if you are paranoid, it does not mean they are not after you 
					-- anonymous
\end{flushright}

In many cases (e.g. medical research, socio-economic studies) the goal
is not to discover patterns not about specific individuals, but about
groups, -- e.g.  which group is more likely to buy a widget, which
group has high unemployment rate, or which group has low incidence of
AIDS.  It would appear that such aggregate patterns are not covered by
the restrictions on personal data.

The problem arises because the combination of several such patterns,
especially in small datasets, may allow identification of specific
personal information, either with certainty or with high probability.

E.g. by learning that in the selected sample
\begin{itemize}
\item "people with code=A don't have AIDS"
\item "people with code=B don't have AIDS"
\item there are 10 people with code not equal to A or B
\item there are 9 cases of AIDS 
\item person X has code=C
\end{itemize}

it is possible to infer that X has AIDS with the probability of 0.9. 

A number of technical solutions have been proposed (see Kloesgen's article)
that would allow discovery of aggregate patterns while avoiding the 
potential invasion of privacy.  These solutions include 

\begin{itemize}

\item  Removing or replacing identifying fields from data 
such as telephone numbers, names, addresses (however, a person could still
be identified from secondary fields).

\item Replacing direct querying of data with querying on a randomly selected
(and each time different) sample.  This, however, may still allow 
identification by a determined intruder. % ref ??

\item Combining similar (in some way) individuals into groups and only storing 
data on those groups.  This does not allow identification of individual
data but may lose some interesting aggregate patterns. 

\item Generating synthetic data which has the same marginal distribution 
as the original data (however, it is very difficult to generate such data
for a large number of variables).
\end{itemize}

These topics, which pose interesting research issues,
 are discussed more by Kloesgen.

\ \\
I hope that this mini-symposium will shed the light on the 
issues of privacy in for knowledge discovery in personal
databases and
will help in generating guidelines that protect both the individual
privacy and the society's right to know.


{\bf Acknowledgements}: I want to thank Dr. Chandrasekaran for suggesting
a symposium on this topic, and Lance Hoffman for useful comments on 
O'Leary's paper. 

\section{References}
\parindent 0pt

N. Cercone and M. Tsuchiya, 1993.  Guest editors, 
Special Issue on Learning and Discovery in Databases, 
{\em IEEE Trans. on Knowledge and Data Engineering}, 5(6), Dec.

U. Fayyad and R. Uthurusamy, 1994. Editors, 
Proceedings of KDD-94: the AAAI-94 workshop on Knowledge Discovery
in Databases, AAAI Press report 94-WS-03.

U. Fayyad and R. Uthurusamy, 1995. Editors, 
Proceedings of KDD-95: First International Conference on Knowledge
Discovery and Data Mining, AAAI Press. 

U. Fayyad, G. Piatetsky-Shapiro, P. Smyth, R. Uthurusamy, 1995. 
Editors, Advances in Knowledge Discovery and Data Mining, 
AAAI/MIT Press.

New Republic, Oct 31, 1994, Special Issue on Murray and Herrnstein's
 The Bell Curve.

G. Piatetsky-Shapiro and W. Frawley, 1991.
Editors, {\em Knowledge Discovery in Databases}, 
Cambridge, Mass.: AAAI/MIT Press. 

G. Piatetsky-Shapiro, 1991.
Report on AAAI-91 workshop on Knowledge Discovery in Databases, 
{\em IEEE Expert}, 6(5): 74--76.

G. Piatetsky-Shapiro, C. Matheus, P. Smyth, and  
R. Uthurusamy, 1994. KDD-93: Progress and Challenges in 
Knowledge Discovery in Database, {\em AI Magazine}, 15:3, 77--87.

G. Piatetsky-Shapiro, 1995.  Editor, 
Special issue on Knowledge Discovery in Databases, 
{\em J. of Intelligent Information Systems} 4:1, January.

J. Rosenfield, Avoid Dark Side of Database Marketing, Wall Street Journal, 
Oct 3, 1994, p. A20.  
See also KDD Nugget 94:20, http://info.gte.com/~kdd/nuggets/94/n20.txt

M. Rosenberg, 1992. Protecting Privacy, Inside Risks column,
{\em Communications of ACM}, 35(4), p. 164. 

\end{document}

>~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Return-Path: <yoram@or.eng.tau.ac.il>
Date: Tue, 20 Jun 1995 18:25:41 +0300 (IDT)
From: Yoram Reich <yoram@eng.tau.ac.il>
To: KDD Nuggets Moderator <kdd%eureka@gte.com>
Subject:  New entry to siftware

I'd like to add an entry to the siftware list.
It is already in an HTML format.

Thanks in advance.

Yoram


<A NAME="ECOBWEB">
*Name: ECOBWEB</a>
<br>*Description: 
ECOBWEB is a concept formation program for the creation of hierarchical
classification trees. It implements several extensions to Fisher's COBWEB
program. In particular, it can work well with numeric attributes, it can
perform simple constructive induction, it has a procedure for mitigating
order effects, it has an experimentation procedure, and it has several
methods for classification that make it suitable for design
domains. ECOBWEB employs multistrategy learning; it is a concept formation
program that includes case-based reasoning capabilities.
<BR>
ECOBWEB was implemented in Common Lisp. I expect it to run on most
implementations of the language. 

<A HREF="http://or.eng.tau.ac.il:7777/topics/ecobweb.html">Longer
description with relevant publications and code are here.</A>
<br>*Discovery methods: Clustering.
<br>*Platform(s): Unix. (but may be will run on other operating systems
with Common Lisp).
<br>*Contact: <i>Yoram Reich, 
	         Faculty of Engineering, 
		 Tel Aviv University,
		 Ramat Aviv 69978, 
		 Israel, yoram@eng.tau.ac.il, 
		 phone: +972-3-640-7385,
		 fax: +972-3-640-7617</i> 
<br>*Status: public domain.
<br>*Updated by: <i>Yoram Reich</i> on 1995-6-20 
<HR>


------------------------------------------------------------------------
Yoram Reich, Department of Solid Mechanics, Materials and Structures,
   Faculty of Engineering, Tel Aviv University, Ramat Aviv 69978, Israel
      Tel: + 972 3 6407385, Fax: + 972 3 6407617, email: yoram@eng.tau.ac.il
          <A HREF="http://or.eng.tau.ac.il:7777/"><EM> Yoram Reich</EM></A>


>~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
From: Ross Quinlan <quinlan@ml2.cs.su.oz.au>
Date: Fri, 16 Jun 1995 11:08:49 +1000
Subject: New Releases of C4.5 and FOIL

C4.5 Release 7
The latest release of C4.5 is now available.  If you have Release 5 (i.e.
the disk from Morgan Kaufmann), you can obtain the altered files by anonymous
ftp from ftp.cs.su.oz.au, directory pub/ml, file patch.tar.Z.  The file
Modifications summarizes the changes since Release 5.

Needless to say, it is advisable to retain the old files until you are
satisfied with Release 7!


FOIL Version 6.3
This version fixes several bugs and incorporates some improvements.  It
is available by anonymous ftp from ftp.cs.su.oz.au, directory pub, file
foil6.sh.

Please report any problems to quinlan@cs.su.oz.au.

>~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Return-Path: <@bt-web.bt.co.uk:roberts_h_d@bt-web.bt.co.uk>
X-Vms-To: R11F::GTE.COM::EUREKA::KDD
To: kdd <@gte.com:kdd@eureka>
From: roberts_h_d <roberts_h_d@bt-web.bt.co.uk>
Subject: data mining agents
Date: Mon, 26 Jun 1995 18:10:08 +0100
Content-Type: text
Content-Length: 1578


Gregory -

Below is my summary of the Communications Week article, for possible 
inclusion in KDD Nuggets.  On re-reading, the article does not 
actually mention data mining, but is more about agent and intelligent 
information retrieval.  But, the future plans mentioned, and the database 
framework they have set up may be of interest/relevance.

Regards,

Huw Roberts
Data Mining Group
BT Laboratories
-----------------------
Communications Week International, 12th June 1995, reports (in "Nabisco 
Unleashes Agents") that Nabisco is planning to develop and deploy 
intelligent agent software on more than 5,000 employee desktops.  The 
agents search various company databases on consumer buying patterns 
and company and competitor sales, analyze the data, and recommend 
courses of action.  The data comes from two main sources: an internal 
database of company sales and customer data, and an Express DBMS (from 
Information Resources Inc.) holding general food industry information. 
 The databases are integrated using Axsys middleware from Information 
Advantage Inc., and in-house agent technology.  The agents find, 
filter and present the results to 300 Nabisco executives in near-real 
time.  Smarter agents are planned which will "provide concise analyses 
of the data and its implications for a specific decision-making 
process".  To reduce network traffic, Nabisco plan to store as much of 
the key data as close to the user as possible. "As little as 20% of the 
data accessed by a given user typically provides 80% of the answers he 
or she is looking for."


>~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
From: SUSAN.F.TAFOLLA@sam.usace.army.mil
Date: Tue, 13 Jun 95 12:23:09 CST
To: kdd@gte.com
     
<title>Peter Clark - Machine Learning Software</title>
     
<h1>
<img src=http://www.cs.utexas.edu/users/pclark/icons/cl-32/disk.xpm> 
Peter Clark - Machine Learning Software</h1>
     
Software which co-workers and I have developed. To be used
solely at your own risk! We'd appreciate an acknowledgement if you
use any of these packages in your research. We'd also be interested in 
hearing any comments or results you have from using this code. Have fun! 
<p>
<b>Keep up to date:</b> 
Don't waste time reinventing new features! If you would like to 
be notified of upgrades or bug-fixes to the s/w below, please send me an 
email and I'll add you to a "users list" to keep you up to date. 
Similarly, if you have any questions or problems with the s/w, please email me 
(pclark@cs.utexas.edu).
     
<h2>Contents:</h2>
<ol>
<li> <a href=#qm>Guiding Inductive Learning with a Qualitative Model</a> 
<li> <a href=#lpe>LPE - Lazy Partial Evaluation</a>
<li> <a href=#cn2>CN2 - Rule induction from examples</a> 
</ol>
     
<h1><a name=qm>
1. Guiding Inductive Learning with a Qualitative Model</a></h1>
     
<h2>Overview</h2>
     
<b>Input:</b> a set of training examples and a qualitative model. 
<b>Output:</b> a
set of propositional if...then... classification rules which are 
also "explainable" by the qualitative model.
This package allows a qualitative model to bias induction of 
propositional if...then... rules (using CN2), so that only rules which 
are also "explainable" by the qualitative model (approximately: 
having a corresponding path in the influence graph) 
are found. This is important
for practical application of ML, where we wish to use domain knowledge 
as well as training data to guide rule learning.
<p>
Learning occurs in two phases: First, a specialisation lattice containing 
only (and all) rules "explainable" by the QM is explicitly enumerated. 
Second, the CN2 induction algorithm is used to learn rules from training 
data, but CN2's specialisation operator restricted to work on the 
QM-generated specialisation lattice. (NB: other implementations of this 
method, eg. which don't explicitly enumerate the lattice a priori, 
would be equally valid). 
<p>
The authors are Stan Matwin (stan@csi.uottawa.ca) and myself.
     
<h2>Software</h2>
     
The algorithm is implemented in Quintus Prolog. It was made available on 
WWW in Jan 1995 so has not been extensively tested outside our lab yet. 
Contact us if you have questions. The software contains source code,
the domain models and data sets used in the ML93 paper (below), and 
documentation. Knowledge of Prolog isn't needed to use the software. 
The software is public domain and freely available.
<p>
For those without Quintus Prolog -- we also provide Sun Sparc executables of 
this software (ie. compiled Prolog, without the Prolog development 
environment). These do not require a Quintus Prolog licence (nor even
any knowledge of Prolog) to run, but of course require a Sparc machine. 
A licence may be needed to use these executables for commercial 
use; contact me for info.
<p>
To download, click below. The software is tar'ed - to unbundle it,
do "<tt>tar xf &lt;file&gt;</tt>" where <tt>&lt;file&gt;</tt> is the file 
that you stored the downloaded code in. 
<ul>
<li><a href=ftp://ftp.cs.utexas.edu/pub/porter/pclark/qmlearn_v1.4.tar.Z> 
Learning with QM software</a> (1.6MB tar'ed and compressed). 
</ul>
     
<h2>References</h2>
     
<ul>
<li>P. Clark and S. Matwin.
         Using qualitative models to guide inductive learning.
         In P. Utgoff, editor, <i>Proc. Tenth Int. Machine Learning
 Conference (ML-93)</i>, pages 49-56, CA, 1993. Kaufmann.
(<a href=http://www.cs.utexas.edu/users/pclark/papers/ml93.abs>Abstract</a> 
and 
<a href=http://www.cs.utexas.edu/users/pclark/papers/ml93.ps>postscript</a>). 
</ul>
     
<h1><a name=lpe>2. LPE - Lazy Partial Evaluation</a></h1>
     
<h2>Overview</h2>
     
Lazy partial evaluation is a form of speed-up learning, when reasoning 
with a domain theory. It is a hybrid between:
<ul>
<li> <b>partial evaluation (PE)</b>, 
where a procedure is "unwound" in all possible 
ways and the results cached and indexed. 
<li> <b>explanation-based learning (EBL)</b>, 
where just execution paths through 
the procedure which prove specific theorems are identified and cached. 
</ul>
LPE does "partial evaluation on demand". It can be advantageous over PE 
as it avoids redundant expansion of a procedure (hence saving memory and 
CPU time). It can be advantageous over EBL as it avoids proving theorems 
from scratch with the (slow) original domain theory (when no cached 
solution applies), and avoids the "masking effect" where suboptimal, cached 
solutions are chosen in preference to better solutions implicit in the 
domain theory (when a cached solution applies).
It is described in detail in the paper below. The authors are Rob Holte 
(holte@csi.uottawa.ca) and myself.
     
<h2>Implementation</h2>
     
LPE is implemented in Quintus Prolog. It comes with documentation, demos, 
and the domain theories used in that paper.
The software is public domain and freely available.
To download, click below. The software is tar'ed - to unbundle it,
do "<tt>tar xf &lt;file&gt;</tt>" where <tt>&lt;file&gt;</tt> is the file 
that you stored the downloaded code in.
     
<ul>
<li><a href=ftp://ftp.cs.utexas.edu/pub/porter/pclark/lpe.tar.Z>LPE software</a>
</ul>
     
<h2>References</h2>
     
<ul>
<li> P. Clark and R. Holte.
 Lazy partial evaluation: An integration of explanation-based
  generalisation and partial evaluation.
 In D. Sleeman and P. Edwards, editors, <i>Proc. Ninth Int. Machine
  Learning Conference (ML-92)</i>, pages 82-91, CA, 1992. Kaufmann.
(<a href=http://www.cs.utexas.edu/users/pclark/papers/lpe.abs>Abstract</a> 
and 
<a href=http://www.cs.utexas.edu/users/pclark/papers/lpe.ps>postscript</a>). 
</ul>
     
<h1><a name=cn2>3. CN2 - Rule induction from examples</a></h1>
     
<h2>Overview</h2>
     
This algorithm inductively learns a set of propositional if...then... rules 
from a set of training examples. To do this, it 
performs a general-to-specific beam search through 
rule-space for the "best" rule, removes 
training examples covered by that rule, then repeats until no more "good" 
rules can be found. The original algorithm (Machine Learning 
Journal paper below) defined "best" using a combination of entropy and a 
significance test. The algorithm was later improved to replace this evaluation 
function with the Laplace estimate (EWSL-91 paper, below), and also to
induce unordered rule sets as well as ordered rule lists ("decision lists"). 
The software implements the latest version (ie. using the Laplace heuristic), 
but has flags which can be set to return it to the original version. The 
algorithm was designed by Tim Niblett (tim.niblett@turing.gla.ac.uk) and 
myself.
     
<h2>Software</h2>
     
The revised version of CN2 was implemented in C in 1990 by Robin Boswell 
(robin@csd.abdn.ac.uk). Email me if you would like further 
information on obtaining a copy (pclark@cs.utexas.edu).
     
<h2>References</h2>
     
<ul>
<li>
P. Clark and R. Boswell.
         Rule induction with CN2: Some recent improvements.
         In Y. Kodratoff, editor, <i>Machine Learning - EWSL-91</i>, pages
  151-163, Berlin, 1991. Springer-Verlag.
(<a href=http://www.cs.utexas.edu/users/pclark/papers/newcn.abs>Abstract</a> 
and 
<a href=http://www.cs.utexas.edu/users/pclark/papers/newcn.ps>postscript</a>).
     
     
<li> P. Clark and T. Niblett.
         The CN2 Induction Algorithm.
         <i>Machine Learning</i>, 3(4):261-283, 1989.
(<a href=http://www.cs.utexas.edu/users/pclark/papers/cn2.abs>Abstract</a> 
and 
<a href=http://www.cs.utexas.edu/users/pclark/papers/cn2.ps>postscript</a>). 
</ul>
     
<hr>
<address>
<a href=http://www.cs.utexas.edu/users/pclark>pclark@cs.utexas.edu</a></address>
>~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~