KDD Nuggets 95:4, e-mailed 95-02-17 Contents: * U. Fayyad, KDD-95 conference -- Final Reminder * D. Fisher, Clustering tech report * GPS, J. of Intelligent Information Systems, KDD spec. issue is out * R. Rymon, SE-learn -- SE-tree-based learning package * J. J Cannat, New Data Mining Tool AC2 * D. Aha, slides for ML tutorial at AI & Stats 1995 The KDD Nuggets is a moderated mailing list for news and information relevant to Knowledge Discovery in Databases (KDD), also known as Data Mining, Knowledge Extraction, etc. Relevant items include tool announcements and reviews, summaries of publications, information requests, interesting ideas, clever opinions, etc. ******** Note for Submissions ******************************************** * Please have a descriptive Subject line in your contribution, * * e.g. A nearest monster algorithm application to the Loch Ness problem * * or a ABCD-95 workshop on non-monotonic discovery of data in knowledge * * instead of "Subject: a submission" or "Subject: a workshop" * * * * Workshop, Conference, and other Meetings announcements should be * * relevant to Knowledge Discovery in Databases. * ************************************************************************** Nuggets frequency is approximately bi-weekly. Back issues of Nuggets, a catalog of S*i*ftware (data mining tools), references, FAQ, and other KDD-related information are now available at Knowledge Discovery Mine, URL http://info.gte.com/~kdd/ or by anonymous ftp to ftp.gte.com, cd /pub/kdd, get README E-mail add/delete requests to kdd-request@gte.com E-mail contributions to kdd@gte.com -- Gregory Piatetsky-Shapiro (moderator) ********************* Official disclaimer *********************************** * All opinions expressed herein are those of the writers (or the moderator) * * and not necessarily of their respective employers (or GTE Laboratories) * ***************************************************************************** ~~~~~~~~~~~~ Quotable Quotes ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Correction: the quote Some people are like foxes -- they know many little things, and others are like a hedgehog -- they know one big thing. is not Isaac Bashevic-Singer, but Isaiah Berlin (thanks to B. Chandrasekaran for correction) -- Change is difficult, uncomfortable and frightening. That's how we know we are changing. Bernie Siegel M.D. - Love Medicine and Miracles ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -------------------------------------------- Date: Thu, 16 Feb 95 10:43:57 PST From: fayyad@aig.jpl.nasa.gov (Usama Fayyad) Subject: KDD-95 -- papers due March 3, 1995 -- final reminder ========================================================================= C a l l F o r P a p e r s ========================================================================= The First International Conference on Knowledge Discovery and Data Mining (KDD-95) -------------------------------------------- Montreal, Canada, August 20-21, 1995 ==================================== Sponsored by AAAI and in Cooperation with IJCAI, Inc. Co-located with IJCAI-95. Knowledge Discovery in Databases (KDD) and Data Mining are areas of common interest to researchers in machine learning, machine discovery, statistics, intelligent databases, knowledge acquisition, data visualization, high performance computing, and expert systems. The rapid growth of data and information created a need and an opportunity for extracting knowledge from databases, and both researchers and application developers have been responding to that need. KDD applications have been developed for astronomy, biology, finance, insurance, marketing, medicine, and many other fields. Core problems in KDD include representation issues, search complexity, the use of prior knowledge, statistical inference, and algorithms for the analysis of massive amounts of data both in size and dimensionality. Due to strong demand for participation and the growing demand for formal proceedings, it has become necessary to change the format of the previous KDD workshops to a conference with open attendance. This conference will continue in the tradition of the 1989, 1991, 1993, and 1994 KDD workshops by bringing together researchers and application developers from different areas, and focusing on unifying themes such as the use of domain knowledge, managing uncertainty, interactive (human-oriented) presentation, and applications. The topics of interest include: Foundational Issues and Core problems in KDD Data Mining Tools and Applications Computationally Efficient Search for Structure in Data Interactive Data Exploration and Discovery Knowledge Representation Issues in KDD Data and Knowledge Visualization Data and Dimensionality Reduction Prior Domain Knowledge and Re-use of Discovered Knowledge Statistical and Probabilistic Aspects of KDD Dependency Models and Inference Machine Learning/Discovery Algorithms for Large Databases Managing Model Selection and Model Uncertainty Assessment of Model Predictive Performance Integrated Discovery Systems and Theories Parallel techniques for data management and search Security and Privacy Issues in Machine Discovery This list of topics is not intended to be exhaustive but an indication of typical topics of interest. Prospective authors are encouraged to submit papers on any topics of relevance to Knowledge Discovery and Data Mining. We also invite working demonstrations of discovery systems. The conference program will include invited talks, a demo and poster session, and panel discussions. Active discussion format will be encouraged to maintain the workshop feel that previous participants found valuable and constructive. The conference proceedings will be published by AAAI. As in previous KDD Workshops, a selected set of KDD-95 papers will be considered for publication in journal special issues and as chapters in a book. PAPER SUBMISSION INFORMATION: Please submit 5 *hardcopies* of a short paper (a maximum of 9 single-spaced pages not including cover page but including bibliography, 1 inch margins, and 12pt font) by March 3, 1995. A cover page must include author(s) full address, E-MAIL, a 200 word abstract, and up to 5 keywords. This cover page must accompany the paper. IN ADDITION, an ascii text version of the cover page MUST BE SENT BY E-MAIL to kdd95@aig.jpl.nasa.gov by March 3, 1995. Please mail the papers to : KDD-95 AAAI 445 Burgess Drive Menlo Park, CA 94025-3496 U.S.A. send e-mail queries regarding submissions logistics to: kdd@aaai.org ******** I m p o r t a n t D a t e s ********** ** Submissions Due: March 3, 1995 ** ** Acceptance Notice: April 10, 1995 ** ** Camera-ready paper due: May 12, 1995 ** ************************************************* NOTE: information regarding local arrangements, registration, and technical program will be announced at a later date. Conference Co-Chairs: ==================== Usama M. Fayyad (Jet Propulsion Lab, California Institute of Technology) Ramasamy Uthurusamy (General Motors Research Laboratories) Program Committee ================= Rakesh Agrawal (IBM Almaden Research Center, USA) Tej Anand (AT&T Global Information Solutions, USA) Ron Brachman (AT&T Bell Laboratories, USA) Wray Buntine (NASA AMES Research Center, USA) Nick Cercone (University of Regina, Canada) Peter Cheeseman (NASA AMES Research Center, USA) Greg Cooper (University of Pittsburgh, USA) Brian Gaines (University of Calgary, Canada) Clark Glymour (Carnegie-Mellon University, USA) David Hand (Open University, UK) David Heckerman (Microsoft Corporation, USA) Se June Hong (IBM T.J. Watson Research Center, USA) Larry Jackel (AT&T Bell Labs, USA) Larry Kerschberg (George Mason University, USA) Willi Kloesgen (GMD, Germany) David Madigan (University of Washington, USA) Chris Matheus (GTE Laboratories, USA) Heikki Mannila (University of Helsinki, Finland) Gregory Piatetsky-Shapiro (GTE Laboratories, USA) Daryl Pregibon (AT&T Bell Laboratories, USA) Arno Siebes (CWI, Netherlands) Evangelos Simoudis (Lockheed Research Center, USA) Andrzej Skowron (University of Warsaw, Poland) Padhraic Smyth (Jet Propulsion Laboratory, USA) Alex Tuzhilin (NYU Stern School, USA) Xindong Wu (Monash University, Australia) Wojciech Ziarko (University of Regina, Canada) Jan Zytkow (Wichita State University, USA) Publicity Chair: Padhraic Smyth, Jet Propulsion Laboratory Industry Liaison: Gregory Piatetsky-Shapiro, GTE Laboratories CONTACT INFORMATION: Please send KDD-95 conference registration and related inquiries to: ------------------------------------------------------------------- KDD-95 American Association for Artificial Intelligence (AAAI) 445 Burgess Drive Menlo Park, CA 94025-3496. U.S.A. Phone: (+1 415) 328-3123; Fax: (+1 415) 321-4457 Email: kdd@aaai.org Please send technical program related queries to Program Co-Chairs: ------------------------------------------------------------------ Usama M. Fayyad Ramasamy Uthurusamy Machine Learning Systems Group Computer Science Department, AP/50 Jet Propulsion Lab M/S 525-3660 General Motors Research, Bldg 1-6 California Institute of Technology 30500 Mound Road, Box 9055 Pasadena, CA 91109 Warren, MI 48090-9055 U.S.A. U.S.A. (+1 818) 306-6197 Phone (+1 810) 986-1989 Phone (+1 818) 306-6912 FAX (+1 810) 986-9356 Fax Email : kdd95@aig.jpl.nasa.gov Please send KDD-95 Publicity and related inquiries to: ----------------------------------------------------- Padhraic Smyth (KDD-95) email: kdd95@aig.jpl.nasa.gov Jet Propulsion Laboratory, 525-3660, California Institute of Technology 4800 Oak Grove Drive, Pasadena, CA 91109 U.S.A. Phone: (+1 818) 306-6422 Fax: (+1 818) 306-6912 Inquiries about KDD-95 sponsorship and industry participation to: ---------------------------------------------------------------- Gregory Piatetsky-Shapiro, e-mail: gps@gte.com GTE Laboratories, MS-45 tel: 617-466-4236 40 Sylvan Road fax: 617-466-2960 Waltham MA 02154-1120 USA URL: http://info.gte.com/~kdd/kdd95.html ---------------------------------------------------------------------------- -------------------------------------------- Return-Path: Date: Fri, 17 Feb 1995 07:42:48 +0600 From: dfisher@vuse.vanderbilt.edu (Douglas H. Fisher) To: kdd@gte.com Subject: Clustering tech report Content-Length: 2718 A technical report on some new results in clustering that may be of interest to some in machine learning and knowledge discovery is available at http://www.vuse.vanderbilt.edu/~dfisher/tech-reports/tr-95-01.html Your comments and literature pointers are welcome and at your discretion (and mine) may be linked in and inform subsequent revision. The title and abstract follow. Iterative Optimization and Simplification of Hierarchical Clusterings Technical Report CS-95-01 Doug Fisher Department of Computer Science Box 1679, Station B Vanderbilt University Nashville, TN 37235 ABSTRACT: Clustering is often used for discovering structure in data. Clustering systems differ in the objective function used to evaluate clustering quality and the control strategy used to search the space of clusterings. Ideally, the search strategy should consistently construct clusterings of high quality, but be computationally inexpensive as well. In general, we cannot have it both ways, but we can partition the search so that a system inexpensively constructs a `tentative' clustering for initial examination, followed by iterative optimization, which continues to search in background for improved clusterings. Given this motivation, we evaluate an inexpensive `sorting' strategy coupled with several control strategies for iterative optimization, each of which repeatedly modifies an initial clustering in search of a better one. One of these optimization strategies, inspired by work on macro-operator learning, appears to be novel in the clustering literature. Once a clustering has been constructed it is judged by analysts -- often according to task-specific criteria. Several authors have abstracted these criteria and posited a generic performance task akin to pattern completion, where the error rate over completed patterns is used to `externally' judge clustering utility. Given this performance task we adapt resampling-based pruning strategies used by supervised learning systems to the task of simplifying hierarchical clusterings, thus promising to ease post-clustering analysis. Finally, we propose a number of objective functions, based on attribute-selection measures for decision-tree induction, that might perform well on the error rate and simplicity dimensions. Keywords: clustering, iterative optimization, cluster validation, resampling, pruning, objective functions. dfisher@vuse.vanderbilt.edu http://www.vuse.vanderbilt.edu/~dfisher/dfisher.html (615) 343-4111 -------------------------------------------- From: gps@gte.com (Gregory Piatetsky-Shapiro) Subject: Special issue of JIIS on KDD Date: 10 Feb 1995 Journal of Intelligent Information Systems, 5(1), January 1995, special issue on Knowledge Discovery in Databases, guest edited by Gregory Piatetsky-Shapiro, has been published. Contents: Foreword: Knowledge Discovery in Databases -- from Research to Applications Gregory Piatetsky-Shapiro -- 1 Automated Analysis and Exploration of Image Databases: Results, Progress, and Challenges Usama Fayyad, Padhraic Smyth, Nick Weir, S. Djorgovski -- 3 Opportunity Explorer: Navigating Large Databases Using Knowledge Discovery Templates Tej Anand -- 23 Selecting among Rules Induced from a Hurricane Database John Major and John Mangano -- 35 Efficient Discovery of Interesting Statements in Databases Willi Kloesgen -- 51 A Bayesian Method for Learning Probabilistic Networks that Contain Hidden Variables Gregory F. Cooper -- 73 Discovering Dynamics: From Inductive Logic Programming to Machine Discovery Saso Dzeroski and Ljupco Todorovski -- 95 --- \begin{article} \authorrunninghead{{Gregory Piatetsky-Shapiro}} \titlerunninghead{Foreword} \title{Guest Editors Introduction: Knowledge Discovery in Databases -- from Research to Applications} \author{Gregory Piatetsky-Shapiro} \email{gps@gte.com} \affil{GTE Laboratories, 40 Sylvan Road, Waltham MA 02154} The notion of Knowledge Discovery in Databases (KDD) has been given various names, including data mining, knowledge extraction, data pattern processing, data archaeology, information harvesting, siftware, and even (when done poorly) data dredging. Whatever the name, the essence of KDD is the {\em nontrivial extraction of implicit, previously unknown, and potentially useful information from data} (Frawley et al 1992). KDD encompasses a number of different technical approaches, such as clustering, data summarization, learning classification rules, finding dependency networks, analyzing changes, and detecting anomalies (see Matheus et al 1993). Interest in KDD continues to increase, driven by the rapid growth in the number and size of large databases and the application-driven demand to make sense of them. The theory of KDD is of growing interest to researchers in machine learning, statistics, intelligent databases, and knowledge acquisition, as evidenced by the number of recent workshops and publications (Piatetsky-Shapiro 1992, Zytkow 1993, Cercone and Tsuchiya 1993, Parsaye and Chignell 1993, Ziarko 1994). KDD applications have been developed for astronomy, agriculture, insurance, marketing, software engineering, medicine, manufacturing, stock market analysis, and many other fields. This special issue is based on the selected papers from the third KDD workshop (Piatetsky-Shapiro 1993) held at AAAI-93 in Washington, D.C., and attended by over 60 researchers from 10 countries. A major trend evident at the workshop was the transition to applications in the core KDD area of discovery of relatively simple patterns in relational databases; the most successful applications, such as the ones described in the first two papers in this issue, are appearing in the areas of greatest need, where the databases are so large that manual analysis is impossible. Progress was also facilitated by the availability of commercial KDD tools, both for generic discovery and for domain-specific applications such as marketing. At the same time, progress is slowed by problems such as lack of statistical rigor, overabundance of patterns, and poor integration. The first paper, by Fayyad, Smyth, Weir, and Djorgovski, shows how using several innovative machine learning methods enabled much better recognition of sky images than was possible with manual methods. The second paper, by Anand, describes A.C. Nielsen's recent work on a commercial product for identifying and reporting on trends and exceptional events in extremely large supermarket sales databases. Both applications have been deployed and are in regular use. The pioneers of data mining have long realized that ``not all that glitters is gold''. The next two papers address this critical issue of how to select the best (real gold) patterns and rules among the many glittering ones. Major and Mangano present a rule refinement strategy which defined rule ``interestingness'' via rule accuracy, coverage, simplicity, novelty, and significance. They were able to reduce 161 generated rules to 10 meaningful ones. Kl\"{o}sgen describes innovative rule refinement and optimization strategies in Explora, an interactive system for discovery of interesting patterns in databases. Finally, the last two papers address research areas that are just beginning to be applied. Cooper gives the latest results in his work on the use of Bayesian statistical methods for the learning of causal probabilistic network models that contain hidden variables. D\v zeroski uses inductive logic programming methods to discover laws that govern the behavior of dynamical systems. This issue shows that research on discovery of relatively simple patterns in data has matured sufficiently to be transferred to applications. At the same time, complex application needs are driving further research on discovery of more sophisticated models. \acknowledgements I thank Chris Matheus for many useful suggestions and stimulating discussions, and Shri Goyal for his support and encouragement. \ \\ {\bf References} \parindent 0pt N. Cercone and M. Tsuchiya, 1993. Guest editors, Special Issue on Learning and Discovery in Databases, {\em IEEE Trans. on Knowledge and Data Engineering}, 5(6), Dec. \smallskip W. Frawley, G. Piatetsky-Shapiro, and C. Matheus, 1992. Knowledge Discovery in Databases: An Overview. {\em AI Magazine}, Fall 1992. \smallskip C. Matheus, P. Chan, G. Piatetsky-Shapiro, 1993. Systems for Knowledge Discovery in Databases, {\em IEEE Trans. on Knowledge and Data Engineering}, 5(6), Dec. \smallskip K. Parsaye and M. Chignell, 1993. {\em Intelligent Database Tools \& Applications}. NY: John Wiley. \smallskip G. Piatetsky-Shapiro, 1992. Editor, Special issue on Knowledge Discovery in Databases, {\em Int. J. of Intelligent Systems} 7:7, Sep. \smallskip G. Piatetsky-Shapiro, 1993. Editor, {\em Proceedings of KDD-93: the AAAI-93 workshop on Knowledge Discovery in Databases}, AAAI Press report WS-02. \smallskip W. Ziarko, 1994. {\em Rough Sets and Knowledge Discovery}, Springer-Verlag. \smallskip J. Zytkow, 1993. Guest editor, Special Issue on Machine Discovery, {\em Machine Learning}, 12(1-3). \end{article} \end{document} -------------------------------------------- Return-Path: Date: Fri, 17 Feb 1995 13:48:07 -0500 From: Ron Rymon Posted-Date: Fri, 17 Feb 1995 13:48:07 -0500 To: kdd@gte.com Subject: SE-Learn Cc: rymon@isp.pitt.edu Hi, I would like to add an entry in your SIFTWARE catalog, to SE-Learn an SE-tree-based learning package. A LISP version if publicly available; a commercial version, in C, is under development. Enclosed is the comp.ai FAQ describing SE-Learn. Set-Enumeration (SE) Trees for Induction/Classification Ron Rymon Intelligent Systems Program University of Pittsburgh Significant research in Machine Learning, and in Statistics, has been devoted to the induction and use of decision trees as classifiers. An induction framework which generalizes decision trees using a Set-Enumeration (SE) tree is outlined in (Rymon, 1993). In this framework, called SE-Learn, rather than splitting according to a single attribute, one recursively branches on all (or most) relevant attributes. An induced SE-tree can be shown to economically embed many decision trees, thereby supporting a more expressive hypothesis representation. Also, by branching on many attributes, SE-Learn removes much of the algorithm-dependent search bias. Implementations of SE-Learn can benefit from many techniques developed for decision trees (e.g., attribute-selection and pruning measures). In particular, SE-Learn can be tailored to start off with one's favorite decision tree, and then improve upon it by further exploring the SE-tree. This hill-climbing algorithm allows trading time/space for added accuracy. Current studies (yet unpublished) show that SE-trees are particularly advantageous in domains where (relatively) few examples are available for training, and in noisy domains. Finally, SE-trees can provide a unified framework for combining induced knowledge with knowledge available from other sources (Rymon, 1994). NOTE: A LISP implementation of SE-Learn is publicly available; please write to Rymon@ISP.Pitt.edu. A commercial version in C is currently under development. Rymon, R. (1993), An SE-tree-based Characterization of the Induction Problem. In Proc. of the Tenth International Conference on Machine Learning, Amherst MA, pp. 268-275. Rymon, R. (1994), On Kernel Rules and Prime Implicants. Proc. of the Twelfth National Conference on Artificial Intelligence, Seattle WA, pp. 181-186. Thanks, -Ron -------------------------------------------- Return-Path: Date: Wed, 15 Feb 95 14:36:37 +0100 From: cannat@is23.isoft.fr (Jean-Jacques Cannat) To: kdd@gte.com Subject: Add Data Mining Tool AC2 FROM : Jean-Jacques Cannat ISoft Chemin de Moulon 91190 Gif sur Yvette tel : +33 (1) 69 41 27 77 fax : +33 (1) 69 41 25 32 e-mail : cannat@isoft.fr TO : kdd@gte.com SUBJECT : Addition of a DATA MINING Tool, namely AC2 Dear Professor Gregory Piatetsky-Shapiro, Please to add the following description of a new Data Mining tool to your S*i*ftware catalog. AC2 belongs to the topic : Classification Decision tree approach Feel free to contact me if necesary. Best Regards. Jean-Jacques Cannat -----------------DESCRIPTION ACCORDING TO TEMPLATE-------------- *Name: AC2
*Description: AC2 is a decision tree classification tool developed in C++. AC2 allows the user to create and to manipulate decision trees from data set of symbolic, numeric, noisy and unknown descriptions. The scientific grounds of AC2 relies on the discriminatory methods and on the representation language of the data set.
  • AC2 integrates different discriminatory methods such as a regression methods (CART), as well as others methods such as gain ratio (J.R. Quinlan), Gini (CART), information gain (Shannon), information class and distance (Mantaras) wich were extensively used and experienced. AC2 provides confusion matrix and cost matrix as well as pre- and post-pruning methods to avoid overfitting and true error rate estimation methods by the use of powerful statistical procedures such as cross validation and bootstrapping.
  • Data can be flat (usual matrix format) or structured by the use of a representation language based on an object-oriented representation extended with relationships between objects. The representation language allows the user to make use of domain knowledge in order to benefit from the semantic of the domain during the classification process.
  • AC2 has been designed for "real-world" data sets analysis such as banking, marketing, risk analysis, decision help systems, quality control, science, medical diagnosis and epidemiology, population analysis and typology.
    *Discovery methods: Classification, regression and discriminatory methods, decision tree approach.
    *Comments: The system has a well-designed and attractive interface allowing a strong interaction with the user. The decision tree is displayed as a graph allowing the user to inspect nodes, to make changes and easy tests. References :
  • MLT : Machine Learning Toolbox, Esprit Project 2154, Deliverable D2.2, Specification of the CKRL of MLT.
  • StatLog : Comparative Testing of Satistical and Logical Learning, Esprit Project 5170, Deliverable D3.11, Description of AC2.
  • T. Brunet 93 : Le probleme de la resistance aux valeurs inconnues dans l'induction : une foret de branches, JFA-93.
  • T. Brunet 94 : Le probleme de la resistance aux valeurs inconnues dans l'induction, These Universite de Paris VI, France.

  • *Source: ISoft S.A.
    *Platform(s): AC2, coded in C++, is available on PC under Windows 3.1 and on Unix Workstations, SUN, IBM RS6000, BULL DPX20, HP 700, DEC Alpha.
    *Contact: H. Perdrix, ISoft S.A., e-mail : hp@isoft.fr, ac2@isoft.fr, tel +33 (1) 69.41.27.77, fax +33 (1) 69.41.25.32, Chemin de Moulon 91190 Gif sur Yvette France,
    *Status: commercial product
    *Updated by: ISoft S.A. on 1995-01-23
    ------------------- END OF DESCRIPTION ---------------------- -------------------------------------------- From <@watstat.uwaterloo.ca:aha@AIC.NRL.Navy.Mil> Wed Jan 11 15:00:02 1995 Subject: Annotated bibliography available To: ai-stats@watstat.uwaterloo.ca Content-Type: text Content-Length: 282 Hi - I presented the Machine Learning tutorial at AI & Stats 1995. I promised some folks to write an annotated bibliography to go along with the tutorial. It and the tutorial slides are now both available under http://www.aic.nrl.navy.mil/~aha/slides.html. Thanks, David Aha