KDD Nuggets 95:2, e-mailed 95-01-30 Contents: * GPS, update on KDD-95 conference * Winton Davies, on Software Agents for Mining Databases * G. Hebrail, Statistical analysis of textual data, 2 abstracts * D. Jensen, Statistical Evaluation of Classifiers --- CFPs --- * P. Ozturk, Int. Conference on Case-Based Reasoning 1995 (ICCBR-95) * B. Julien, IJCAI-95 W'shop on ML in Engineering, 2nd CFP The KDD Nuggets is a moderated mailing list for news and information relevant to Knowledge Discovery in Databases (KDD), also known as Data(base) Mining, Knowledge Extraction, etc. Relevant items include tool announcements and reviews, summaries of publications, information requests, interesting ideas, clever opinions, etc. ******** Note for Submissions ******************************************** * Please have a descriptive Subject line in your contribution, * * e.g. A nearest monster algorithm application to the Loch Ness problem * * or a ABCD-95 workshop on non-monotonic discovery of data in knowledge * * instead of "Subject: a submission" or "Subject: a workshop" * * * * Workshop, Conference, and other Meetings announcements should be * * relevant to Data Mining and Knowledge Discovery. * ************************************************************************** Nuggets frequency is approximately bi-weekly. Back issues of Nuggets, a catalog of S*i*ftware (data mining tools), references, FAQ, and other KDD-related information are now available at Knowledge Discovery Mine, URL http://info.gte.com/~kdd/ or by anonymous ftp to ftp.gte.com, cd /pub/kdd, get README E-mail add/delete requests to kdd-request@gte.com E-mail contributions to kdd@gte.com -- Gregory Piatetsky-Shapiro (moderator) ********************* Official disclaimer *********************************** * All opinions expressed herein are those of the writers (or the moderator) * * and not necessarily of their respective employers (or GTE Laboratories) * ***************************************************************************** ~~~~~~~~~~~~ Quotable Quotes ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ "The goal of all inanimate objects is to resist man and ultimately defeat him." Russell Baker "The significant problems we have cannot be solved at the same level of thinking we were at when we created them." Albert Einstein ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -------------------------------------------- From: gps@gte.com (Gregory Piatetsky-Shapiro) Subject: KDD-95 Update Date: Jan 30, 1995 Reminder -- The Knowledge Discovery and Data Mining (KDD-95) Conference is approaching. The paper submission deadline is March 3, 1995. I have been informed by Padhraic Smyth that the first submission has already been received! Here is the submission info ( see KDD Nuggets 94:22 or http://info.gte.com/~kdd/kdd95.html for full call for papers and details). Please submit 5 *hardcopies* of a short paper (a maximum of 9 single-spaced pages not including cover page but including bibliography, 1 inch margins, and 12pt font) by March 3, 1995. A cover page must include author(s) full address, E-MAIL, a 200 word abstract, and up to 5 keywords. This cover page must accompany the paper. IN ADDITION, an electronic version of the cover page MUST BE SENT BY E-MAIL to kdd95@aig.jpl.nasa.gov by March 3, 1995. Please mail the papers to : KDD-95 AAAI 445 Burgess Drive Menlo Park, CA 94025-3496 U.S.A. send e-mail queries regarding submissions logistics to: kdd@aaai.org -------------------------------------------- Date: Tue, 17 Jan 1995 17:25:00 GMT To: gps@gte.com From: wdavies@csd.abdn.ac.uk (Winton Davies) Subject: Re: KDD Nuggets 95:1 Hi Gregory, Regarding the discussion about data-mining, I'd like to mention the work I'm undertaking at Aberdeen - and wonder if there are any comments. I think there is another way of mining distributed data, and that is to use software agents that mine local databases and then collaborate in refining local theories. Having spoken to Evangelous Simoudis from Lockheed, I understand the power of being able to rely on the database engines capabilities. Howver I still feel that if local mining can be done, then far less network traffic is required, as well as using distributed computing power. It also (this was Evangelous' point) may well provide an insight into how to integrate mined knowledge from different tools. So far I have combined the Knowledge Sharing Effort with Agent-0 from YOav Shoham, to provide the basic software agents. I have just started work on using FOCL to provide first ordering mining as well as kn. int. capabilities. I'm well aware that I'm not really in the same league as Lockheed et al, and that first order learning is not going to be easy to apply to data mining (due to the preponderance of numeric data out there) or even particularly useful. I have yet to identify a nice application area that requires 1st order learning. Cheers, Winton Davies, University of Aberdeen -------------------------------------------- From: Georges.Hebrail@der.edf.fr ( Georges HEBRAIL ) Subject: Contribution to the kdd mailing list To: kdd@gte.com Date: Fri, 27 Jan 1995 14:18:58 +0100 (MET) Dear Gregory, You will find below the abstracts of two papers describing work we have done at Electricite de France and which may be relevant to the kdd community. They both describe work on applying statistical data analysis techniques to textual data. One of paper was a communication at the 92 ACM-SIGIR conference on Information Retrieval and the other one at the 92 IFCS conference of the International Federation of Classification Societies. Associated with each abstract, you'll find the ftp address where people can get the Postscript version of the full papers. Any comment or remark should be sent to: Georges.Hebrail@der.edf.fr Best, G.Hebrail AUTOMATIC DOCUMENT CLASSIFICATION: NATURAL LANGUAGE PROCESSING, STATISTICAL ANALYSIS, AND EXPERT SYSTEM TECHNIQUES USED TOGETHER. Blosseville M.J., Hebrail G., Monteil M.G., Penot N. ELECTRICITE DE FRANCE Research Center 1, Av. du Gal de Gaulle 92141 CLAMART CEDEX FRANCE Abstract: In this paper we describe an automated method of classifying research project descriptions: a human expert classifies a sample set of projects into a set of disjoint and pre-defined classes, and then the computer learns from this sample how to classify new projects into these classes. Both textual and non-textual information associated with the projects are used in the learning and classification phases. Textual information is processed by two methods of analysis: a natural language analysis followed by a statistical analysis. Non-textual infor mation is processed by a symbolic learning technique. We present the results of some experiments done on real data: two different classifications of our research projects. available on ftp : ftp://edf.edf.fr/hebrail/sigir92.ps.Z EXPERIMENTS OF TEXTUAL DATA ANALYSIS AT ELECTRICITE DE FRANCE G.Hebrail, J.Marsais ELECTRICITE DE FRANCE Research Center 1, Av. du Gal de Gaulle 92141 CLAMART CEDEX FRANCE Abstract: We present here some results of applying data analysis methods to datasets describing textual data. The dataset we consider is a set of research project descriptions from our research center. A natural language processor extracts keywords from the texts. These keywords are then replaced by more general concepts using a structured thesaurus. These steps lead to a description of the texts with a tractable number of variables. Then, we apply correspondence analysis and hierarchical clustering to our dataset. Correspondence analysis leads to a very synthetic view of the activity of the research center. Clustering discovers a structure which is very close to the actual organizations of our research center. available on ftp : ftp://edf.edf.fr/hebrail/ifcs92.ps.Z -------------------------------------------- From: "Jensen, David" Subject: Statistical Evaluation of Classifiers Date: Sun, 29 Jan 95 16:27:00 PST In Machine Learning List 7(2), Olivier Gascuel calls attention to the statistical problem of multiple tests. The assumption underlying most tests of statistical significance is that a single, isolated test is performed. Performing multiple tests violates this assumption and results in what statisticians refer to as a Type I error: the null hypothesis will be rejected, even though it is true. Put in practical terms, an experimenter will believe that one system is superior to another, even though there is really no difference between them. The problem of multiple tests is present in both: 1) experiments to determine whether one algorithm is superior to another; and 2) induction algorithms that employ search (i.e., essentially all induction algorithms). Both of these situations are of interest to researchers studying induction algorithms. For example, the problem of multiple tests was noted at the last Knowledge Discovery in Databases Workshop (KDD-93) (1). Gascuel suggests one approach to dealing with multiple tests: dividing "the level of significance (usually 5%) by the number of tests". This is a vital first step toward greater statistical awareness in machine learning. However, this approach, sometimes known as a Bonferroni adjustment, assumes that the individual tests are independent (i.e., not correlated). If the tests are correlated, then using a Bonferroni adjustment will result in what statisticians call a Type II error: the null hypothesis will be accepted, even though it is false. An excellent paper on the problems with, and specialized alternatives to, the Bonferroni adjustment was presented at the recent AI & Statistics Workshop (2). Another alternative is a randomization test (3,4) -- a computational procedure that applies induction algorithms to randomized data to estimate what would be expected by chance alone, and then compares the actual results to the randomized results. While this approach is computationally intensive, it is effective, general, relatively simple, and escapes a variety of assumptions made by other methods. David Jensen djensen@ota.gov (1) Piatetsky-Shapiro, G., et al, "KDD-93: Progress and Challenges in Knowledge Discovery in Databases," AI Magazine, Spring 1994, 77-82. (2) Feelders, A. and W. Verkooijen, "Which method learns the most from data? Methodological issues in the analysis of comparative studies" Preliminary papers of the Fifth International Workshop on Artificial Intelligence and Statistics, January 4-7, 1995, 219-225. (3) Jensen, D., "Knowledge discovery through induction with randomization testing," Proceedings of the 1991 Knowledge Discovery in Databases Workshop, G. Piatetsky-Shapiro (Ed.), AAAI, 1991, 148-159. (4) Jensen, D., Induction with Randomization Testing: Decision-oriented Analysis of Large Data Sets," Doctoral dissertation, Washington University, St. Louis, Missouri, May 1992. -------------------------------------------- ---------------- *********** CFP section ****************** -------------- Return-Path: X-Sender: pinar@ifi.unit.no Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Date: Fri, 13 Jan 1995 11:50:31 +0100 To: kdd@gte.com From: Pinar.Ozturk@ifi.unit.no Subject: Call for Papers, ICCBR-95 INTERNATIONAL CONFERENCE ON CASE-BASED REASONING 1995 (ICCBR-95) Oct. 23-26, 1995 Sesimbra - Portugal C A L L F O R P A P E R S We welcome submissions to ICCBR-95, the first international conference on case-based reasoning. The conference follows and extends the CBR workshops that have taken place in the United States since 1988 and Europe since 1993. ICCBR is planned to become a biennial event, encouraging the more specific workshops to be held in the years between. This year's conference will be held in Europe, in the beautiful small town of Sesimbra, south of Lisbon. Case-based reasoning is now an established area of AI, with a continually growing world-wide research community and a number of fielded applications. ICCBR-95 will be a broad-scoped and high-quality conference, where we will sum up the status of the field in a tutorial, discuss recent research results through scientific paper and poster sessions, and in application sessions focus on system building tools and industrial and other type of applications. The overall aim of ICCBR-95 is to advance the scientific and application-oriented state of the CBR field by bringing researchers and system builders together for in-depth discussions and general exchange of views and ideas. As the first international event within the case-based reasoning field, ICCBR-95 should lead to a consolidation of our common platform, and to a strengthening of the CBR community and field in general. The conference will start with an 'application day' consisting of an introductory tutorial and a set of presentations focusing the practical uses of CBR technology. This is followed by three days of scientific paper presentations, invited talks, panel discussions, and poster sessions. SUBMISSION OF PAPERS: Submissions are invited on all aspects of CBR and analogical reasoning including (but not restricted to): - Indexing and retrieval - Case modification - Learning - Cognitive modelling - Knowledge representation - Case memory structures - Integrated problem solving and learning - Integrating CBR and with other methods - Case-based reasoning and design - Case-based knowledge engineering - Evaluation methods - Comparisons of CBR systems - System architectures - Application-related methods for, e.g., Reuse of experience, Corporate memories, Information filtering/retrieval, Knowledge management Active decision support, Education and tutoring. Papers will be carefully reviewed for relevance to CBR, originality, significance, technical quality, and clarity of presentation. Scientific papers may address theoretical issues, empirical studies, or novel approaches related to cognitive models, computational methods, tool-oriented research, or development methodologies. Application papers may describe fielded or close-to-be-fielded systems, system building experience, or tool studies. Papers may be accepted for plenary or poster presentation. All accepted papers will be published in the conference proceedings. All submissions should be in the form of a printed paper, written in English. Papers should be printed on 8.5" x 11" or A4 sized paper. They must be a maximum of 12 pages long, including figures and bibliography. Each page must have no more than 43 lines, lines being at most 140mm (5.5") long and 12 point type. The papers should have a separate page, in addition to the 12 pages mentioned above, with the title of the paper, names, postal and email address of the authors, a 200-word abstract, and a list of keywords. Please indicate on the front page whether your paper is a scientific paper or an application paper. Five copies of submitted papers should be sent by surface mail before April 10th to one of the program conference co-chairs: PROGRAM CONFERENCE CHAIRS Prof. Agnar Aamodt (program co-chair) University of Trondheim, Department of Informatics N-7055 Dragvoll, Norway email: agnar@ifi.unit.no phone: +47-7359-1838, fax: +47-7359-1733 Prof. Manuela Veloso (program co-chair) Carnegie Mellon University, Department of Computer Science Pittsburgh PA 15213-3891, USA email: mmv@cs.cmu.edu phone: (412) 268-8464, fax: (412) 268-5576 LOCAL ORGANIZATION CHAIR Carlos Bento Univ. de Coimbra - Lab. de Informatica e Sistemas Vila Franca - Pinhal de Marrocos 3030 Coimbra - PORTUGAL email: bento@mercurio.uc.pt phone: +351 39 7000030, fax: +351 39 701266 IMPORTANT DATES Submission Deadline April 10th 1995 Notification of Acceptance or Rejection May 20th 1995 Camera-Ready Copy June 20th 1995 APPLICATIONS AND TOOL DEMOS We also invite demos of systems and commercial tools during ICCBR-95. This will be organized by Stefan Wess, University of Kaiserslautern and Inference Germany: Stefan Wess Inference GmbH Lise-Meitner-Strasse 3 D-85716 Unterschleissheim, Germany email: wess@inference.co.uk phone: ++49 89 3218180 fax: ++49 89 32181830 ARRANGEMENT ICCBR-95 is organized by APPIA, the Portuguese Association for Artificial Intelligence, in cooperation with the University of Coimbra. Sesimbra is a small town 35 Km south of Lisbon. It has great fishing traditions and is famous for its fish and seafood. Near from there is a spectacular region of natural forest which is national patrimony. Accommodation will be at Hotel do Mar, overlooking the beautiful beach of Sesimbra. The bedrooms, with large terraces facing the ocean, are equipped with telephone, satellite television and air conditioning. The Hotel includes a broad set of sport facilities and a play area for children. We are determined at keeping the costs at an acceptable level to also enable students and people from academia to participate on a large scale. The prices are favorable in Portugal at this time of the year. There may be student scholarships available. The conference will be sponsored by Acknosoft, Paris and some other companies and organisations yet to be decided or formally confirmed. PROGRAM COMMITTEE Agnar Aamodt (co-chair) University of Trondheim Manuela Veloso (co-chair) Carnegie Mellon University David Aha Naval Research Laboratory, Washington DC Klaus Althoff University of Kaiserslautern Kevin Ashley University of Pittsburgh Ray Bareiss ILS, Northwestern University Brigitte Bartsch-Spoerl BSR Consulting, Munich Jeff Berger University of Chicago Karl Branting University of Wyoming Ernesto Costa University of Coimbra Paul Compton University of New South Wales Kris Hammond University of Chicago James Hendler University of Maryland Tom Hinrichs ILS, Northwestern University Carl Gustaf Jansson Stockholm University Surma Jerzy University of Economics, Wroclaw PL Eric Jones Victoria University of Wellington N.Z Mark Keane University of Dublin James King AT&T GIS, USA Janet Kolodner Georgia Institute of Technology David Leake University of Indiana Michel Manago Acknosoft, Paris Enric Plaza IIIA, Spanish Scientific Research Council Ashwin Ram Georgia Institute of Technology Michael Richter University of Kaiserslautern Chris Riesbeck ILS, Northwestern University Edwina Rissland University of Massachusetts Derek Sleeman University of Aberdeen Ian Smith EPFL, Lausanne Gerhard Strube University of Freiburg Katia Sycara Carnegie Mellon University Henry Tirri University of Helsinki Shusaku Tsumoto Tokyo Medical and Dental University Angi Voss GMD, St. Augustin Ian Watson University of Salford ----------------------------------------------------------------------- PRELIMINARY REGISTRATION FORM Please return the following information to Carlos Bento (bento@mercurio.uc.pt) as soon as possible. Full Name: Title: Organization: Address: Tel: Fax: e-mail: Check the appropriate boxes below: [ ] I intend to submit a scientific paper to ICCBR-95. [ ] I intend to submit an application-oriented paper to ICCBR-95. [ ] I do not intend to submit a paper, but I would like to attend the conference. [ ] I do not intend to attend the whole conference, but I would like to participate on the first day (tutorial and application day) ------------------------------- Date: Fri, 27 Jan 95 11:31:34 EST From: julien@magnum.crim.ca (Benoit Julien) To: kdd@gte.com Subject: Machine Learning in Engineering - 2nd Call for Participation =========================================================================== SECOND CALL FOR PARTICIPATION (Extended Deadlines) *** Workshop on Machine Learning in Engineering *** International Joint Conference on Artificial Intelligence 1995 IJCAI-95 Montreal, Quebec, Canada August 19-25, 1995 =========================================================================== WORKSHOP OBJECTIVES The last ten years have seen a significant increase in the development of knowledge-based systems for engineering applications. As in other domains, the success of knowledge-based approaches in engineering depends critically on the quality of the knowledge acquisition process. In the early nineties, computer-aided engineering system developers quickly recognized the potential offered by emerging machine learning techniques. As machine learning moves from "toy" problems to "real" engineering applications, a concerted R&D effort becomes essential to identify and overcome critical engineering knowledge acquisition bottlenecks. In that perspective, this workshop will bring together researchers applying or developing machine learning techniques for various engineering disciplines in order to establish important commonalities and differences in engineering learning problems. This forum will permit the definition of basic engineering learning tasks and their relationships with appropriate machine learning strategies. By presenting the state-of-the-art in machine learning applications to engineering, this event should also bridge many gaps between machine learning theory and engineering practice. TOPICS OF INTEREST All researchers and practitioners actively applying or developing machine learning techniques to engineering problems are encouraged to submit papers for this workshop. Topics of interest include, but are not limited to, the following: * Case studies Case studies of application of machine learning in engineering, with analysis of successes and failures. Examples of application topics: - Knowledge mining of engineering databases; - Engineering learning apprentice systems; - Semi-automated engineering knowledge acquisition; - Constructive induction in engineering; - Engineering knowledge discovery systems; - Engineering model acquisition and refinement; - Learning from sensory data. * Comparative studies Comparative studies of machine learning techniques solving similar engineering learning tasks. * Overviews Overviews of the state-of-the-art of machine learning in engineering. * Position papers on key issues Position papers discussing and proposing methodologies for solving important engineering learning issues. Examples of key issues: - Prior knowledge in engineering learning problems; - Tracking engineering concept drifts (dynamic knowledge); - Mapping of generic engineering tasks with learning techniques; - Multistrategy learning for engineering problems; - Machine learning for engineering data analysis; - Learning from very small or very large training sets; - Learning from noisy and incomplete information; - Integration of machine learning and interactive knowledge acquisition. Papers describing strictly case studies of manual knowledge acquisition and maintenance are discouraged. This workshop does not cover applications of subsymbolic learning techniques such a neural networks and genetic algorithms. SUBMISSION GUIDELINES All papers submitted should not exceed 15 pages. The organizers intend to publish a selection of the accepted papers as a book or a special issue of a journal. The authors should take this into account while preparing their papers. In order to encourage the submission of work in progress reports, 5 pages extended abstracts will also be accepted for submission. However, the accepted extended abstracts will not be considered for later publication. Copies of the workshop proceedings containing all accepted papers and extended abstracts will be prepared and made available by IJCAI at the workshop. Each submitted paper and extended abstract should provide a clear description of the engineering task and the learning problem so that other participants not familiar with the application can easily understands the key characteristics and objectives of the research. The papers should also define all technical terms and make explicit the research methodology and the underlying characteristics and assumptions of the learning problem(s) and technique(s). The authors should also discuss important future issues as well as implications and possible extensions of their work to other engineering domains. Each submitted paper and extended abstract will be reviewed by at least three members of the international program committee and will be judged on significance, originality, and clarity. Papers submitted simultaneously to other conferences or journals must state so on the title page. DEADLINES Four (4) hard copies of the papers or extended abstracts must be received by the workshop organiser by March 10, 1995 (extended deadline). Alternatively, electronic submissions in PostScript are encouraged. FAX submissions are not accepted. Notification of acceptance or rejection will be sent to the first (or designated) author with the reviewers comments by March 31, 1995. Final camera-ready papers and extended abstracts should arrive by April 21, 1995. This one-day workshop will be held between Saturday 19 August and Monday 21 August 1995. WORKSHOP CHAIRS Benoit Julien, Centre de recherche informatique de Montrial (CRIM), Canada Steven J. Fenves, Carnegie Mellon University, United States Tomasz Arciszewski, George Mason University, United States INTERNATIONAL PROGRAM COMMITTEE Jerzy Bala, George Mason University, United States James H. Garrett Jr., Carnegie Mellon University, United States D. Gunarathnam, University of Sidney, Australia Yves Kodratoff, University of Paris-Sud, France Stan Matwin (not confirmed), University of Ottawa, Canada Mahamad Mustafa, Savannah State College, United States Yoram Reich, Tel-Aviv University, Israel Wojciech Ziarko, University of Regina, Canada ADDRESS FOR CORRESPONDENCE Benoit Julien Centre de recherche informatique de Montrial (CRIM) 1801, McGill College avenue, Suite 800 Montreal (Quebec) H3A 2N4 Canada Phone : 1-514-398-5862 Fax : 1-514-398-1244 E-mail : julien@crim.ca PAPER FORMAT Submissions must be clearly legible, with good quality print. Papers and extended abstracts are respectively limited to 15 and 5 pages including title page, bibliography, tables and figures. Papers must be printed on 8.5 x 11 inch paper or A4 paper using 12 point type (10 characters per inch) with a 1 inch margins and no more than 40 lines per page. The title page must include the names, postal and electronic (e-mail) addresses and phone and FAX numbers of all authors together with an abstract (200 words maximum) and a list of key words. The first key words should specify the engineering domain (e.g., electrical, civil, mechanical, industrial, chemical, environmental, metallurgy, mining), the engineering generic task (e.g., classification, scheduling, control, maintenance, planning, design), and the machine learning technique(s) used (e.g., case-based learning, conceptual clustering, explanation-based learning, rule induction, inductive predicate logic). Papers without this format will not be reviewed. To save paper and postage costs please use double-sided printing or, preferably, send a PostScript file via internet to the workshop organizer. WORKSHOP FORMAT The format of the workshop will be paper sessions with discussion at the end of each session. The day will be divided in four (4) thematic sessions of an hour and a half each. A commentator from the program committee will be assigned for each presentation so as to initiate and supervised the discussions. The workshop will conclude with a panel discussion. The panel discussions will be instrumental in establishing guidelines for future integrations and collaborations and a research agenda for the next five years based on the key multidisciplinary issues identified. The number of participants to the workshop is limited to 40. All workshop participants are expected to register for the main IJCAI conference and to pay an additional fee ($US 50) for the workshop. Those who would like to attend the workshop without giving a presentation should send a 1 page description of relevant research interests with a short list of selected publications. Please send general inquiries to julien@crim.ca. Further information about IJCAI-95 and related activities can be obtained from IJCAI home page on the web (http://ijcai.org/) or by e-mail at ijcai@aaai.org.