KDD Nuggets Index

To KD Mine: main site for Data Mining and Knowledge Discovery.

To subscribe to KDD Nuggets, email to kdd-request

Past Issues: 1996 Nuggets, 1995 Nuggets, 1994 Nuggets, 1993 Nuggets

Data Mining and Knowledge Discovery Nuggets 96:6, e-mailed 96-02-14

Contents:
News:

* M. Hernandez, Results on The Data Cleaner

http://www.cs.columbia.edu/~mauricio

* B. Masand, Thinking Machines exits Chapter 11

http://www.think.com/
Meetings:

* S. McClean, UNICOM Data Mining Conference, London 25-26 APRIL 1996

* R. Golan, CIFEr 1996 call for participation, March 24-26, NYC

http://www.ieee.org/nnc/conferences/cfp/cifer96.html

--
Data Mining and Knowledge Discovery community,
focusing on the latest research and applications.

Contributions are most welcome and should be emailed,
with a DESCRIPTIVE subject line (and a URL, when available) to (kdd@gte.com).
E-mail add/delete requests to (kdd-request@gte.com).

Nuggets frequency is approximately weekly.
Back issues of Nuggets, a catalog of S*i*ftware (data mining tools),
and a wealth of other information on Data Mining and Knowledge Discovery
is available at Knowledge Discovery Mine site, URL http://info.gte.com/~kdd.

-- Gregory Piatetsky-Shapiro (moderator)

********************* Official disclaimer ***********************************
* All opinions expressed herein are those of the writers (or the moderator) *
* and not necessarily of their respective employers (or GTE Laboratories) *
*****************************************************************************

~~~~~~~~~~~~ Quotable Quote ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Metadata -- what you wish you had in your data values
Ken Orr, a Data Warehousing expert

Previous 1 Next Top

>~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Date: Fri, 2 Feb 1996 12:23:19 -0500 (EST)
From: 'Mauricio A. Hernandez' (mauricio@cs.columbia.edu)
Subject: Results on The Data Cleaner

Results on The Data Cleaner:

In March of 1995 Timothy Clark, Computer Information Consultant for the
Office of Children Administrative Research (OCAR) of the Department of
Social and Health Services, posted a request on the KDD-nuggets [95:7]
asking for assistance analyzing one of their databases. Clark was looking
for software tools to apply to their entire 4.5+ million record database
to identify records belonging to the same child.

This problem was complicated by the fact that no reliable unique
identifier exists in the data to accurately identify a client (a common
problem for most 'real world' data sets). Although the records include
fields for a name, birth date, and social security number, it is not
uncommon that parts of these fields are erroneous, incomplete, or entirely
missing.

This problem is called out specifically as 'data scrubbing' in the
recent paper by Silbershatz, Stonebraker and Ullman reporting on
the NSF Workshop of the Future of Database Systems Research
http://db.stanford.edu/pub/ullman/lagii.ps. The task of merging and
correlating data is a crucial step in any DM/KDD process and is
exemplified by this particular task of cleaning data on clients or
customers. Some corporate entities call this the 'merge/purge' problem.

We answered Clark's request [KDD-nuggets 95:8] and offered him an
experimental Merge/Purge system we had under development here at Columbia.
(We also promised to report our results on KDD nuggets. This note serves
to live up to that promise.)

The system, which we have now come to call The Data Cleaner, was
developed under the auspices of Citicorp. (The problem is of course common
to all banks.)

The Data Cleaner provides a rule-based 'equational-theory' component
allowing a user to define domain-specific criteria of record equivalence.
The system limits the search for matching records to a small number of
possible candidate records at a time. It also repeats this procedure a
number of times (in some cases only twice) using different criteria to
group possible candidates. It then computes the transitive closure over
its independent results, rapidly boosting the accuracy of record
clustering in toto.

The OCAR staff have reported that their initial tests of the accuracy of
record matching are surpassing their expectations. An initial test of our
system using a sample of OCAR's data exhibited an accuracy rate of about
97%, a substantial improvement over OCAR's prior results, which achieved
no more than 90% accuracy rate. Accuracy is measured simply as the
percentage of correctly merged records out of the total number possible.
The accuracy rate was established by the OCAR staff after a laborious 'eye
balling' of the resultant 'cleaned' Database.

OCAR is currently in the process of deploying our Data Cleaner in their
operational systems. OCAR's staff report that they are especially pleased
with the rule-based equational-theory since it can be easily modified and
recompiled as new characteristics of the data are discovered and as the
characteristics of the data change in the future.

Our implementation has been ported to Unix-based workstation clusters,
Linux and DOS. If you are interested in the details of The Data Cleaner,
point your browser to url:

http://www.cs.columbia.edu/~sal
http://www.cs.columbia.edu/~mauricio

for links to a recent paper describing our work.

Mauricio Hernandez and Sal Stolfo
Department of Computer Science
Columbia University
New York, NY 10027

Previous 2 Next Top

>~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Date: Tue, 13 Feb 1996 15:13:16 -0500
From: brij@darwin (Brij Masand)
Subject: [martha@Think.COM: TMC Exits Chapter 11]

From: martha@Think.COM (Martha Keeley)
To: Multiple recipients of list (thunko@wais.com)
Subject: TMC Exits Chapter 11

Wanted the list to hear this from the source--

FOR IMMEDIATE RELEASE

THINKING MACHINES CORPORATION
EMERGES FROM BANKRUPTCY PROTECTION

BEDFORD, Mass. (February 13, 1996) -- Thinking Machines
Corporation announced today it has successfully emerged from
Chapter 11 protection, three months after submitting a
reorganization plan to the U.S. Bankruptcy Court. It is now
a recapitalized, debt-free company focused on parallel
software tools and data mining solutions for multiprocessor
computing.

The company's creditors and the court have approved Thinking
Machines' reorganization plan, which will leverage its
acknowledged leadership in parallel technology to address
achievable market opportunities. The new Thinking Machines
will provide parallel solutions to mainstream businesses
on open hardware platforms for true networked computing and
intensive data mining efforts.

'In the nineties, parallel processing and data mining
technologies are evolving into strategic computing tools for
business, enabling companies to successfully compete in a
complex world,' said Robert L. Doretti, company president
and CEO. 'Our core competencies--parallel computing and
advanced data mining software--position the new Thinking
Machines to powerfully address the market demand for high
performance computing solutions.'

Thinking Machines recently completed a profitable year,
ending December 31, 1995. The company received a $10
million capital infusion from an investment group led by the
New York-based investment and merchant banking company of
Ladenburg, Thalmann Group Inc., and its affiliates. The
investment will be used to fund the company's marketing and
product development plans.

'The new Thinking Machines is a highly focused
organization,' added Doretti. 'Thanks to the extraordinary
talent of our employees and the competitive strength of our
products, the company is now well-poised for future growth.'

Over the next year, Thinking Machines will widen its
strategic alliances with mainstream hardware vendors, as
well as software companies and systems integrators, to
market its innovative and industry-leading parallel and data
mining software solutions on open computing platforms. In
November, the company announced the first of these
alliances, with Sun Microsystems.

'The confirmation of Thinking Machines' reorganization plan
successfully concludes one of the most complex, yet
cooperatively productive, Chapter 11 reorganizations in New
England,' said Charles Dougherty, a member of the law firm
of Hill & Barlow, Counsel to Thinking Machines in their
Chapter 11 case.

Founded in 1983, Thinking Machines is a pioneer in the
revolutionary computing technology field known as massively
parallel processing, one of the most powerful approaches to
addressing large-scale, data-intensive computing
problems. The company's leading-edge technology is employed
in fields from petroleum exploration to operations research
and database marketing. Today, the company's parallel
solutions are available through its product partnerships
with leading open hardware platform vendors, such as Sun
Microsystems, as well as through its standalone system, the
Connection Machine. Based in Bedford, Massachusetts,
Thinking Machines has offices worldwide.

Previous 3 Next Top

>~~~Meetings:~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
From: 'Prof Sally McClean' (sally@causeway.infc.ulst.ac.uk)
Organization: Informatics, University of Ulster
Date: Thu, 1 Feb 1996 18:21:04 GMT
Subject: UNICOM Data Mining Conference

DATA MINING '96

LONDON, 25-26 APRIL 1996

Sponsored by:
BCS SGES
AI Watch - the Newsletter of AI INTELLIGENCE

Background and Objectives

Many organisations have collected large amounts of data recording
their past activities. Buried within these databases is knowledge,
from which can be learnt important lessons which, in turn, can be
exploited to improve future performance. The extractio n of this
knowledge, often in the form of a number of rules which describes how
one or more fields are related to other fields, is known as data
mining or KDD (Knowledge Discovery in Databases).

The techniques used in KDD exploit some of the most recent research in
artificial intelligence and machine learning. A fundamental purpose
of this Seminar is to gather together both academics and
representatives from industry in order to review the curr ent
techniques and to discuss their practical application.

Industry and commerce have begun to see the potential of these
techniques and have started to exploit them in a wide range of
applications such as market segmentation, risk analysis, credit rating
and customer profiling. Data mining techniques have also been used by
social service departments and there is huge potential for medical
data mining. Case studies for a wide range of applications will be
presented.

Anyone wishing to apply these tools needs to be aware of the
availability and use of data mining toolkits. There are a number on
the market and some software is available on the World Wide Web. These
will be assessed both in case studies and in comparativ e studies.

Topics covered include:

Database manipulation
Tools
Who should attend?
Professor Sally McClean, University of Ulster
Mr Ken Totton, BT
Mr Tony Bowden, Tony Bowden & Associates

Programme
Day one
Keynote presentation

Inductive Query Languages
Arno Siebes, Centre for Mathematics and Computer Science, The
Netherlands State of the art in the data mining area Inductive Query
Languages: data mining technology for users rather than data mining
experts 'Interest Subsets' as an Inductive Query Language The
generality of interesting subsets The KESO project: building a second
generation data mining system

State of the art in Data Mining and Knowledge Discovery in Databases
Willi Kloesgen, GMD, Sankt Augustin, Germany Definition of KDD and
data mining Why is KDD necessary When can KDD be applied The KDD
process and its tasks Methods for KDD tasks KDD
- tools and systems
- applications
- architectures

Data Mining Using Modern Heuristic Techniques
V J Rayward Smith, University of East Anglia
Viewing data mining as a optimisation problem
Applying heuristic optimisation techniques to data mining:
traditional local search
genetic algorithms
simulated annealing
tabu search
hybrid systems
Case studies

Choosing the Right Data Mining Solution
Sarabjot S Anand et al, University of Ulster
Navigating the hype, buzz words and acronyms
Asking the right questions:
- Can I get what I want from the data I have?
- Is my data in a state that allows me to mine it?
- How will my running a Data Mining system on my data affect my
existing OLTP operations?

Distributed Database Management for Uncertainty Handling in Data
Mining Sally McClean and Bryan Scotney, University of Ulster

Combining data from different databases to produce new knowledge
integrating the aggregates
achieving lower levels of granularity
identifying new knowledge which could not have been found previously

Automatic Induction of Rules from Examples: a Critical Review
Max Bramer, University of Portsmouth
The knowledge elicitation bottleneck
The ID3 algorithm and its derivatives
Rule induction in practice
Strengths and weaknesses of the ID3 approach
Inducing decision trees v. modular rules - The Prism Project
Using knowledge to guide rule induction - The Cupid Project
Using induction to capture non-verbal skills
Directions for further research

The Specification and Implementation of Data-Defined Problems
Derek Partridge, University of Exeter
Data mining as the domain of data-defined problems
Software engineering and data-defined problems
Inductive programming techniques
A multiversion methodology for reliable data mining
Some examples
Conclusion

Knowledge Discovery in Large Databases on Data Compression and
Conceptual Clustering Salem al-Naemi and Jorge Bocca, University of
Birmingham Using the process of conceptual clustering in KDD Defining
a clustering algorithm measure based on entropy AOCCA - An Attributed
Oriented Conceptual Clustering Algorithm A KDD methodology to extract
various qualitative and/or quantitative knowledge rules Implementation
and analysis Conclusion

Day Two
Keynote Presentation

From Data Mining to Knowledge Discovery: the Roadmap
Gregory Piatetsky-Shapiro, GTE Laboratories, USA
The rapidly growing databases are overwhelming the traditional, ad-hoc
methods of data analysis, while hiding many potentially valuable
nuggets of knowledge. This creates a need for a new, automated
approach for making sense of the data - the domain of a n emerging
field called Data Mining and Knowledge Discovery in Databases (KDD).
KDD combines techniques of machine learning, expert systems,
databases, statistics and data visualisation to create a new
generation of intelligent and automated tools for di scovery in data,
which are already being applied in many areas of business, science and
government all around the world. This presentation provides an
overview of KDD, focusing on Data Mining goals and methods Survey of
available data mining tools and Internet Resources The steps of the
knowledge discovery process Application development challenges and
pitfalls Examples of successful data mining applications

Data Mining in BT
Ken Totton and Huw Roberts, Data Mining Group, BT Laboratories
Overview of BT's approach to data mining
The data mining process
Case Studies

Data Mining for Data Owners
Colin Shearer, Integral Solutions Ltd
Data owners: who they are and why they are the key to successful data
mining projects Barriers to data owner involvement: technology,
complexity and accessibility issues Presenting data mining technology
to data owners Case studies: successful examples of data mining by
data owners Experiences of Data Mining in a Financial Services Company
J C W DeBuse and B de la Iglesia, University of East Anglia Using data
mining tools for commercial data: the key features Survey of currently
available tools and algorithms Experiences of applying these tools to
commercial databases Performance evaluation

Improving Customer Retention (through Knowledge Guided Data Mining)
Rob Milne, Intelligent Applications Ltd Much more profit can be made
from existing customers than new customers This increase in
profitability can be very high Knowledge Guided Data Mining techniques
are very effective at predicting those customers most likely to change
A case study will be presented in which over 90% accuracy of
predictions was achieved - this would have been impossible with
traditional approaches. Guiding the data mining with knowledge
provided the critical success factor.

Data Mining and Data Visualisation with Silicon Graphics Technology
Chris Hardy, Silicon Graphics Data mining tools for visualising and
analyzing data with demonstrations Case Studies

Business Applications of Statistics for Data Mining - Getting the
Basics Right Jon Petersen, SPSS UK Ltd An overview of statistical
tools for data mining The advantages of statistical techniques
Predictive tools and Classification tools The use of one statistical
technique, CHAID, which is especially suitable to data mining problems
The use of statistical techniques to complement other techniques such
as Neural Networks Examples of the use of statistics to gain
competitive advantage, illustrated by case studies

End of Conference

This event is part of a series of Seminars and Tutorials, to be held
in London from 22-26 April 1996, at Chelsea Village, Fulham Road.
Other topics include Intelligent Systems for Finance and Commerce
25-26 April Intelligent Data Management 24-25 April Uncertainty in
Information Systems 24 April Building the Data Warehouse 22-23 April
Developments in Database Technology 24 April Data Warehousing and
Parallel DB Servers 24-25 April Enterprise Client/Server 24-25 April
Rapid Application Delivery for Client/Server 26 April Middleware 26
April OLAP Tutorial and EIS & OLAP Seminar 23-25 April

This series is complemented by an exhibition of related products and
services

Price: One day =A3395; 2 days 695; 3 days 950; 4 days =A31275; 5 days
=A31550. V.A.T. at 17.5% is charged on all fees.

SUBSTANTIAL ACADEMICS AVAILABLE. APPLY TO UNICOM FOR DETAILS

For futher information on attending, exhibiting products and services,
contributing a paper, purchasing the proceedings or details of
UNICOM's Data Mining or Data Warehousing Reports Please contact
UNICOM @UNICOM.CO.UK. telephone +44 1895 256 484 fax +44 895 813 095.

-----------------------------------------------
Professor Sally McClean,
Division of Mathematics,
School of Information and Software Engineering,
University of Ulster,
Coleraine,
Northern Ireland BT52 1SA.
Telephone 44-1265-324602
www: http://www.infc.ulst.ac.uk/informatics/personnel/si.mcclean.html
Fax number 44-1265-324916
e-mail SI.McClean@ulst.ac.uk
----------------------------------------------

Previous 4 Next Top

>~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Date: 7 Feb 1996 12:33:11 +0500
From: 'Robert Golan' (Robert_Golan@mail.tcpl.ca)
Subject: Re: CIFEr Call for Participa

Note: there is a session on Financial Data Mining.

--------------------------------------
$$$$$$$$$$$$$$$$$$$$$$
$ $
$ CIFEr 1996 $
$ $
$$$$$$$$$$$$$$$$$$$$$$

IEEE/IAFE Computational Intelligence
in Financial Engineering

full info at
http://www.ieee.org/nnc/conferences/cfp/cifer96.html

March 24-26 1996
Crowne Plaza Manhattan, New York

Sponsored by

- The Institute of Electrical & Electronic Engineers (IEEE):
Neural Networks Council
- International Association of Financial Engineering (IAFE)

----------------
Conference Scope
----------------

The IEEE/IAFE CIFEr Conference is the second annual collaboration between
the professional engineering and financial communities, and is one of the
leading forums for new technologies and applications in the intersection =
of
computational intelligence and financial engineering. Intelligent
computational systems have become indispensable in virtually all financial
applications, from portfolio selection to proprietary trading to risk
management.

--------
Sponsors
--------

Sponsorship for the CIFEr Conference is being provided by the IAFE
(International Association of Financial Engineers) and the IEEE Neural
Networks Council. The IEEE (Institute of Electrical and Electronics
Engineers) is the world's largest engineering and computer science
professional non-profit association and sponsors hundreds of technical
conferences and publications annually. The IAFE is a professional
non-profit financial association with members worldwide specializing in =
new
financial product design, derivative structures, risk management
strategies, arbitrage techniques, and application of computational
techniques to finance.

----------------------------
CONFERENCE AND TUTORIAL FEES
----------------------------

REGISTRATION FEES

EARLY BIRD CONFERENCE REGISTRATION THROUGH MARCH 8, 1996

IEEE & IAFE MEMBERS ............................ $400
NON-MEMBERS .................................... $550
FULL-TIME STUDENTS* ............................ $190
KEYNOTE SPEECH LUNCHEON MEAL TICKET
(Monday, March 25) ........................ $ 10

AFTER MARCH 8, 1996

IEEE & IAFE MEMBERS ............................ $450
NON-MEMBERS .................................... $600
FULL-TIME STUDENTS* ............................ $240
KEYNOTE SPEECH LUNCHEON MEAL TICKET
(Monday, March 25) ........................ $ 30

*Students must submit evidence of full-time enrollment on
University letterhead.

Conference registration fee includes refreshments, the
cocktail reception (on Sunday, March 24 at 5:15 P.M.) and
the conference proceedings. Be sure to attend the keynote
speech luncheon. You may send a check or money order for
your registration fee, or pay by credit card. Please make
your check payable to 'IEEE & IAFE CIFEr '96 Conference' and
print the attendee(s) name(s) on the face of the check.

Previous 5 Next Top

>~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~