(text)
Carol Hamilton, KDD-98 attendance statistics
(text)
Sal Stolfo, Some 'random' thoughts about KDD99
(text)
Nitin Agrawal, Urban Science Summary of their KDD-CUP-98 results
Publications:
(text)
William Shannon, Classification Society of North America Newsletter
(text)
Ronny Kohavi, Information Week story on data mining
Tools/Services:
(text)
Sergei Ananyan, PolyAnalyst PRO/Power -- new release of a
leading DM solution
Courses:
(text)
Eric King, DATA MINING: PRINCIPLES AND PRACTICE,
November 4-6, Dallas, Texas; January 27-29, Orlando, Florida
--
electronic newsletter focusing on the latest news, publications, tools,
meetings, and other relevant items in the Data Mining and Knowledge Discovery
field. KDNuggets is currently reaching over 5500 readers in 70+ countries
twice a month.
Items relevant to data mining and knowledge discovery are welcome
and should be emailed to gps
in ASCII text or HTML format.
An item should have a subject line which clearly describes
what is it about to KDNuggets readers.
Please keep calls for papers and meeting announcements
short (50 lines or less of up to 80-characters), and provide a web site for
details, such as papers submission guidelines.
All items may be edited for size.
Back issues of KDNuggets, a catalog of data mining tools
('Siftware'), pointers to data mining companies, relevant websites,
meetings, etc are available at KDNuggets home at http://www.kdnuggets.com/
********************* Official disclaimer ***************************
All opinions expressed herein are those of the contributors and not
necessarily of their respective employers (or of KDNuggets)
*********************************************************************
~~~~~~~~~~~~ Quotable Quote ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Some data miners use statistics as drunk
uses lamp post -- for support, not illumination.
Peter Huber Previous1NextTop
Date: Tuesday, September 29, 1998 1:47 PM
From: Carol Hamilton [hamilton@aaai.org]
Subject: KDD-98 attendance statistics
We have now completed the registration process for KDD-98. It breaks down
as follows:
619 Paid Technical Registrants
53 Complimentary Technical Registrants
29 Complimentary registrations for Panelists, Invited Speakers and
Tutorial Speakers
14 Student Scholars (Comp Registration)
10 PC Members (Comp Registrations)
33 Paid Workshop Only Registrants
68 Exhibit or Demo Personnel (No technical registration)
---
773 Total Attendees
I mentioned to you at KDD98 that I would send this note regarding
thoughts I have about papers in KDD99, with the intent of starting a
dialogue to maintain and upgrade the scientific quality of KDD as a
field, especially now that SIGKDD is a reality.
By all means, share this with Usama and the other PC chairs (whose email
addresses I don't yet have).
a)One of the basic principles of the 'scientific method' is verification
and repeatability of experiments. In the context of KDD, much of this is
thwarted by the 'NDA' agreements between researchers/authors and
participating corporate sponsors (in some cases this also applies to
government agencies). It is not that every reported experiment ought to
be repeated for verification, but there should be a preference for
papers where this is at least possible, or at least addressed in some
sensible way. In my particular case, I have 1MM credit card records that
are covered by an NDA...and hence I cannot distribute the data. However,
I managed to get permission to publish a high level description of the
schema of the data, and others who wish to use the data are allowed to
do so at Columbia after signing the NDA. Charles Elkan did this! The
point is NOT to scare away corporate sponsors or researchers from
KDD....rather there should be an attempt to at least appear as if the
scientific method is the preferred standard for KDD publications, and
every attempt should be made to present specific results in terms that
are general and clear enough for others to attempt, including perhaps
schema information, parameter settings, and any other unique information
that may allow one to infer applicability to other applications.
b)Related to a) and to also address another issue raised at the
conference: what is potentially unique about KDD that might elevate it
as a discipline, and easily differentiate it from other disciplines,
rather than perhaps just being viewed as the application arm of ML,
statistics, Db, etc. Ted Senator has strong and valid opinions about
this with respect to the Innovative Application of AI
conference....papers MUST state and back up with sufficient evidence
(including literature search) the true novelty of the paper's
contributions. This of course can be many things....I do not need to
enumerate the obvious...but the call for papers should state that
authors MUST write such a claim and back it up. This will also help
reviewers properly judge a new contribution, rather than possibly
inferring it with perhaps incomplete knowledge.
Hence, authors must be guided to report their results in a manner where
their goals are clearly stated, and its importance clarified. Yet
another paper reporting on application X using method Y with accuracy
results Z is no one's idea of a good paper, unless a particular
scientific or methodological goal is set forth and clearly demonstrated
as novel and achieved. For example, application X may be an important
goal in and of itself, and no one has ever tried it before. Or,
COST-based results are achieved by method Y not previously matched by
other competing methods. AND, please, no more ACCURACY COMPARISON papers
for 'real world' problems, unless ACCURACY is provable equivalent to
COST.
c) On the topic of KDD novelty as a field. Foster had an interesting
observation that the field ought to report upon truly novel pieces of
knowledge gleaned form some KDD process. (It is perhaps true that truly
novel learning algorithms will first be submitted and published in the
older established conferences.....maybe in time KDD will viewed as the
place to publish this type of work...) 'We should focus on the 'KD' part
of KDD' is what I think I heard him say. Not a bad idea, unless the
true machine generated pieces of knowledge are so esoteric as to make
them uninteresting except for the authors (or their corporate sponsors
who would be unlikely to allow its disclosure anyway). One of the best
KDD papers I saw some years ago was not at all devoted to any novel
methods, nor to a novel application. But the results achieved using
rather simple methods produced a very interesting new piece of medical
knowledge that anyone can understand. Some Doctor (I think in Chicago
working with some computer scientist) processed two separate bodies of
medical literature (generated by two sub-specialities with no
discernible professional connection) using simple keyword extraction and
modeling techniques. They 'intersected' the two bodies of literature to
find some 'causal link' and they indeed did achieve their goal. I am not
a doctor so don't recall the specifics, other than some causal link
between the onset of Alzheimer's disease and some biochemical
abnormality. This was really neat in demonstrating a new useful piece of
knowledge by a rather simple method applied to two independent online
sources. the intrinsic value of the new knowledge is clear. The methods
used were not...but do point to new ways of attempting KDD to '
generate links' between sources, with a call for new and improved
methods that might be needed.
d)The kdd cup as well was somewhat problematic this year. Getting a
sufficiently stressful data set was hard to do. The results of the study
are perhaps important to the corporate participants (who now have a new
marketing sound bite), but it is not very clear what intrinsic value
the results might have for the researchers (without perhaps breaking
into the trade secrets of the corporate systems).
Well, perhaps this suggestion might be useful to acheive a few
concurrent goals for KDD:
The Chief Statistician of the Federal Government, Catherine Wallman,
some years ago directed the US Government statistical agencies to cross
link on the web. She created what is known as http://www.fedstats.gov.
Fedstats is a wonderful new national resource. It is an index into 70
(yes, SEVENTY) statistical agencies that gather and present data on
every conceivable topic. Some of the data is accessible via browser
scrapping, others have query processors, others have forms/applications
to fill out to get data sent to you. It is a somewhat chaotic mess at
the moment, but will improve in time. The point is that fedstats is a
treasure-trove of data sources about many topics. Visit it yourself to
see. The government wishes to make all public data accessible to the
public, and needs to provide analysis tools for the public to analyze
this data. This activity is done by the staffs of policy makers, private
companies who package analyses for resale, by public-interest groups,
political organizations, students in classrooms, social/political
scientists, researchers in a variety of disciplines, etc. etc. KDD can
make a significant impact broadly across many disciplines if attention
were turned towards these sources, and relationships built between
specialists and KDD data miners at large...and the KDD cup may have an
easier time of finding suitable sources of a public nature!
Repeatability of experiments is possible. New knowledge gleaned from
public sources can be reported, and the heterogeneous and large scale
aspects of the sources provides a rich suite of perplexing problems for
KDD researchers to chew on...with the potential of real 'public good'.
[Since there was not enough time at KDD-98 conference to describe their
results, I have offered to include their statement in KDnuggets.
Silver (SAS) and Bronze (Quadstone) medalists were also invited to
describe their results. For more details on KDD-CUP-98 see http://www.kdnuggets.com/kdd98/
-- GPS]
Urban Science wins the KDD-98 Cup (A second straight victory for GainSmarts)
Background
GainSmarts is a premier data-mining tool that provides solutions to
database marketers, analysts, as well as, statisticians. GainSmarts
has been developed by Drs. Jacob Zahavi and Nissan Levin of Tel Aviv
University and Urban Science. GainSmarts is a fully automated tool
consisting of several suites. These suites cover all aspects of
data-mining, ranging from data import, sampling, data cleaning,
preprocessing, automatic transformations, feature selection, model
building, cross-validation, scoring and reporting. For further
information on GainSmarts visit our web page at http://www.urbanscience.com
and select GainSmarts.
Algorithm/Model
The competition for the KDD-98 cup was based upon actual data donated
by The Paralyzed Veterans of America (PVA). Each record in the
training PVA dataset represented a previously lapsed donor and
included their response to a recent mailing campaign including
donatiopn amount (if applicable). The competitors were asked to
calibrate a model using their data-mining tool to predict the donation
amount. The competitors were evaluated based upon maximizing the net
donations for the campaign (total donations minus contact
costs). GainSmarts applied a two-stage regression model (similar to
Heckman's model) to predict the donation amount. The first step of the
two stage model is a classification model (we used Logisitc
Regression) applied to all prospects, where each prospect is assigned
a probability of donation. The second step is an estimation model (we
used Linear Regression) applied to the responding donors. This second
model produces a conditional donation amount. The product of the
probability of donation (from step 1) and the conditional donation
amount (from step 2) produces an unconditional prediction of donation
amount.
Modeling Process
1.Split of the dataset into train (calibration) and test (validation).
2.Explode raw variables into predictors using transformations. A
variable such as AGE can be used to create four binary catagorical
variables based upon the distribution of AGE by quartile. Several
transformations are created for each variable. For example, AGE can
also be transformed into: Chi-Square categories, a LOG transform,
and a Piece-Wise Linear transform. Each type of transformation of an
individual variable is referred to as a set of
predictors. GainSmarts arranges these predictors hierarchically and
then tests each set to determine the 'best' transformation to
represent the variable in the subsequent modeling processes.
--------------------------------------
_- | Piece-wise Linear Transform of AGE |
_- --------------------------------------
-
------- - ------------------------------------
| AGE | - ----- | Chi-sqaure categorization of AGE |
------- - ------------------------------------
-_
-_ ------------------------
- | LOG transform of AGE |
------------------------
3.Univariate analysis by individual predictor
4.Correlation analysis by predictor (within the hierarchy) to
eliminate highly correlated predictors.
5.GainSmarts selects the best available representation for each
attribute using an expert system (rule based) approach, thereby
selecting either AGE by QUARTILES, or Piece-Wise Linear transform
for AGE, or ...etc.
6.Select the best set of attributes using a stepwise methodology.
7.Correlation analysis across all remaining attributes to remove
highly correlated attributes.
8.Select the final set of predictors in the model, using a rule based
mechanism, to eliminate overfitting. This is achieved by limiting
the number of coefficients (or weights), proper setting of
parameters and introducing/eliminating entire representations of
variables.
9.Parameter estimation and calibration
10.Cross validation and generate output (to EXCEL)
11.Model scoring (or code generation)
Note: The process from 2-10 was repeated for both stages of the modeling process. Therefore, each stage of the modeling process could contain it's own unique variables with unique transformations.
Results
------------------------------------------------------------
Projected Actual as
from TEST reported by KDD
file Cup Committee
------------------------------------------------------------
GainSmarts Net Donation $14,844 $14,712
Net Donation if the entire $10,500 $10,560
file is mailed
Increase in Net Donation $4,344 $4,152
% Increase 41.37% 39.32%
------------------------------------------------------------
A comparison between the projected and actual results (less than 1%
error) indicates that the model developed was very robust and
reliable.
Conclusion
Urban Science attributes our KDD cup successes to our feature
selection expert system. This expert system includes (implicitly) the
many years of experience of Drs. Zahavi and Levin in developing models
and data mining systems. GainSmarts also practically automates the
entire modeling process. The manual labor consisted of running 3 types
of models/algorithms and then comparing the results. Urban Science
invites data-miners to request a trial version of our software and run
it themselves on the PVA database (once it becomes public, as
planned).
------------------------------------------------------------------------
For further information or to comment upon the competition, please
feel free to email (niagrawal@urbanscience.com)
or call Nitin Agrawal
Data Mining Project Manager at Urban Science (+313-259-9900 or
800-321-6900 toll free in the U.S.)
The September issue of the 'Classification Society of North America'
(CSNA) Newsletter, as well as back issues, can be obtained through the
societies web page http://www.pitt.edu/~csna/.
The newsletter contains
information of general interest to anyone working in the field of
clustering and classification. We invite everyone to take a look.
The CSNA is a nonprofit interdisciplinary organization whose purposes
are to promote the scientific study of classification and clustering
(including systematic methods of creating classifications from data),
and to disseminate scientific and educational information related to its
fields of interests. The
CSNA is a member of the International Federation of Classification
Societies (IFCS).
CSNA is highly interdisciplinary with members from mathematics, computer
science, statistics, management, biology, and psychology, as well as
many other disciplines.
--
William D. Shannon, Ph.D.
Assistant Professor of Biostatistics in Medicine
Division of General Medical Sciences
Assistant Professor of Biostatistics
Division of Biostatistics
Washington University School of Medicine
Campus Box 8005, 660 S. Euclid
St. Louis, MO 63110
Stories are beginning to come out. Take a look at Information Week,
Data Mining Muscle. They interviewed several of our clients and
wrote a nice story:
Sept 16, 1998 -- FOR IMMEDIATE RELEASE -- Megaputer Intelligence, USA
Megaputer announced a new release of its award-winning data mining solutions:
PolyAnalyst PRO for Win NT and PolyAnalyst Power for Win 95. The new
comprehensive multi-strategy solutions utilize an additional self-learning
algorithm - PolyNet Predictor - a hybrid between the GMDH and Neural Net
approaches, most efficient when processing large volumes of data. Both
PolyAnalyst PRO and Power feature similar graphical user interface. In
addition to utilizing enhanced data manipulation, visualization and report
generating capabilities, users of PolyAnalyst PRO/Power take advantage of the
following machine learning algorithms: PolyNet Predictor, Find Laws*, Cluster,
Find Dependencies, Classify, Discriminate, and MLR.
*Available only in PolyAnalyst PRO.
PolyAnalyst PRO and Power provide the following capabilities:
Data Access:
Both systems can directly access data held in Oracle, DB2, Informix, Sybase,
MS SQL Server, or any other ODBC-compliant database. Data and exploration
results can be exchanged with MS Excel 7.0 or 97. New data can be added to the
project when necessary. A customized version of PolyAnalyst PRO or Power comes
merged with the IBM Visual Warehouse or ORACLE Express.
Data Manipulation and Cleansing:
Records can be selected according to multiple criteria. A union, intersection,
or complement of datasets can be created. Exceptional records can be filtered
out. Drill-through feature helps selecting data points for a new dataset
visually from a chart. Rules, automatically discovered by PolyAnalyst or
entered by the user, can be used to produce new fields. Data can be split into
n-tile percentage intervals for any numerical variable.
Machine Learning:
PolyAnalyst PRO and Power provide a broad selection of self-learning
algorithms for data analysis. With a new PolyNet Predictor the system features
seven unique exploration engines for predicting and modeling. As always, the
statistical significance of the results obtained by each engine of PolyAnalyst
is rigorously checked.
Visualization:
PolyAnalyst has an object-oriented graphical user interface. Data and
exploration results can be visualized in numerous formats: histograms, line
and point plots with zoom and drill-through capabilities, colored charts for
three dimensions, interactive Rule-Graphs with sliders for effective
presentation of multidimensional relations. In addition, there is a special
Frequencies function providing for a quick and thorough visualization of the
distribution of categorical, integer, or yes/no variables.
Results Reporting:
Discovered relations are readily incorporated in existing DSS or EIS systems.
The Print Form feature provides for the generation of an advanced output
including a mixture of text, graphics, and system reports. A project file
contains all the results of the performed data exploration. Created datasets
and summary statistics can be exported to MS Excel.
Hands-on Evaluation:
An evaluation copy of PolyAnalyst supplemented by a series of interactive
lessons in data mining from various application fields are available for
downloading from http://www.megaputer.com
or http://www.megaputer.ru
Platforms: PolyAnalyst Power -- MS Win 95 or NT; PRO -- MS Win NT
Pricing (limited time web promotion):
PolyAnalyst Power: $987 (40% discount off regular price $1,645);
PolyAnalyst PRO: $3,740 (30% discount off regular price $5,340);
===============================================
PolyAnalyst is a complete multi-strategy data mining environment utilizing the
latest achievements in the automated knowledge discovery in databases. A broad
selection of exploration engines allows the user to predict values of
continuous variables, explicitly model complex phenomena, determine the most
influential independent variables, and solve classification and clustering
tasks. The ability of PolyAnalyst to present the discovered relations in
explicit symbolic form has no world analogs. An object-oriented design, point-
and-click GUI, versatile data manipulation, visualization, and reporting
capabilities, minimum of statistics, and a simple interface to various data
storage architectures make PolyAnalyst a very easy-to-use system.
Previous7NextTop
Date: Mon, 21 Sep 1998 14:03:29 -0400
From: Eric King, eric@heuristics.com
Subject: Gordian Institute Course: DATA MINING: PRINCIPLES AND PRACTICE
DATA MINING: PRINCIPLES AND PRACTICE
A broad-brushed, intensive introduction of
methods, applications, tools and techniques
offered by
The Gordian Institute
November 4-6, Dallas, Texas
January 27-29, Orlando, Florida
___________________________________________________
WHAT MAKES THIS COURSE UNIQUE?
This course focuses on actual use and implementation of data mining
methods. The instructor will also show how to evaluate tools and
products. Attendees will receive a binder of course slides and notes,
two texts, and a CD full of sample data, evaluation packages and
references to other resources and tools.
Hands-on workshop exercises will reveal impressive results from the
same tool or method that may have failed in other categories. The
workshops will save immeasurable time and effort in assessing and
selecting which suite of tools and techniques will perform best for
your application.
WHAT YOU WILL LEARN
- The basic principles of data mining
- The different methods of data mining and how they compare
- How to prepare raw data for data mining
- How to analyze and validate the results
- What questions data mining can answer
- What are the pitfalls and how to avoid them
- What commercial products are available and how to evaluate them
REQUEST FULL COURSE DETAILS
You will quickly receive complete details to include pricing, course
outline, instructor background, site logistics and registration form
through any of the following:
- Email: agent@gordianknot.com
Send an Email message with your request in the subject field:
- DATA MINING COURSE DETAILS
- GORDIAN'S QUARTERLY ELECTRONIC NEWSLETTER
- Toll Free: 800-405-2114
- Direct: 281-364-9882
- Fax: 281-754-4014
- http://www.gordianknot.com Previous8NextTop