Knowledge Discovery Nuggets Index


To
KDNuggets Directory   |   Here is how to subscribe to KD Nuggets   |   This Year   |   Past Issues

Knowledge Discovery Nuggets(tm) 98:21, e-mailed 98-09-30


News:
  • (text) Carol Hamilton, KDD-98 attendance statistics
  • (text) Sal Stolfo, Some 'random' thoughts about KDD99
  • (text) Nitin Agrawal, Urban Science Summary of their KDD-CUP-98 results

    Publications:
  • (text) William Shannon, Classification Society of North America Newsletter
  • (text) Ronny Kohavi, Information Week story on data mining

    Tools/Services:
  • (text) Sergei Ananyan, PolyAnalyst PRO/Power -- new release of a
    leading DM solution

    Courses:
  • (text) Eric King, DATA MINING: PRINCIPLES AND PRACTICE,
    November 4-6, Dallas, Texas; January 27-29, Orlando, Florida
    --
    electronic newsletter focusing on the latest news, publications, tools,
    meetings, and other relevant items in the Data Mining and Knowledge Discovery
    field. KDNuggets is currently reaching over 5500 readers in 70+ countries
    twice a month.

    Items relevant to data mining and knowledge discovery are welcome
    and should be emailed to gps in ASCII text or HTML format.
    An item should have a subject line which clearly describes
    what is it about to KDNuggets readers.
    Please keep calls for papers and meeting announcements
    short (50 lines or less of up to 80-characters), and provide a web site for
    details, such as papers submission guidelines.
    All items may be edited for size.

    To subscribe, see http://www.kdnuggets.com/subscribe.html

    Back issues of KDNuggets, a catalog of data mining tools
    ('Siftware'), pointers to data mining companies, relevant websites,
    meetings, etc are available at KDNuggets home at
    http://www.kdnuggets.com/

    -- Gregory Piatetsky-Shapiro (editor)
    gps

    ********************* Official disclaimer ***************************
    All opinions expressed herein are those of the contributors and not
    necessarily of their respective employers (or of KDNuggets)
    *********************************************************************

    ~~~~~~~~~~~~ Quotable Quote ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    Some data miners use statistics as drunk
    uses lamp post -- for support, not illumination.
    Peter Huber

    Previous  1 Next   Top
    Date: Tuesday, September 29, 1998 1:47 PM
    From: Carol Hamilton [hamilton@aaai.org]
    Subject: KDD-98 attendance statistics

    We have now completed the registration process for KDD-98. It breaks down
    as follows:

    619 Paid Technical Registrants
    53 Complimentary Technical Registrants
    29 Complimentary registrations for Panelists, Invited Speakers and
    Tutorial Speakers
    14 Student Scholars (Comp Registration)
    10 PC Members (Comp Registrations)

    33 Paid Workshop Only Registrants
    68 Exhibit or Demo Personnel (No technical registration)
    ---
    773 Total Attendees

    Carol


    Previous  2 Next   Top
    Date: Mon, 14 Sep 1998 14:13:14 -0400
    From: Sal Stolfo sal@cs.columbia.edu
    Subject: Some 'random' thoughts about KDD99

    Greg

    I mentioned to you at KDD98 that I would send this note regarding
    thoughts I have about papers in KDD99, with the intent of starting a
    dialogue to maintain and upgrade the scientific quality of KDD as a
    field, especially now that SIGKDD is a reality.

    By all means, share this with Usama and the other PC chairs (whose email
    addresses I don't yet have).

    a)One of the basic principles of the 'scientific method' is verification
    and repeatability of experiments. In the context of KDD, much of this is
    thwarted by the 'NDA' agreements between researchers/authors and
    participating corporate sponsors (in some cases this also applies to
    government agencies). It is not that every reported experiment ought to
    be repeated for verification, but there should be a preference for
    papers where this is at least possible, or at least addressed in some
    sensible way. In my particular case, I have 1MM credit card records that
    are covered by an NDA...and hence I cannot distribute the data. However,
    I managed to get permission to publish a high level description of the
    schema of the data, and others who wish to use the data are allowed to
    do so at Columbia after signing the NDA. Charles Elkan did this! The
    point is NOT to scare away corporate sponsors or researchers from
    KDD....rather there should be an attempt to at least appear as if the
    scientific method is the preferred standard for KDD publications, and
    every attempt should be made to present specific results in terms that
    are general and clear enough for others to attempt, including perhaps
    schema information, parameter settings, and any other unique information
    that may allow one to infer applicability to other applications.

    b)Related to a) and to also address another issue raised at the
    conference: what is potentially unique about KDD that might elevate it
    as a discipline, and easily differentiate it from other disciplines,
    rather than perhaps just being viewed as the application arm of ML,
    statistics, Db, etc. Ted Senator has strong and valid opinions about
    this with respect to the Innovative Application of AI
    conference....papers MUST state and back up with sufficient evidence
    (including literature search) the true novelty of the paper's
    contributions. This of course can be many things....I do not need to
    enumerate the obvious...but the call for papers should state that
    authors MUST write such a claim and back it up. This will also help
    reviewers properly judge a new contribution, rather than possibly
    inferring it with perhaps incomplete knowledge.
    Hence, authors must be guided to report their results in a manner where
    their goals are clearly stated, and its importance clarified. Yet
    another paper reporting on application X using method Y with accuracy
    results Z is no one's idea of a good paper, unless a particular
    scientific or methodological goal is set forth and clearly demonstrated
    as novel and achieved. For example, application X may be an important
    goal in and of itself, and no one has ever tried it before. Or,
    COST-based results are achieved by method Y not previously matched by
    other competing methods. AND, please, no more ACCURACY COMPARISON papers
    for 'real world' problems, unless ACCURACY is provable equivalent to
    COST.

    c) On the topic of KDD novelty as a field. Foster had an interesting
    observation that the field ought to report upon truly novel pieces of
    knowledge gleaned form some KDD process. (It is perhaps true that truly
    novel learning algorithms will first be submitted and published in the
    older established conferences.....maybe in time KDD will viewed as the
    place to publish this type of work...) 'We should focus on the 'KD' part
    of KDD' is what I think I heard him say. Not a bad idea, unless the
    true machine generated pieces of knowledge are so esoteric as to make
    them uninteresting except for the authors (or their corporate sponsors
    who would be unlikely to allow its disclosure anyway). One of the best
    KDD papers I saw some years ago was not at all devoted to any novel
    methods, nor to a novel application. But the results achieved using
    rather simple methods produced a very interesting new piece of medical
    knowledge that anyone can understand. Some Doctor (I think in Chicago
    working with some computer scientist) processed two separate bodies of
    medical literature (generated by two sub-specialities with no
    discernible professional connection) using simple keyword extraction and
    modeling techniques. They 'intersected' the two bodies of literature to
    find some 'causal link' and they indeed did achieve their goal. I am not
    a doctor so don't recall the specifics, other than some causal link
    between the onset of Alzheimer's disease and some biochemical
    abnormality. This was really neat in demonstrating a new useful piece of
    knowledge by a rather simple method applied to two independent online
    sources. the intrinsic value of the new knowledge is clear. The methods
    used were not...but do point to new ways of attempting KDD to '
    generate links' between sources, with a call for new and improved
    methods that might be needed.

    d)The kdd cup as well was somewhat problematic this year. Getting a
    sufficiently stressful data set was hard to do. The results of the study
    are perhaps important to the corporate participants (who now have a new
    marketing sound bite), but it is not very clear what intrinsic value
    the results might have for the researchers (without perhaps breaking
    into the trade secrets of the corporate systems).

    Well, perhaps this suggestion might be useful to acheive a few
    concurrent goals for KDD:

    The Chief Statistician of the Federal Government, Catherine Wallman,
    some years ago directed the US Government statistical agencies to cross
    link on the web. She created what is known as http://www.fedstats.gov.
    Fedstats is a wonderful new national resource. It is an index into 70
    (yes, SEVENTY) statistical agencies that gather and present data on
    every conceivable topic. Some of the data is accessible via browser
    scrapping, others have query processors, others have forms/applications
    to fill out to get data sent to you. It is a somewhat chaotic mess at
    the moment, but will improve in time. The point is that fedstats is a
    treasure-trove of data sources about many topics. Visit it yourself to
    see. The government wishes to make all public data accessible to the
    public, and needs to provide analysis tools for the public to analyze
    this data. This activity is done by the staffs of policy makers, private
    companies who package analyses for resale, by public-interest groups,
    political organizations, students in classrooms, social/political
    scientists, researchers in a variety of disciplines, etc. etc. KDD can
    make a significant impact broadly across many disciplines if attention
    were turned towards these sources, and relationships built between
    specialists and KDD data miners at large...and the KDD cup may have an
    easier time of finding suitable sources of a public nature!
    Repeatability of experiments is possible. New knowledge gleaned from
    public sources can be reported, and the heterogeneous and large scale
    aspects of the sources provides a rich suite of perplexing problems for
    KDD researchers to chew on...with the potential of real 'public good'.

    Sal Stolfo


    Previous  3 Next   Top
    Date: Wed, 23 Sep 1998 14:36:09 -0400
    From: Nitin Agrawal, niagrawal@URBANSCIENCE.com
    Subject: Urban Science Summary of their KDD-CUP-98 results
    Web: http://www.kdnuggets.com/kdd98/

    [Since there was not enough time at KDD-98 conference to describe their
    results, I have offered to include their statement in KDnuggets.
    Silver (SAS) and Bronze (Quadstone) medalists were also invited to
    describe their results. For more details on KDD-CUP-98 see
    http://www.kdnuggets.com/kdd98/
    -- GPS]

    Urban Science wins the KDD-98 Cup (A second straight victory for GainSmarts)

    Background

    GainSmarts is a premier data-mining tool that provides solutions to
    database marketers, analysts, as well as, statisticians. GainSmarts
    has been developed by Drs. Jacob Zahavi and Nissan Levin of Tel Aviv
    University and Urban Science. GainSmarts is a fully automated tool
    consisting of several suites. These suites cover all aspects of
    data-mining, ranging from data import, sampling, data cleaning,
    preprocessing, automatic transformations, feature selection, model
    building, cross-validation, scoring and reporting. For further
    information on GainSmarts visit our web page at
    http://www.urbanscience.com and select GainSmarts.

    Algorithm/Model

    The competition for the KDD-98 cup was based upon actual data donated
    by The Paralyzed Veterans of America (PVA). Each record in the
    training PVA dataset represented a previously lapsed donor and
    included their response to a recent mailing campaign including
    donatiopn amount (if applicable). The competitors were asked to
    calibrate a model using their data-mining tool to predict the donation
    amount. The competitors were evaluated based upon maximizing the net
    donations for the campaign (total donations minus contact
    costs). GainSmarts applied a two-stage regression model (similar to
    Heckman's model) to predict the donation amount. The first step of the
    two stage model is a classification model (we used Logisitc
    Regression) applied to all prospects, where each prospect is assigned
    a probability of donation. The second step is an estimation model (we
    used Linear Regression) applied to the responding donors. This second
    model produces a conditional donation amount. The product of the
    probability of donation (from step 1) and the conditional donation
    amount (from step 2) produces an unconditional prediction of donation
    amount.


    Modeling Process

    1.Split of the dataset into train (calibration) and test (validation).

    2.Explode raw variables into predictors using transformations. A
    variable such as AGE can be used to create four binary catagorical
    variables based upon the distribution of AGE by quartile. Several
    transformations are created for each variable. For example, AGE can
    also be transformed into: Chi-Square categories, a LOG transform,
    and a Piece-Wise Linear transform. Each type of transformation of an
    individual variable is referred to as a set of
    predictors. GainSmarts arranges these predictors hierarchically and
    then tests each set to determine the 'best' transformation to
    represent the variable in the subsequent modeling processes.


    --------------------------------------
    _- | Piece-wise Linear Transform of AGE |
    _- --------------------------------------
    -
    ------- - ------------------------------------
    | AGE | - ----- | Chi-sqaure categorization of AGE |
    ------- - ------------------------------------
    -_
    -_ ------------------------
    - | LOG transform of AGE |
    ------------------------


    3.Univariate analysis by individual predictor

    4.Correlation analysis by predictor (within the hierarchy) to
    eliminate highly correlated predictors.

    5.GainSmarts selects the best available representation for each
    attribute using an expert system (rule based) approach, thereby
    selecting either AGE by QUARTILES, or Piece-Wise Linear transform
    for AGE, or ...etc.

    6.Select the best set of attributes using a stepwise methodology.

    7.Correlation analysis across all remaining attributes to remove
    highly correlated attributes.

    8.Select the final set of predictors in the model, using a rule based
    mechanism, to eliminate overfitting. This is achieved by limiting
    the number of coefficients (or weights), proper setting of
    parameters and introducing/eliminating entire representations of
    variables.

    9.Parameter estimation and calibration

    10.Cross validation and generate output (to EXCEL)

    11.Model scoring (or code generation)

    Note: The process from 2-10 was repeated for both stages of the modeling process. Therefore, each stage of the modeling process could contain it's own unique variables with unique transformations.

    Results
    ------------------------------------------------------------
    Projected Actual as
    from TEST reported by KDD
    file Cup Committee
    ------------------------------------------------------------
    GainSmarts Net Donation $14,844 $14,712
    Net Donation if the entire $10,500 $10,560
    file is mailed
    Increase in Net Donation $4,344 $4,152
    % Increase 41.37% 39.32%
    ------------------------------------------------------------

    A comparison between the projected and actual results (less than 1%
    error) indicates that the model developed was very robust and
    reliable.


    Conclusion

    Urban Science attributes our KDD cup successes to our feature
    selection expert system. This expert system includes (implicitly) the
    many years of experience of Drs. Zahavi and Levin in developing models
    and data mining systems. GainSmarts also practically automates the
    entire modeling process. The manual labor consisted of running 3 types
    of models/algorithms and then comparing the results. Urban Science
    invites data-miners to request a trial version of our software and run
    it themselves on the PVA database (once it becomes public, as
    planned).

    ------------------------------------------------------------------------
    For further information or to comment upon the competition, please
    feel free to email (niagrawal@urbanscience.com) or call Nitin Agrawal
    Data Mining Project Manager at Urban Science (+313-259-9900 or
    800-321-6900 toll free in the U.S.)



    Previous  4 Next   Top
    Date: Fri, 18 Sep 1998 11:19:05 -0500
    From: William Shannon shannon@osler.wustl.edu
    Subject: Classification Society of North America Sept. Newsletter
    Web: http://www.pitt.edu/~csna/

    The September issue of the 'Classification Society of North America'
    (CSNA) Newsletter, as well as back issues, can be obtained through the
    societies web page http://www.pitt.edu/~csna/. The newsletter contains
    information of general interest to anyone working in the field of
    clustering and classification. We invite everyone to take a look.

    The CSNA is a nonprofit interdisciplinary organization whose purposes
    are to promote the scientific study of classification and clustering
    (including systematic methods of creating classifications from data),
    and to disseminate scientific and educational information related to its
    fields of interests. The
    CSNA is a member of the International Federation of Classification
    Societies (IFCS).

    CSNA is highly interdisciplinary with members from mathematics, computer
    science, statistics, management, biology, and psychology, as well as
    many other disciplines.

    --
    William D. Shannon, Ph.D.
    Assistant Professor of Biostatistics in Medicine
    Division of General Medical Sciences

    Assistant Professor of Biostatistics
    Division of Biostatistics

    Washington University School of Medicine
    Campus Box 8005, 660 S. Euclid
    St. Louis, MO 63110

    Phone: 314-454-8356
    Fax: 314-454-5113
    e-mail: shannon@osler.wustl.edu
    web page: http://osler.wustl.edu/~shannon


    Previous  5 Next   Top
    Date: Tue, 22 Sep 1998 10:46:23 -0700 (PDT)
    From: Ronny Kohavi ronnyk@starry.engr.sgi.com
    Subject: Information Week story on data mining
    Web: http://www.informationweek.com/695/95iudat.htm

    Stories are beginning to come out. Take a look at Information Week,
    Data Mining Muscle. They interviewed several of our clients and
    wrote a nice story:

    http://www.informationweek.com/695/95iudat.htm


    Previous  6 Next   Top
    Date: Wed, 16 Sep 1998 12:11:42 EDT
    From: Sergei Ananyan, Megaputers@aol.com
    Subject: PolyAnalyst PRO/Power -- new release of a leading DM solution
    Web: http://www.megaputer.com

    Sept 16, 1998 -- FOR IMMEDIATE RELEASE -- Megaputer Intelligence, USA

    Megaputer announced a new release of its award-winning data mining solutions:
    PolyAnalyst PRO for Win NT and PolyAnalyst Power for Win 95. The new
    comprehensive multi-strategy solutions utilize an additional self-learning
    algorithm - PolyNet Predictor - a hybrid between the GMDH and Neural Net
    approaches, most efficient when processing large volumes of data. Both
    PolyAnalyst PRO and Power feature similar graphical user interface. In
    addition to utilizing enhanced data manipulation, visualization and report
    generating capabilities, users of PolyAnalyst PRO/Power take advantage of the
    following machine learning algorithms: PolyNet Predictor, Find Laws*, Cluster,
    Find Dependencies, Classify, Discriminate, and MLR.
    *Available only in PolyAnalyst PRO.

    PolyAnalyst PRO and Power provide the following capabilities:

    Data Access:
    Both systems can directly access data held in Oracle, DB2, Informix, Sybase,
    MS SQL Server, or any other ODBC-compliant database. Data and exploration
    results can be exchanged with MS Excel 7.0 or 97. New data can be added to the
    project when necessary. A customized version of PolyAnalyst PRO or Power comes
    merged with the IBM Visual Warehouse or ORACLE Express.

    Data Manipulation and Cleansing:
    Records can be selected according to multiple criteria. A union, intersection,
    or complement of datasets can be created. Exceptional records can be filtered
    out. Drill-through feature helps selecting data points for a new dataset
    visually from a chart. Rules, automatically discovered by PolyAnalyst or
    entered by the user, can be used to produce new fields. Data can be split into
    n-tile percentage intervals for any numerical variable.

    Machine Learning:
    PolyAnalyst PRO and Power provide a broad selection of self-learning
    algorithms for data analysis. With a new PolyNet Predictor the system features
    seven unique exploration engines for predicting and modeling. As always, the
    statistical significance of the results obtained by each engine of PolyAnalyst
    is rigorously checked.

    Visualization:
    PolyAnalyst has an object-oriented graphical user interface. Data and
    exploration results can be visualized in numerous formats: histograms, line
    and point plots with zoom and drill-through capabilities, colored charts for
    three dimensions, interactive Rule-Graphs with sliders for effective
    presentation of multidimensional relations. In addition, there is a special
    Frequencies function providing for a quick and thorough visualization of the
    distribution of categorical, integer, or yes/no variables.

    Results Reporting:
    Discovered relations are readily incorporated in existing DSS or EIS systems.
    The Print Form feature provides for the generation of an advanced output
    including a mixture of text, graphics, and system reports. A project file
    contains all the results of the performed data exploration. Created datasets
    and summary statistics can be exported to MS Excel.

    Hands-on Evaluation:
    An evaluation copy of PolyAnalyst supplemented by a series of interactive
    lessons in data mining from various application fields are available for
    downloading from
    http://www.megaputer.com or http://www.megaputer.ru

    Platforms: PolyAnalyst Power -- MS Win 95 or NT; PRO -- MS Win NT
    Pricing (limited time web promotion):
    PolyAnalyst Power: $987 (40% discount off regular price $1,645);
    PolyAnalyst PRO: $3,740 (30% discount off regular price $5,340);
    ===============================================
    PolyAnalyst is a complete multi-strategy data mining environment utilizing the
    latest achievements in the automated knowledge discovery in databases. A broad
    selection of exploration engines allows the user to predict values of
    continuous variables, explicitly model complex phenomena, determine the most
    influential independent variables, and solve classification and clustering
    tasks. The ability of PolyAnalyst to present the discovered relations in
    explicit symbolic form has no world analogs. An object-oriented design, point-
    and-click GUI, versatile data manipulation, visualization, and reporting
    capabilities, minimum of statistics, and a simple interface to various data
    storage architectures make PolyAnalyst a very easy-to-use system.


    Previous  7 Next   Top
    Date: Mon, 21 Sep 1998 14:03:29 -0400
    From: Eric King, eric@heuristics.com
    Subject: Gordian Institute Course: DATA MINING: PRINCIPLES AND PRACTICE

    DATA MINING: PRINCIPLES AND PRACTICE
    A broad-brushed, intensive introduction of
    methods, applications, tools and techniques
    offered by
    The Gordian Institute

    November 4-6, Dallas, Texas
    January 27-29, Orlando, Florida
    ___________________________________________________

    WHAT MAKES THIS COURSE UNIQUE?
    This course focuses on actual use and implementation of data mining
    methods. The instructor will also show how to evaluate tools and
    products. Attendees will receive a binder of course slides and notes,
    two texts, and a CD full of sample data, evaluation packages and
    references to other resources and tools.

    Hands-on workshop exercises will reveal impressive results from the
    same tool or method that may have failed in other categories. The
    workshops will save immeasurable time and effort in assessing and
    selecting which suite of tools and techniques will perform best for
    your application.

    WHAT YOU WILL LEARN
    - The basic principles of data mining
    - The different methods of data mining and how they compare
    - How to prepare raw data for data mining
    - How to analyze and validate the results
    - What questions data mining can answer
    - What are the pitfalls and how to avoid them
    - What commercial products are available and how to evaluate them

    REQUEST FULL COURSE DETAILS
    You will quickly receive complete details to include pricing, course
    outline, instructor background, site logistics and registration form
    through any of the following:

    - Email: agent@gordianknot.com
    Send an Email message with your request in the subject field:
    - DATA MINING COURSE DETAILS
    - GORDIAN'S QUARTERLY ELECTRONIC NEWSLETTER
    - Toll Free: 800-405-2114
    - Direct: 281-364-9882
    - Fax: 281-754-4014
    - http://www.gordianknot.com

    Previous  8 Next   Top