KDD Nuggets Index

To KD Mine: main site for Data Mining and Knowledge Discovery.

To subscribe to KDD Nuggets, email to kdd-request

Past Issues: 1996 Nuggets, 1995 Nuggets, 1994 Nuggets, 1993 Nuggets

Data Mining and Knowledge Discovery Nuggets 96:17, e-mailed 96-05-29

Contents:
News:

* G. Grinstein, Information Exploration Shootout

http://www.cs.uml.edu/shootout/

* GPS, May 1995 LAN Magazine Cover Story on Data Mining:

http://www.lanmag.com/

* J. Gallman, KDD applied to a corrective action data base
Publications:

* C. Brodley, preprint available: 'Applying classification

algorithms in practice', http://yake.ecn.purdue.edu/~brodley/

* A. Pryke, Searchable Bibliography of Online KDD papers,

http://www.cs.bham.ac.uk/~anp/TheDataMine.html
Siftware:

* M. Kiselev, PolyAnalyst v2.01 data mining system,

http://mosca.sai.msu.su/~mp/megapute.html

* C. Turnquist, New Knowledge Discovery Company www.kd1.com

--
Data Mining and Knowledge Discovery community,
focusing on the latest research and applications.

Contributions are most welcome and should be emailed,
with a DESCRIPTIVE subject line (and a URL, when available) to (kdd@gte.com).
E-mail add/delete requests to (kdd-request@gte.com).

Nuggets frequency is approximately weekly.
Back issues of Nuggets, a catalog of S*i*ftware (data mining tools),
and a wealth of other information on Data Mining and Knowledge Discovery
is available at Knowledge Discovery Mine site, URL http://info.gte.com/~kdd.

-- Gregory Piatetsky-Shapiro (moderator)

********************* Official disclaimer ***********************************
* All opinions expressed herein are those of the writers (or the moderator) *
* and not necessarily of their respective employers (or GTE Laboratories) *
*****************************************************************************

~~~~~~~~~~~~ Quotable Quote ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
' Yet all the right tools don't necessarily add to a successful data
warehouse. Particularly if empowerment-minded end-users are not brought into
the process early enough or end up with another form of data bondage:
Ill-fitting and and ill-conceived warehouses foisted on [end-users]
by copycat top managers who read about data warehouses in the ... press.
I guess that's when a data warehouse truly becomes a data jail'.
Alan Alper, in ComputerWorld Client-Server Journal, April 1996.

Previous 1 Next Top

>~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Date: Fri, 24 May 1996 13:33:12 -0400
Subject: Information Exploration Shootout http://www.cs.uml.edu/shootout
From: ggg@mail11.mitre.org (Georges G. Grinstein)

Information Exploration Shootout (aka - Information Exploration Benchmarks)

Over the past year many users have requested more serious comparative
evaluations of the various data exploration techniques: analysis, knowledge
discovery and data mining, statistics and grand tours, database tools,
visualization, or combinations thereof.

We all recognize that mining for information and knowledge from large
databases and documents will be the next fundamental impact in database
systems, knowledge discovery, and visualization. This is considered an
important area for major cost savings and potential revenue, and it has
immediate applications in decision systems, intelligence, information
management, business, and communication-in the form of both on-line services
and the World Wide Web. Data mining now draws from fields including
databases, statistics, information technology, data visualization, and
artificial intelligence, especially machine learning and knowledge-based
systems. There is a clear sense that, to achieve the next increase in
knowledge exploitation, individual data exploration approaches must work
together.

There have been promising developments. In 1995 a 'shootout' was held for
the statistical community. The knowledge discovery in databases (KDD)
community has meanwhile made numerous data sets publicly available for timing
'benchmarks'. There has not, however, been any comparative evaluation of
techniques across domains-and definitely none permitting hybrid approaches.

How does one discover information and knowledge in datasets-e.g., databases,
archives, document collections, television news reports, the Web? What
process do analysts and other data explorers use in discovering non-trivial
patterns? How do, or should, knowledge discovery, statistics, and
visualization work together to support the human exploration process? What
are the procedures for using visualization and analytic agents, in
context with the human operator, to achieve timely, computationally
responsive discoveries in data?

There is now a plethora of techniques to explore data. They range from
purely statistical approaches to neural networks, machine learning, and
knowledge discovery as batch processes. Integrated approaches use applied
perception (e.g., glyphs) with interactive grand tours, and purely geometric
systems such as parallel coordinates that, integrating little mathematics,
rely more on human participation. Which techniques are better? Which work
on what kind of data sets? Are certain combinations
better?The questions abound.

Several datasets have been identified and selected to be made publicly
available for exploration and discovery. The first dataset to be released
consists of human generated network intrusion attempts and a baseline dataset
with no intrusions. There were 4 intrusions over a period of time and these
have been tracked in separate datasets. Information explorers are to discove
these intrusions.

The second dataset to be released shortly thereafter will consist of
newspaper data set up as a collection of web pages.

GO to ***URL*** for further information and more details.

The results will be reviewed by a group that includes Georges Grinstein
(UMass Lowell and the MITRE Corporation), Gregory Piatetsky-Shapiro (GTE) and
Graham Wills (AT&T). The panelists and domain experts will discuss the data
sets and the reporting and selection mechanism, and will present their
preliminary analysis of the results at various conferences which include
KDD'96, the 9th Annual DAMA-NCR Data Management Symposium, and perhaps the
IEEE Visualization'96 Conference. Credit to the participants will be
provided as well as copies of the results and various documents.

Previous 2 Next Top

>~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Date: Tue, 28 May 1996 09:13:16 -0400
From: Gregory Piatetsky-Shapiro (gps0@gte.com)
Subject: Unearthing Underground Data

The recent LAN magazine (devoted to networking -- see http://www.lanmag.com
featured a cover story on data mining by Cheryl Krivda.
See http://www.lanmag.com/9605mine.htm for the story (also repreduced here).

-- GPS

Cover Story

by Cheryl D. Krivda

Unearthing Underground Data

Will it yield the information motherlode of the millennium? Nobody knows for sure, but data mining will likely have many network users digging in their organization's databases for buried treasure.

Perhaps you were watching a ball game or an episode of Friends or 60 Minutes recently and they caught your eye-those subtitled, perfectly coiffed runway models discussing the benefits of data mining. As part of IBM's 'Solutions for a Small Planet' advertising campaign, the models in that commercial make data mining sound as easy as running a spreadsheet and as commonplace as white bread.

If you have stumbled upon this commercial, you probably wondered whether your site could benefit from data mining. And you were probably curious as to which data needs mining, and why? If you did, you are not alone.

Data mining deserves an award for generating the most powerful industry buzz since artificial intelligence or computer-aided software engineering (CASE). The difference is that the two previous trends were embraced primarily by small vendors; companies such as IBM and AT&T are holding the reins of the horses that are pulling the data mining bandwagon.

But even though the industry giants are embracing it, data mining is still no more than an early market. It is so immature, in fact, that vendors and industry analysts have not yet agreed on a definition for data mining, much less whether or not a given tool or database application provides data mining capabilities. This information may come as a surprise if you've been reading glowing industry trade press reports that describe how businesses can use data mining to improve profitability by better understanding their customers.

In the most successful implementations, powerful parallel processors and applications based on sophisticated algorithms search for previously unrecognized patterns in a company's data. Once discovered, such patterns enable businesses to quickly anticipate a buyer's needs, resolve client dissatisfaction, and otherwise keep their customers happy. However, the public relations teams scripting these press reports artfully ignore the realities of issues such as network design and system storage. And, what some vendors promote as data mining may be something else altogether, depending on your view of the technology. In other words, data mining is not yet spreadsheet simple.

Thoroughly confused? Read on.

DATA MINING DEFINED

Data mining is performed by the most sophisticated class of tools in a discipline known as decision support. The simplest tools include executive information systems that can generate a limited number of high-level reports, says Michael Saylor, president and CEO of software vendor MicroStrategy (Vienna, VA).

More complex are structured decision support systems. These systems allow users to select parameters when generating larger numbers of reports. Ad hoc decision support systems offer users the ability to build reports on the fly using arbitrary filters.

The most complex tools-those used for data mining-enable users to produce exception or scanning reports. These systems are notable because, unlike users of other decision support systems, data mining users launch the search process in order to seek answers they don't already have.

The greatest data mining success stories involve businesses, often retailers, that use this process to expose previously unrecognized patterns in their customers' buying behaviors. For example, credit card companies, upon spotting clients who bought swimsuits and signed up for scuba lessons at the YMCA, have then sent them discount coupons for a Caribbean cruise. Grocery chains have analyzed customers' baskets of purchases and learned that cosmetics buyers typically also purchase greeting cards. They've subsequently increased sales in both product categories by redesigning store layouts to ensure that the two product lines were positioned in the same aisle.

Each of these information 'nuggets' is used by the 'miner' to improve profits, enhance customer service, and ultimately achieve a competitive advantage. Some industry vendors maintain that every business has, in its operational data, at least one nugget that will justify the cost of what can be an expensive data mining system.

Early data mining successes spawned a surge in market demand for applications and tools. Five years ago, the data mining arena held no more than 10 vendors. Today, that market boasts 50 or more small to midsized companies, not including the industry giants.

However, most of these companies are marketing tools that analysts regard as faux data mining tools. Instead of searching for unrecognized data patterns, many of these solutions summarize the operational data in new ways and then allow users to submit queries using sophisticated tools. Because the main function of these verification-based systems is to check statistics and allow the analyst to create hypotheses to prove or disprove, analysts consider them merely the newest generation of sophisticated decision support tools, not true data mining tools.

''Data mining' is a term that's been misused by vendors and users alike,' says Bruce Love, research director for industry analysis firm the Gartner Group (Stamford, CT). 'It is a fairly narrow process that is, by definition, discovery,' not verification of data patterns, he says. The Gartner Group defines data mining as follows:

'Data mining is the process of discovering meaningful new correlations, patterns, and trends by sifting through large amounts of data stored in repositories and by using pattern recognition technologies as well as statistical and mathematical techniques.'

Discovery-driven data mining can include systems that perform predictive modeling, clustering systems, association discovery systems, and deviation detection systems. (See Figure 1 for the Gartner Group's rendition of the juxtaposition of data mining tools and database vendors.)

Figure 1

Discovery-based data mining is currently being used only 'in the rarefied regions of the most sophisticated user organizations,' says Robert Moran, director of decision support research for the Aberdeen Group (Boston). 'We're talking about mighty sophisticated technology' that presently resides in the realm of analytical people, not the average business user, according to Moran.

'We're in the 'gee-whiz' phase right now,' agrees Brian McGill, development manager for one of Kenan Systems' (Cambridge, MA) data mining software product lines. 'Most people aren't too serious about pursuing [genuine] data mining solutions.'

Even if they're not yet interested in discovery-based data mining solutions, many LAN users do want to mine their data to find the information nugget with maximum bottom-line impact. Vendors are drawing the interest of prospective clients by introducing a host of data mining-style products and solutions. Before searching the product shelves for data mining offerings, however, LAN managers need to consider their hardware platform and their network design.

DESIGN TO MINE

A processing-intensive data mining system has no tolerance for design flaws. Even slight redundancies and inefficiencies become bottlenecks when multigigabyte- or terabyte-sized data repositories are queried by complex multidimensional requests. 'The value of the architecture increases nonlinearly with the size of the data set,' says MicroStrategy's Saylor, not unlike the flaw that is inconsequential to a one-story building, but that could cause a 50-story high-rise to collapse. 'Brute force is seldom, if ever, the solution with these sorts of systems.'

The keystone of successful data mining systems is parallel processing. Because each data mining request pulls data from various storage repositories, processes it using I/O devices, and performs iterative sorts and merges, serial processors return query responses only after days or weeks, if at all. However, parallel processors divide the request into bite-sized chunks and then distribute them among multiple CPUs, which retrieve the information in parallel.

The introduction of affordable parallel processing systems has enabled data mining vendors to expand the complexity of their software. Traditional decision support applications were often limited to two-dimensional queries, for example, 'show me sales in November [activity and time frame].' With parallel processing, many data mining applications pose multidimensional queries: 'Show me sales of seasonal merchandise in the Northeast region in November [activity, inventory, geography, and time frame].'

Once parallelism is accepted, network designers must consider the volume of data that will traverse the network. If users will be permitted to submit complex queries or to search large portions of the corporate database, larger servers and more powerful processors will probably be needed. For example, a catalog-based retailer initiating a new mail campaign may search as much as 50GB of data that relates to its 25 million customers. On an inadequately designed system, such a search could kill system throughput.

Beyond the volume of data is the issue of the number of users that may concurrently run checks against the data. Evangelos Simoudis, director of data mining solutions for IBM, cites the example of a banking organization using data mining in its research and strategy department. As many as 15 users might execute one type of operation, with some of them looking at as much as 30GB of data at a time. 'That's a lot of movement,' Simoudis says.

Savvy network designers generally structure data mining networks in one of two ways: They either use a large central server to store data and access the required data sectors for processing only, or they break up the components from the data warehouse and transfer relevant pieces into a smaller data mart. A typical configuration allows users to generate requests from a workstation; the requests are then sent across the LAN (usually in some form of SQL query) to a superserver, which performs the analysis and returns a response.

Most network designers divvy up the data. Some use specialized, smaller data warehouses, called data marts, or even smaller data mines. These smaller repositories can segregate data by corporate function (for example, inventory data) or by customer name (for example, customers whose names begin with the letters A through F). They allow data mining operations to take place on smaller volumes of data, saving processing time and effort.

At some sites, data marts that were initially designed to be mined are instead being segregated by business function. Each corporate group has a separate data repository that is provided by the central data warehouse or even the larger data marts. Breaking up the data in this way can improve data mining performance and allow LAN managers to supply the repository with the information needed by specific applications. A field that was not originally included in the data warehouse, for example, can be included in the data mart so information can be tracked at the departmental level, says Kamran Parsaye, Ph.D., president and CEO of Information Discovery (Hermosa Beach, CA), a data mining vendor and consultancy. 'I see [data marts] working well in LANs,' he says.

Some companies make smaller data marts available to certain applications while continuing to offer access to other corporate data. Others build a series of small data marts off the central data warehouse, limiting access according to user. If the data warehouse is well built, says Donna Prlich, market development manager for Sun Microsystems (Mountain View, CA), 'There is no reason why users of a small LAN couldn't go after all of the data as needed.'

According to Simoudis, IBM's data mining consulting business (which focuses on discovery-driven data mining) recommends a classic, three-tier logical architecture to support data mining applications. In this design, the client launches applications and graphically presents data mining results. The application server, the heart of the data mining application, stores business- or application-specific information and processes the data using specially designed tools. The data server stores the operational or summarized data.

The Gartner Group's Love maintains that an appropriately designed, three-tier architecture solves most throughput problems for LANs that handle data mining. 'A server working through the actual mining process and producing smallish data sets will keep your LAN traffic reduced to relatively small and infrequent messages and allow your desktop and local server to do the things they are designed to do,' he says. Conversely, 'if you bring down some huge application and load it on a small server with 30 users, it's going to die. There's no way you're going to get the performance you want,' he says.

MAKE ROOM FOR DATA

Although proper system design is central to maintain optimum data mining performance, LAN managers must also consider other issues regarding the management of the congestion that such a system will generate. Proper placement and storage of the data varies according to the site and system needs, but a solid plan is essential for efficient data management.

Determining whether to store the data in the data warehouse, a data mart, or on remote storage media can be a difficult decision. Most network managers make several attempts to achieve the proper balance before hitting the right combination, says Jim Ashbrook, president of Sunnyvale, CA-based Prism Solutions, a data warehouse software vendor.

Over time, network managers can measure the usage of certain data, moving various data sets to appropriate locations. Less frequently used data can be moved to less available storage mechanisms, while commonly used data can be stored in a more central location.

Network managers often find themselves striking a balance between accessibility and overall network performance, consolidating and relocating data throughout the life of the data mining application. Vendors insist that this method is valid. 'It is not necessary to make all the decisions up front, before realizing returns,' says IBM's Simoudis.

Another possibility is to invest in data mining tools that use optimization techniques. The current generation of tools is proficient in translating SQL into business language. Most also provide query governors, which can prevent users from generating 'the query from hell,' says Aberdeen's Moran.

Some vendors say response times may be enhanced depending on the hardware you choose. Companies such as IBM and Sun promote the need for a hardware platform that can grow from a few linked workstations to a network of super-processors. Such a platform enables not only scalability but more efficient processing. IBM's SP2 machine, also known as the 'LAN in a can,' uses multiple modular RISC processors connected by a high-speed bridge to process complex queries in parallel. Sun's Solaris product line scales from workstations to high-powered symmetric multiprocessing (SMP) -based networks.

'As you start performing more sophisticated data mining operations, you move more data around,' says David Gelardi, manager of commercial parallel systems marketing for the RS/6000 division of IBM. 'The tighter the integration of those components, the better.'

SAFE AND SOUND

Planned integration can be an important data storage issue. Managers of networks on which large volumes of data will be mined-from hundreds of gigabytes to terabytes-need a cost-effective plan regarding the storage and management of that data. For LANs with many types of storage and CPUs, a centralized storage strategy may be more efficient than individualized storage devices.

The fortunes of storage vendor EMC (Hopkinton, MA) have risen dramatically in the last two years, thanks to sites that need to store larger volumes of data for on-line access. EMC's Symmetrix line of storage products enables sites to plug multiple servers into one central storage device. A storage facility that transmits queries quickly and efficiently is critical to the success of data mining applications, explains Roy Sanford, EMC's director of solutions for partner programs.

Yet the cost of storage solutions, centralized or distributed, can be prohibitive for some organizations. Extremely cost-conscious LAN managers can be scared away by the price tags affixed to the ability to keep gigabytes or terabytes of data at users' fingertips. Sanford estimates that storage accounts for 30 percent of the cost of a data mining system. Industry experts suggest that storing 500GB of data can cost upward of $1 million. 'The industry looks at storage as a hidden cornerstone of the computing environment,' he says. 'It shouldn't be hidden anymore.'

The cost catches some by surprise, but the overall cost of data mining solutions may be less than expected. 'Data mining is becoming very affordable now,' confirms Information Discovery's Parsaye. Improvements in the price and performance of parallel processing platforms such as RISC machines and SMP devices are enabling buyers to obtain millions of instructions per second at the lowest prices to date.

As smaller servers and large workstations support increasing power, they will be able to handle more of the data mining analysis workload that has so far required superservers. 'What took a $10-million Teradata system five years ago is now increasingly being run on Hewlett-Packard, Sun Solaris, Digital Alphas, and Silicon Graphics [workstations],' Love says.

PICKS AND SHOVELS

The spectacular growth in the number of data mining vendors and products is a sign that end users are willing to try a variety of approaches to achieve simpler, wider access to heretofore mysterious corporate operational data. Because the market is so new, data mining tools are not yet clearly categorized by type. There's considerable overlap from product to product in terms of functionality. Steve Smith, director of advanced analytics for Pilot Software (Cambridge, MA), says that one way to group tools is in three categories:

Tools that provide database access (typically by using a GUI on a SQL query structure)

Tools that produce data reports that can formulate more detailed questions and can 'drill down' to more specific information.

Multidimensional database environments, which allow queries to be posed in multiple dimensions

Adding to the confusion in this market is the inevitable formation of partnerships. Some vendors are pursuing such arrangements in hopes of combining multiple features and findings and providing users with specific data mining solutions.

For example, Pilot Software has partnered with Dun & Bradstreet Information Services (Wilton, CT) to produce sales and marketing applications for the transportation, pharmaceutical, and wireless industries. The new products, scheduled for release this summer, are intended to integrate leading-edge data mining capabilities with large-scale data stores, says Paul Buta, product manager for marketing intelligence at Pilot. 'Companies can leverage the data mining experiences of our partners' to get up and running faster with data mining tools, he says.

Because Pilot's LightShip product suite is an on-line analytical processing (OLAP) environment, it uses operational data that has been summarized into specific formats suitable for multidimensional searches, Buta says. OLAP solutions also allow users to drill down to more detailed levels of information or to zoom up to higher, more general levels.

One of Pilot's partners is Lightbridge (Waltham, MA), a provider of customer acquisition and retention services for the wireless communications industry. Working with Pilot, Lightbridge will offer a data mining system that can provide detailed intelligence for wireless carriers, which typically experience huge customer turnover after initial service contracts expire. The application is designed to help carriers better understand their clients, reduce customer turnover, and decrease acquisition costs. Over time, a wireless carrier could use the product to generate predictions based on previous customer activity, ultimately identifying customers at risk for defection. Such customers could then be approached with proactive offers such as service discounts.

These types of partnerships are only one indication that vendors are becoming more sophisticated in their data mining product offerings. Even technical enhancements are flowing fast and furious. 'We are seeing a considerable amount of sophistication among tool suppliers in dealing with LAN traffic,' says Moran of the Aberdeen Group. 'They are learning to build the appropriate multiple threads for that universe. If you have a large body of users banging away, you must be aware of how things are routed around the LAN,' he says.

With so much change in the data mining tools market, how can a prospective buyer make the right choice? The majority of the tools support Windows, although some processing-intensive tools, such as Statistical Analysis Systems (SAS), are better suited to run in RISC or Unix environments. Considering a tool's connectivity before purchasing it may prevent the network manager from becoming trapped by tools that are unable to grow with the site's data mining needs.

LAN managers who are selecting data mining tools should also consider the needs-perceived or otherwise-of business users. Ask users what they expect to do with the data mining tools now, and what they expect to do six months from now. Such users are often unsure about the types of queries that they will want to ask, so a tool that provides flexibility is critical. Most users access a mere 20 percent of the information in a database 80 percent of the time, says Donna Ruben, data warehousing technology manager for Sun. With that knowledge, network managers can select tools that allow quick access to routinely searched data.

GIGABYTE GOLD RUSH?

Once a site begins data mining, Ruben warns, there's no turning back. 'It's like a drug; people can't get enough,' she says. Faced with the possible 'addiction' of network users, how can a LAN manager accurately plan for the growth of a site's data mining applications?

'Assume that if you are successful, your needs will grow dramatically,' says IBM's Gelardi. '[They will] double and triple within 18 months to two years. Look at ceilings, not at floors.'

When first considering the addition of data mining capabilities, most LAN managers express an interest in putting historical operational data on-line. Such data volumes can be enormous and costly to bring on-line. And determining how much data is there requires an in-depth assessment of the operational data and the answers to such questions as: How much duplication exists within the data? Are there substantial amounts of erroneous data? Is there data that would be superfluous to the actual data mining application?

Without fail, data mining experts advise at LAN managers to perform a data modeling exercise before diving in head first. The data modeling exercise begins with the examination of a small, representative portion of data. The data should receive whatever cleansing or summarization techniques that have been determined for optimum performance.

The network manager then applies the appropriate data mining tools and judges the results. Wherever the outcome of the modeling is less than satisfactory, the process can be refined. For some sites, this means reevaluating the data available for mining.

Some find that data contains far more errors or duplication than expected. Most quickly realize that they've underestimated the total amount of data needed on line. 'They guess wrong, and they're always low,' Gelardi says.

MicroStrategy's Saylor relates the experience of a client who assumed that a 20GB data set would be sufficient for his site's data mining needs. Once the data mining began, however, the data set doubled every six weeks until it reached 500GB.

Growth can be accommodated if the initial pilot is well planned. Rather than throwing 500GB of data into a warehouse and hoping to mine it, LAN managers should hire an experienced consultant to help create a pilot that is one-tenth or one-twentieth the expected size of the final data mining operation, advises Saylor. With each success, the operation can safely expand.

Using the power and tangible results that data mining capabilities provide, a company can stockpile opportunities for future growth and additional development. Offered by firms ranging from small vendors to industry giants, true data mining tools and data mining-style products and services are likely to continue growing at phenomenal rates.

Sites already using the technology are expected to expand their data mining projects as users become more focused in specific subject areas, says Prism Solutions' Ashbrook. Data mining is also expected to be adopted by scores of new users as the technology trickles down from deep-pocketed Fortune 1000 companies to smaller enterprises.

Perhaps the most important aspect of the data mining phenomenon is that it may swell as companies discover the technology's benefits. After all, if the mining of one key nugget is enough to pay for the system, a company that devotes even a moderate effort to data mining is bound to experience success. 'In data mining, you must win only 10 percent of the time to really win,' explains Parsaye. 'If you find that pattern during your first attempt, you are suddenly a hero.' Companies that continue the effort, says Parsaye, 'will find that the payback curve is linear.'

Cheryl D. Krivda is a technical journalist who specializes in information systems topics. She can be reached via the Internet at 5309513@mcimail.com.(/I)

Previous 3 Next   Top
>~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Date: Wed, 29 May 1996 10:58:25 -0500
From: 'Jim Gallman' (jgallma1@146.61.102.20)
Subject: KDD applied to a corrective action data base

I am interested in applying data mining techniques to a corrective action
database. The database consists of 7 years of equipment repair documentation
(approximately 30,000/year) and personnel performance issues (approx.
1500/year). Each entry is linked to design, procurement and procedural
databases and are coded for issues. Many of the fields are keyword type (i.e.,
hard coded from a list of possible codes), but some of the important information
is free narrative. The increasing amount of data makes eye integration
unreliable for ensuring that all important trends are known.

I'd like at least two capabilities. The simplest is given a group of records
which I've determined to have some common theme, I'd like the software to
generate a 'signature' that defines that theme and can use the signature to find
others belonging to the same group. I assume that the signature would be fuzzy,
with group membership being a probablistic assessment, since much of the data is
in narrative form and not cosistently coded.

The second capability is to be able to have the software periodically analyze
the data for newly emerging issues (i.e., looking back over periods of time,
what new relationships are emerging).

My questions are, what tools are available? Has anyone any experience at
applying the techniques to corrective action databases? Do the tools handle
free text analysis (many of the issues will be burried unrecognized in the
narratives)?

Jim Gallman JGallma1@tuelectric.com 817-897-5673

Previous 4 Next   Top
>~~~Publications:~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
from Machine Learning List: Vol. 8, No. 8
Subject: Preprint available: 'Applying classification algorithms in practice'
Date: Wed, 24 Apr 1996 08:19:17 -0500
From: Carla Brodley (brodley@ecn.purdue.edu)

In light of the recent concerns about applications papers in ML, we would
like to announce that our paper

'Applying classification algorithms in practice,'
Brodley, C. E. and Smyth, P. (To appear) Statistics and Computing.

is available at: http://yake.ecn.purdue.edu/~brodley/

In particular, the paper discusses how applications can be a source of
important new research problems for theorists and that maximizing
predictive performance is only one of many factors that influence
success in practical applications.

Carla Brodley and Padhraic Smyth

Applying Classification Algorithms in Practice

Carla E. Brodley and Padhraic Smyth
Purdue University UC Irvine and JPL
brodley@ecn.purdue.edu smyth@ics.uci.edu

ABSTRACT

In this paper we present a perspective on the overall process of developing
classifiers for real-world classification problems. Specifically, we
identify, categorize and discuss the various problem-specific factors that
influence the development process. Illustrative examples are provided to
demonstrate the iterative nature of the process of applying classification
algorithms in practice. In addition, we present a case study of a large scale
classification application using the process framework described, providing an
end-to-end example of the iterative nature of the application process. The
paper concludes that the process of developing classification applications for
operational use involves many factors not normally considered in the typical
discussion of classification models and algorithms.

Previous 5 Next   Top
>~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
From: A.N.Pryke@cs.bham.ac.uk
Date: Tue, 28 May 1996 16:30:56 +0100
Subject: Searchable Bibliography of Online KDD papers

The Data Mine http://www.cs.bham.ac.uk/~anp/TheDataMine.html
provides information about Data Mining and Knowledge Discovery in
Databases (KDD).

Contributions of relevent information about other sites on the web,
papers / publication on data mining, biographies, conferences,
software etc. are welcome and can now be entered using a new forms
interface http://www.cs.bham.ac.uk/~anp/contributions.

Searchable bibliographies include about 40 online papers and about 130
offline bibliography. There is currently a bias towards
machine-learning papers. The bibliographies can also be downloaded in
bibtex or bibtex/html format.

Bibtex format information on relevent online papers (including a 'url'
field) is particularly welcome, and can be submitted via forms.
Collections of formatted entries are welcome via email.

Andy Pryke

---
Andy Pryke, Research Student, Computer Science, Birmingham University
Data Mining Information - http://www.cs.bham.ac.uk/~anp/TheDataMine.html

Previous 6 Next   Top
>~~~Siftware:~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Date: Fri, 24 May 96 18:12:13 +0400 (WSU DST)
From: 'Mikhail V.Kiselev' (megaputer@glas.apc.org)
Subject: new data mining product PolyAnalyst v2.01

The software solutions company - Megaputer Intelligence, Ltd. -
releases its new data mining product PolyAnalyst v2.01.

PolyAnalyst is a convenient integrated environment for database
exploration. It has a friendly object-oriented user interface.
PolyAnalyst incorporates a set of powerful tools - exploration engines
- for intelligent data analysis.

--The main engine, Core PolyAnalyst, finds exact form of
multi-parameter functional dependencies in data, expressing them as
mathematical formulae and/or structural algorithms including IF and FOR
blocks, as well as other constructions. A unique feature of
PolyAnalyst is its ability to discover empirical laws of a great
variety of forms. In particular, it can work with structured data,
which are not necessarily represented as just sets of attribute
values.

--The second engine, ARNAVAC, detects presence of dependencies in data
with the help of statistical methods, displaying results of
exploration in a tabular form. ARNAVAC also separates the
sub-population of points obeying the found dependence from a diffuse
component considered to be noise or database errors.

--Other engines use multiple regression technique, and data
visualization as discovery methods.

Summary of PolyAnalyst features and characteristics:

1. PolyAnalyst can work with databases consisting of up to 16000 records and
up to 1000 attribute fields.

2. It can export and import files in DBF and CSV (comma separated values)
formats.

3. Data from different sources can be combined using mechanism of keys and
references.

4. All contents of any specific database exploration task including data,
graphs, results obtained by the exploration engines, rules, and laws are
stored in a separate project file. This allows one to use a single copy of
PolyAnalyst in many different research projects. Upon loading the a project
file you can contiue your work on the project exactly at the step at which you
left it the previous time.

5. Analyzed data may be a mixture of numerical, boolean or categorical values.
PolyAnalyst can work with partially missing data.

6. Data can be easily split into several subsets, or datasets which can be
explored separately. New datasets can be created by splitting data according
to various methods and criteria, or they can appear as a result of boolean
operations (creating intersection, unation, complement) on existing datasets.

7. All elements of a project: datasets, rules, tables, currently running
exploration engines, their reports, and so on, are represented as objects
depicted by their icons. Such object-oriented user interface makes PolyAnalyst
easy to learn and operate.

8. Three main exploration engines of PolyAnalyst automatically extract
information, or knowledge, from the data:

--The first engine, Core PolyAnalyst, discovers the exact form of
dependencies in data, expressing them in the language of mathematical
expressions, structural blocks, and other constructions, which are
very intuitive and easy to understand. Core PolyAnalyst can discover
laws of very broad nature. Discovered rules can be edited; they also
can be combined with rules entered by the user or rules obtained by
other PolyAnalyst exploration engines. Combination of various rules
discovered by PolyAnalyst with prior user's knowledge of the field
generally provides elaborate and very effective models.

--The second exploration engine, ARNAVAC, detects presence of
functional dependencies in data and finds data fields obeying these
dependencies. It also determines the accuracy and significance of the
dependence found, as well as a subset of data points that do not obey the
dependence (possible noise or database errors). ARNAVAC represents the
discovered dependence in a tabular form, revealing its general
structure.

--The third exploration engine provides multiple regression with automated
selection of independent variables. It is a more traditional but still a very
useful tool.

9. One additional data exploration method featured by PolyAnalyst is
based on its data visualization capabilities. The user can create
various graphs, manipulate points and datasets with the help of these
graphs, modify data in the graphs using entered and discovered rules,
or analyze graphically multi-dimensional models varying their
parameters with the help of sliders. For example, this manual data
analysis is very helpful in difficult cases, when complex derivative
parameters of original data structures have to be used as independent
variables for obtaining a more precise empirical law by Core
PolyAnalyst.

10. Exploration engines can work concurrently.

11. PolyAnalyst maintains strict control of significance of reported
results.

12. PolyAnalyst performs data exploration automatically, while
keeping interaction with the user to a minimum. This feature allows
even an inexperienced user with no mathematical or statistical
training to exercise the full power of PolyAnalyst exploration
engines.

For further info contact

Dr. Mikhail Kiselev,
Megaputer Intelligence Ltd.,
megaputer@glas.apc.org
tel: + 7 095/231-8079,
tel/fax: + 7 095/485-9354,
38, Bolshaya Tatarskaya,
Moscow 113184
RUSSIA

or in the USA:

Sergei Ananyan
Megaputer Intelligence rep
tel: 804-221-1522
fax: 804-220-3878

Previous 7 Next   Top
>~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Date: Wed, 29 May 1996 09:05:15 -0500
To: kdd@gte.com
From: Chris Turnquist (chris@kd1.com)
Subject: New Knowledge Discovery company

Knowledge Discovery One, Inc. (KD1) is a provider of application software,
consulting services, and off-site computing services for knowledge discovery
and data mining projects. This newly formed team of ten professionals is
made up of business and technical consultants with extensive experience in
the data warehouse and decision support marketplace. KD1's charter is the
application of advanced knowledge discovery and data mining tools and
strategies to today's most challenging decision support applications. Through
partnerships with leading knowledge discovery/data mining tool vendors,
together with a fully equipped data center, KD1 has the tools and resources to
deliver advanced knowledge discovery applications and proof of concept projects
today.

For more information, visit our web site at www.kd1.com, or email us at
info@kd1.com

Previous 8 Next   Top
>~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~