Data Mining and Knowledge Discovery Nuggets 96:16, e-mailed 96-05-20
By Julie Bort
Publication Date: April 29, 1996 (Vol. 18, Issue 18)
Deep within the pulsating mass of bits and bytes strung throughout the enterprise lie answers to the most perplexing problems of any business. Which customers will turn to competitors? Which offers will prompt customers to buy more? What are the signs of fraudulent activity?
The newfound data warehouse concept for client/server architectures is a step in the right direction for getting answers. But if an organization really wants huge paybacks from its warehouse or data marts, it will need to turn to data mining. Data mining is the act of drilling through huge volumes of information in order to discover relationships, or to answer specific questions, that are too broad in nature for traditional query tools.
Fundamentally, data mining is statistical analysis and has been in practice as long as there have been mathematicians. But until recently, statistical analysis was a time-consuming, manual process and accuracy depended heavily on the person performing the analysis. No more. Today, thanks to the matur-ing of neural networks and other sophisticated technologies, tools exist that automate the process, making data mining a practical solution for a wide range of companies. Couple these tools with a growing base of accessible enterprise data -- often in the form of a data warehouse -- and a company has at its disposal a tool with immense implications.
'We use the HNC [data mining] product to identify customers who are about to leave our bank, an extremely important application for us. It is far easier to keep a customer than to go out and get a new one. This reduces our expenses,' says Bob Esters, vice president of marketing research and database marketing for Star Bank Inc., a regional bank with 250 branches throughout the Midwest and user of the Database Mining Workstation from HNC Software Inc., in San Diego. 'Beyond that, applications for things like site analysis and cross-selling opportunities are ripe for this kind of tool.'
A MODEL ATTRACTION. Tapping into this potential requires a basic understanding of data mining, which is as complex as its manual statistical counterpart.
'There are four operations for doing discovery driven mining. These are predictive modeling, database segmentation, link analysis, and deviation detection,' says Evangelos Simoudis, director of data mining solutions for IBM's World Wide Decision Support Solutions Division, in San Jose, Calif. 'You need a variety of tools to perform these [operations] because various types of data behave differently.'
Predictive modelers attempt to forecast a particular event -- such as which customers of a bank are likely to move to the competition. They assume that a company has a specific question that it is trying to answer. They will attempt to provide that answer by assigning a rank that determines the likelihood of certain outcomes.
Most of the readily available tools perform predictive modeling. In generic terms, a predictive modeler functions something like this: A company decides what it wants to research -- for example, which customers are likely to leave. It takes a sampling of scrubbed data on customers that have left and feeds it to the predictive modeler, telling it this is the sample of 'bad' customers. It also takes a sample of data from longtime customers and feeds it to the modeler, telling it this is the sample of 'good' customers.
The tool then sifts through these samples to uncover variables and combinations of variables that make up the typical 'bad' and typical 'good' customer profiles, and it returns a ranking of those variables. The results may then read as follows: Customers who are over 50, have an income greater than $100,000, are male, drive a Buick, and own their home have a 30 percent chance of leaving. Customers who are 18 to 25 years of age, have an income of less than $25,000, drive a Honda, rent, and are male have a 70 percent chance of leaving.
With these results, a company can run a query against its customer database to draw lists of customers that fit such profiles and design marketing programs to target the defined groups. Furthermore, as the modeler receives more data it will 'learn' and produce increasingly accurate predictions.
Predictive modeling tools can be segmented into several types, the most common of which are neural network products. Neural networks are computer applications that simulate the function of a human brain. They can be trained and are adept at the nonlinear reasoning that is the hallmark of many 'leap to a conclusion' human beings. Neural network tools include HNC's Database Mining Workstation and the DataCruncher, from DataMind Inc., in Redwood City, Calif.
The neural network predictive modeler is ideal for companies that have a great depth of statistical information and analysts who are already doing their own analyses, because neural networks work far faster than any human being working on a spreadsheet can.
'The beauty of the tool is that it can model in a nonlinear way and the process is fast. It makes the same decisions along the way that an analyst would make regarding which variables to include,' Esters says.
Other users concur.
'With statistics that used to take a month to model, we can have a new model overnight,' says Mike Eichorst, vice president of database marketing for Chase Manhattan Bank Inc., in New York, and a user of HNC's Database Mining Workstation.
Whether the simulated human thinking of a neural network modeler is more accurate than human thought remains debatable. Esters says that neural network products are comparable to, but not better than, functions of the human brain. But Eichorst disagrees.
'We originally used the mining workstation to assign customers to market segments. It consistently outperformed traditional statistical analysis methods,' Eichorst says.
PREDICTIONS INDUCED. The drawback to neural network products is that they are a black box, users say. Data is fed in and results come out, but the tool doesn't report how it comes to its conclusions. And sometimes, the how is as revealing as the what, users say.
An alternative type of predictive modeling tool relies on inductive reasoning algorithms rather than neural networks. It is exemplified by both IDIS Predictive Modeler (IDIS PM) from Information Discovery Inc., in Los Angeles, and SAS Stat, from SAS Institute Inc., in Cary, N.C.
Users say the inductive reasoning method is a better choice for company analysts who have little interest in extremely complex models and would rather have insight into the data itself.
'We need to determine what the data elements are. We want to understand them,' says Ken Zabel, vice president of business development at Customer Focus International Inc. (CFI), in Diamond Bar, Calif. 'We looked at neural network tools, but with some neural networks you can't really understand why certain choices are made.'
CFI builds customer information systems (CIS) for financial institutions. It uses IDIS PM to sort through a client's data before a targeted CIS warehouse can be created.
'We require a product like IDIS to perform affinity analysis to help our banks determine which variables make people have an affinity for purchasing certain products,' Zabel says.
In addition, inductive tools, also known as rule-based or tree-based modelers, may be more appropriate for dealing with data that is not easily quantified, according to vendors.
'Neural network predictors must quantify all the data, even data that isn't naturally quantified. With rule prediction, the data doesn't need to be numeric. It maintains the nature of the data,' explains Diana Lin, manager of application support at Information Discovery.
Lin offers the example of loan payment predictions. If a neural network were to predict how a loan would be paid, either with cash, check, credit card, or fund transfer, it would assign numbers to those options, then offer a numeric prediction that would have to be interpreted. IDIS PM would generate a prediction of the next payment method by name.
USER-UNFRIENDLY. The whole genre of tools is not particularly user-friendly. One factor to consider when shopping for a data mining application is how the data will be fed into the modeler. Some tools, such as IDIS PM, work on a separate workstation but can be attached to a LAN.
'It's a very straightforward process. Using IDIS is very intuitive,' Zabel says. 'We draw a subset of information from the enterprise warehouse into the [IDIS] workstation and we can draw it over the LAN. You can extract information from relational database tables -- prejoined conditions. It's very well structured.'
Other tools, such as HNC's Database Mining Workstation, run on stand-alone machines that cannot automate the burdensome task of dumping data.
'To get ready to use the tool, you've got to prepare an extract [from the data warehouse]. Then you've got to go through the laborious task of organizing it and manipulating data to put it into the Database Mining Workstation,' Esters says.
Beyond the data mining system's physical connection, reviewing the model itself can be tricky and requires, at the very least, a person who excels at mathematical analysis and at best, someone who is trained in statistical analysis.
'You can't just go into data mining saying I'm going to get a packaged tool off the shelf that I'll grab and dump data into,' says Ramin Makili, manager of the knowledge technology group of Andersen Consulting Inc., in Chicago, and a user of DataMind's DataCruncher.
'Once you have the tool you need to explain the models. You've got to have a room full of geek scientists like me,' Makili -- who was a nuclear physicist prior to becoming Andersen's predictive modeling expert -- adds.
Other users agree with Makili.
'This is not intuitive. You have to be analytical,' Eichorst says. 'And you have to be very insightful. You need to be able to look at any two variables to see the correlation.'
OTHER MINING TECHNIQUES. Beyond predictive modelers there exists a group of products that uncover relationships before the hypothesis. These are tools that can be used prior to a predictive modeler to uncover facts about your business you wouldn't think to ask about. The classic example of such exploration is the grocery cart analogy. By using exploration tools, a department store discovered that the two items most commonly found in the same shopping cart were diapers and beer. It then used a predictive modeler to find out which customers were likely to buy diapers and beer to send them marketing materials.
One tool that specializes in such association discovery is Information Discovery's IDIS. This tool, separate from the company's predictive modeler, accesses relational databases directly, via agents, to uncover trends such as market clusters and financial patterns. These patterns can then be modeled for further analysis.
Another tool that performs exploratory analysis is SAS Insight, which belongs in the visualization category. Visualization data modeling tools allow a user to assign colors to variables, which the tool then uses to discover relationships among variables. Again, once relationships are uncovered, further analysis or modeling may be employed.
However, because exploration tools have an unknown return on investment for many applications, they may be most appropriate as the next step after predictive modeling for known needs has been mastered. For instance, if a company had a model to predict attrition, one to predict fraud, and one to predict cross-sales, exploration for more models might be in order.
Users should also be aware of a movement that is beginning to take shape -- the tool suite. Suites combine various technologies and perform multiple forms of mining.
This month, IBM began beta testing its Intelligent Miner data mining development platform, expected to ship in July. The Intelligent Miner combines kernels of several types of mining technologies, including predictive modeling, association discovery, and visualization. It is aimed at corporations that want to develop their own applications. In addition, IBM will offer several targeted applications, which include a customer segmentation application, a market-basket analysis application, and a fraud-detection system.
SAS Institute is also offering a suite that incorporates several data mining tools into its SAS System for Data Warehousing. It has a neural network predictive modeler in development, although the company already offers some neural network capabilities based on SAS macros.
The result is that data mining turns business mysteries into competitive advantages.
'Unlike some other products, we didn't create these applications and go looking for a market,' IBM's Simoudis says. 'The applications we developed have come through our experiences in performing data mining services. Customers would come to us and say, `We have an attrition problem.' or `We need to attract new customers. How can we do that?''
Data mining is the answer.
Julie Bort is a free-lance writer based in Dillon, Colo.
Vendor contact information
HNC Software Inc.
San Diego
(619) 546-8877
http://www.hncs.com
DataMind Inc.
Redwood City, Calif.
(415) 364-5580
http://www.datamindcorp.com
Information Discovery Inc.
Hermosa Beach, Calif.
(310) 937-3600
http://www.datamining.com
SAS Institute Inc.
Cary, N.C.
(919) 677-8000
http://www.sas.com
IBM's World Wide Decision Support Solutions Division
San Jose, Calif.
http://www.dss.ibm.com
Tips for striking it data rich
The best tools in the world won't find you any gems unless you follow a few simple procedures. Here are some tips for mining well:
* Use only scrubbed data. (See 'Scrubbing dirty data,' Dec. 18, 1995, page 1.)
* Have business analysts, statistical analysts, and IT staff on the original application development team. Business analysts help clarify the importance of variables. The tool may scream that a correlation between two items is important, but it may turn out to be a no-brainer. Statisticians can bring understanding to the results. And IT staff can ease the burden of drawing data samples.
* When doing predictive modeling, test the model twice before relying on it. First test the model by feeding it data in a situation with a known outcome. For example, if you're trying to find out which customers might buy a product, use a list containing customers who already bought it and ones who didn't. See if the model points to the correct ones. Then, test the model with a sample promotion. Make the offer to a small sample of the customers indicated by a predictive modeler to see how on target the modeler is with live data.
* Continue to refine the model by feeding it the results of every marketing campaign.
* Add new models gradually as the tool becomes mastered.
* Realize that despite its scientific stance, modeling and all other aspects of data mining are more art than science. How the results of the mining are used will determine the benefits.
It is estimated that the amount of information in the world doubles
every 20 months; that is, some scientific, government and corporate
information systems are being overwhelmed by a flood of data that
are generated and stored, routinely. These massive amounts of data
exceed human experts' ability to analyze with traditional tools,
though they contain potential gold mine of valuable information.
Unfortunately, the database technology of today offers little
functionality to explore such data. At the same time, knowledge
discovery techniques for intelligent data analysis are not yet
mature enough for large data sets. Therefore, systems with a wide
variety of techniques for automatic (or semi-automatic) discovery of
knowledge from databases will play an increasingly important role.
The data mining, also known as database mining or Knowledge Discovery
in Databases (KDD), is defined to emphasize the challenges of knowledge
discovery in large databases and to motivate researchers and application
developers for meeting that challenge.
Papers in this area are sought. Specific topics of interest include,
but are not limited to, the following:
Theory and Foundational Issues in Data Mining:
This list of topics is not intended to be exhaustive but an indication of
typical topics of interest. Prospective authors are encouraged to submit
papers on any topics of relevance to data mining.
Inquiries (by voice, fax, or email) and manuscript submissions (four
copies of full articles) should be addressed to one of the guest editors.
Manuscripts may be submitted in hardcopy, fax, or e-mailed in plain ASCII
format. All manuscripts will be reviewed by a select panel of referees,
and those accepted will be published in a special issue of _JASIS_.
Original artwork and a signed copy of the copyright Release form will
be required for all accepted papers.
Manuscripts Due | October 1, 1996 |
Acceptance Notification | January 15, 1997 |
Final Manuscripts | March 1, 1997 |
Publication | Late-Summer 1997 |
Professor Vijay Raghavan | Dr. Hayri Sever |
Center for Advanced Computer Studies | The Department of Comp. Sc. & Eng. |
University of Southwestern Louisiana | Hacettepe University |
P.O. Box 44330 | 06532 Beytepe, Ankara, Turkey |
Lafayette, LA 70504 | Fax: 90 312/235 4314 |
Voice: (318) 482-6603 | E-mail: Hayri SEVER
|
Fax: (318) 482-5791 | |
E-mail: Vijay V. RAGHAVAN
|