KDnuggets : News : 2001 : n25 : item6    (previous | next)

News


From: Hand, David J
Date: Wed, 5 Dec 2001 15:26:30 -0000
Subject: D. Hand Commentary on Arnold Goodman's remarks about KDD-2001

It was interesting to see Arnold Goodman's comments about the relationship between statistics and data mining. As the statistician amongst the three co-chairs of KDD-02, perhaps I can add my view.

There is no doubt that there is mutual ignorance between statisticians and data miners. There is also no doubt that this is detrimental to both disciplines. This mutual ignorance and suspicion between statistics and the computational data analytic disciplines is not new. I have explored it in various publications. I think one part of the reason lies in a conservativeness in statistics (perhaps a consequence of its mathematical past, inducing an inclination towards rigour at the cost of adventurousness) versus a risk taking attitude in computing (program it and see if it works, without worrying too much about provable properties of the algorithm). Another part (hinted at by Arnold) lies in the modern statistical concern with models and the modern computer science concern with algorithms. Both concerns are perfectly sound, and are natural consequences of the way the disciplines have developed. But to stress one while dismissing the other is to dramatically reduce the potential for discovery and progress.

It is said that those who do not understand statistics are condemned to reinvent it. As far as data analysis goes, there is certainly a lot of truth in this. Here are two examples off the top of my head:

(i) Overfitting problems in early neural networks research. This work was characterised by claims of future predictive classification accuracy which turned out to be gross overestimates of that actually achieved. There was a real danger of a backlash, as customers realised that the proponents of neural networks were not fulfilling their promises. But then understanding dawned - the aim was one of generalisation, not one of fitting the training data. Simple optimisation criteria measuring the discrepancy between the true and predicted classes of the training set were replaced by more sophisticated measures which penalised overcomplex models - using methods such as weight decay. Statisticians, of course, could have explained this at the start. They had explored such phenomena 40 or 50 years ago, in domains such as variable selection in regression, and the comparative superiority of linear over quadratic discriminant analysis in classical problems. In both cases, the more complex model 'ought' to do better - it has more flexibility to model the underlying truth. The reason that it didn't was that it was also able to model training set peculiarities. And this discovery resulted in a deep and comprehensive understanding of overfitting and model complexity phenomena, along with associated theory of penalised goodness of fit criteria. An understanding of the statistical literature could have saved colossal effort, time, and resources in the development of neural network technology. In turn, it would have meant we would be years ahead of where we are now.

2) Causation in association rules. Statisticians have it drummed into them in the cradle that correlation does not imply causation. Understanding of this is now fairly widespread in other intellectual communities, including data mining. But it was not always thus. Classic data miners have come from a computer science background, often tasked with extracting useful information from a database, and this is what they have been good at - extracting information from a huge existing body of data. But in many (maybe most) problems, the aim is not really to describe the database at all. Rather, we want to make inferences from that database to perhaps the future. How will people behave tomorrow or next year? With large data sets (as statisticians will tell you) there are sound theoretical reasons why the patterns you have observed in your existing database may reflect an underlying reality, and not be mere sampling fluctuations - so that such patterns may well recur next year. Indeed, statisticians can even put a probability on such an event. But, discovering that people who bought item A also bought item B does not imply that inducing people to buy A will increase the chance that they will also buy B. Early data mining work on association rules was often sold under the implicit implication that this was the case. Again, better statistical knowledge would have avoided this mistake. It would have avoided rediscovering statistics.

The point of these examples is not to try to demonstrate that statisticians are in any sense superior to data miners. That would be an absurd claim (especially since I regard myself as sitting in both camps). Rather, it is to demonstrate that data mining can learn from statistics - that, to a large extent, statistics is fundamental to what data mining is really trying to achieve. And to illustrate the wasted effort which will result if the vast body of statistical knowledge is not integrated into data mining.

Statistics developed over the course of the 20th Century, starting in a pre-computer era when data sets that we would now describe as 'large' just did not exist. This meant that many of the tools needed to handle them - to answer questions relating to them - did not exist. Which is one reason why a new discipline has grown up. It is something of an indictment of the statistical profession that so few statisticians have become involved in a deep way with data mining. An attitude of superiority is unacceptable. Yes, statisticians have a lot to teach data miners. But data miners have many fascinating and new problems which statisticians have not even begun to look at. Arnold is right to criticise data miners for their ignorance of statistics. But he might, with just as much justification, have criticised statisticians for their lack of involvement in the exciting new problems which data mining is tackling.

There is the opportunity for an immensely rewarding synergy between statisticians and data miners. It would be nice to think that researchers from both communities would come together to pool their distinct perspectives and approaches, to tackle the really important problems which are facing us in this modern data-rich world. It would be nice to see a mix of disciplines at KDD2002, with stimulating and productive discussions as researchers brought their particular skills and viewpoints to tackling problems that we all want to solve.

David Hand

Hand D.J. (1997) Intelligent data analysis: issues and opportunities. In Advances in Intelligent Data Analysis: Reasoning about Data. ed. Xiaohui Liu, Paul Cohen, and Michael Berthold.. Berlin: Springer. 1-14. Reprinted in Intelligent Data Analysis, 2, 67-79, 1998.

Hand D.J. (1998) Breaking misconceptions - statistics and its relationship to mathematics (with discussion). Journal of the Royal Statistical Society, Series D, 47, 245-250 and 284-286.

Hand D.J., (1998) Data mining: statistics and more? The American Statistician, 52, 112-118.

Hand D.J.(1999) Data mining: new challenges for statisticians. Proceedings of the ASC International Conference, 1999. ed. C.Christie and J.Francis et al., Association for Survey Computing, 21-29. Also Social Science Computer Review, (2000) 18, 442-449.

Hand D.J. (2000) Methodological issues in data mining. COMPSTAT 2000: Proceedings in Computational Statistics, ed. J.G.Bethlehem and P.G.M. van der Heijden. Physica-Verlag, 77-85.

___________________________

Professor David J. Hand Department of Mathematics The Huxley Building Imperial College 180 Queen's Gate London SW7 2BZ UK http://stats.ma.ic.ac.uk/~djhand/


KDnuggets : News : 2001 : n25 : item6    (previous | next)

Copyright © 2001 KDnuggets.   Subscribe to KDnuggets News!