Poll Results: More data or Better Algorithm? (KDnuggets News 08:08, item 1, Features)

KDnuggets : News : 2008 : n08 : item1

Features

From: Gregory Piatetsky-Shapiro
Date: 21 Apr 2080
Subject: Poll Results: More data or Better Algorithm?

The previous KDnuggets Poll asked:
What will usually give better improvement in data mining results:
More data or Better algorithm?

45% voted for more data, while 20% for a more advanced algorithm, confirming my rule of thumb:

More data (especially more relevant features) produces larger improvement than a more advanced algorithm,
(especially in the initial stages of the peoject)

Of course, as with all such general sayings, a lot depends on specifics:

Dean Abbott commented: ... just more data is not enough, but better features (particularly multi-variate features) can provide significant model improvement.

Greg Safarz wrote: More attributes and features wins hands down.
When data is very limited (as in many medical applications, with <100 patients are typical), then more advanced algorithms are needed.

Jozo Kovac wrote: But what are "results"? Model accuracy, model benefits in real world, new extracted knowledge(rules) about your customers?

Alexandru Floares suggested: If the number of cases is less than 10 times number of features, and the quality is reasonable, adding data can improve the accuracy. If the data quality is low, adding data can improve the accuracy, by increasing the number of informative cases, which remain in the data set after pre-processing or cleaning the initial data.
On the algorithm side, balancing unbalanced data (e.g. two classes: Class A 10% and Class B 90%) can improve the accuracy and ensemble methods (boosting, bagging, etc.) can improve the accuracy of the results.

For full results and more interesting comments, see KDnuggets 2008 Poll: More data or Better algorithm?

KDnuggets : News : 2008 : n08 : item1