KDnuggets News 01:04, item 17, Publications

KDnuggets : News : 2001 : n04 : item17 (previous | next)

Publications

From: Weiss, Gary M, BMCIO gmweiss@att.com
Date: Tue, 13 Feb 2001 10:59:56 -0500
Subject: Weiss on how to deal with the case where one class is rare

Building models when one class is rare can be quite difficult because there are often many unstated assumptions. To make things concrete, imagine we have a training set with 9,900 examples of class "0" and 100 examples of class "1" (a class ratio of 99:1). What training distribution will yield the best model?

The most natural approach is to use all of the data. Most learning algorithms in this situation will generate a classifier that always predicts the majority class and never predicts the minority class. Because this is a trival model, most people will consider it a poor model. However, the generated model is likely the best model given the assumption that the training distribution will be the same as the testing distribution and the goal of the learner to maximize predictive accuracy-- the trival model achieves 99% accuracy in the above scenario.

The real problem is that the test distribution may not be the same as the training distribution, and, more importantly, predictive accuracy is often a poor evaluation metric. Accuracy is only appropriate if the cost of a false positive equals the cost of a false negative. However, when the class distribution is highly skewed this is almost never the case-- the rare case is generally considered "more important". That is, assuming the rare class is considered the positive class, the cost of a false negative will generally be a lot more than the cost of a false positive. For example, the cost of diagnosing a person with a treatable form of cancer as being healthy (false negative) will generally be much higher than the cost of diagnosing a healthy person as having cancer (false positive); in the later case a more horough but expensive test will be run which will lead to the correct diagnosis.

Thus, in theory, given enough computational power, the best choice is to use all of the data (extra data shouldn't hurt) but use a learning algorithm where the evaluation function matches the real-world characteristics of the problem. In practice this might mean passing in cost information to the algorithm. Unfortunately, not all learning algorithms accept cost information-- and often this information is not even known. If the learning algorithm cannot handle cost information, one can compensate by modifying the class distribution. If the ratio of costs is 2:1 (misclassifying a minority examples is worse than misclassifying a majority example), then you might want to oversample the minority class so the class ratio is twice what is was before. However, in an ideal world, this would never be necessary.

Foster Provost and I have a draft paper called "The Effects of Class Distribution on Classifier Learning" that empirically determines the best class distribution for learning for 25 datasets. In it we use two evaluation metrics-- accuracy and the area under the ROC curve (this measures the quality of a model over a range of class distributions and costs). However, our study assumes that there is a cost associated with getting/cleaning/processing the data so we actually answer the question: "given a training set size n, what is the best class distribution for learning?" We find that the naturally occuring class distribution generally does not even maximize predictive accuracy and that a class ratio close to 1:1 generally maximizes the area under the ROC curve. This result helps explain the common practice of handling a skewed distribution by generating a balanced dataset and learning from that-- this generates a model that performs well over a range of class distributions and costs. However, our result only holds when the dataset size is fixed and the class ratio can be varied-- I still believe that in theory it is best to use as much data as possible but modify the evaluation criteria used by the learning algorithm.

Our paper is available on-line in .pdf and .ps format. We would greatly appreciate any comments you may have. The URLs are:

http://www.cs.rutgers.edu/~gweiss/papers/class-distr.pdf http://www.cs.rutgers.edu/~gweiss/papers/class-distr.ps

I'd like to mention two things that may help clarify some issues. In our paper we introduce the notion of a learning curve for the minority and majority classes. This allows us to measure the benefit of adding additional minority vs. majority class examples. We generally find that, starting with the naturally occurring distribution, adding additional minority class examples yields a large improvement in the minority class learning curve; adding additional majority class examples generally yields a small improvement (this should not be surprising since we start with more data points for the majority class). However, if predictive accuracy is the evalution criteria, one may still want to add majority class examples since there will be more majority than minority examples in the test set.

Finally, in our paper we cover one important technical concern that is highly relevant to this discussion. When we purposefully change the training class distribution to differ from the test distribution, we compensate by re-evaluating the estimates at each leaf (we use C4.5). For example, if we increase the fraction of minority class examples in the training set so that it is twice that of the natural class distribution, we require twice as many minority class examples as majority class examples at a leaf in order to label the leaf with the minority class. That is, we increase the fraction of minority examples because it allows us to build a classifier that is better able to predict the minority class, but we account for the difference in distributions so that we do not improperly bias the classifier toward predicting the minority class.

The main point here is that if we do change the distribution, we want to ensure that the model is built to reflect how it will be used.

KDnuggets : News : 2001 : n04 : item17 (previous | next)