Logo image
The effect of class distribution on classifier learning
Technical documentation   Open access

The effect of class distribution on classifier learning

Gary M. Weiss and Foster Provost
Rutgers University
2001
DOI:
https://doi.org/10.7282/t3-v9kt-9510

Abstract

Many of today's large data sets must be reduced in size before invoking inductive algorithms, due to the costs associated with procuring/processing the data, and because most of these algorithms cannot handle enormous amounts of data. In these cases it is important to select the training data carefully so the impact on classifier performance is minimized. A tacit assumption behind much research on classifier induction is that the class distribution of the training data should match the "natural" distribution of the data. In this paper we analyze the relationship between training class distribution and classifier performance on 25 data sets and show that the natural distribution usually is not the best distribution for learning--a different class distribution should generally be chosen when the dataset size must be limited. We also explain how changing the class distribution of the training set affects classifier learning and why one training distribution might be better than another.
pdf
ml-tr-43125.83 kBDownloadView
Author's Original (AO) Open Access
url
Report an accessibility issueView
Please complete a content remediation request to report an accessibility issue with a library electronic resource, website, or service.

Metrics

186 File downloads
243 Record Views

Details

Logo image