Abstract
Many of today's large data sets must be reduced in size before invoking inductive algorithms, due to the costs associated with procuring/processing the data, and because most of these algorithms cannot handle enormous amounts of data. In these cases it is important to select the training data carefully so the impact on classifier performance is minimized. A tacit assumption behind much research on classifier induction is that the class distribution of the training data should match the "natural" distribution of the data. In this paper we analyze the relationship between training class distribution and classifier performance on 25 data sets and show that the natural distribution usually is not the best distribution for learning--a different class distribution should generally be chosen when the dataset size must be limited. We also explain how changing the class distribution of the training set affects classifier learning and why one training distribution might be better than another.