Abstract
Addressing noise and uncertainty in training data is an important issue in inductive learning. Inductive learners are necessarily sensitive to noise and uncertainty in training data, since training data constitutes the primary basis for generalization. Some of today's more popular off-the-shelf learners ignore the presence of imperfect data or invoke statistically motivated post-processing to help compensate for its unwanted effects; none exploits specific knowledge of noise or uncertainty. Several research projects have taken a step in this direction by explicitly addressing noise in training data. Unfortunately, these works are limited because they depend on particular models of environmental noise, overly restrictive concept-description languages, and sometimes unrealistic sample complexity. This dissertation describes a knowledge-based approach that uses uncertain reasoning to help overcome these limitations. In what follows, learning from imperfect data is formulated as the search for a hypothesis with maximum {em a posteriori} probability. Implementing the search as incremental probabilistic-evidence combination extends the range of useful uncertainty models to those described by discrete probability distributions. I built a novel conjunction learner and from it an iterative DNF learner. On standard datasets, where strong knowledge is unavailable, the DNF learner is competitive with conventional learners. In experiments using synthetic data, where strong knowledge is available, the knowledge-based learners are superior to their more familiar, conventional counterparts. To demonstrate that problem-specific uncertainty models can be engineered and used effectively in practical problems, the evidence-combination approach was used to address a difficult open problem in molecular biology: learning to recognize promoter sequences in {em E.~coli}. Earlier efforts notwithstanding, the inherent uncertainty as to the location of the biologically active regions in the raw DNA data actually invalidates the direct application of many standard inductive learning methods. Here, knowledge from molecular biology was used instead to engineer models of three domain uncertainties and a mapping from raw sequence data to a plausible and focused evidence representation. The evidence-combination approach then yields classifiers that are accurate and credible, and the best yet developed for this important problem.