Framework
In a typical "inductive learning task", the learner uses a given set
of "labeled instances" (eg, each might be a description of a particular
patient, labeled with the correct diagnosis) to learn a classifier
that will accurately label new (unlabled) instances drawn from the same
distribution. (Here, this means other patients in this population.)
Much of the formal work is in the context of
"PAC Learning":
The learner is given some class of possible classifiers
(eg, perceptrons, or decision-trees, or ...) as well as error and confidence
constants. It then draws a sufficient number of labeled instances
(from an oracle that produces such instances according to the underlying
distribution, and labels each with correct label, based on the unknown
target classifier), which it uses to identify a classifiers from the given
class. Any classifier will have some error -- which is the probability
that it will misclassify an instance, over instances drawn from the underlying
distribution. (Note the target concept presumably has 0 error.)
For certain classes, we can guarantee that the classifier returned by the
learner will, with high probability, have small error.
We often measure the quality of a performance system (or ``agent'') by
how well it performs on average --- eg, how often an expert system
returns the appropriate diagnosis, or a web crawling agent finds the most
relevant articles, etc. A learner's task is to find the best such
agent, often from a large space of possible agents (eg, the learner may
be seeking the best parameter setting, or the most appropriate set of heuristics
to use, etc.). As it is often difficult, or intractable, to find
the globally optimal agent, many practical learning systems instead hill-climb
to a local optimum. Even this task is problematic, as the hill-climber
must know the distribution of tasks that will be encountered to decide
whether to climb from one agent to another;
unfortunately, this information is typically not known a priori.
The Palo algorithm approximates this hill-climbing search when
the ``utility function'' (used to evaluate each agent's performance) can
only be estimated by sampling. It can efficiently return an agent
that is, with high probability, essentially a local optimum.
Making efficient use of
training samples
Many learners work in batch mode: returning a classifier only observing
a specified (large) number of labeled training instances. Note this
quantity of instances must be sufficient to PAC-learn any
classifier in the class, given any sequence of training instances.
We consider sequential learners, that can observe these instances
one-at-a-time and decide autonomously whether to halt and return a hypothesis,
or continue training. We prove that these algorithms require many
fewer training samples (on average) than previous PAC-learning techniques,
while maintaining the exact same distribution-free worst-case guarantees.
Most theoretical analyses assume that both training and performance examples
are complete --- ie, that the value of every attribute is known to both
learner and classifier. As noted throughout the learning and data-mining
communities, real-world data is usually incomplete. We address
this discrepancy by formally analyzing the task of learning to classify
incompletely specified performance examples --- considering, for example,
the questions
-
If the desired classification algorithm must classify partially-specified
instances, which learning algorithm should be used?
-
Should this learning algorithm use partially-specified instances
(exactly like the ones its classifier will have to classify), or instances
that have been ``filled in'' (ie, which have no missing values)?
Most classification algorithms are ``passive'', in that they assign a class-label
to each instance based only on the description given, even if that description
is incomplete. In contrast, an active classifier can
--- at some cost --- obtain the values of missing attributes, before deciding
upon a class label. This can be particularly useful when considering, for
example, sending a web-crawler out to find some critical information,
or performing some expensive information-gathering test or experiment.
The expected utility of using an active classifier depends on both the
cost required to obtain the additional attribute values and the penalty
incurred if the classifier outputs the wrong classification. We analyze
the problem of learning optimal
active classifiers, using a variant of the probably-approximately-correct
(PAC) model.
Effectively Exploiting
Domain and Context Information
Most learning and data-mining algorithms build new classifiers ``from scratch''.
This is clearly inefficient if one already has a good, but not completely
correct, classifier (or theory); here, a more efficient learner would instead
begin with that initial theory, and revise it as required to accommodate
new, more trusted information. This is the essence of the Machine
Learning area of ``Theory Revision''.
In addition to implementing and evaluating a (now deployed) theory
revision system, we have studied both sample and computational complexity
of this task and prove, for example, that it is not even approximatable
--- ie, assuming P <>NP, no efficient algorithm can find a revision
that is even close to optimal.