Framework
In a typical "inductive learning task", the learner uses a given set
of "labeled instances" (eg, each might be a description of a particular
patient, labeled with the correct diagnosis) to learn a classifier
that will accurately label new (unlabled) instances drawn from the same
distribution. (Here, this means other patients in this population.)
Much of the formal work is in the context of
"PAC Learning":
The learner is given some class of possible classifiers
(eg, perceptrons, or decisiontrees, or ...) as well as error and confidence
constants. It then draws a sufficient number of labeled instances
(from an oracle that produces such instances according to the underlying
distribution, and labels each with correct label, based on the unknown
target classifier), which it uses to identify a classifiers from the given
class. Any classifier will have some error  which is the probability
that it will misclassify an instance, over instances drawn from the underlying
distribution. (Note the target concept presumably has 0 error.)
For certain classes, we can guarantee that the classifier returned by the
learner will, with high probability, have small error.
We often measure the quality of a performance system (or ``agent'') by
how well it performs on average  eg, how often an expert system
returns the appropriate diagnosis, or a web crawling agent finds the most
relevant articles, etc. A learner's task is to find the best such
agent, often from a large space of possible agents (eg, the learner may
be seeking the best parameter setting, or the most appropriate set of heuristics
to use, etc.). As it is often difficult, or intractable, to find
the globally optimal agent, many practical learning systems instead hillclimb
to a local optimum. Even this task is problematic, as the hillclimber
must know the distribution of tasks that will be encountered to decide
whether to climb from one agent to another;
unfortunately, this information is typically not known a priori.
The Palo algorithm approximates this hillclimbing search when
the ``utility function'' (used to evaluate each agent's performance) can
only be estimated by sampling. It can efficiently return an agent
that is, with high probability, essentially a local optimum.
Making efficient use of
training samples
Many learners work in batch mode: returning a classifier only observing
a specified (large) number of labeled training instances. Note this
quantity of instances must be sufficient to PAClearn any
classifier in the class, given any sequence of training instances.
We consider sequential learners, that can observe these instances
oneatatime and decide autonomously whether to halt and return a hypothesis,
or continue training. We prove that these algorithms require many
fewer training samples (on average) than previous PAClearning techniques,
while maintaining the exact same distributionfree worstcase guarantees.
Most theoretical analyses assume that both training and performance examples
are complete  ie, that the value of every attribute is known to both
learner and classifier. As noted throughout the learning and datamining
communities, realworld data is usually incomplete. We address
this discrepancy by formally analyzing the task of learning to classify
incompletely specified performance examples  considering, for example,
the questions

If the desired classification algorithm must classify partiallyspecified
instances, which learning algorithm should be used?

Should this learning algorithm use partiallyspecified instances
(exactly like the ones its classifier will have to classify), or instances
that have been ``filled in'' (ie, which have no missing values)?
Most classification algorithms are ``passive'', in that they assign a classlabel
to each instance based only on the description given, even if that description
is incomplete. In contrast, an active classifier can
 at some cost  obtain the values of missing attributes, before deciding
upon a class label. This can be particularly useful when considering, for
example, sending a webcrawler out to find some critical information,
or performing some expensive informationgathering test or experiment.
The expected utility of using an active classifier depends on both the
cost required to obtain the additional attribute values and the penalty
incurred if the classifier outputs the wrong classification. We analyze
the problem of learning optimal
active classifiers, using a variant of the probablyapproximatelycorrect
(PAC) model.
Effectively Exploiting
Domain and Context Information
Most learning and datamining algorithms build new classifiers ``from scratch''.
This is clearly inefficient if one already has a good, but not completely
correct, classifier (or theory); here, a more efficient learner would instead
begin with that initial theory, and revise it as required to accommodate
new, more trusted information. This is the essence of the Machine
Learning area of ``Theory Revision''.
In addition to implementing and evaluating a (now deployed) theory
revision system, we have studied both sample and computational complexity
of this task and prove, for example, that it is not even approximatable
 ie, assuming P <>NP, no efficient algorithm can find a revision
that is even close to optimal.