Sequential Learning
Most formally-described learning algorithms first compute
how many training samples are required to learn a target concept in
the worst possible case; they then observe training data at once. In
many situations, the computed worst-case bounds can be lead to very inefficient
use of the training data. Here, we describe a class of sequential
learning algorithms, that each draw one labeled training example at a time,
and autonomously decide when to terminate. We show -- both theoretically
and empirically -- that these algorithms are more efficient (on average)
than the standard "batched" algorithms.
(Note that these algorithms are not on-line learners:
An on-line learner will successively draw unlabeled examples, and
propose a label; it is then penalized for each erroneous label it suggests.
By contrast, our algorithms sequentially draw labeled examples,
and decide when to terminate. They are then penalized by the average error
(over the natural distribution) of the classifier they return.)
Distribution Specific
Computational Learning Theory and Natural Learning Systems:
Making Learning Systems Practical, MIT Press, 1996.
PAC-learning results are often criticized for demanding impractically large
training samples. The common wisdom is that these large samples follow
from the worst case nature of the analysis, and therefore PAC-learning,
though desirable, must not be a practical goal. We however consider an
alternative view: perhaps these large sample sizes are due to the learning
strategies used, which make inefficient use of the available training data.
To demonstrate this, we consider sequential learning strategies
that observe training examples one-at-a-time, and autonomously decide when
to stop training. We present several such algorithms for distribution-specific
learning, and prove that they can require far fewer training examples (on
average) than existing fixed-sample-size approaches, and moreover are able
to learn with certainty, not just high probability. In fact, we
show a simple sequential strategy that is optimally efficient in
many cases.
Distribution Independent
(Theory)
Proceedings of the Eighth Annual Conference on Computational
Learning Theory (COLT-95), Santa Cruz, July 1995.
A "PAC-learning algorithm" uses labeled samples to produce a hypothesis
whose misclassification rate is at most ε, with probability at
least 1- δ. While standard PAC-algorithms work in "batch" mode, producing
a hypothesis after examining a fixed number of labeled samples,
this report considers sequential PAC-algorithms, which observe one
labeled sample at a time, and can decide to stop, and return a hypothesis,
after seeing any set of samples. We first consider learning when the learner
knows the sample distribution, providing first bounds on the "expected
sample size" of such algorithms (ie, expected number of samples until the
algorithm terminates), then necessary and sufficient conditions under which
one can learn ε-good hypotheses with certainty (ie, with
δ = 0), and finally "truncated sequential learners", that require
only (at most) a specified number of samples to produce an appropriate
ε, δ-good hypothesis. We next consider distribution independent
learning, and provide bounds in terms of the concept class's mistake bound
and then its VCdimension. While our theoretical results prove that the
worst-case expected sample complexity for sequential learners scales in
essentially the same manner as sample complexity scales for fixed-learners
in most situations (meaning essentially that the same classes can be PAC-learned
in both models), we present empirical evidence that shows that a sequential
learner can require significantly fewer samples in practice; often
by orders of magnitude.
Distribution Independent
(Experimental)
Proceedings of the Fourteenth International Joint Conference
on Artificial Intelligence (IJCAI-95), Montreal, August 1995.
We present new strategies for ``probably approximately correct'' (pac)
learning that use fewer training examples than previous pac-learning techniques,
while maintaining the exact same worst-case guarantees. We achieve this
improvement by using *sequential* learning procedures that observe training
examples one-at-a-time, and decide ``on-line'' whether to halt and return
a hypothesis, or continue training. We first provide a theoretical worst-case
upper-bound on the average number of training examples these procedures
will use, and show that it is often better than the fixed number required
by previous ``batch'' learning algorithms. However, the efficiency of our
approach is primarily established by a series of experiments which show
a sequential learning algorithm that uses many times fewer training examples
in practice than previous techniques. These results are robust to changes
in the target concept, domain distribution, etc., and moreover, the reductions
are obtained with little (additional) computational overhead. Overall,
these results demonstrate how pac-learning can be achieved far more efficiently
in practice than previously thought, even in the worst case, countering
the claim that pac-learning can never be feasibly achieved in real applications.
Return to Greiner's
home page