Sequential Learning

Most formally-described learning algorithms first compute how many training samples are required to learn a target concept in the worst possible case; they then observe training data at once. In many situations, the computed worst-case bounds can be lead to very inefficient use of the training data. Here, we describe a class of sequential learning algorithms, that each draw one labeled training example at a time, and autonomously decide when to terminate. We show -- both theoretically and empirically -- that these algorithms are more efficient (on average) than the standard "batched" algorithms.

(Note that these algorithms are not on-line learners: An on-line learner will successively draw unlabeled examples, and propose a label; it is then penalized for each erroneous label it suggests. By contrast, our algorithms sequentially draw labeled examples, and decide when to terminate. They are then penalized by the average error (over the natural distribution) of the classifier they return.)

Distribution Independent (Theory)

Distribution Independent (Experimental)

Distribution Specific

Fast Distribution-Specific Learning

Dale Schuurmans and Russell Greiner

Computational Learning Theory and Natural Learning Systems: Making Learning Systems Practical, MIT Press, 1996.

PAC-learning results are often criticized for demanding impractically large training samples. The common wisdom is that these large samples follow from the worst case nature of the analysis, and therefore PAC-learning, though desirable, must not be a practical goal. We however consider an alternative view: perhaps these large sample sizes are due to the learning strategies used, which make inefficient use of the available training data. To demonstrate this, we consider sequential learning strategies that observe training examples one-at-a-time, and autonomously decide when to stop training. We present several such algorithms for distribution-specific learning, and prove that they can require far fewer training examples (on average) than existing fixed-sample-size approaches, and moreover are able to learn with certainty, not just high probability. In fact, we show a simple sequential strategy that is optimally efficient in many cases.

Distribution Independent (Theory)

Sequential PAC Learning

Dale Schuurmans and Russell Greiner

Proceedings of the Eighth Annual Conference on Computational Learning Theory (COLT-95), Santa Cruz, July 1995.

A "PAC-learning algorithm" uses labeled samples to produce a hypothesis whose misclassification rate is at most ε, with probability at least 1- δ. While standard PAC-algorithms work in "batch" mode, producing a hypothesis after examining a fixed number of labeled samples, this report considers sequential PAC-algorithms, which observe one labeled sample at a time, and can decide to stop, and return a hypothesis, after seeing any set of samples. We first consider learning when the learner knows the sample distribution, providing first bounds on the "expected sample size" of such algorithms (ie, expected number of samples until the algorithm terminates), then necessary and sufficient conditions under which one can learn ε-good hypotheses with certainty (ie, with δ = 0), and finally "truncated sequential learners", that require only (at most) a specified number of samples to produce an appropriate ε, δ-good hypothesis. We next consider distribution independent learning, and provide bounds in terms of the concept class's mistake bound and then its VCdimension. While our theoretical results prove that the worst-case expected sample complexity for sequential learners scales in essentially the same manner as sample complexity scales for fixed-learners in most situations (meaning essentially that the same classes can be PAC-learned in both models), we present empirical evidence that shows that a sequential learner can require significantly fewer samples in practice; often by orders of magnitude.

Distribution Independent (Experimental)

Practical PAC Learning

Dale Schuurmans and Russell Greiner

Proceedings of the Fourteenth International Joint Conference on Artificial Intelligence (IJCAI-95), Montreal, August 1995.

We present new strategies for ``probably approximately correct'' (pac) learning that use fewer training examples than previous pac-learning techniques, while maintaining the exact same worst-case guarantees. We achieve this improvement by using *sequential* learning procedures that observe training examples one-at-a-time, and decide ``on-line'' whether to halt and return a hypothesis, or continue training. We first provide a theoretical worst-case upper-bound on the average number of training examples these procedures will use, and show that it is often better than the fixed number required by previous ``batch'' learning algorithms. However, the efficiency of our approach is primarily established by a series of experiments which show a sequential learning algorithm that uses many times fewer training examples in practice than previous techniques. These results are robust to changes in the target concept, domain distribution, etc., and moreover, the reductions are obtained with little (additional) computational overhead. Overall, these results demonstrate how pac-learning can be achieved far more efficiently in practice than previously thought, even in the worst case, countering the claim that pac-learning can never be feasibly achieved in real applications.

Return to Greiner's home page