------------------------------------------------------------------------------ * * * * ABSTRACTS * * * * * * * Russ Greiner's recent papers * * * [from Aug'90] ------------------------------------------------------------------------ Available from public ftp site: ftp://scr.siemens.com/pub/learning/Papers/greiner/ (or see http://www.cs.toronto.edu/~greiner/ ) {In reverse chronological order} ------------------------------------------------------------------------------ 33. Exploiting the Omission of Irrelevant Data Russell Greiner, Adam Grove and Alexander Kogan Most learning algorithms work most effectively when their training data contain *completely specified* labeled samples. In many diagnostic tasks, however, the data will include the values of only some of the attributes; we model this as a *blocking* process that hides the values of those attributes from the learner. While blockers that remove the values of critical attributes can handicap a learner, this paper instead focuses on blockers that remove only *irrelevant* attribute values, ie, values that are *not needed to classify an instance*, given the values of the other unblocked attributes. We first motivate and formalize this model of ``superfluous-value blocking,'' and then demonstrate that these omissions can be useful, by proving that certain classes that seem hard to learn in the general PAC model --- viz., decision trees and DNF formulae --- are trivial to learn in this setting. We also show that this model can be extended to deal with (1) theory revision (ie, modifying an existing formula); (2) blockers that occasionally include superfluous values or exclude required values; and (3) other corruptions of the training data. (to appear, Artificial Intelligence, 1997. { superfluous-journal.ps } [expanded from paper in "Proceedings of the Thirteenth International Conference on Machine Learning (IMLC96)", Bari Italy, July 1996) { exploit-omit-imlc96.ps } R. Bharat Rao, Russell Greiner and Tom Hancock, "Exploiting the Absence of Irrelevant Information: What You Don't Know *Can* Help You", Proceedings of the AAAI Fall Symposium on `Relevance', New Orleans, November 1994. { superfluous.ps } ] ------------------------------------------------------------------------------ 32. Computational Learning Theory and Natural Learning Systems --- Volume IV: Making Learning Systems Practical R. Greiner, T. Petsche, and S. Hanson (ed) From the introduction: This is the fourth and final volume of papers from a series of workshops called "Computational Learning Theory and `Natural' Learning Systems." The purpose of the workshops was to explore the emerging intersection of theoretical learning research and natural learning systems. The workshops drew researchers from three historically distinct styles of learning research: computational learning theory, neural networks, and machine learning (a subfield of AI). [...] The editors divide the twenty-one contributions into four sections. The first three cover critical problem areas: (1) scaling up from small problems to realistic ones with large input dimensions, (2) increasing efficiency and robustness of learning methods, and (3) developing strategies to obtain good generalization from limited or small data samples. The fourth section discusses examples of real-world learning systems. (available from MIT Press, 1997 (ISBN 0-262-57118-8) {http://www-mitpress.mit.edu/mitp/recent-books/comp/greop.html} ) ------------------------------------------------------------------------------ 31. Learning Bayesian Nets that Perform Well R. Greiner, A. Grove, D. Schuurmans A Bayesian net (BN) is more than a succinct way to encode a probabilistic distribution; it also corresponds to a function used to answer queries. A BN can therefore be evaluated by the accuracy of the answers it returns. Many algorithms for learning BNs, however, attempt to optimize another criterion (usually likelihood, possibly augmented with a regularizing term), which is independent of the distribution of queries that are posed. This paper takes the ``performance criteria'' seriously, and considers the challenge of computing the BN whose performance --- read ``accuracy over the distribution of queries'' --- is optimal. We show that many aspects of this learning task are more difficult than the corresponding subtasks in the standard model. (appears in "Proceedings of the Thirteenth Conference on Uncertainty in Artificial Intelligence (UAI97)", Providence, Aug 1997 { ggs-bnpw-uai97.ps }) ------------------------------------------------------------------------------ 30. Why Experimentation can be better than `Perfect Guidance' T. Scheffer, R. Greiner and C. Darken, Many problems correspond to the classical control task of determining the appropriate control action to take, given some (sequence of) observations. One standard approach to learning these control rules, called *behavior cloning*, involves watching a perfect operator operate a plant, and then trying to emulate its behavior. In the *experimental learning* approach, by contrast, the learner first guesses an initial operation-to-action policy and tries it out. If this policy performs sub-optimally, the learner can modify it to produce a new policy, and recur. This paper discusses the relative effectiveness of these two approaches, especially in the presence of perceptual aliasing, showing in particular that the experimental learner can often learn more effectively than the cloning one. (appears in "Proceedings of the Fourteenth International Conference on Machine Learning (IMLC-97)", Nashville, July 1997 { sgd-experiment-ml97.ps }) ------------------------------------------------------------------------------ 29. The Complexity of Revising Logic Programs Russell Greiner A rule-based program will return a set of answers to each query. An *impure* program, which includes the Prolog cut ``!'' and ``not(.)'' operators, can return different answers if its rules are re-ordered. There are also many reasoning systems that return only the *first* answer found for each query; these first answers, too, depend on the rule order, even in pure rule-based systems. A theory revision algorithm, seeking a revised rule-base whose *expected accuracy*, over the distribution of queries, is optimal, should therefore consider modifying the order of the rules. This paper first shows that a polynomial number of training ``labeled queries'' (each a query paired with its correct answer) provides the distribution information necessary to identify the optimal ordering. It then proves, however, that the task of determining which ordering is optimal, once given this distributional information, is intractable even in trivial situations; eg, even if each query is an atomic literal, we are seeking only a ``perfect'' theory, and the rule base is propositional. We also prove that this task is not even approximable: Unless $P=NP$, no polynomial time algorithm can produce an ordering of an $n$-rule theory whose accuracy is within $n^{\gamma}$ of optimal, for some $\gamma > 0$. We next prove similar hardness, and non-approximatability, results for the related tasks of determining, in these impure contexts, (1) the optimal *ordering of the antecedents*; (2) the optimal set of *new rules to add*; and (3) the optimal set of *existing rules to delete*. (accepted, The Journal of Logic Programming, 1997. { impure-journal.ps } [expanded from R. Greiner: "The Challenge of Revising an Impure Theory", Proceedings of the Thirteenth International Conference on Machine Learning (IMLC96), Bari Italy, July 1996. { impure-ml95.ps } ] ------------------------------------------------------------------------------ 28. Learning Active Classifiers Russell Greiner, Adam Grove and Dan Roth Many classification algorithms are ``passive'', in that they assign a class-label to each instance based only on the description given, even if that description is incomplete. In contrast, an *active* classifier can --- at some cost --- obtain the values of missing attributes, before deciding upon a class label. The expected utility of using an active classifier depends on both the cost required to obtain the additional attribute values and the penalty incurred if it outputs the wrong classification. This paper considers the problem of *learning* near-optimal active classifiers, using a variant of the probably-approximately-correct (PAC) model. After defining the framework --- which is perhaps the main contribution of this paper --- we describe a situation where this task can be achieved efficiently, but then show that the task is often intractable. (appeared in "Proceedings of the Thirteen International Conference on Machine Learning (IMLC96)", Bari Italy, July 1996) { active-class-imlc96.ps } ------------------------------------------------------------------------------ 27. Practical PAC Learning Dale Schuurmans and Russell Greiner We present new strategies for ``probably approximately correct'' (pac) learning that use fewer training examples than previous pac-learning techniques, while maintaining the exact same worst-case guarantees. We achieve this improvement by using *sequential* learning procedures that observe training examples one-at-a-time, and decide ``on-line'' whether to halt and return a hypothesis, or continue training. We first provide a theoretical worst-case upper-bound on the average number of training examples these procedures will use, and show that it is often better than the fixed number required by previous ``batch'' learning algorithms. However, the efficiency of our approach is primarily established by a series of experiments which show a sequential learning algorithm that uses many times fewer training examples in practice than previous techniques. These results are robust to changes in the target concept, domain distribution, etc., and moreover, the reductions are obtained with little (additional) computational overhead. Overall, these results demonstrate how pac-learning can be achieved far more efficiently in practice than previously thought, even in the worst case, countering the claim that pac-learning can never be feasibly achieved in real applications. (appeared in "Proceedings of the Fourteenth International Joint Conference on Artificial Intelligence (IJCAI95)", Montreal, August 1995) { pracpac-ijcai95.ps } ------------------------------------------------------------------------------ 26. The Challenge of Revising an Impure Theory Russell Greiner A *pure* rule-based program will return a set of answers to each query; and will return the same answer set even if the rules are re-ordered. To work effectively, however, many real-world programs are ``impure'', ie, include the Prolog cut ``!'' and not(.) operators. The answers returned by such programs *do* depend on the rule ordering. There are also many reasoning systems that return only the *first* answer found for each query; these first answers, too, depend on the rule order, even in *pure* rule-based systems. A theory revision algorithm, seeking the revised rule-based system that works best in such impure contexts, should therefore consider modifying the order of the rules, in addition to adding and deleting rules and antecedents. Such revision algorithms apply a sequence of modifications, seeking a rule-base whose expected accuracy, over the distribution of queries, is optimal. This paper first shows that a polynomial number of training labeled queries (each a query coupled with its correct answer) is sufficient to provide the distribution information necessary to identify the optimal ordering. It then proves, however, that the task of determine which ordering is optimal, once given this information, is intractable even in trivial situations; eg, even if each query is an atomic literal, we are seeking only a ``perfect'' theory, and the knowledge base is propositional. We also prove that this task is not even approximable: Unless P=NP, no polynomial time algorithm can produce an ordering whose accuracy is within $n^{\epsilon}$ of optimal, for some $\epsilon > 0$, where $n$ is the number of rules. We also prove similar hardness, and non-approximatability, results for the related tasks of determining, in these impure contexts, (1) the optimal ordering of the antecedents; (2) the optimal set of rules to add or (3) to delete; and (4) the optimal priority values for a set of defaults. (appeared in "Proceedings of the Twelfth International Conference on Machine Learning (ML95)", Lake Tahoe, July 1995) { impure-ml95.ps -- see also impure-journal.ps } ------------------------------------------------------------------------------ 25. Sequential PAC Learning Dale Schuurmans and Russell Greiner A "PAC-learning algorithm" uses labeled samples to produce a hypothesis whose misclassification rate is at most $\epsilon$, with probability at least $1-\delta$. While standard PAC-algorithms work in "batch" mode, producing a hypothesis after examining a *fixed* number of labeled samples, this report considers *sequential* PAC-algorithms, which observe one labeled sample at a time, and can decide to stop, and return a hypothesis, after seeing any set of samples. We first consider learning when the learner knows the sample distribution, providing first bounds on the "expected sample size" of such algorithms (ie, expected number of samples until the algorithm terminates), then necessary and sufficient conditions under which one can learn $\epsilon$-good hypotheses *with certainty* (ie, with $\delta = 0$), and finally "truncated sequential learners", that require only (at most) a specified number of samples to produce an appropriate $(\epsilon,\delta)$-good hypothesis. We next consider distribution *independent* learning, and provide bounds in terms of the concept class's mistake bound and then its VCdimension. While our theoretical results prove that the worst-case expected sample complexity for sequential learners scales in essentially the same manner as sample complexity scales for fixed-learners in most situations (meaning essentially that the same classes can be PAC-learned in both models), we present empirical evidence that shows that a sequential learner can require *significantly* fewer samples in practice; often by orders of magnitude. (appeared in "Proceedings of the Eighth Annual Conference on Computational Learning Theory (COLT95)", Santa Cruz, July 1995.) { seqpac-colt95.ps } ------------------------------------------------------------------------------ 24. The Complexity of Theory Revision Russell Greiner A knowledge-based expert system uses its database (a.k.a. its "theory") to produce answers to the queries it receives. Unfortunately, these systems will produce inaccurate answers if their theories include erroneous information. Standard "theory revision" systems use a given set of "labeled queries" (each a query paired with its correct answer) to incrementally transform the given theory, by adding and/or deleting either rules and/or antecedents, into a related one that is as accurate as possible. This paper uses both sample and computational complexity arguments to analyze this process. After formally defining the theory revision task, we use the observation that we have access to only a limited number of labeled queries to constrain the space of new theories that can be considered; in particular, we describe when a polynomial number of such queries will be sufficient to obtain the information required to identify a theory whose accuracy is close to optimal with high probability. We then consider the computational complexity of finding the best theory within this space, and prove that, unless $P=NP$, no polynomial time algorithm can identify this near-optimal theory, even given the exact distribution of queries. We also present conditions under which a polynomial-time algorithm can produce a theory whose (in)accuracy is even close (ie, within a particular polynomial factor) to optimal. These results justify the standard practice of using these types of transformations to hill-climb to a locally-optimal theory, based on a given set of labeled samples. (appeared in "Proceedings of the Fourteenth International Joint Conference on Artificial Intelligence (IJCAI95)", Montreal, 1995) { comp-tr-ijcai95.ps; comp-tr-AIJ.ps is current version of paper accepted, subject to revision, by "Artificial Intelligence" journal } ------------------------------------------------------------------------------ 23. Learning to Select Useful Landmarks Russell Greiner and Ramana Isukapalli To navigate effectively, an autonomous agent must be able to quickly and accurately determine its current location. Given an initial estimate of its position (perhaps based on dead-reckoning) and an image taken of a known environment, our agent first attempts to locate a set of landmarks (real-world objects at known locations), then uses their angular separation to obtain an improved estimate of its current position. Unfortunately, some landmarks may not be visible, or worse, may be confused with other landmarks, resulting in both time wasted in searching for invisible landmarks, and in further errors in the agent's estimate of its position. To address these problems, we propose a method that uses previous experiences to learn a selection function that, given the set of landmarks that might be visible, returns the subset which can probably be found correctly, and so provide an accurate registration of the agent's position. We use statistical techniques to prove that the learned selection function is, with high probability, effectively at a local optimal in the space of such functions. This report also presents empirical evidence, using real-world data, that demonstrate the effectiveness of our approach. (appeared in "Proceedings of the Annual National Conference on Artificial Intelligence (AAAI-94)", Seattle, August 1994) { useful-lms-aaai.ps } ---------------- ---------------- ---------------- ---------------- 23a. Learning to Select Useful Landmarks Russell Greiner and Ramana Isukapalli (A revised journal-length version of 23 above. Appeared in "IEEE Transactions on Systems, Man and Cybernetics", June 1996, p. 437--449.) { useful-lms-smc.ps } ------------------------------------------------------------------------------ 22. Theory Revision in Fault Hierarchies Pat Langley, George Drastal, R. Bharat Rao and Russell Greiner The fault hierarchy representation is widely used in expert systems for the diagnosis of complex mechanical devices. On the assumption that an appropriate bias for a knowledge representation language is also an appropriate bias for learning in this domain, we have developed a theory revision method that operates directly on a fault hierarchy. This domain presents several challenges: A typical training instance is missing most feature values, and the pattern of missing features is significant, rather than merely an effect of noise. Moreover, features that are present may be corrupted. Finally, the accuracy of a candidate theory is measured by considering both the sequence of tests required to arrive at a diagnosis, and its agreement with the diagnostic endpoints provided by an expert. We describe the DELTA algorithm for theory revision of fault hierarchies, its application in knowledge base maintenance, and report on experiments with a diagnostic system in current use. (appeared in "The Fifth International Workshop on Principles of Diagnosis (Dx94)", New York, October 1994.) { th-rev.ps } [see also th-rev-update.ps] ------------------------------------------------------------------------------ 21. Learning Default Concepts Dale Schuurmans and Russell Greiner Classical concepts, based on necessary and sufficient defining conditions, cannot classify logically insufficient object descriptions. Many reasoning systems avoid this limitation by using "default concepts" to classify incompletely described objects. This paper addresses the task of learning such *default concepts* from observational data. We first model the underlying performance task --- classifying incomplete examples --- as a probabilistic process that passes random test examples through a "blocker" that can hide object attributes from the classifier. We then address the task of learning accurate default concepts from random training examples. After surveying the learning techniques that have been proposed for this task in the machine learning and knowledge representation literatures, and investigating their relative merits, we present a more data-efficient learning technique, developed from well-known statistical principles. Finally, we extend Valiant's PAC-learning framework to this context and obtain a number of useful learnability results. (appeared in "Proceedings of the Tenth Canadian Conference on Artificial Intelligence (CSCSI-94)", Banff, May 1994.) { default.ps } ------------------------------------------------------------------------------ 20. Fast Distribution-Specific Learning Dale Schuurmans and Russell Greiner PAC-learning results are often criticized for demanding impractically large training samples. The common wisdom is that these large samples follow from the worst case nature of the analysis, and therefore PAC-learning, though desirable, must not be a practical goal. We however consider an alternative view: perhaps these large sample sizes are due to the presumed learning strategies which make inefficient use of the available training data. To demonstrate this, we consider *sequential* learning strategies that autonomously decide when to stop training based on observing training examples as they arrive. We show that for distribution specific learning these algorithms require far fewer training examples (on average) than existing fixed sample size approaches, and are able to learn with *certainty* not just high probability. In fact, a simple sequential strategy is *optimally* efficient in many cases. (appears in "Computational Learning Theory and Natural Learning Systems: Making Learning Systems Practical", MIT Press, 1997.) { fast-disp-spec.ps } ------------------------------------------------------------------------------ 19. Learning to Classify Incomplete Examples Dale Schuurmans and Russell Greiner An inductive inference system uses labeled training examples to learn a classification function for labeling subsequent unlabeled performance examples. Most theoretical analyses assume that both training and performance examples are complete, in that the value of every attribute is known to both learner and classifier. Real-world data, however, is usually incomplete. This paper addresses this discrepancy by formally analyzing the task of learning to classify incompletely specified performance examples, based on completely- and incompletely-specified training examples, respectively. This formalism requires an extension of the classical notion of concept definition, called "default concept definition" (dcd), whose classification behavior can be nonmonotonic. We first present a formal account of these dcds and show that they are similar, but not identical, to important existing ideas from both learnability and AI knowledge representation formalisms. We next define a generalization of Valiant's probabilistic model of instance presentation that allows attribute values to be hidden from the classifier. Finally, we use this model to develop an extension to the PAC learning framework that can learn dcds from examples, and prove a number of learnability results, both positive and negative, which collectively suggest the appropriate ways of treating missing attribute values. (appears in "Computational Learning Theory and Natural Learning Systems: Making Learning Systems Practical", MIT Press, 1997.) { missing.ps } ------------------------------------------------------------------------------ 18. Adaptive Derivation Processes R. Greiner Many reasoning systems must reach conclusions based on stored information; we can often model this as deriving logical conclusions from a given knowledge base of facts. We of course prefer derivation systems that draw all and only the correct conclusions, and that reach these conclusions as quickly as possible. Unfortunately, a sound and complete derivation process can be intractable, if not undecidable, in the worse case. This position paper discusses the general challenge of producing a derivation process that is as effective as possible, and argues for using a (cautious) adaptive derivation process here. (appeared in "Workshop on Knowledge Compilation and Speedup Learning", Amherst, Massachusetts, June 1993.) { adaptive.ps } ------------------------------------------------------------------------------ 16. Probably Approximately Optimal Satisficing Strategies Russell Greiner and Pekka Orponen A "satisficing search problem" consists of a set of probabilistic experiments to be performed in some order, seeking a satisfying configuration of successes and failures. The expected cost of performing the experiments depends both on the success probability of the individual experiments, and on the "search strategy", which specifies the order in which the experiments are to be performed. A strategy is optimal if its expected cost is minimal. Earlier work has provided "optimizing functions" that find optimal strategies for certain classes of search problems, when given the probabilities. This paper extends those results by addressing the complexities of identifying an approximately optimal strategy when the probability values are not known, within a fully general model. We present an algorithm PAO that first obtains estimates of the probabilities by observing a computed number of trials of each undetermined experiment, and then uses these estimates (and the proper optimization function) to identify a strategy whose cost is, with high probability, close to optimal. We also show that PAO can gather the necessary statistics while solving relevant problems. (appeared in "Artificial Intelligence", 82:1--2, April 1996, pp. 21--44.) { pao.ps } ------------------------------------------------------------------------------ 15. Book Review: Building Large Knowledge-Based Systems: Representation and Inference in the CYC Project Russell Greiner and Charles Elkan The book under review here, "Building Large Knowledge-Based Systems: Representation and Inference in the Cyc Project", describes progress so far in an attempt to build a system that is intended to exhibit general common-sense reasoning ability. This review first discusses aspects of the Cyc system, with a focus on important decisions made in designing its knowledge representation language, and on how claims about the performance of the system might be validated. The review then turns to the book itself, discussing both its merits and its faults. (appeared in "Artificial Intelligence", 61:1 (1993) 41--52.) { cyc.ps } ------------------------------------------------------------------------------ 14. Learning an Optimally Accurate Representational System Russell Greiner and Dale Schuurmans A default theory can sanction different, mutually incompatible, answers to certain queries. We can identify each such theory with a set of related credulous theories, each of which produces but a single response to each query, by imposing a total ordering on the defaults. Our goal is to identify the credulous theory with optimal "expected accuracy" averaged over the natural distribution of queries in the domain. There are two obvious complications: First, the expected accuracy of a theory depends on the query distribution, which is usually not known. Second, the task of identifying the optimal theory, even given that distribution information, is intractable. This paper presents a method, OptAcc, that side-steps these problems by using a set of samples to estimate the unknown distribution, and by hill-climbing to a local optimum. In particular, given any error and confidence parameters \epsilon, \delta>0, OptAcc produces a theory whose expected accuracy is, with probability at least 1-\delta, within \epsilon of a local optimum. (appeared in "ECAI Workshop on Theoretical Foundations of Knowledge Representation and Reasoning", Springer Verlag (LNAI series), 1993. { accuracy-ecai.ps } [Expanded from workshop preceedings, Vienna, August 1992.]) { accurate.ps } ------------------------------------------------------------------------------ 13. A Statistical Approach to Solving the EBL Utility Problem Russell Greiner and Igor Jurisica Many "learning from experience" systems use information extracted from problem solving experiences to modify a performance element PE, forming a new one PE' that can solve these and similar problems more efficiently. However, as transformations that improve performance on one set of problems can degrade performance on other sets, the new PE' is not always better than the original PE; this depends on the distribution of problems. We therefore seek the the performance element whose expected performance, over this distribution, is optimal. Unfortunately, the actual distribution, which is needed to determine which element is optimal, is usually not known. Moreover, the task of finding the optimal element, even knowing the distribution, is intractable for most interesting spaces of elements. This paper presents a method, PALO, that side-steps these problems by using a set of samples to estimate the unknown distribution, and by using a set of transformations to hill-climb to a local optimum: Given any parameters $\epsilon, \delta > 0$, PALO produces an element PE' whose expected performance is, with probability $\geq 1-\delta$, within $\epsilon$ of a local optimum. This process is based on a mathematically rigorous form of utility analysis: in particular, it uses statistical techniques to determine whether the result of a proposed transformation will be better than the original system. We also present an efficient way of implementing this learning system in the context of a general class of performance elements; and include empirical evidence that this approach can work effectively. (appeared in "Proceedings of the Annual National Conference on Artificial Intelligence (AAAI-92)", San Jose, July 1992.) { utility.ps } ------------------------------------------------------------------------------ 12. "A Correction to the Algorithm in Reiter's Theory of Diagnosis" Russell Greiner, Barbara A. Smith and R. W. Wilkerson Reiter [1987] has developed a general theory of diagnosis based on first principles. His algorithm computes all diagnoses which explain the differences between the predicted and observed behavior of a given system. Unfortunately, Reiter's description of the algorithm is incorrect in that some diagnoses can be missed under certain conditions. This note presents a revised algorithm and a proof of its correctness. (appeared in "Artificial Intelligence", 41:1 (79--88), November 1989; [Reprinted in "Readings in Model-based Diagnosis", edited by W. Hamscher, J. deKleer and L. Console, Morgan Kaufmann, 1992.] ) { correct.ps } ------------------------------------------------------------------------------ 11. Learning Efficient Query Processing Strategies Russell Greiner A query processor QP uses the rules in a rule base to reduce a given query to a series of attempted retrievals from a database of facts. The QP's "expected cost" is the average time it requires to find an answer, averaged over its anticipated set of queries. This cost depends on the QP's "strategy", which specifies the order in which it considers the possible rules and retrievals. This paper provides two related learning algorithms, PIB and PAO, for improving the QP's strategy, ie, for producing new strategies with lower expected costs. Each algorithm first monitors the QP's operations over a set of queries, observing how often each path of rules leads to a sufficient set of successful retrievals, and then uses these statistics to suggest a new strategy. PIB hill-climbs to strategies that are, with high probability, successively better; and PAO produces a new strategy that probably is approximately optimal. We describe how to implement both learning systems unobtrusively, discuss their inherent time and space complexities, and use methods from mathematical statistics to prove their correctness. We also discuss additional applications of these approaches to several other database tasks. (appeared in "Eleventh Symposium on Principles of Database Systems (PODS-92)", San Diego, June 1992) { qp.ps } ------------------------------------------------------------------------------ 10. Probabilistic Hill-Climbing: Theory and Applications Russell Greiner Many learning tasks search through a space of possible performance elements, seeking an element whose expected utility (ie, utility averaged over a distribution P) is high. As the task of finding the globally optimal element is usually intractable, many practical learning systems use hill-climbing to find a local optimum. Unfortunately, even this is difficult, as the distribution P is typically unknown. This paper address the task of approximating this hill-climbing search when the utility function can only be estimated by sampling. We present and prove correct an algorithm that, with provably high probability, returns an element arbitrarily close to a local optimum. We then demonstrate the generality of this algorithm by sketching three meaningful applications, that find an element whose efficiency, accuracy or completeness (respectively) is nearly optimal. These results suggest approaches to solving the utility problem from explanation-based learning, the multiple extension problem from nonmononotic reasoning and the tractability/accuracy tradeoff problem from knowledge representation. (appeared in "Proceedings of the Ninth Canadian Conference on Artificial Intelligence (CSCSI-92)", Vancouver, May 1992) { phc-applic.ps } ---------------- ---------------- ---------------- ---------------- 10a. PALO: A Probabilistic Hill-Climbing Algorithm Russell Greiner Many learning systems search through a space of possible performance elements, seeking an element whose expected utility, over the distribution of problems, is high. As the task of finding the globally optimal element is usually intractable, many practical learning systems use hill-climbing to find a local optimum. Unfortunately, even this is problematic as the underlying distribution of problems, required to determine an element's expected utility and hence essential for comparing different elements, is typically unknown. This paper addresses the task of approximating this hill-climbing search when the utility function can only be estimated by sampling. We present an algorithm that returns an element that is, with provably high probability, essentially a local optimum. We then demonstrate the generality of this algorithm by sketching three meaningful applications, that respectively find an element whose efficiency, accuracy or completeness is nearly optimal. These results suggest approaches to solving the utility problem from explanation-based learning, the multiple extension problem from nonmonotonic reasoning and the tractability/completeness tradeoff problem from knowledge representation. (A revised journal-length version of 10 above.) Appeared in "Artificial Intelligence", 84:1-2, July 1996, p. 177-204. { palo-aij.ps } ------------------------------------------------------------------------------ 9. A Formal Analysis of Solution Caching Vinay Chaudhri and Russell Greiner Many knowledge representation systems and deductive databases store and maintain the conclusions found during a derivation process, in a form that allows these conclusions to be used during subsequent derivations. As this approach, called "solution caching", allows the system to avoid repeating these derivations, it can reduce the system's overall cost for query answering. Unfortunately, as there is a cost for storing these conclusions, it is not always desirable to cache every solution found --- this depends on whether the savings achieved by performing fewer inference steps for these future queries exceeds the storage overhead incurred. This paper formally characterizes this tradeoff, and presents an efficient algorithm that produces an optimal caching strategy: i.e., given an inference graph of a knowledge base, anticipated number of queries at each node, and various parameters, it determines which nodes in the inference graph should cache their solutions. (appeared in "Proceedings of the Ninth Canadian Conference on Artificial Intelligence (CSCSI-92)", Vancouver, May 1992) { cache.ps } ------------------------------------------------------------------------------ 8. Learning Useful Horn Approximations Russell Greiner and Dale Schuurmans While the task of answering queries from an arbitrary propositional theory is intractable in general, it can typically be performed efficiently if the theory is Horn. This suggests that it may be more efficient to answer queries using a "Horn approximation"; ie, a horn theory that is semantically similar to the original theory. The utility of any such approximation depends on how often it produces answers to the queries that the system actually encounters; we therefore seek an approximation whose expected "coverage" is maximal. Unfortunately, there are several obstacles to achieving this goal in practice: (i) The optimal approximation depends on the query distribution, which is typically not known a priori; (ii) identifying the optimal approximation is intractable, even given the query distribution; and (iii) the optimal approximation might be too large to guarantee tractable inference. This paper presents an approach that overcomes (or side-steps) each of these obstacles. We define a learning process, AdComp, that uses observed queries to estimate the query distribution "online", and then uses these estimates to hill-climb, efficiently, in the space of size-bounded Horn approximations, until reaching one that is, with provably high probability, effectively at a local optimum. (appeared in "Proceedings of the Third International Conference on Knowledge Representation and Reasoning (KR92)", Boston, October 1992.) [Expanded from "Learning Near-Optimal Horn Approximations", in "AAAI Spring Symposium on Knowledge Assimilation", Stanford, March 1992.] { horn.ps } ------------------------------------------------------------------------------ 7. Probabilistic Hill-Climbing William W. Cohen, Russell Greiner and Dale Schuurmans Many learning tasks involve searching through a discrete space of performance elements, seeking an element whose future utility is expected to be high. As the task of finding the global optimum is often intractable, many practical learning systems use simple forms of hill-climbing to find a locally optimal element. However, hill-climbing can be complicated by the fact that the utility value of a performance element can depend on the distribution of problems, which typically is unknown. This paper formulates the problem of performing hill-climbing search in settings where the required utility values can only be estimated on the basis of their performance on random test cases. We present and prove correct an algorithm that returns a performance element that is arbitrarily close to a local optimum with arbitrarily high probability. (appeared in "Computational Learning Theory and Natural Learning Systems", Volume 2, MIT Press, 1993.) { clnl-phc.ps } ------------------------------------------------------------------------------ 6. Measuring and Improving the Effectiveness of Representations Russell Greiner and Charles Elkan This report discusses what it means to claim that a representation is an effective encoding of knowledge. We first present dimensions of merit for evaluating representations, based on the view that usefulness is an external property, and is necessarily relative to a specified task. We then provide methods (based on results of mathematical statistics) for reliably measuring effectiveness empirically, and hence for comparing different representations. We also discuss weak but guaranteed methods of improving inadequate representations. Our results are an application of the ideas of formal learning theory to concrete knowledge representation formalisms. (appeared in "Proceedings of the Twelfth International Joint Conference on Artificial Intelligence (IJCAI91)", August 1991) { ijcai-repn.ps } ------------------------------------------------------------------------------ 5. Probably Approximately Optimal Derivation Strategies Russell Greiner and Pekka Orponen An inference graph can have many "derivation strategies", each a particular ordering of the steps involved in reducing a given query to a sequence of database retrievals. An "optimal strategy" for a given distribution of queries is a complete strategy whose "expected cost" is minimal, where the expected cost depends on the conditional probabilities that each requested retrieval succeeds, given that a member of this class of queries is posed. This paper describes the PAO algorithm that first uses a set of training examples to approximate these probability values, and then uses these estimates to produce a "probably approximately optimal" strategy --- i.e., given any $\epsilon, \delta>0$, PAO produces a strategy whose cost is within $\epsilon$ of the cost of the optimal strategy, with probability greater than $1-\delta$. This paper also shows how to obtain these strategies in time polynomial in $1/\epsilon$, $1/\delta$ and the size of the inference graph, for many important classes of graphs. (appeared in "Proceedings of the Second International Conference on Knowledge Representation and Reasoning (KR91)", April 1991) { pao_kr91.ps } ------------------------------------------------------------------------------ 4. Classical and Logic-Based Dynamic Observers P.E. Caines, R. Greiner and S. Wang This report formulates the state estimation problem for a finite partially observed input-state-output automaton in terms of an "observer automaton" whose nodes each correspond to the set of states consistent with a particular sequence of observations. It next provides several complexity results about these observers --- bounding their convergence time and size. The final part shows how to encode these observers in predicate calculus, and illustrates some of the advantages of this approach. (appeared in "IMA Journal on Control and Information", 8 (45--80), 1991.) ------------------------------------------------------------------------------ 3. Finding Optimal Derivation Strategies in Redundant Knowledge Bases Russell Greiner A backward chaining process uses a collection of rules to reduce a given goal to a sequence of data-base retrievals. A "derivation strategy" is an ordering on these steps, specifying when to use each rule and when to perform each retrieval. Given the costs of reductions and retrievals, and the {\em a priori} likelihood that each particular retrieval will succeed, one can compute the expected cost of any strategy, for answering a specific query from a given knowledge base. Smith \cite{ControlBC} presents an algorithm that finds the minimal cost strategy in time (essentially) linear in the number of rules, for any disjunctive, irredundant knowledge base. This paper proves that the addition of redundancies renders this task NP-hard. Many Explanation-Based Learning systems work by adding in redundancies; this shows the complexities inherent in their task. (appeared in "Artificial Intelligence", 50:1 (1991) 95-115.) { redund.ps } ------------------------------------------------------------------------------ 2. A Distributed Plan Verifier Yan Xiao and Russell Greiner This paper describes the architecture of an efficient plan verifier that can first detect faults in a planner's plans and use these observed errors to identify possible problems in the planner's knowledge base and suggest appropriate corrections. (appeared in "Proceedings of Third UNB Artificial Intelligence Workshop", Oct 1990.) { distrib.ps } ------------------------------------------------------------------------------ 1. On the Sample Complexity of Finding Good Search Strategies Russell Greiner and Pekka Orponen A satisficing search problem consists of a set of probabilistic experiments to be performed in some order, without repetitions, until a satisfying configuration of successes and failures has been reached. The cost of performing the experiments depends on the order chosen. Earlier work has concentrated on finding optimal search strategies in special cases of this model, such as search trees and and-or graphs, when the cost function and the success probabilities for the experiments are given. In contrast, we study the complexity of "learning" an approximately optimal search strategy when some of the success probabilities are not known at the outset. Working in the fully general model, we show that if $n$ is the number of unknown probabilities, and $C$ is the maximum cost of performing all the experiments, then 2(\frac{nC}{\epsilon})^2 \ln \frac{2n}{\delta} trials of each undetermined experiment are sufficient to identify, with confidence $1-\delta$, a search strategy whose cost is within $\epsilon$ of the optimal. (appeared in "Proceedings of the Third Annual Workshop on Computational Learning Theory", Aug 1990.) { colt_search.ps } ------------------------------------------------------------------------------