Hengshuai Yao

Email: username@cs.ualberta.ca (username:hengshua)
RLAI Lab: CSC 305,  Computing Science Department, University of Alberta


I am a PhD candidate in Computing Science Department of University of Alberta. I work with Dr. Rich Sutton and Dr. Csaba Szepesvári on reinforcement learning. I also work with Dr. Davood Rafiei and Dr. Rich Sutton on information retrieval and Web search.

Interests

I am interested in Reinforcement Learning, Information Retrieval and Web Search.

What I did in Reinforcement Learning? In general, I am interested in making efficient and optimal decisions in unknown environments.
Planning is a model-based approach to decision making, that employs certain models of the world for better learning and decision making. Examples of planning include dynamic programming, heuristic search, and Dyna, etc. Dyna planning was originally proposed by Dr. Sutton in the 1980s. Dyna is an integrate architecture for acting, learning, and planning, in all of which state is represented by a lookup table. In 2008, Dr. Sutton and his colleagues generalized Dyna to linear Dyna-style planning to handle problems with a large state space. The key is that a state is encoded by a set of feature functions. In addition, a set of compressed world models of actions are built. Linear Dyna-style planning has a sub-procedure of modeling the world, replacing the experience recording in Dyna. I have a paper on how to implement linear Dyna and do multi-step planning.  More
 
In general, the ability of evaluating a certain way of behaving with a source data collected in another manner is called off-policy learning. The idea sheds lights on an important idea: We can evaluate many (perhaps millions of ) policies with a single stream of data. Off-policy learning is thus an important of way of achieving both data efficiency and computation efficiency. Plus, it is much more convenient than on-policy learning. Off-policy learning is more challenging, however, especially with the use of function approximation. Refer to the gradient Temporal Difference learning papers by our group (GTD, GTD2 and TDC, GQ) for an overview of the field and the latest algorithms and results. I designed a model-based off-policy learning framework (the first model-based approach to off-policy learning ever), and together with Csaba we provided empirical validation and error bound. Comparing to GTD algorithms, our method is more data efficient, though the computation complexity is higher because we used LSTD for policy evaluation. The good thing of our framework is that one can plug in any off-policy learning algorithm (including GTDs) for policy evaluation.  


Publication

   

Conference and Workshop

Yao, H. and Szepesvári, Cs.  Approximate Policy Iteration with Linear Action Models. Twenty-Sixth Conference on Artificial Intelligence. AAAI. Toronto, Canada. 2012. [pdf]

Yao, H.   Off-policy learning with linear action models: an efficient "One-Collection-For-All-Solution". In
workshop on "Planning and Acting with Uncertain Models" at the 28th ICML, Bellevue, Washington, USA. 2011. [pdf] [slides]

Yao, H.  
Linear least-squares Dyna-style planning. Technical Report TR11-04, Department of Computing Science, University of Alberta. 2011.

Yao, H., Bhatnagar, S., and Diao, D.  
Multi-step linear Dyna-style planning. Advances in Neural Information Processing Systems (NIPS) 22, Vancouver, BC, Canada. 2009. [retyped pdf] [supplementary material: computation details on Mountain-car]

Yao, H., Bhatnagar, S., and Szepesvári, Cs.  
LMS-2: towards an algorithm that is as cheap as LMS and almost as efficient as RLS. The Forty-eighth IEEE Control and Decision Conference (CDC), Shanghai, China. December 2009. [pdf]

Yao, H., Sutton, R. S., Bhatnagar, S., Diao, D., and Szepesvári, Cs.  
Dyna(k): A multi-step Dyna planning. Abstraction in Reinforcement Learning. Montreal, Canada. June 2009. [pdf][slides]

Yao, H., Bhatnagar, S., and Szepesvári, Cs.  
Temporal difference learning by direct preconditioning. Multidisciplinary Symposium on Reinforcement Learning (MSRL), Montreal, Canada. June 2009.  [pdf]

Yao, H., and Liu, Z-Q.  
Preconditioned temporal difference learning. The 25th International Conference on Machine learning (ICML), Helsinki, Finland. June 2008. [pdf]

Yao, H., and Liu, Z-Q.  
Minimal residual approaches for policy evaluation in large sparse Markov chains. The Tenth International Symposium on Artificial Intelligence and Mathematics (ISAIM), Fort Lauderdale, USA. January 2008. [pdf]



Software and Codes


Linear Action Model (LAM)
This is a package for reinforcement learning written in Matlab:

The package plus two domains: LAMAPI.zip

MISC

A cool way of parsing URLs in C++ (absolutely sound!)
Hadoop Map-Reduce
SVN
Data Sets
working in Mac OS
VIM
Collection of Research-ology
C++ books read and to read (always!)
Java books


UofA

Basketball
Academic Calendar

---
hsh, Last updated on 2012/04/17

大学之道,在明明德,在亲民,在止于至善。
知止而后有定,定而后能静,静而后能安,安而后能虑,虑而后能得。
物有本末,事有终始。知所先后,则近道矣。  ---<大学>