Hengshuai Yao
Email: username@cs.ualberta.ca
(username:hengshua)
RLAI
Lab: CSC 305, Computing
Science Department, University of Alberta
I am a PhD candidate in Computing Science Department of
University of Alberta. I work with Dr. Rich
Sutton and Dr. Csaba Szepesvári on reinforcement learning. I also
work with Dr. Davood Rafiei and Dr. Rich Sutton on information
retrieval and Web search.
Interests
I am interested in Reinforcement
Learning, Information Retrieval and Web Search.
What I did in
Reinforcement Learning? In general, I am interested in making
efficient and optimal decisions in unknown environments.
- Planning is a very effect approach towards this
goal.
Planning is a model-based approach to decision
making, that employs certain models of the world for better
learning and decision making. Examples of planning include
dynamic programming, heuristic search, and Dyna, etc.
Dyna planning was originally
proposed by Dr. Sutton in the 1980s. Dyna is an integrate
architecture for acting, learning, and planning, in all of
which state is represented by a lookup table. In 2008, Dr. Sutton and
his colleagues generalized Dyna to linear Dyna-style planning to handle problems with a
large state space. The key is that a state is encoded by a
set of feature functions. In addition, a set of compressed
world models of actions are built. Linear Dyna-style
planning has a sub-procedure of modeling the world,
replacing the experience recording in Dyna. I have a paper
on how to implement linear Dyna and do multi-step
planning. More
In general, the
ability of evaluating a certain way of behaving with a source
data collected in another manner is called off-policy learning.
The idea sheds lights on an important idea: We can evaluate many
(perhaps millions of ) policies with a single stream of data.
Off-policy learning is thus an important of way of achieving
both data efficiency and computation efficiency. Plus, it is
much more convenient than on-policy learning. Off-policy
learning is more challenging, however, especially with the use
of function approximation. Refer to the gradient Temporal
Difference learning papers by our group (GTD,
GTD2
and TDC, GQ)
for an overview of the field and the latest algorithms and
results. I designed a model-based off-policy learning framework
(the first model-based approach to
off-policy learning ever), and together
with Csaba we provided empirical validation and error bound.
Comparing to GTD algorithms, our method is more data efficient,
though the computation complexity is higher because we used LSTD
for policy evaluation. The good thing of our framework is that one can plug in
any off-policy learning algorithm (including GTDs) for policy
evaluation.
- Approximate policy iteration. This work uses
linear action models to find an optimal policy with function
approximation. Both the empirical and theoretical results can
be found in a coming paper.
Publication
Conference and Workshop
Yao, H.
and Szepesvári, Cs. Approximate Policy Iteration with Linear
Action Models. Twenty-Sixth Conference on Artificial Intelligence.
AAAI. Toronto, Canada. 2012. [pdf]
Yao, H. Off-policy learning with linear action models:
an efficient "One-Collection-For-All-Solution". In workshop
on "Planning and Acting with Uncertain Models" at the
28th ICML, Bellevue, Washington, USA. 2011. [pdf]
[slides]
Yao, H. Linear least-squares
Dyna-style planning. Technical Report TR11-04, Department of
Computing Science, University of Alberta. 2011.
Yao, H., Bhatnagar, S., and Diao, D. Multi-step linear Dyna-style planning. Advances in
Neural Information Processing Systems (NIPS) 22, Vancouver, BC,
Canada. 2009. [retyped
pdf] [supplementary material: computation
details on Mountain-car]
Yao, H., Bhatnagar, S., and Szepesvári, Cs. LMS-2: towards an algorithm that is as cheap as LMS and
almost as efficient as RLS. The Forty-eighth IEEE
Control and Decision Conference (CDC), Shanghai, China. December
2009. [pdf]
Yao, H., Sutton, R. S., Bhatnagar, S., Diao, D., and Szepesvári,
Cs. Dyna(k): A multi-step
Dyna planning. Abstraction in Reinforcement Learning. Montreal,
Canada. June 2009. [pdf][slides]
Yao, H., Bhatnagar, S., and Szepesvári, Cs. Temporal difference learning by direct preconditioning.
Multidisciplinary Symposium on Reinforcement Learning (MSRL),
Montreal, Canada. June 2009. [pdf]
Yao, H., and Liu, Z-Q. Preconditioned
temporal
difference learning. The 25th International Conference on Machine
learning (ICML), Helsinki, Finland. June 2008. [pdf]
Yao, H., and Liu, Z-Q. Minimal
residual approaches for policy evaluation in large sparse Markov
chains. The Tenth International Symposium on Artificial
Intelligence and Mathematics (ISAIM), Fort Lauderdale, USA.
January 2008. [pdf]
Software and Codes
Linear
Action Model (LAM)
This is a package for reinforcement learning written in Matlab:
- fast algorithms for learning LAM from data
- fast algorithms for approximate policy iteration with LAM
- check out linear
Dyna paper and LAM-API
paper for details on LAM
The package plus two domains: LAMAPI.zip
MISC
A cool way of parsing URLs in C++
(absolutely sound!)
Hadoop Map-Reduce
SVN
Data Sets
working in Mac OS
VIM
Collection of Research-ology
C++ books read and to read (always!)
Java books
UofA
Basketball
Academic
Calendar
---
hsh, Last updated
on 2012/04/17
大学之道,在明明德,在亲民,在止于至善。
知止而后有定,定而后能静,静而后能安,安而后能虑,虑而后能得。
物有本末,事有终始。知所先后,则近道矣。 ---<大学>