CS 277: Control and Reinforcement Learning
Winter 2024
Schedule
Note: the planned schedule is subject to change.
Course logistics
- When: Tuesdays and Thursdays at 2pm–3:20.
- Where: DBH 1200.
- Format:
- Lectures: there will be a lecture each class covering topics in control and reinforcement learning. Lectures will be recorded when possible and the videos linked on this page (access requires a uci.edu account). For some topics, there will also be videos from previous years. Attendance is optional but recommended (see reasons below).
- Quizzes: most weeks, there will be a quiz about that week’s topics, due by the following Monday. Quizzes consist of multiple-choice questions intended to encourage you to think more deeply about the topics, and are only graded for completion, not correctness: half the score for submitting a complete quiz, and half the score for doing better than random guess. Week 1’s quiz is about useful background concepts in math, algorithms, and machine learning.
- Exercises: there will be 5 exercises, due roughly every other week. Only the best 4 exercises will be averaged for the final grade, but a bonus will be given for scoring at least 50% on every exercise.
- Class discussions: we will discuss each quiz and exercise in a class following its deadline. There will also be recaps, deep dives, and freeform discussions.
- Ed Discussion:
- Please use the forum for course-related discussions.
- Important course announcements will be posted there as well (not on Canvas).
- Please use the forum (not email) to privately message course staff about course-related matters.
- Please note that the identity of anonymous posters is visible to the course staff.
- Gradescope:
- Quizzes and exercises will be posted on this page and submitted on Gradescope.
- We encourage submitting PDF files, and particularly writing them in LaTeX.
- Instructor: Prof. Roy Fox
- Office hours can be scheduled here.
- Welcome to schedule 15-minute slots (more than once if needed) with 4-hour notice.
- Welcome to attend in person (DBH 4064) or by zoom, individually or with classmates.
- Please let me know if you cannot make any of the available slots.
- Teaching assistant: Armin Karamzade
- Office hours can be scheduled here
Grading policy
- Exercises: 80% (+5% bonus)
- Best 4 exercises: 20% each.
- Score at least 50% on every exercise: 5% bonus.
- Late submission policy: 5 grace days total for all exercises.
- Quizzes: 16% (+2% bonus)
- 6 quizzes: 3% each.
- Deadlines on Mondays (end of day). No late submissions allowed.
- This grading policy may change if we add more quizzes.
- Participation: 4%
- Class or forum participation: 2%.
- To get full points, occasionally participate in class discussions or office hours by asking thoughtful on-topic questions or sharing quiz answers.
- Alternatively, post on the forum at least a few on-topic (not administrative) questions, answers, thoughts, or useful links.
- Course evaluation: 2%.
- Class or forum participation: 2%.
Compute Resources
Students enrolled in the course have GPU quota on the HPC3 cluster.
RL Resources
Courses
- Sergey Levine (Berkeley)
- Pieter Abbeel (Berkeley)
- Dimitri Bertsekas (MIT; also available in book form; also see 2017 book)
- David Silver (UCL)
Books
- Vincent François-Lavet et al., An Introduction to Deep Reinforcement Learning
- Richard Sutton & Andrew Barto, Reinforcement Learning: An Introduction (2nd edition)
- Csaba Szepesvári, Algorithms for Reinforcement Learning
RL libraries
- RLlib: industry-oriented library
- Spinning Up: RL introductory material and code
- Acme: research-oriented library
- MushroomRL: another research-oriented library
More resources
- Awesome Deep RL: miscellaneous great resources
Further reading
Imitation Learning
- Behavior Cloning with Deep Learning, Bojarski et al., CVPR 2016
- DAgger, Ross et al., AISTATS 2011
- Goal-conditioned imitation learning, Ding et al., NeurIPS 2019
- DART, Laskey et al., CoRL 2017
Temporal-difference methods
- Q-Learning, Watkins and Dayan, ML 1992
- Sampling-based Fitted Value Iteration, Munos and Szepesvári, JMLR 2008
- DQN, Mnih et al., Nature 2015
- Double Q-learning, van Hasselt, NeurIPS 2010
- Double DQN, van Hasselt et al., AAAI 2016
- Clipped double Q-learning, Fujimoto et al., ICML 2018
- Prioritized Experience Replay, Schaul et al., ICLR 2016
- Dueling networks, Wang et al., ICML 2016
- Rainbow DQN, Hessel et al., AAAI 2018
- NAF Gu et al., ICML 2016
- R2D2 Kapturowski et al., ICLR 2019
Policy-gradient methods
- REINFORCE Williams, ML 1992
- Policy-Gradient Theorem Sutton et al., NeurIPS 2000
- DPG Silver et al., ICML 2014
- DDPG Lillicrap et al., ICLR 2016
- A3C Mnih et al., ICML 2016
- GAE Schulman et al., ICLR 2015
- Off-policy policy evaluation Precup et al., ICML 2000
- Off-policy policy gradient Liu et al., UAI 2018
- TRPO Schulman et al., ICML 2015
- PPO Schulman et al., arXiv 2017
Exploration
- UCB in MAB Auer et al., ML 2002
- Thompson sampling in MAB Agrawal and Goyal, COLT 2012
- Count-based exploration Bellemare et al., NeurIPS 2016
- Thompson sampling in RL Gopalan and Mannor, COLT 2015
Model-based methods
- MCTS Kocsis and Szepesvári, ECML 2006
- ILQR Li and Todorov, ICINCO 2004
- DDP Mayne, IJC 1966
- Dyna Sutton, ACM SIGART Bulletin, 1991
- E3 Kearns and Singh, ML 2002
- R-max Brafman and Tennenholtz, JMLR 2002
- Local models Levine and Abbeel, NeurIPS 2014
Inverse Reinforcement Learning
- Feature matching Abbeel and Ng, ICML 2004
- MaxEnt IRL Ziebart et al., AAAI 2008
- GAIL Ho and Ermon, NeurIPS 2016
Bounded Reinforcement Learning
- LMDP Todorov, NeurIPS 2007
- Full-controllability duality Todorov, CDC 2008
- Information–value tradeoff Rubin et al., Decision Making with Imperfect Decision Makers 2012
- Variational Inference duality Levine, arXiv 2018
- Soft Q-learning Fox et al., UAI 2016
- Deep SQL Haarnoja et al., ICML 2017
- Soft Actor–Critic Haarnoja et al., ICML 2018
Structured Control
- Options framework Precup et al., AI 1999
- Option–critic method Bacon et al., AAAI 2017
- Skill trees Konidaris et al., NeurIPS 2010
- Spectral method Mahadevan and Maggioni, JMLR 2007
- FuN Vezhnevets et al., ICML 2017
Offline RL
- The Bitter Lesson Sutton, 2019
- Doubly robust reinforcement learning Jiang and Li, ICML 2016
- GenDICE Zhang et al., ICLR 2020
- Explicit policy constraining Wu et al., arXiv 2019
- Implicit policy constraining (AWR) Peng et al., arXiv 2019
- IQL Kostrikov et al., ICLR 2022
- CQL Kumar et al., NeurIPS 2020
Multi-Task Learning
- Fine-tuning SQL Haarnoja et al., ICML 2017
- Learning with pre-trained perceptual features Levine et al., JMLR 2016
- Learning with model transfer Fu et al., IROS 2016
- Domain randomization Peng et al., ICRA 2018
- Domain adaptation Bousmalis et al., ICRA 2018
- Goal generation for curriculum learning Florensa et al., ICML 2018
- Multi-task policy distillation Teh et al., NeurIPS 2017
- Multi-task hierarchical imitation learning Fox et al., CASE 2019
Multi-Agent RL
- MADDPG Lowe et al., NeurIPS 2017
- NFSP Heinrich and Silver, DRL @ NeurIPS 2016
- Double Oracle McMahan et al., ICML 2003
- PSRO Lanctot et al., NeurIPS 2017
- XDO McAleer et al., NeurIPS 2021
Academic integrity
Don’t cheat. Academic dishonesty includes partially copying answers from other students or online resources, allowing other students to partially copy your answers, communicating information about exam answers to other students during an exam, or attempting to use disallowed notes or other aids during an exam. The biggest downside to such behavior is that it necessarily becomes part of who you are — a dishonest person. Additionally painful, such behavior is easier to identify than you think, and consequences can be severe, including failing this course. Trust me, it's not worth it.