University of Essex

MA338-6-SP-CO:
Dynamic programming and reinforcement learning

The details

Year: 2024/25

Department: Mathematics, Statistics and Actuarial Science (School of)

Campus: Colchester Campus

Term: Spring

Level: Undergraduate: Level 6

Status: Current

Start date: Monday 13 January 2025

End date: Friday 21 March 2025

Credits: 15

Last updated: 29 May 2024

Requisites for this module

Pre-requisites: (none)

Co-requisites: (none)

Pre and / or co-requisites: (none)

Prohibited modules: (none)

Key module (requisite for): (none)

Key module for

(none)

Module description

Machine learning has become a prominent tool in data analytics. One major category of it: the reinforcement learning, has been widely used in industry to maximise the notion of cumulative reward. This module is concerned with the conceptual background of reinforcement learning, i.e. Markov decision process (MDP) and dynamic programming.

Reinforcement learning, has been covered under Dynamic Programming for decades, designed on the divide-and-conquer basis, provide the structure to all the methods that have been developed in recent years. Describing the problem status by stages, states and transition matrices, but allowing decisions in the whole process.

Some programming experience (in any language) is recommended.

Module aims

The aims of this module are:

To teach the basic mathematical concepts behind dynamic programming and reinforcement learning.

To link the solution approach with statistics, computing, simulation, calculus and standard optimization techniques.

Module learning outcomes

By the end of this module, students will be expected to have obtained:

The ability to formulate sequential decision problems, write Bellman’s equations and convert to the desired form of policy and/or value iteration.

A basic understanding of the advantages and limitations of different formulations and algorithms.

Awareness of concepts such as “curse of dimensionality” and “exploration/exploitation trade-off”. As well as understanding the difference between exact and approximate methods, and when to use each one of them.

An understanding of techniques such as backward induction and be able to write pseudo codes to implement them on small scale problems.

An ability to evaluate different methodologies/approaches/software packages to solve typical applications.

Module information

Dynamic programming is both a mathematical optimisation method and algorithmic paradigm based on the concept of breaking down a problem into several sub-problems, aiming to find an overall optimal solution. In this module, said solution will be either a ‘value function’ that represents how good/bad are different states and actions; or a ’policy’ that recommends the best action to take, given the current information.

To use dynamic programming we require to have full knowledge of the dynamics of the problem (for example, knowing the probability that an autonomous car will not crash if it turns right at any point in time). These assumptions rarely hold in real-life problems, and to tackle those issues we will explore other methods that do not need such information, and just consider agents that learn by experience (trial & error).

We will also learn about modern reinforcement learning approaches and typical applications, that are designed for large scale problems and have shown excellent results in recent years. A very good example is the algorithm developed by DeepMind based on Deep Reinforcement Learning and Monte Carlo Tree Search that beat professional Go players. Currently, most of the Large Language Models (LLMs such as ChatGPT) use some version of Reinforcement Learning (from human feedback: RLHF).

The theory behind these problems and all the required concepts will be discussed in the Lectures, and complemented with ad-hoc experiments in the Laboratory Sessions that will build upon libraries such as OpenAI Gym and the use of Python.

Indicative Syllabus

Basics of sequential decision process: stage, state, action, objective, policy, etc. Applications to Multi-armed bandits.

Markov Decision Process (MDP), Dynamic Programming and the Bellman Equation.

Algorithmic concepts: Value iteration, (general) Policy iteration.

Reinforcement learning techniques: Monte Carlo Methods, Temporal difference methods (Q-Learning and SARSA).

Approximate methods: Deep Q-Networks, Policy Gradient methods.

Convergence/Divergence and the Deadly Triad.

Applications: Revenue Management, Inventory Optimisation, Energy Auctions, Large Language Models, etc.

Learning and teaching methods

Teaching in the School will be delivered using a range of face to face lectures, classes and lab sessions as appropriate for each module. Modules may also include online only sessions where it is advantageous, for example for pedagogical reasons, to do so.

Bibliography*

Sutton, R.S. and Barto, A.G. (2018) Reinforcement learning: an introduction. Second edition. Cambridge, Massachusetts: The MIT Press. Available at: https://ebookcentral.proquest.com/lib/universityofessex-ebooks/detail.action?docID=3338821.
Szepesvári, C. (2010) Algorithms for reinforcement learning. [Sand Rafael, CA]: Morgan & Claypool. Available at: https://ebookcentral.proquest.com/lib/universityofessex-ebooks/detail.action?docID=881218.
Puterman, M.L. (2005) Markov Decision Processes. Newy York: John Wiley & Sons Inc. Available at: https://app.kortext.com/Shibboleth.sso/Login?entityID=https://idp0.essex.ac.uk/shibboleth&target=https://app.kortext.com/borrow/894556.

The above list is indicative of the essential reading for the course.
The library makes provision for all reading list items, with digital provision where possible, and these resources are shared between students.
Further reading can be obtained from this module's reading list.

Assessment items, weightings and deadlines

Coursework / exam	Description	Coursework weighting
Coursework	Lab report	33.3%
Coursework	Project	66.7%
Exam	Main exam: In-Person, Open Book (Restricted), 120 minutes during Summer (Main Period)
Exam	Reassessment Main exam: In-Person, Open Book (Restricted), 120 minutes during September (Reassessment Period)

Exam format definitions

Remote, open book: Your exam will take place remotely via an online learning platform. You may refer to any physical or electronic materials during the exam.
In-person, open book: Your exam will take place on campus under invigilation. You may refer to any physical materials such as paper study notes or a textbook during the exam. Electronic devices may not be used in the exam.
In-person, open book (restricted): The exam will take place on campus under invigilation. You may refer only to specific physical materials such as a named textbook during the exam. Permitted materials will be specified by your department. Electronic devices may not be used in the exam.
In-person, closed book: The exam will take place on campus under invigilation. You may not refer to any physical materials or electronic devices during the exam. There may be times when a paper dictionary, for example, may be permitted in an otherwise closed book exam. Any exceptions will be specified by your department.

Your department will provide further guidance before your exams.

Overall assessment

Coursework	Exam
30%	70%

Reassessment

Coursework	Exam
30%	70%

Module supervisor and teaching staff

Supervisor: Dr Felipe Maldonado, email: felipe.maldonado@essex.ac.uk.

Teaching staff: Dr Felipe Maldonado

Contact details: maths@essex.ac.uk

Availability

Available to incoming Essex Abroad / Exchange students: Yes

Available to Outside Option: Yes

Available to Audit: Yes

External examiner

Name: Dr Yinghui Wei

Institution: University of Plymouth

Resources

Teaching materials: Available via Moodle

Lecture recording:* Of 13 hours, 13 (100%) hours available to students:
0 hours not recorded due to service coverage or fault;
0 hours not recorded due to opt-out by lecturer(s), module, or event type.

Further information

Department website: Mathematics, Statistics and Actuarial Science (School of)

* Please note: due to differing publication schedules, items marked with an asterisk (*) base their information upon the previous academic year.

Disclaimer: The University makes every effort to ensure that this information on its Module Directory is accurate and up-to-date. Exceptionally it can be necessary to make changes, for example to programmes, modules, facilities or fees. Examples of such reasons might include a change of law or regulatory requirements, industrial action, lack of demand, departure of key personnel, change in government policy, or withdrawal/reduction of funding. Changes to modules may for example consist of variations to the content and method of delivery or assessment of modules and other services, to discontinue modules and other services and to merge or combine modules. The University will endeavour to keep such changes to a minimum, and will also keep students informed appropriately by updating our programme specifications and module directory.

The full Procedures, Rules and Regulations of the University governing how it operates are set out in the Charter, Statutes and Ordinances and in the University Regulations, Policy and Procedures.

Module Directory

MA338-6-SP-CO:Dynamic programming and reinforcement learning