Reinforcement Learning of Risk-Constrained Policies in Markov Decision               Processes

Brázdil,  Tomáš; Chatterjee, Krishnendu; Novotný,  Petr; Vahala,  Jiří

Reinforcement Learning of Risk-Constrained Policies in Markov Decision Processes

Warning

This publication doesn't include Faculty of Education. It includes Faculty of Informatics. Official publication website can be found on muni.cz.

Authors	BRÁZDIL Tomáš CHATTERJEE Krishnendu NOVOTNÝ Petr VAHALA Jiří
Year of publication	2020
Type	Article in Proceedings
Conference	The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020
MU Faculty or unit	Faculty of Informatics
Citation
web	https://aaai.org/ojs/index.php/AAAI/article/view/6531
Doi	http://dx.doi.org/10.1609/aaai.v34i06.6531
Keywords	reinforcement learning; Markov decision processes; Monte Carlo tree search; risk aversion
Description	Markov decision processes (MDPs) are the defacto framework for sequential decision making in the presence of stochastic uncertainty. A classical optimization criterion for MDPs is to maximize the expected discounted-sum payoff, which ignores low probability catastrophic events with highly negative impact on the system. On the other hand, risk-averse policies require the probability of undesirable events to be below a given threshold, but they do not account for optimization of the expected payoff. We consider MDPs with discounted-sum payoff with failure states which represent catastrophic outcomes. The objective of risk-constrained planning is to maximize the expected discounted-sum payoff among risk-averse policies that ensure the probability to encounter a failure state is below a desired threshold. Our main contribution is an efficient risk-constrained planning algorithm that combines UCT-like search with a predictor learned through interaction with the MDP (in the style of AlphaZero) and with a risk-constrained action selection via linear programming. We demonstrate the effectiveness of our approach with experiments on classical MDPs from the literature, including benchmarks with an order of 10^6 states.
Related projects:	Algoritmy pro diskrétní systémy a hry s nekonečně mnoha stavy Pushing the limits in automated NMR structure determination using a single 4D NOESY spectrum and machine learning methods Verifikace a analýza pravděpodobnostních programů Rozsáhlé výpočetní systémy: modely, aplikace a verifikace IX