We propose MDP-GapE, a new trajectory-based Monte-Carlo Tree Search algorithm for planning in a Markov Decision Process in which transitions have a finite support. We prove an upper bound on the number of calls to the generative model needed for MDP-GapE to identify a near-optimal action with high probability. This problem-dependent sample complexity result is expressed in terms of the sub-optimality gaps of the state-action pairs that are visited during exploration. Our experiments reveal that MDP-GapE is also effective in practice, in contrast with other algorithms with sample complexity guarantees in the fixed-confidence setting, that are mostly theoretical.

Paper and Bibtex



Jonsson A., Kaufmann E., Ménard P., Domingues O., Leurent E. and Valko M., 2020. Planning in Markov Decision Processes with Gap-Dependent Sample Complexity. In Advances in Neural Information Processing Systems.


    title={Planning in Markov Decision Processes
    	with Gap-Dependent Sample Complexity},
    author={Anders Jonsson and Emilie Kaufmann and
    	Pierre Ménard and Omar Darwiche Domingues and
    	Edouard Leurent and Michal Valko},
    booktitle={Advances in Neural Information
    	Processing Systems 33},
    publisher={Curran Associates, Inc.},



  1. Install the finite-mdp environment

pip install --user git+https://github.com/eleurent/finite-mdp

  1. Install the rl-agents implementations.

pip install --user git+https://github.com/eleurent/rl-agents


The experiments can be reproduced by running:

Table 4 and Figure 1

python planners_evaluation_confidence.py --accuracies=[1, 0.5, 0.2] --seeds=200

Figure 2

python planners_evaluation_budget.py --budgets=1,5,9 --seeds=200

The figures and data will appear in the scripts/out directory.


We acknowledge the support of the European CHIST-ERA project DELTA. Anders Jonsson is partially supported by the Spanish grants TIN2015-67959 and PCIN-2017-082.