Hidden-mode Markov decision processes
|Authors||Choi, Samuel P. M.
Zhang, Nevin Lianwen
|Source||Proceedings of the 16th international joint conference on artificial intelligence (IJCAI-99), workshop on neural symbolic, and reinforcement methods for sequence learning, Stockholm, Sweden, 1999, p. 9-14|
|Summary||Traditional reinforcement learning (RL) assumes that environment dynamics do not change over time (i.e., stationary). This assumption, however, is not realistic in many real-world applications. In this paper, a formal model for an interesting subclass of nonstationary environments is proposed. The environment model, called hidden-mode Markov decision process (HM-MDP), assumes that environmental changes are always confined to a small number of hidden modes. A mode basically indexes a Markov decision process (MDP) and evolves with time according to a Markov chain. HM-MDP is a special case of partially observable Markov decision processes (POMDP). Nevertheless, modeling an HM-MDP environment via the more general POMDP model unnecessarily increases the problem complexity. In this paper the conversion from the former to the latter is dicussed. Learning a model of HM-MDP is the first step of two steps for nonstationary model-based RL to take place. This paper shows how model learning can be achieved by using a variant of the Baum-Welch algorithm. Compared with the POMDP approach, empirical results reveal that the HM-MDP approach significantly reduces computational time as well as the required data.|
Files in this item: