Efficient Reinforcement Learning with Generalized-Reactivity Specifications (APSEC 2022 - Technical Track)

Who

Chenyang Zhu, Yujie Cai, Can Hu, Jia Bi

Track

APSEC 2022 Technical Track

Time Zone

The program is currently displayed in (GMT+09:00) Osaka, Sapporo, Tokyo.

Use conference time zone: (GMT+09:00) Osaka, Sapporo, TokyoSelect other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

By setting a time band, the program will dim events that are outside this time window. This is useful for (virtual) conferences with a continuous program (with repeated sessions).
The time band will also limit the events that are included in the personal iCalendar subscription service.

Display full programSpecify a time band

Save

When

Fri 9 Dec 2022 13:00 - 13:20 at Room2 - Machine Learning 3 Chair(s): Atul Gupta

Abstract

Reinforcement learning has been used to solve sequential decision-making problems in intelligent systems. However, current RL approaches suffer from slow convergence and reward sparsity, and its reward mechanism is challenging to deal with complex task specifications. Temporal logic can describe non-Markovian task specifications, the synthesized strategy of which could be used as a priori knowledge to train the agents to interact with the environment efficiently. This paper considers the intelligent agent reacts to the environment with a high-level reactive temporal logic specification called Generalized Reactivity of rank 1 (GR(1)). We first use the synthesized strategy of GR(1) to construct the Markov Decision Process with a potential-based reward machine, which integrates the environment with high-level reactive temporal specifications. Then we developed a topological-sort-based reward shaping approach to calculate the potential functions of the reward machine, based on which we used Q-learning to train the agents. Experiments on multi-task learning show that the proposed approach outperforms the state-of-art algorithms in learning rate and optimal rewards. Also, compared with the value-iteration-based reward shaping approaches, our topological-sort-based reward shaping approach could handle the cases where the synthesized strategies are in the form of directed cyclic graphs.

Chenyang Zhu

Yujie Cai

Changzhou University

China

Can Hu

changzhou university

China

Jia Bi

University of Southampton

United Kingdom