Adaptive Action Advising with Different Rewards
Action advising is a critical aspect of reinforcement learning, involving a teacher-student paradigm wherein the teacher, possessing a pre-trained policy, advises the student with the actions calculated from its policy based on the latter’s observations, thereby improving the student’s task performance. An important requirement is for the teacher to be able to learn to robustly adapt and give effective advice in new environments where the reward is different from the one the teacher has been trained on. This issue has not been considered in the current teacher-student literature, therefore, most of the work require the teacher to be pre-trained with the same reward that the student interacts with and cannot generalize advice that differs from the policy; the reward that the student gained through interaction with the environment is also directly given to the teacher, regardless the exploration process. To fill this gap, our proposed method enhances action advising by allowing the teacher to learn by observing and collecting data from the student and adapting its reward function. We empirically evaluate our method over three environments consisting of a Gridworld, an ALE skiing, and a Pacman, and find that our method demonstrates improved policy returns and sample efficiency.