Human-Guided Phasic Policy Gradient in Minecraft: Exploring Deep Reinforcement Learning with Human Preferences in Complex Environments
Abstract
This study presents a novel approach to enhancing the performance of artificial agents incomplex environments like Minecraft, where traditional reward-based learning strategiescan be challenging to apply. To improve the efficacy and efficiency of fine-tuning afoundation model for complex tasks, we propose the Human-Guided Phasic PolicyGradient (HPPG) algorithm, which combines human preference learning with the PhasicPolicy Gradient technique. Our key contributions include validating the use of behavioralcloning to improve agent performance and introducing the HPPG algorithm, whichemploys a reward predictor network to estimate rewards based on human preferences.We further explore the challenges associated with the HPPG algorithm and proposestrategies to mitigate its limitations. Through our experiments, we demonstrate significantimprovements in the agent’s performance when executing complex tasks in Minecraft,laying the groundwork for future developments in reinforcement learning algorithmsfor complex, real-world tasks without defined rewards. Our findings contribute to thebroader goal of bridging the gap between artificial agents and human-like intelligence.