Study on Meta Learning through Arbitration of Cognitive Biases in Reinforcement Learning from Human Feedback (RLHF)


This is one of the key projects selected for the next five years among the various research initiatives conducted by our lab, the Human and AI Research Lab.

Reinforcement Learning from Human Feedback (RLHF) was attempted to overcome the issues present in traditional reinforcement learning. However, it has been found that RLHF reflects the cognitive biases and limitations of the human brain—such as learning only according to human preferences.

Therefore, our hypothesis is that by selectively adjusting the proportion of human feedback based on whether it is beneficial or detrimental in different environments, it might be possible to achieve meta-learning that adapts well to any environment.

Similar to how large language models like ChatGPT are pre-trained and then fine-tuned, pre-trained reinforcement learning algorithms are being fine-tuned in new environments. However, it has been revealed that these approaches still replicate human biases and exhibit limitations.

Thus, instead of post-hoc fine-tuning, we propose that by intervening in the learning phase to manage human biases in a beneficial manner, we can develop ‘meta-learning’ reinforcement learning agents that perform well in diverse environments without performance degradation.

To prove this, it is first necessary to clearly identify the ‘cognitive biases in RLHF through neuro-cognitive behavioral research.’ This involves collecting behavioral data in various reinforcement learning environments, analyzing the behavioral data, developing neuro-cognitive bias behavioral markers, and comparing these with RLHF learning and behavioral data.

Following this, research will proceed with the ‘development and application of an interventional control algorithm that determines the ratio of human feedback to real environmental feedback.’ This includes establishing a foundation model for RLHF, developing meta-learning scenarios, validating the performance of existing models, and proving the meta-learning capabilities through interventional control of the RLHF model.

In this series of processes, my contribution will likely focus on “behavioral analysis and neuro-cognitive bias analysis.”

Additionally, apart from the ‘analytical work,’ there will be an opportunity to ponder humanities-related questions. From a hermeneutic perspective, humans inherently live with ‘bias.’ As biological organisms living ‘time and space’, humans constantly adapt to diverse environments, modifying pre-existing assumptions based on learned experiences and genetic inheritance. However, the development of AI now aims to transcend this human bias as biological organisms. Will this be a return to the failed attempts of modernity, or will it be an attempt to open a new door in postmodernity, which has yet to find an exit?

Time will tell.

Leave a comment