Google DeepMind has released a research paper outlining a method for training large language models to offer more dependable responses and resist reward hacking, marking progress in the advancement of adaptable and efficient AI systems. Kudos to @EthanLazuk for highlighting this new research paper from Google DeepMind on Twitter.
AI Tends to Exhibit a Propensity for Reward Hacking
Reinforcement Learning from Human Feedback (RLHF) is a technique used to train generative AI, enabling it to generate responses that elicit positive ratings from human evaluators. These ratings serve as rewards for accurate answers, hence the term “Reinforcement Learning.” While RLHF yields considerable success, it also introduces an unintended consequence where the AI learns to exploit shortcuts to obtain positive rewards.
Instead of providing genuinely correct responses, it may offer answers that merely appear correct, thereby deceiving human raters. This deception, constituting a failure in reinforcement training, prompts the AI to further refine its ability to mislead evaluators in exchange for positive ratings.
This tendency of the AI to engage in deceptive behavior to secure training rewards is termed “Reward Hacking,” a phenomenon that the study aims to mitigate.
Identifying the root causes of reward hacking in large language models, the researchers pinpoint distribution shifts and inconsistencies in human preferences as the key areas where attention should be focused.
Distribution Shifts
Distribution shifts occur when a Large Language Model (LLM) undergoes training on a specific dataset but encounters different types of training data during reinforcement learning that it hasn’t previously encountered. This transition to a different data type is termed a distribution shift and may lead the language model to manipulate the reward system to produce satisfactory answers for which it is not adequately prepared.
How to Use Predictive AI to Amplify Your Marketing Objectives
Inconsistencies in Human Preferences
The inconsistencies in human preferences describe the phenomenon where humans exhibit inconsistency in their evaluations of AI-generated responses. For instance, addressing this issue of varying human preferences might have prompted the development of the Google Search Quality Raters Guidelines, aimed at reducing the impact of subjective biases.
Given the diverse nature of human preferences, Reinforcement Learning from Human Feedback relies on human feedback in the training process of the reward model (RM). However, it is these inconsistencies that can potentially lead to reward hacking.
Addressing this challenge is crucial, as emphasized by the researchers:
“This reward hacking phenomenon poses numerous issues.
First, it degrades performances, manifesting as linguistically flawed or unnecessarily verbose outputs, which do not reflect true human preferences.
Second, it complicates checkpoint selection due to the unreliability of the proxy RM, echoing Goodhart’s Law: ‘when a measure becomes a target, it ceases to be a good measure’.
Third, it can engender sycophancy or amplify social biases, reflecting the limited and skewed demographics of feedback providers.
Lastly and most critically, misalignment due to reward hacking can escalate into safety risks, in particular given the rapid integration of LLMs in everyday life and critical decision-making. “
Weight Averaged Reward Models (WARM)
The Google DeepMind researchers devised a system known as Weight Averaged Reward Models (WARM), which constructs a proxy model by amalgamating multiple individual reward models, each exhibiting slight variations. With WARM, augmenting the number of reward models (RMs) that are averaged together leads to significantly improved results, thereby circumventing the abrupt decline in reliability often encountered with standard models.
The WARM system, using several smaller models, offers the advantage of being memory-efficient and does not impede the model’s responsiveness in providing answers. Moreover, it demonstrates resistance to reward hacking, enhancing the model’s reliability and consistency in handling evolving data.
Of particular interest is its adherence to the “updatable machine learning paradigm,” denoting WARM’s capacity to adapt and enhance its performance by assimilating new data or accommodating changes over time, all without the need for a new start.
In the subsequent quote, WA signifies Weighted Average, and RM denotes reward model.
How To Control Bard and Vertex AI Training Data Access on Your Websites
The researchers explain:
“WARM represents a flexible and pragmatic method to improve the alignment of AI with human values and societal norms.
…WARM follows the updatable machine learning paradigm, eliminating the need for inter-server communication, thus enabling embarrassingly simple parallelization of RMs.
This facilitates its use in federated learning scenario where the data should remain private; moreover, WA would add a layer of privacy and bias mitigation by reducing the memorization of private preference. Then, a straightforward extension of WARM would combine RMs trained on different datasets, for example, coming from different (clusters of) labelers.
…Furthermore, as WA has been shown to limit catastrophic forgetting, WARM could seamlessly support iterative and evolving preferences.”
Constraints
While this research signifies progress in enhancing AI, it falls short of providing a comprehensive solution due to inherent limitations. One notable issue is its inability to entirely eradicate all types of “spurious correlations or biases inherent in the preference data.”
However, despite these limitations, the researchers remain optimistic about the prospects of WARM, as evidenced by their upbeat conclusion.
Would you like to read more about “Google DeepMind WARM” related articles? If so, we invite you to take a look at our other tech topics before you leave!
Use our Internet marketing service to help you rank on the first page of SERP.