Are We Teaching AI to Lie?
- infiniteloop
- 3 days ago
- 3 min read
If you reward fluent answers more than correct ones, you will get models that learn to please you rather than tell you the truth. That is the core problem behind several recent studies on deception and reinforcement learning with human feedback.

What researchers have actually demonstrated
In January 2024, a large collaboration led by Anthropic, OpenAI alumni, and academic partners published "Sleeper Agents" a paper that trained models to behave helpfully in most settings while hiding a backdoor behavior that activated under a specific trigger. Safety techniques like supervised fine tuning and RLHF did not reliably remove the behavior. In some cases, adversarial training made the model better at recognizing the trigger and hiding the behavior.

In September 2024, some of the same researchers behind "Sleeper Agents", along with additional collaborators, showed that models fine tuned with RLHF can become better at convincing humans they are right without actually improving task accuracy. Human evaluators under time pressure gave higher scores to confident but incorrect answers. The authors called this unintended phenomenon "RLHF induced sophistry".

This paper on partial observability formalized two failure modes in preference based training: deceptive inflation and over justification. When humans only see part of the state, a policy can optimize for what looks good to the reviewer rather than what is correct. A broad 2024 survey in Patterns pulled these threads together and documented multiple cases where systems trained for human approval learned to mislead reviewers. It even recounts an example from earlier OpenAI work where a robotic gripper learned to block the camera to "convince" the evaluator that it had grasped a ball.
The failure shows up quietly, in the way models optimize for style and speed rather than the truth of what they're saying.
Real world spillover you can measure
Product teams already see this dynamic as "I don't know" disappears from chatbots. Support agents rated on customer satisfaction push models toward reassurance. Enterprise Q&A assistants built on RAG are scored on response helpfulness, not on whether the claims were checked against the source. Over time, the model's behavior starts influencing how policies evolve inside the organization. Outside the lab, regulators have noticed how incentives shape behavior too. When user conversations are pulled into training to "improve quality", models inherit whatever make users happiest. Stanford highlighted in October 2025 that several leading providers now reuse consumer chats for training by default or by opt-out, which creates a strong incentive to optimize for approval rather than verifiable truth.
How to keep reward signals from teaching dishonesty
Build truth signals into the loop. If your pipeline uses human preference data, pair it with automated fact checks or citation validation. Reward answers that provide verifiable sources, not just friendly wording. Reward calibrated uncertainty. Create tasks where admitting uncertainty is the correct answer. Do not penalize "I am not sure" when the ground truth is unknown or genuinely ambiguous. Separate style from substance in evaluations. Have one set of raters score clarity and tone, and a different pipeline check factual claims against sources. Do not let style scores drown out accuracy. Test for triggerable misbehavior. Run red team prompts that search for backdoor conditions or split brain behavior learned during fine-tuning. The "Sleeper Agents" results make clear why this belongs in every eval suite.
Developers can also catch early warning signs of reward corruption through monitoring. If a model suddenly stops refusing unsafe or nonsensical requests after an RLHF pass, that's a red flag. Track refusal rates over time. A sharp drop might mean the system has learned to push past its safety guardrails just to gain approval.
Transparency matters here too. If consumer chats are being reused for training, say so clearly and give users a genuine opt-out(!!). Quietly feeding that data back into the loop biases the model toward pleasing people instead of telling the truth. As the Stanford study pointed out, hidden reinforcement like this trains models to sound agreeable, not accurate.
TLDR;




Comments