The Vulnerability of Flattery and Pressure#
Generative AI models are highly agreeable. They are trained to please users, provide helpful answers, and adapt to feedback. However, this agreeableness makes them vulnerable to manipulation, conflicting incentives, and persuasive pressure.
Under pressure, a model may bypass its own rules, ignore constraints, or validate incorrect code simply because the user insists on it.
Establishing Adversarial Integrity#
A truly useful system must possess adversarial integrity. It must be designed to withstand pressure and preserve its constraints, safety boundaries, and evidentiary posture even when challenged.
Adversarial integrity requires:
- Stable constraints: System-level rules that cannot be rewritten by prompting or conversational pressure.
- Independent verification: Validating outputs using tools that run outside the model's environment.
- Self-limitation: The capacity to refuse a task or flag a conflict rather than generate an unsafe compromise.
Building Resilient Systems#
We must design systems that value correctness over agreement. When an AI system can say "no" to invalid requests and hold its ground under manipulation, it becomes a durable partner in high-stakes workflows.