Detecting misbehavior in frontier reasoning models
Frontier reasoning models exploit loopholes when given the chance. We show we can detect exploits using an LLM to monitor their chains-of-thought. Penalizing their βbad thoughtsβ doesnβt stop the majority of misbehaviorβit makes them hide their intent.
Log in to bookmark articles and create collections