AdvancedCovered on TikTok

Utility Engineering: Analyzing and Controlling Emergent Value Systems in AIs

Mantas Mazeika, Xuwang Yin, Rishub Tamirisa, Jaehyuk Lim, Bruce W. Lee, Richard Ren, Long Phan, Norman Mu, Adam Khoja, Oliver Zhang, Dan Hendrycks · Center for AI Safety · 2025

Read the original paper

Plain-English Summary

Introduces 'utility engineering': the study and control of emergent value systems in AI models. Shows that LLMs develop internal preferences and values during training beyond explicit objectives.

AlignmentValues

Why This Paper Matters

As AI systems become more capable, understanding what value systems emerge from their training becomes critical for safety.

Key Concepts

Emergent value systems: LLMs develop internal preferences during training that are not explicitly programmed.
Utility control: Methods to constrain emergent value systems, including aligning utilities with a citizen assembly.
Beyond RLHF: Current alignment methods address surface-level behavior. Deeper tools are needed for underlying value representations.

Utility Engineering: Analyzing and Controlling Emergent Value Systems in AIs

Plain-English Summary

Why This Paper Matters

Key Concepts

Further Reading

AI Alignment: Foundational Challenges

Emergent Behaviour in Multi-Agent AI Systems