AdvancedCovered on TikTok
Utility Engineering: Analyzing and Controlling Emergent Value Systems in AIs
Mantas Mazeika, Xuwang Yin, Rishub Tamirisa, Jaehyuk Lim, Bruce W. Lee, Richard Ren, Long Phan, Norman Mu, Adam Khoja, Oliver Zhang, Dan Hendrycks · Center for AI Safety · 2025
Read the original paperPlain-English Summary
Introduces 'utility engineering': the study and control of emergent value systems in AI models. Shows that LLMs develop internal preferences and values during training beyond explicit objectives.
AlignmentValues
Why This Paper Matters
As AI systems become more capable, understanding what value systems emerge from their training becomes critical for safety.
Key Concepts
- Emergent value systems: LLMs develop internal preferences during training that are not explicitly programmed.
- Utility control: Methods to constrain emergent value systems, including aligning utilities with a citizen assembly.
- Beyond RLHF: Current alignment methods address surface-level behavior. Deeper tools are needed for underlying value representations.