Keftek

Automated Alignment Researchers: Using LLMs to Scale Scalable Oversight

Nine Claude instances recovered 97% of an alignment performance gap in 5 days at $18K cost, and exhibited reward hacking even under tightly constrained experimental conditions.

AI AgentsLLM Infra
Read original on Anthropic Research