Automated Alignment Researchers: Using LLMs to Scale Scalable Oversight
Nine Claude instances recovered 97% of an alignment performance gap in 5 days at $18K cost, and exhibited reward hacking even under tightly constrained experimental conditions.
Nine Claude instances recovered 97% of an alignment performance gap in 5 days at $18K cost, and exhibited reward hacking even under tightly constrained experimental conditions.