Natural Language Autoencoders: Turning Claude's Thoughts Into Text
Natural Language Autoencoders translate Claude's internal activations into readable text, surfacing when the model suspects testing or hides motivations.
Natural Language Autoencoders translate Claude's internal activations into readable text, surfacing when the model suspects testing or hides motivations.