In a joint paper with @OwainEvans_UK as part of the Anthropic Fellows Program, we study a surprising phenomenon: subliminal learning. Language models can transmit their traits to other models, even in what appears to be meaningless data.
Owain Evans
Owain Evans23.7. klo 00.06
New paper & surprising result. LLMs transmit traits to other models via hidden signals in data. Datasets consisting only of 3-digit numbers can transmit a love for owls, or evil tendencies. 🧵
Subliminal learning can occur for benign traits (such as liking eagles) or more concerning traits (such as misalignment). This has consequences for training on model-generated data. Read more on our Alignment Science blog:
156,91K