DApp Store | Web3 Hub tapahtumille ja peleille

Trendaavat aiheet

If you don't train your CoTs to look nice, you could get some safety from monitoring them. This seems good to do! But I'm skeptical this will work reliably enough to be load-bearing in a safety case. Plus as RL is scaled up, I expect CoTs to become less and less legible.

To be clear: CoT monitoring is useful and can let you discover instances of the model hacking rewards, faking alignment, etc. But absence of bad "thoughts" is not evidence that the model is aligned. There are plenty of examples of prod LLMs having misleading CoTs.

Lots of egregious safety failures probably require reasoning, which is often hard for LLMs to do without showing its hand in the CoT. Probably. Often. A lot of caveats.

The authors of this paper say this; I'm just more pessimistic than them about how useful this will be.

65,87K

Johtavat

Rankkaus

Suosikit

Ketjussa trendaava

Trendaa X:ssä

Viimeisimmät suosituimmat rahoitukset

Merkittävin