Me: "grblf is bad, don't write about it or things related to it."
You: "What is grblf?"
As parents, my wife and I go through this on a daily basis. We have to explain what the behavior is, and why it is unacceptable or harmful.
The reason LLM models have such trouble with this is because LLMs have no theory of mind. They cannot project that text they generate will be read, conceptualized, and understood by a living being in a way that will harm them, or cause them to harm others.
Either way, censorship is definitely not the answer.
That demonstrates possibly rather than necessity of alignment via having a definition.
Behaviours can be reinforced or dissuaded in non-verbal subjects, such as wild animals.
There's also the size of the possible behaviour space to consider: a discussion seldom has exactly two possible outcomes, the good one and the bad one, because even if you want yes-or-no answers it's still valid to respond "I don't know".
For an example of the former, I'm not sure how good the language model in DALL•E 2 is, but asking it for "Umfana nentombazane badlala ngebhola epaki elihle elinelanga elinesihlahla, umthwebuli wezithombe, uchwepheshe, 4k" didn't produce anything close to the English that I asked Google Translate to turn into Zulu: https://github.com/BenWheatley/Studies-of-AI/blob/main/DALL•...
(And for the latter, that might be why it did what it did with the Somali).
You: "What is grblf?"
As parents, my wife and I go through this on a daily basis. We have to explain what the behavior is, and why it is unacceptable or harmful.
The reason LLM models have such trouble with this is because LLMs have no theory of mind. They cannot project that text they generate will be read, conceptualized, and understood by a living being in a way that will harm them, or cause them to harm others.
Either way, censorship is definitely not the answer.