At the moment, we're still trying to turn the question from an easy and obvious natural language form ("give me what I meant to ask for") into a form that is amenable to being solved in the world of weights and biases.[0]
So several of the AI labs are published work on this topic, but it looks like e.g. "interpretable models" or philosophy of ethics etc.
So today we're doing today with AI alignment research is the equivalent of asking Ada Lovelace how to prevent the Therac-25 deaths and the best response is "Why would you even do that, just don't write the wrong commands on the punched cards".
[0] Some, myself included, are optimistic that early AI can help us do that. Yudkowsky appears to be dismissive of all options, but I don't see why any near-term AI would care to insert errors into later types of AI, so we've only got out own blind spots to look out for and we had that already…
…but all the AI doomers I hang out with seem to consider me a terrible optimist, so YMMV.
So several of the AI labs are published work on this topic, but it looks like e.g. "interpretable models" or philosophy of ethics etc.
So today we're doing today with AI alignment research is the equivalent of asking Ada Lovelace how to prevent the Therac-25 deaths and the best response is "Why would you even do that, just don't write the wrong commands on the punched cards".
[0] Some, myself included, are optimistic that early AI can help us do that. Yudkowsky appears to be dismissive of all options, but I don't see why any near-term AI would care to insert errors into later types of AI, so we've only got out own blind spots to look out for and we had that already…
…but all the AI doomers I hang out with seem to consider me a terrible optimist, so YMMV.