It could work where a large number of AI's are constructed. A small subset of these AI's--those that can only be used for good--are used as a training set. A number of AI's that can be used for bad are added to the training set. The AI builder is exposed to this training set for a period of time, and on each exposure he is rewarded if he correctly categorizes each AI by its ability to be used for good or bad. After the AI builder demonstrates an ability to properly differentiate the AI's that can only be used for good from those that can be used for good or for bad, he is set loose on constructing a new AI, after which he is compelled to render (and publicize) a judgement on its potential use for good or bad. Alternatively, the builder can also be tasked with choosing only-good AI's from larger mixed set of good/bad AI's.
Quite a tautology.