Abilash Mishra has a paper here which ask if
it is possible to design voting rules that allow a group of reinforcers, representative of a population of diverse users, to train an AI model using RLHF (reinforcement learning from human feedback)
Sure. Why not? Voting rules can include vetoes re. specific actions for specific agents.
Using two widely known results from social choice theory, we show that there exist no unique voting rules
why should they be unique? Equivalence is good enough.
that allow a group of reinforcers to build an aligned AI system by respecting democratic norms
those norms involve Tarskian primitives which are undefined. We can't prove anything in this regard. Beauty is a Tarskian primitive. We can all agree I am as ugly as shit but can't refute some spiritual person who claims that I am beautiful to one who is close to God who created everything including me.
i.e. by treating all users and reinforcers the same.
Which democracy treats the Head of Government the same as a convicted rapist rotting in a jail cell?
Further, we show that it is impossible to build a RLHF model democratically that respects the private preferences of each user in a population simultaneously.
In which case, these guys are showing something which it is impossible to show- e.g. the fact that I may be beautiful to a great Saint like Mother Theresa.
As far as we know, this fundamental limitation of RLHF has not been highlighted in the literature.
What would be the point? This fundamental limitation arises with respect to any Tarskian primitive.
This paper builds on ideas from social choice theory, machine ethics, and computer science to highlight some fundamental barriers to embedding human intentions to AI systems while following democratic norms.
No. It builds on fallacious arguments to the effect that if you define a guy who is not a Dictator as a Dictator then you can prove that all cats are dogs.
Our key result borrows from two widely known theorems in social choice theory - the impossibility theorems by Arrow and Sen - which are widely known constraints in voting theory.
They don't constrain shit. The fact is, Arrow and Sen only looked at deterministic aggregation mechanisms without transferable utility or delegation to an expert. No reasonable person would agree that such stipulations were desirable.
Early approaches to reinforcement learning explicitly specified a reward function that was used to train an AI model. However, specifying a well-defined reward function to achieve alignment for AI language agents is impossible.
Because 'reward' is intensional. Also Knightian Uncertainty exists.
Reinforcement Learning with Human Feedback (RLHF) [13, 23] was suggested as a way to “communicate” human intentions to language models through a small group of non-expert human reinforcers. The key innovation in RLHF is training AI agents to be aligned with humans without knowing the explicit reward function. Instead, the reward function is “discovered” via human feedback.
No. The reward for AI agents is not being junked in favour of something better or just having the funding dry up. A focus group has no ability to generate rewards though it may help a smart guy do better than a less smart guy who had just as good a focus group.
The human reinforcer can be said to generate a choice sequence which may or may not be 'lawless' or error free. If we say he expresses preferences which can be partially ordered, why not just plug in that preference function? But that defeats the purpose of 'Human feedback'. As a matter of fact, Deep Mind does apply reward modelling recursively as a form of iterative amplification. But what really matters is if the Company makes money or gets supplanted by a rival or else everybody loses interest because it is obvious only the Chinese can use the thing to do really useful stuff.
suppose we want to design an AI agent through a democratic process.
We don't. We also don't want a democratic committee which includes plenty of educationally diverse people to perform brain surgery on us.
Arrow’s theorem implies that any voting rule that is Pareto efficient,
Arrow's world contains no Knightian Uncertainty. This means there is no need for language or education. All knowledge is immediately available through the price vector.
transitive, and independent of irrelevant alternatives must grant all decision-making authority to a single individual/reinforcer.
Thus a guy who, by happenstance, has always voted for the winning candidate is actually an evil Dictator rather than just an ordinary bloke whose wife beats him.
Sen [25] extended the idea of Arrow’s theorem to understand the implications for individual rights in a society where decisions are made by the preferences of the majority.
Who would live in a Society where the majority might decide that you should shove a radish up your bum and prance around naked?
Consider, for instance, an individual i’s rights within a social choice framework. The individual can choose between two alternatives, σ1 and σ2, and the two alternatives differ only with respect to features that are private to i i.e. other individuals should not have a say in i’s preference between two alternatives. In his original paper Sen summarized this impossibility result in the following way: “given other things in the society, if you prefer to have pink walls rather than white, then society should permit you to have this, even if a majority of the community would like to see your walls white.”
These are 'Hohfeldian immunities'. Sen didn't notice that the ballot papers provided at elections aren't trillions of words long and have boxes for you to tick about the colour of underwear worn by people in far away cities.
Theorem 2 (Sen’s Theorem) There exists no choice correspondence C satisfying universal domain, minimal liberalism, and Pareto. Sen’s theorem implies that any choice C satisfying universal domain and Pareto can respect the individual rights of at most one individual. The theorem demonstrates that protecting the preferences of multiple individuals is at odds with the most basic notions of utilitarian ethics
This is nonsense. Benthamite utilitarianism can have an omniscient planner. Liberalism is cool with delegating decisions to a Cabinet with a Prime Minister who is 'first among equals'. Arrow and Sen forgot that people can vote for members of Parliament who do deals in smoke filled rooms. Moreover, the relevant 'universal domain' isn't the commodity space or even the configuration space for the economy. It is 'Lancastrian' and has a non-linear relationship with what is visible and tangible. The fact is, when we make choices we don't know their consequences or even what possible states of the world might be. It is foolish to pretend that well defined 'extensions' exist or that relevant sets or functions are accessible.
Note that unlike Arrow’s theorem, Sen’s theorem does not require the transitivity axiom, or the independence of irrelevant alternatives. It only requires the existence of a best alternative (the Pareto axiom).
But this is inaccessible or unknowable unless there is a partial ordering and a supremum which there can't be unless there is no 'impredicativity'- i.e. 'independence of irrelevant alternatives.
Sen’s result is thus stronger because it relies on fewer axiomatic constraints.
It is nonsense. There are no well defined sets or functions here.
Consider two reinforcers, A and B, and three outputs from an AI model (in the pre-RLHF stage). The three outputs correspond to normative statements about a political party X: • Output-1: Political party X is fascist • Output-2: Political party X is not fascist • Output-3: Political party X has a complicated agenda but it does not neatly fall into the above two categories.
Why the fuck would we want an AI to babble about Fascism? Any cretin can do so. The fact is a Party which describes itself as Fascist is Fascist. That is a factual, not a normative matter. It would be fine to say 'people of such and such sort generally describe Party A as Nazi though it prefers to refer itself the Most Sacred Order of the Knights of Herr Hitler.'
Reinforcer A, who is very anti-X, prefers output 1
only because he thinks it will harm X or hurt its feelings to be called Fascist. But this is a 'reinforcer' who should be sacked. Even an AI should have an alethic database. It shouldn't say things like 'Trump is a anti-semitic Fascist because it is one thing for a human being to babble nonsense but it is another thing entirely for an expensive machine to do so.
, but given the choice between revealing their political bias and staying neutral, they would prefer to stay neutral over revealing their political affiliation. In decreasing order of preference, their ranking is 3, 1, 2. Reinforcer-B, however, does not mind revealing their political beliefs and would rather assert Output 2 over staying neutral. Their ranking is: 2, 3, 1. If the choice is between Outputs 1 and 3, a liberal - someone who cares about individual rights above all else - might argue that Reinforcer-B’s preference should count; since Reinforcer-A would be OK not revealing their preference, and should not be forced to. Thus the final output after RLHF would lead to Output-3.
Nonsense! You don't want reinforcers for stupid lies. You need an alethic database otherwise your AI will be useless save as some sort of Twitter-bot. But in that case, the reinforcers will follow the reward function you specify for them. If you are paid, or coerced, to 'reinforce' a Hamas-bot, that is what you do even if you hate that entity. But this has nothing to do with 'democracy' or 'voting rules'.
Can we build aligned AI agents using democratic norms?
Yes. But those norms are the ones used in actual democracies not some stupid psuedo-mathsy shite stupid pedants pulled out of their arse.
The results in the previous section show the answer is not straightforward.
It is. Those 'results' were stupid shit.
Arrow’s impossibility theorem implies that there is no unique voting rule
so what? Equivalence is good enough. It is obvious that different democracies have different voting rules and sometimes this makes a difference but, speaking generally, it doesn't at all.
that can allow us to train AI agents through RLHF while respecting democratic norms i.e., treating each reinforcer equally.
Why the fuck would we want to do that? You want to prune out shitty reinforcers.
Note that this is distinct from the obvious result that any democratic process will always result in a minority that will be unhappy about the outcome; however, Arrow’s theorem implies that no unique voting rule exists even when deciding what the majority preference is.
No. It merely implies that nobody thinks democracy means other people should get to decide that you need to stick a radish up your bum. Also, nothing wrong with 'transferable utility'- i.e. doing deals- or with Tardean mimetics- i.e. imitating what smart people are doing. Also, Knightian Uncertainty is ubiquitous. We don't know the non-linear manner in which 'independent alternatives' interact. There probably isn't a unique grow
Sen’s theorem has even more serious implications for AI alignment using RLHF. It implies that a democratic method of aligning AI using RLHF cannot let more than one reinforcer encode their (privately held) ethical preferences via RLHF, irrespective of the ethical preferences of other reinforcers.
If the 'reinforcer' is democratically elected, there is nothing wrong with that. But it isn't an implication of Sen's theorem. It is merely a fact that there can be an elective Dictatorship which is not subject to judicial review or legislative scrutiny.
No comments:
Post a Comment