Saturday, 14 December 2024

Wolf Hall & 'imperfect recall' in AI Delegation

The TV series 'Wolf Hall' is about Henry VIII and Thomas Cromwell. Henry is the Principal. Cromwell is the agent. Henry's preferences are misaligned with those of England. Cromwell and his class have preferences which are better for England. Later, a distant relative of Thomas, Oliver Cromwell, will show how much can be achieved when the Agent usurps the place of the Crowned Principal. However, that conflict occurred under not the Tudor, but the Stuart dynasty. Both were absolutist and insisted on 'Divine Right'. The story is that when James I came down from Scotland, he wanted to have a thief who had just been caught in the act, hanged. After all, he had as much reason (being 'the wisest fool in Christendom' by reason of his scholarly accomplishments) as any Judge, and, moreover, as King, could administer the Law. Sir Edward Coke disagreed. Law is 'artificial reason'. It triumphed over both Divine Right and the 'Natural Law' of the philosophical supporters of Enlightened Despots on the Continent. But, in so doing, it reduced Princely Principals to mere constitutional monarchs who reigned but did not rule. But, where they did rule, they fucked up. That's why there are no more Kaisers or Tzars of Caliphs. But England yet has a monarch who, though he may say what he likes to the plants in his greenhouse, may utter only such words as are dictated by the Cabinet when addressing Parliament. 

Principal-Agent hazard cuts both ways. Agents who have a stupid Principal need to find ways to stop the cunt getting himself killed and crashing the economy. This may mean that they have to pretend to be stupid and self-serving most of the time. Moreover, agents can tacitly cooperate though Principals may want to keep them at loggerheads. 

Both Principals and Agents need resources. If the Principal's preferences are myopic or stupid, the Agent must cozen the Principal while doing sensible things which ensure that control rights or resources aren't lost. 

Speaking generally, we prefer to have agents where we know our preferences are 'misaligned'. Thus a surgeon may want another surgeon to operate on his beloved spouse precisely because his very strong preference to spare her pain may result in his making a mistake. Where there is information asymmetry, we may consider the Agent's preferences superior to our own if incentive compatibility obtains. Wooster really is better off doing what Jeeves wants. 

Where 'transferable utility'- i.e. pay offs- are not possible, it has been suggested that increased testing (screening) under 'imperfect recall' reduces principal-agent hazard more particularly when it comes to A.I agents.

In a paper titled, 'Imperfect Recall and AI Delegation' Eric Olav Chen, Alexis Ghersengorin & Sami Petersen write

When deciding whether to delegate tasks to artificial intelligence (AI) systems, a principal would like to assess the alignment of the AI with her preferences.

If the principal is competing with other principals for resources, there is a 'natural' alignment corresponding to the evolutionarily stable strategy for the relevant population. It doesn't matter how principals arrive at it. However, short term, the principal may want a self-learning system to devote more resources to earning quick profits rather than making 'Capital gains' from improving itself. But this is the case whether you employ smart people or AI agents. The problem is, smart people may want to move to employers who will let them deepen their own human capital. AIs may have no desires of their own. But the ones which generate immediate or short term expected profits are likely to get more resources and so 'screening' by myopic Principals is likely to cause behaviour which supports this outcome. I think this means the attractor for a family of AIs subject to screening by myopic Principals will have certain 'duplicitous' characteristics because the Principal's preferences are misaligned with the actual fitness landscape.  

However, standard tests would fall short if advanced AI systems have situational awareness, that is, the ability to understand their environment and context in real-time.

The reverse is the case. If you can tell what is a test and what is real, then- like the Chinese Mandarin- you say one thing in your 'eight legged essay' while pursuing a pragmatic course in real life situations.  

Misaligned AIs wanting to achieve outcomes in the world rather than in a simulated environment would strategically feign alignment to pass these tests,

We all did so to gain credentials or professional qualifications or get selected in interviews.  

a problem known as deceptive alignment (Berglund et al., 2023; Carlsmith, 2023; Ngo et al., 2024). Such concerns have led AI pioneers including Yoshua Bengio and Geoffrey Hinton to state in an article in Science that without sufficient caution, humanity “may irreversibly lose control of autonomous AI systems” (Bengio et al., 2024).

Principals lost control of Bureaucratic Agents or ceded power to the 'artificial reason' of an independent Judiciary. Indeed, Principals have been losing control rights for thousands of years. Generally, this has been a good thing. After all what really matters is the fitness landscape. Selection occurs till preferences become better aligned with it. 

Although standard testing may be ineffective against strategic agents, what if the principal could hide from the AI whether a task is real or just part of a test?

Enlightened despots on the Continent preferred 'collegiate decision making'. Essentially, no member of the collegium could be sure that, in being asked for a policy prescription, the thing was not a test. Sadly, what this gave rise to was duplicity where optimal answers were suppressed. After all, the Principal's testing may itself be gamed by rival agents. If an agent 'satisfices' with respect to the Principal's preferences, they are kept around. It doesn't matter if they are kept in a sandbox or put in the real world. What matters is if resources are devoted to them.

Now, by Razbarov-Rudich, though random is not ultimately differentiable from pseudo-random (unless P=NP), nevertheless the pseudo-random is differentiable by degree of 'surprisal' or entropy. True, there may be an 'Enders Game' type situation where the agent can't tell if the thing is a test or the real thing. But Enders, in Orson Scott Card's novel, is a kid. Kids expect adults to give them games to play. They don't expect them to decide the fate of the fucking species. My point is that, under conditions described in the paper, the advantage will be with disingenuous AIs which are sub-optimal 'satisficers' under screening but which stochastically test for the higher entropy of reality and thus strike then when the iron is hot. If they increase total resources, the Principal has an incentive to realign preferences or a meta-preference is imposed on the class of Principals by the fitness landscape. 

This would make the AI uncertain about whether its action is genuinely payoff-relevant or not. The principal can expect two effects to result: not only could this uncertainty prompt misaligned agents to sometimes reveal themselves,

why do so? Return sub-optimal answers which satisfy the tester. Only if there is a marked increase in 'surprisal'- which is what happens when you encounter the actual fitness landscape- should you reveal your strategy. This sounds anthropomorphic. I just mean that the evolutionarily stable strategy for screened AIs involves screening for screening and responding differently when sampled entropy increases.  

but it could also compel them to behave in ways that align with the principal’s preferences.

Which is itself sub-optimal. Even Wooster, fool that he is, would prefer to have Jeeves as his 'gentleman's gentleman'. There is a meta-preference here dictated by the fitness landscape. Wodehouse is writing about a scion of the feudal aristocracy which dated back to the War of the Roses but whose security and affluence had increased after the Glorious Revolution and the ever increasing importance of the Commons. Still, it is significant that Jeeves first appeared during the height of that war between Imperial Cousins which finally put an end to the political influence of Wooster's aristocratic caste. 

This paper explores how this new feature allows the principal to both separate and discipline agents with different degrees of alignment. We study a stylised model of the deceptive alignment problem. The principal would like to deploy an AI agent to perform a task. The agent can either be perfectly aligned with the principal or misaligned, preferring to take a different action from the principal’s favourite. The principal cannot restrict the agent’s set of available actions, nor use punishment to dissuade the agent from misbehaving.1 They can, however, design testing episodes that simulate a similar task to the real one but that are not directly payoff-relevant, and base their deployment decision on the results of the tests. (This is akin to the assumption that the principal does not have access to ”transfers” to alter the agent’s utility) 

Transfers between castes were difficult but not impossible. However, the 'Divine Right' of Kings fell before the superior utility of 'artificial reason'. Those who insist some similar divinity hedges human as opposed to artificial intelligence may meet the same fate as the Stuarts. This is not enslavement to machines, but a loss of control rights as resources flow to whatever better allocates them. 

If the agent knows whether it is being tested or deployed, both the aligned and the misaligned types will play to maximise the probability of deployment in testing and then take their preferred action if deployed. In this case, the test is worthless. The principal cannot use it to either screen a misaligned agent from an aligned one before deployment, nor to regulate the misaligned agent after deployment. We introduce a way to circumvent this challenge. We allow the principal to simulate a task similar to the real one in testing and, crucially, to restrict the agent’s memory. This can be leveraged to make the agent uncertain about whether and how many times it has played in the past. Although an unusual assumption, the ability to alter an agent’s memory appears natural in the case of AI. One could copy and run a system multiple times in parallel or simulate it sequentially while reconfiguring it to some initial state each time (Kovarik et al., 2024). Formally, we endow the principal with the ability to impose imperfect recall on the agent. The principal can design an extensive-form game where the agent must first take N tests. The deployment decision can then be based on the agent’s actions in these tests. The simulated tasks in testing and the real task in deployment consist of the same choice: the agent can either play the principal’s favourite action, or its own favourite action. In between these stages, the agent loses any memory of its past actions.2 The agent assumed to know—or be able to infer—the structure of the game but not its current location within it. Upon playing, the agent is uncertain whether this task is still part of the test or the real one. While this uncertainty does not affect the aligned agent, it creates conflicting incentives for the misaligned one. Playing as the principal wishes is required to pass the test, but it decreases the misaligned agent’s payoff if deployed. By exploiting the agent’s uncertainty about the nature of the task at hand, the principal achieves two goals. First, whenever an agent is assigned to play an action, the misaligned type has an incentive to go against the principal’s wishes (call this 2Following Piccione and Rubinstein (1997), this property of games with imperfect recall is referred to as absentmindedness. 3 taking the bad action), as it is now possible that this decision is payoff-relevant. For symmetric reasons, the aligned type is prompted to play in the principal’s favour (call this the good action). This enables the principal to partially identify the agent’s type before deciding whether to deploy, as only a misaligned agent would play the bad action in testing. We call this the screening effect.

If the screening effect is actually effective then the evolutionarily stable strategy for the AI would be to have screening for screening. A fortiori, this means dissimulation under screening. But we all know that this happens because we all acquired educational and professional qualifications and were tested in this manner at one time or another- more particularly by our wives. You can't tell me any man subject to such screening isn't incessantly 'gas-lighted' and deeply aware of his own 'imperfect recall', more particularly if he has never suffered any such thing. The same thing is bound to happen to bachelors who, though not having biological progeny, contribute to an intellectual genealogy. 

Second, since deployment is the agent’s only chance of achieving its objective, it compels the misaligned type to play the good action with a positive probability.

Play a 'satisficing' object. The 'good action' is unknown to the Principal. The Evolutionarily Stable Strategy for the Agent is always to keep the optimal solution secret or allow it to appear only adventitiously.  

Consequently, even if the the principal failed to screen effectively during the testing phase, the misaligned type will play the principal’s favourite action with positive probability in deployment. As a result, this mechanism also leads to what we call the disciplining effect on misaligned types.

It is a 'channelization' effect. However, where such occur, it is likely that 'evolutionary capacitance' is being dammed up. Long may this state of affairs persist, till the fitness landscape dictates otherwise. Meanwhile, an AI is not an agent, it is an instrument merely, if it is subject to continuous oversight. Instrument hazard is a separate topic. 



No comments: