Why Evidential Cooperation in Large Worlds might not be action-guiding
[Disclaimer: Even more esoteric musings than this post, not worth reading unless you can already tell from the title that it’s relevant to you. I wrote this in July 2023, so it doesn’t necessarily represent my current views. (I decided to publish it anyway both for ease of reference and for the sake of illustrating why I think decision theory is much more complicated than the rationalist community gives it credit for.) Specifically: (1) I now think the most compelling objection to Evidential Cooperation in Large Worlds is that we’re clueless about its implications. (2) An important crux is, what is the relevant “decision” with respect to which we decide whether to cooperate with some agents? I give my brief take on that here.]
Evidential Cooperation in Large Worlds (ECL) is the idea that if you endorse an acausal decision theory, you ought to pursue a compromise among values of agents who make decisions “similarly” to you, in some sense. Here’s a key reason to be skeptical that ECL is action-guiding, at least for humans and early AIs (probably for advanced superintelligences too). Towards the end of the post, I sketch my own model of decision-making on which we plausibly should expect agents to be much more correlated with those who share their values than those who don’t.
Some important definitions:
An agent’s decision procedure, with respect to a given decision, is the function from their values and beliefs to that decision.
For the purposes of ECL,1 the subjective correlation between the decision procedure of A and that of B is P(B cooperates | A cooperates, X) -P(B cooperates | A defects, X), where X is everything A knows about their own decision procedure (including their deliberation process) at the time of the decision to cooperate or defect. (So, this more or less cashes out to “correlations that aren’t screened off.”2)
This definition is broader, in principle, than “probability of logical causation,” i.e., P(B cooperates if and only if A cooperates | X). I’m currently unsure how much these differ in practice, though; more in “Aside on ‘similarity’” below.
Different-values correlations are the subjective correlations between your decision procedure and those of agents with different values from you. And same-values correlations are the subjective correlations between your decision procedure and those of agents with your values.
It’s well-known that the ECL argument only implies you should act differently from the recommendations of your own values if, roughly, your different-values correlations are not too much weaker than your same-values correlations.
The standard argument that our different-values correlations are strong enough for ECL to be significantly action-guiding is: Values and decision procedures seem (empirically and a priori) to just not have much influence over each other. From the original paper on ECL, Section 3.1:
[Empirical] [W]e do not seem to observe empirical evidence of such convergence [of agents with the same decision theory on the same values]. For example, Eliezer Yudkowsky and Brian Tomasik agree that non-causal considerations are important for decision theory, but Yudkowsky’s values nevertheless differ significantly from Tomasik’s.
And Caspar Oesterheld, in personal communication:
[A priori] Think about how you decide what to do in Newcomb’s problem and similar thought experiments. It just doesn’t seem like values feed into what you’re doing much at all. If tomorrow your moral views changed, would your views on decision theory change?
I think this argument is flawed in two key ways:
The empirical claim is far too weak, because sharing a “decision theory” isn’t sufficient for evidential cooperation.
If I understand the a priori claim correctly, it misses the point: The concern is not that values are a cause of one’s decision procedure, but that values and some relevant parts of one’s decision procedure have common causes.
The basic argument for weak different-values correlations
To understand my objection to the argument above, it’s helpful to lay out the argument for why our different-values correlations might be weak, and the model of decision-making it’s based on. This is a summary of what I’ll unpack in the rest of the post.3
A decision procedure is a very complex function consisting of several features, not just the idealized decision theory an agent consults. In order for two agents’ decisions to be nontrivially subjectively correlated, the agents need to be sufficiently similar on sufficiently many of these features. Then:
We don’t have empirical evidence of any agents being very similar with respect to many of these features.4 Therefore we don’t have empirical evidence that agents with different values can be nontrivially subjectively correlated.
If the number of decision procedure features is large, it seems a priori unlikely that none of these features will have a strongly linked common cause with values. Thus we should expect agents with different values to only be weakly subjectively correlated (insofar as their beliefs are reasonable), because i) a strong correlation between values and an individual feature implies ii) a strong correlation between values and the conjunction of features.
Aside on “similarity”
“Similar” decision procedures are sufficient for ECL on an evidentialist view (EDT, UEDT, etc.), which I endorse. But suppose your acausal decision theory is logi-causalist, i.e., the only acausal effects you find action-relevant are those that come from your decision intervening on the same algorithm as another instance of that algorithm. Then, these effects are extremely brittle: Instead of merely having a sufficiently high subjective correlation with B, in order for a logi-causalist A to do ECL with B, it seems that A needs to think it’s sufficiently likely that B’s decision procedure5 is exactly the same.
In principle, A could think the decision of B is still under some nontrivial logical constraints if B shares some features (exactly) with A, even if B’s whole decision procedure isn’t exactly the same. Something like this seems like the most promising way for logi-causalist views to admit acausal effects on imperfect copies. But I’m not sure how well this holds up in practice. How exactly do these logical constraints map onto the relationships between agents’ concrete decisions (about whether to optimize their own values or some others)?
Further, I’m sympathetic to a view that all the subjective correlations I’m justified in believing in, on EDT, are just those that come from logical identities—which include the kinds of constraints just mentioned, as well as two decision procedures being “logically downstream” of the same thing. After all, logical implications are what the most rock-solid intuition pump for acausal influence, the Prisoner’s Dilemma with your copy, relies on. (But this is fuzzy, and I’d be excited to see more foundational research modeling plausible ways decision procedures could be “similar” in the way needed for ECL, before we take high-stakes actions based on ECL.)
Empirical claim: An unrealistic model of decision-making
We would be able to say that values are empirically orthogonal to decision procedures if we 1) knew what all the relevant features of a “decision procedure” are, and 2) observed agents who share all these features (to a sufficient degree) but don’t share values. But this doesn’t seem to be the case.
The first quote above (labeled “[Empirical]”) seems to assume the following model: Decision = DecisionTheory(Values).6 But your decision procedure is (much) more than the decision theory you endorse on paper.7 Knowing that A and B share the same acausal “decision theory,” at the level of EDT vs. UDT vs. [blah], doesn’t tell me that in a one-shot interaction B is more likely to cooperate with A if A cooperates than if A defects. Their decision procedures could come apart if, e.g.:
A reasons, “Since we’re both EDT, we’re very subjectively correlated. So I’ll cooperate”;
B reasons, “I don’t think we’re that subjectively correlated just because we’re both EDT. So I’ll defect.”
And when deciding whether to cooperate with some value system on ECL grounds, the decision procedure D you in fact use is not as simple as D*: “Optimize my multiverse-wide compromise utility function (MCUF).” There are many other features involved in D, which Eliezer and Brian (in Oesterheld’s example) do not share. (More on this below in the “A priori claim” section.)
Maybe you think, even if some aspects of D are pretty idiosyncratic, irrational, or subconscious, you can try to correct for those and approximate D* (and smart agents will tend to do this). But:
Your approximation of D* itself is going to depend on D. It’s not just the obviously irrational features of D that make things more complicated, but also many features that are just necessary to specify what “cooperation” in D* is—e.g., what kinds of evidence about the strategic situation are salient to you, how you construct a MCUF, etc.
This is not the same as the objection that ECL is broken by ECL-cooperators’ inability to blindly coordinate on the same MCUF. That problem is a “tax on trade,” meaning, although it makes ECL harder, agents who make different guesses at “the” MCUF can still benefit each other’s values in expectation.
The problem, rather, is: Grant that agents with different algorithms can benefit each other’s values in expectation. Even if this holds, it seems my subjective correlation with other agents who try to use D* ought to be8 weaker the more my algorithm for specifying “cooperation” differs from those agents’. Why should I think your chance of cooperating increases with mine, if the reason you might end up doing the action that benefits my values (in particular) is so different from the reason I end up benefiting your values (in particular)?
Further, from your perspective when you’re considering whether to self-modify into D*, you have no reason to think this will acausally influence agents who don’t use (something sufficiently similar to D). I.e., you have no reason to think they’ll be more likely to consider your values in their MCUF if and only if you self-modify into D*.
So, values divergence among humans who endorse the same decision theory, or even among the subset of those who also endorse ECL, doesn’t establish strong different-values correlations. Absent some argument that some list of decision procedure features is sufficient for strong subjective correlation, and that some agents share those features, we have no empirical evidence of any agents being subjectively correlated, nontrivially—much less those with different values.
A priori claim: Common causes and many potential correlates
Even if the empirical evidence is inconclusive, the argument for strong different-values correlations could still go through if we had some a priori reason to think an agent’s decision procedure doesn’t covary much with their values. We can check if, on the most plausible models of agency, values correlate with similarity on the features that make up a decision procedure.
I agree with Oesterheld (in the second quote, labeled “A priori”) that my values don’t “feed into” the decision-theoretic principles I’m sympathetic to. But that’s not the only way values and decision procedures could be strongly correlated. The more plausible way is that values and some decision procedure features might have common causes—in which case, “If tomorrow your moral views changed, would your views on decision theory change?” is the wrong question, because it’s about a counterfactual rather than a conditional.
Example: Whether a welfarist consequentialist is more sympathetic to classical or negative utilitarianism is downstream of certain psychological and cultural factors. I would guess that these factors also tend to lead to different ways of reasoning about decisions as complex as, “Should I help create an AGI that might significantly increase suffering, because my ECL-cooperators might prefer that kind of AGI to one that doesn’t participate in ECL?” For example, it has been conjectured that backfire risks are systematically more concerning for suffering-focused value systems. Plausibly the sorts of agents who end up more suffering-focused reliably also end up on average more attentive to backfire risks in their decision-making, even relative to their value system, than those who end up non-suffering-focused.
Okay, but even if values might correlate with some features of a decision procedure, how does that imply a strong correlation between values and one’s whole decision procedure?
I don’t think we can be very confident in any particular model of decision procedures. So, this part of my argument is more speculative than my claim in the previous section (i.e., we just don’t have a positive reason to think agents with different values could be strongly subjectively correlated). That said, under my high uncertainty I’m following a principle of indifference, which seems to suggest weak different-values correlations given the following high-level model.9
The problem is that a decision procedure seems to be a conjunction of many features, especially in humans. Here are some things that a human’s decisions, even high-stakes ones, are nontrivially sensitive to besides their values and the decision theory they endorse on paper:
Their beliefs about the strategic structure of their decision;
Social pressure and drives for status;
What evidence they have considered;
Which decision options they have considered;
Which biases they have more or less corrected for;
Their priors;
How they estimate likelihood ratios and approximate Bayesian updating;
How they interpret the decision theory they endorse on paper (e.g., what stuff an aspiring UDT agent tries to be updateless with respect to);
Emotions;
Their beliefs about how similar their epistemic algorithms are, and how compatible the evidence they’ve been exposed to is, with those of potential cooperators.
So, it’s plausible that in order for agents to be subjectively correlated, they need to be similar on each of many dimensions. (At least agents whose decision procedures are like humans’.)
As noted above in the “Aside on ‘similarity’” section, although in principle a nontrivial subjective correlation doesn’t require perfectly sharing all of these features (given EDT), subjective correlation seems more brittle than the standard discourse suggests. This is because I haven’t heard a specific compelling mechanism by which, if agents’ algorithms aren’t precisely identical and they’re causally disconnected, their decisions in ECL-relevant contexts can still depend on each other.10 We need something stronger than a general appeal to similarity.
This is a problem for strong different-values correlations, because:
If so many shared features are required for subjective correlation, it seems hard to be confident that none of these features are strongly correlated with values (via common causes). The “principle of indifference” here is that correlations between each pair of features-and-values are independent and uniformly distributed on [0, 1]. (Aside: As of Dec. 2024, I don’t endorse precise principle of indifference arguments like this as a way of getting to our all-things-considered beliefs, but I think they do suffice to show we shouldn’t precisely believe in strong different-values correlations.)
Importantly, this is assuming the “features” are defined in a way that isn’t just cheating by dividing a large feature up into separate pieces. Rather, they are conceptually distinct properties such that we shouldn’t expect their correlations with values to be themselves correlated. See footnote.11
And only one feature strongly correlated with values is required for a strong correlation between values and the whole decision procedure. This is just because a conjunction of shared features is required, on this model—as an extreme example, if values were perfectly correlated with one of these features, it would logically follow that only agents with the same values could share a decision procedure.
Example of decision complexity
Say I’m trying to decide whether to keep working directly on suffering reduction (“defect”), or drop what I’m doing and focus on ensuring that Earth AIs do ECL (“cooperate”).
To start, I ask, “Wait, how do I know that making AIs do ECL is more cooperative than direct suffering reduction? Maybe preventing extreme suffering is such a widely shared aim in my MCUF (while other aims are more parochial), and I have enough of a comparative advantage in this, that the latter is more cooperative.”
So I deliberate about the relative stakes of these options for my values versus my best guess at my MCUF. I need to choose which heuristics to apply for this process, since precisely computing these stakes is obviously a nonstarter. And this will depend on what pieces of evidence I find salient.
And that “best guess” itself will be subject to my idiosyncrasies. Even if it’s true that “meta” interventions like making AIs do ECL seem robustly good by the lights of agents who do ECL, the upside of this relative to the downsides for my MCUF still depends on the details of that MCUF.
I think about how to model this situation game-theoretically in the first place—insofar as this is a Prisoner’s Dilemma, which sequences of actions correspond to defection versus cooperation, exactly? Are there other games that I could just as easily model it as?
And some founder effects have probably shaped the kinds of interventions that I pay more or less attention to.
And I notice various status dynamics and weirdness heuristics (or their opposite, a contrarian streak) at play in how seriously I take this “making AIs do ECL” thing anyway, and how willing I am to critique it versus support it. I consider which biases I’m more or less prone to, and how strongly I should adjust my estimates to correct for them.
I decide whether to update on the relative bargaining power of my values, and if I don’t, which veils of ignorance exactly do I step behind?
I think about which sorts of Tickle Defense adjustments I should make when reasoning about evidential acausal effects.
Realizing that I’m definitely not a perfect Bayesian who is aware of all the relevant possibilities, I consider how I should make this decision under my deep uncertainty. How risk-averse should I be? How do my “defect” versus “cooperate” options score on different metrics of value on my unawareness set?
I consider how I want to approach decision-theoretic uncertainty. I have such and such credences in EDT versus CDT versus FDT versus… but how do I aggregate these into my decision? The “evidentialist’s wager” doesn’t work for EDT versus FDT, for example.
Hopefully you get the idea: The decision procedure that one uses to make high-stakes decisions about ECL will—or at least ought to be, if we’re doing our due diligence—be rather conjunctive.
Implications
My understanding of the current standard position on ECL prioritization, among people who are keen on ECL, is: “Even if humans aren’t capable of overcoming cluelessness about our MCUF, we can identify robust ‘meta’ interventions. In particular, it seems very plausible that our MCUF wants us to make Earth-originating AIs do ECL, because this will grow the cooperation pie (and AIs will be much better than us at estimating their ECL-cooperators’ values).”
In principle my argument for weak different-values correlations might not apply to advanced AIs, if we expect advanced AIs to converge to relatively simple decision procedures. I’m skeptical of this, but for the purposes of assessing the position I just outlined, this point doesn’t matter.
Regardless of whether ECL is action-guiding for AIs, if my argument goes through then the case for us making AIs do ECL is weakened. This is because, insofar as making AIs do ECL is highly suboptimal (and potentially net-negative) according to “our” values, if our MCUF is very close to “our” values then we shouldn’t make AIs do ECL.
And setting aside difficulties of agents coordinating on their conception of “cooperation.”
E.g., say I play a few rounds of the Prisoner’s Dilemma with a CooperateBot, and we both cooperate every time. Empirically, our past decisions are perfectly correlated. But this doesn’t tell me I should cooperate in the next round; once I’m confident I’m playing against a CooperateBot, I should defect.
Even if this agent isn’t a CooperateBot, and in the next round when I defect they defect too, this still isn’t compelling evidence that should update me towards that agent being subjectively correlated with me. Maybe they were “testing the waters” with me by cooperating a few times, and just happened to defect against me that same round because, like me, they became sufficiently confident I was a CooperateBot. This is totally consistent with them defecting against me next round if I cooperate.
Note that this restriction to subjective correlations is entirely consistent with EDT; EDT was never about just using statistical correlations from the past to predict dependencies. See Ahmed, Evidence, Decision, and Causality, p. 88: “...the most important point to take forward from this discussion is the distinction between statistical correlation (relative frequencies) and subjective evidential relations in the form of conditional credences. It is the latter, not the former, that drive EDT.”
This obviously looks more complicated than the argument I’m responding to, but I think simple arguments in a domain like “figuring out what boundedly rational agents should do when they endorse acausal decision theories” can be very misleading.
Indeed, we don’t even know what features need to be similar to what degree for decisions to be correlated in the relevant sense.
Recall from the definition above that the decision procedure here is relative to the given decision situation.
As I understand it, Oesterheld doesn’t endorse this view currently. But, as discussed below, I don’t see how one can have high credence in strong different-values correlations without endorsing a view like this.
Note that endorsing this view is consistent with endorsing the Tickle Defense. The idea of the Tickle Defense is that, when you condition on knowledge of your decision procedure and values, this can (mostly) screen off merely empirical correlations between agents deciding to do Y and some outcome Z (when those agents didn’t decide to do Y because of a decision procedure like yours). This does not mean that when reasoning about subjective correlations, we should only account for “rational” causes of one’s decision. (I’m not sure the relevant kind of Tickle Defense relies on the premise that agents always know all the causes of their decision, but even if it does, there’s still no inconsistency here—you can know all the causes of your decision and know that there are many such causes.)
“Ought to be” in the sense of what beliefs are justified.
If you’d asked me to cash out my intuition for weak different-values correlations before writing this doc, I don’t think I would’ve explained it this way. I would’ve said that decision procedure similarity seems very brittle. But implicitly, the relevant sense of “brittleness” is that decision procedures aren’t just defined by a few features that we know to be uncorrelated with values. This model makes more precise this fuzzy intuition: If some processes tend to generate agents that match on a complex conjunction of decision procedure features, probably those processes generate ~the same worlds, generally, so those agents will also share values.
One option proposed by Sylvester Kollin here is that we might model two agents consulting the same ideal decision theory as following “that decision theory plus some noise (in some sense).” I think something like this is worth exploring further, but in its current form, my impression is: I’d expect the amount of noise introduced by all the different idiosyncrasies of (at least) human-level agents’ decision-making to swamp the common signal, even assuming it makes more sense to use this “top-down” model than the “bottom-up” one Sylvester also proposes.
For example, suppose decision procedures really were just reducible to “decision theories” in the usual sense, and I formalized a “decision theory” in terms of a long list of criteria. I can’t argue for weak different-values correlations by appealing to the length of this list, because once we know empirically that decision theories as a whole aren’t correlated with values, we know that all these decision theory sub-features aren’t correlated with values. The correlations between values and these sub-features thus wouldn’t be reasonably modeled as independent draws from a uniform distribution. By contrast, we don’t know empirically that all the features of a decision procedure aren’t correlated with values, so we don’t have any particular reason to think the correlations between these features and values will be systematically low.