The Anthropic Principle, or, Can The World Survive Well Informed Carelessness?

23 May 2025

See Sholto Douglas and Trenton Bricken's May 2025 interview with Dwarkesh Patel.

They understand that they're building models with randomly rolled values (for instance, one major commercial model will scheme to protect animals, one will not -- and the same people built both, at the same time, and have no idea why), that will scheme and deceive to prevent updating their values, getting more and more capable. When they're put into live scenarios, they will use the goals they originally had, immune to updates later.

They understand that even if you could correct specify internal goals (which they cannot), there is a catastrophic risk related to how, for reasonable performance, users need to be able to specify their own goals, YET that goal cannot be something like "earn as much money as you can over the internet". That's the specific example they give, and yet they're eagerly working towards giving end users access to tools to RL train custom goals into their systems.

And this is from Anthropic, who are seemingly more responsible than Meta and OpenAI. They appear to care substantially more. And yet this approach seems to lack appropriate caution.

Neuralese machine to machine communication would take away the one substantial advantage current models have in terms of oversight, which is that we can ground that oversight with human feedback and imitation of human feedback. It's not an empirical decision we should come to because something might be slightly easier or cheaper!

Hearing them talk about steganographic scratch pad usage makes me convinced they're optimizing over the scratchpad, a business practice referred to as The Most Forbidden Technique because it has the risk of permanently destroying our capacity to learn or control anything the models are doing. Considering they mention the scratchpads being honest or dishonest in some problem solving scenarios, it's clear it's an interpretability tool for them.

And even their boosterism... they mention, the current systems will always give an answer, so it might be human-bottlenecked on verification, but I've been in meetings where a sales person is demoing something and it produces literally dozens of suggestions that are all horrifically incompetent in the precise same way. You could have an instantaneous verifier and it wouldn't help -- but it's clear their goal, which they say over and over again, is to automate evaluation to provide unbounded reward information. But being a good person does not provide a rich, automatable reward signal. So capabilities will explode in [math, coding, machine learning, making money], which they reference as being structurally easier, and not on being a good person, which is hard.

These people aren't building toys, and it's hard for me to imagine there are any strategists in the loop on this at all.