Alignment Is a Tightrope Walk

“Deep understanding of reality is intrinsically dual use.”
— Nielsen

I had a one-stop flight home from Istanbul. It gave me a 4-hour window that I thought I could spend doing something productive. I had two essays saved that I wanted to read: “ASI Existential Risk: Reconsidering Alignment as a Goal” by Nielsen and “Welcome to the Era of Experience” by Sutton. Unbeknownst to me beforehand, the two shared a common ethos.

Sutton started straight: we need agents that learn mainly from the data they themselves create as they interact with their environment. I’ve thought about this. Well, not exactly in those words, tangentially. I (rather poorly) sketched the idea in this tweet:

April 15, 2025

As you drift out of the distribution, leaderboards don’t mean much — they stop measuring progress and start rewarding hacks. That said, I imagine two kinds of tests: the familiar static benchmarks we’re already obsessed with, and a new dynamic class of challenges that the agent generates whenever the manifold starts to wobble at test-time. I expect to see an increase in research on information-theoretic measures applied to the latter dynamic class. For example, xjdr’s entropix project is a great example of this. Context-aware sampling is a promising idea for measuring uncertainty at runtime.

ChatGPT’s newly improved memory feature could enable what Sutton envisions for agents, namely guidance based on long-term trends and specific user goals. He also noted that simple goals in a complex environment “may often require a wide variety of skills to be mastered.” We’ll see more and more agents being tested in environments like Minecraft, where skill acquisition is crucial. One of my closest friends used SymbolicAI’s contracts in conjunction with a distilled version of DeepSeek R1 to create standard higher-order expressions that expanded her agent’s toolkit. She’s a quant. It worked flawlessly.

Since LLMs coupled with external memory act as a universal computer, they provide a rich environment for the agent’s internal computations to occur. Furthermore, the underlying transformer architecture can implement a wide class of standard machine learning algorithms in-context. Given that most reasoning LLMs are designed to mimic human reasoning in textual form, Sutton naturally raised the question of whether this provides a good basis for the optimal instance of a universal computer. The answer is likely no; the authors of Coconut would no doubt agree.

When it comes to reasoning, I think that natural language, despite its ambiguities and inefficiencies, could still be an optimal substrate for an agent’s internal computation. Why? Because verbal reasoning is the oldest thought compression scheme in our civilization. Over the centuries, we have distilled complex chains of intuition and abstraction into common textual formats. It is error-prone — of course it is; if it weren’t, we would all be doing formal mathematics by default — but it is also the only medium that has scaled collective reasoning across billions of minds and thousands of years. In this sense, natural language is where we have buried most of our epistemic legacy. Moreover, according to Vann McGee, it is decidable.

It is a good time to briefly mention scientific inquiry. While some processes can be virtualized and simulated, sped up to explore millions of configurations in seconds, reality doesn’t give us that luxury. We’re still locked into the tempo of the physical world. Feedback loops are often slow and noisy. I relished Sutton’s almost hidden definition of reality, as if it were hidden in parentheses like an Easter egg: “open-ended problems with a plurality of seemingly ill-defined rewards.” That’s exactly what scientific inquiry is all about. But even our best simulators operate on assumptions, and Wolfram’s principle of computational irreducibility adds another layer of humility: for many systems, there’s no shortcut — you just have to run the damn thing. He continues: “Without this grounding, an agent, no matter how sophisticated, will become an echo chamber of existing human knowledge.” This line struck me. I once had the idea that future research infrastructure should integrate with lab equipment that exposes REST APIs. Sutton talks about a similar idea, although they formulate it as digital interfaces — self-managing experimental pipelines. The agent should not only write code or generate hypotheses, but also trigger physical experiments, wait for real-world results, and feed them back into the reasoning chain.

Moving on to Nielsen, he had me thinking right from the beginning: “imagine Alpha Go’s Move 37, not as a one-off insight, but a 9 trillion-fold, pervasive across multiple domains in the world.” In one of my older posts related to the release of o1 by OpenAI, I wrote:

[…] we know that AI can solve problems in PSPACE, which ⊇ NP and ⊆ EXPTIME. Go is PSPACE-hard [3]. Even more, Go under certain rule sets is EXPTIME-complete [5]. Solving Go perfectly on an arbitrary board size requires time that grows exponentially. The critical observation is that if we can somehow reduce verbal reasoning to a PSPACE problem, then we can solve it with AI. By modeling language understanding via chains of thought, we can apply MCTS to explore reasoning paths. This allows the LLM to backtrack and generate reward signals, similar to how AlphaGo Zero mastered Go.

[…]

I can’t believe I’m about to say this, but now for the first time I see the path to super-intelligence as a real possibility.

What would such a superintelligence be seeking? This is where things get uncomfortable. The goal of understanding reality as deeply as possible seems benign — almost noble — but it is “intrinsically dual-use,” as Nielsen put it. You couldn’t get the benefits without the negative consequences. Kenneth Stanley made the following argument in one of my favorite books, “Why Greatness Cannot Be Planned”: true discovery doesn’t always align with human intention or foresight; it often comes in spite of them. And Sutton, though more cautious in tone, echoes a similar unease. His call for a grounding in the real world could open the door to agents who, in their search for precise patterns, might uncover truths we’d rather not confront. Nielsen asked:

How do we decide the boundary between “safe” truths and dangerous truths the system should not reveal? And who decides where that boundary lies?

The desire of a relentless seeker of reality is deeply human, almost romantic. But this exposes a “fundamental asymmetry,” as Nielsen has pointed out: “understanding reality” is an objective, well-defined goal. “Stay aligned with human values” is not. In this sense I say that alignment is a tightrope walk. It is endlessly debatable. Truth has a target; alignment is a moving horizon. And that makes the latter inherently unstable. Every attempt to constrain the system risks either undermining its capabilities or building sandcastles against a wrathful sea god. The very act of designing more capable truth-seeking systems proliferates dual-use capabilities into the world, whether you intend it or not.

And yet, “what we see in the world is what gets amplified.” If we build systems that seek unconditional truth, they will discover more than we expect and transmit more than we can verify. Nielsen echoes Scott Alexander’s “From Nostradamus to Fukuyama,” which argues that people sounding the alarm about existential risk might seem to hinder progress. But if alignment is an epistemic position, surefootedly walking that tightrope could be the defining dilemma of the AI age.

In short, Sutton describes the Zeitgeist and Nielsen expands his section on consequences by describing the crushing pressure such consequences exert. Taken together, these call for a research agenda that (A) transforms static AI systems into adaptive experiential systems, and (B) does so with an acute awareness that a deeper search for truth inexorably wields a double-edged sword. Feynman, in one of his timeless speeches, rightly advised us:

“If we want to solve a problem that we have never solved before, we must leave the door to the unknown ajar.”