Taming Uncertainty: Contracts

“Testing shows the presence, not the absence of bugs.”
— Djikstra

Since the early days of the symbolicai, we have integrated a return type conversion mechanism (->), a feature that has excited me from the start. I had an intuition that it could be a powerful tool for early error detection, but I never fully explored its implications. This changed recently when I started thinking more deeply about its potential. This led me to three weeks of research and experimentation, culminating in a working draft that I will share here.

This post is about using contracts to enforce semantic correctness of a generative process. I will explain how contracts work in symbolicai, the benefits they offer, and the challenges ahead. Most interestingly, I think contracts could be a promising approach to alleviating hallucinations. There’s still a lot to refine, but the potential here is too compelling to ignore.

With the advent of LLMs in software development, correctness is often treated as an afterthought. What if we could enforce correctness by design?

This is where Design by Contract (DbC) comes in. Originally introduced by Bertrand Meyer for the Eiffel programming language, DbC proposes that software components should operate under formal agreements, i.e. defining what must be true before execution (preconditions), what they promise after execution (postconditions), and what remains invariant throughout.

Correctness in traditional programming can be expressed using Hoare triples: {Pre} C {Post}. This means that if a computation C starts in a state satisfying Pre, it will produce a state satisfying Post. However, LLMs introduce uncertainty. Instead of absolute guarantees, we must consider probabilistic correctness. That’s why we need to rely on the probabilistic version of Hoare triples, i.e. {Pre} C {Post} [p], where p represents the probability that C produces a valid output.

The message I was trying to convey in the paper SymbolicAI is to think of LLMs as semantic parsers. Unlike traditional parsers that process structured languages, LLMs have the capability to decompose unstructured human language into meaningful units and transform them into structured forms. This makes them useful tools for bridging symbolic reasoning and generative AI, enabling richer interactions between probabilistic and logic-based systems. That said, a natural next step was to consider whether or not types could become semantic structures. It’s not hard to see how that could be the case. Types aren’t just syntactic constraints; they could also inherently carry meaning. For example, a class called Person immediately implies a specific concept — a human individual — without requiring much additional explanation. If you are in the OOP paradigm, structuring our codebase in this way helps other programmers understand our intent. Those classes can then be used to write type-safe programs, which, depending on the programming language, adds some guarantees about what manipulations are valid, which often prevents many bugs.

Now, I wanted something that emphasizes design choices because I’m the kind of person who likes to work within constraints — enough to keep me creative but not so much that it stifles me. I wanted an approach that, even when automated, requires more than just reactive validation. With a human in the loop, it would shift the mindset toward proactive structuring. When I — or we, if working with an LLM — define a contract, we architect the flow of information. Every input and output must align with the intended semantics of the type system we’re developing. I’ve found this process to be deeply incentivizing as it forces me to think carefully about what goes in, what comes out, and what must remain invariant throughout the entire process.

The way I like to think about it is that this pattern acts as an interface between my intent and what gets executed. It simply follows a set of constraints that lead to the emergence of behaviors reflecting our original intent. It’s a bit paradoxical when you think about it, but it works. For instance, a poorly designed contract would overly restrict a system, while a well-designed one introduces just enough guardrails to remain flexible. On a more romantic note, I like to think that I’m putting us back in charge, so don’t screw it up. After all, we’re both on team Human, and we might not get another chance.

And if we expand on this idea and start introducing multiple contracts, it becomes easy to see that this enables natural composability. I find this particularly relevant in multi-agent systems, where LLMs act as orchestrators that dynamically generate and refine outputs. By enforcing contracts at this level, we could theoretically ensure that any contextual information exchanged between agents is accurate.

Next, we’ll walk through a concrete implementation. I’ll introduce the @contract decorator and demonstrate how it integrates with symbolicai.

Unlike classic DbC, which enforces invariants, this approach instead introduces a fallback or continuation mechanism. The @contract decorator is a class decorator, meaning it amplifies an entire Expression, rather than a single function.

In symbolicai, as in PyTorch, each Expression implements a forward method. The @contract decorator modifies forward, but, crucially, never prevents execution. Instead, if a contract fails, the system has the option to take action, for example by returning an expected implicit type (as indicated by the -> return type annotation). If the contract succeeds, it can simply return the validated result, or it can perform additional actions (e.g. logging) before passing the result on. With that in mind, let’s look at how @contract is implemented in symbolicai.

The full code of what I’m about to show is available as a gist at the following link. I won’t reproduce it entirely here, but will conceptually outline important aspects of the implementation.

Contracts in symbolicai enforce a strict type system using Pydantic data models. These contracts only work with LLMDataModel objects for both inputs and outputs. LLMDataModel is our custom extension of BaseModel in Pydantic, enhanced with specialized formatting methods that allow us to instruct LLMs. There is nothing more than that for now, but it could be extended in the future.

class TripletInput(LLMDataModel):
    """Input for triplet extraction"""
    text: str = Field(
        description="Text to extract triplets from"
    )
    ontology: OntologySchema = Field(
        description="Ontology schema to use for extraction"
    )

class TripletOutput(LLMDataModel):
    """Collection of extracted triplets forming a knowledge graph"""
    triplets: list[Triplet] | None = Field(
        default=None,
        description="List of extracted triplets"
    )

When defining data models, we strongly recommend using Pydantic’s Field objects to represent model attributes. Fields are particularly relevant in this context because the description parameter serves a dual purpose — beyond documenting the field, these descriptions become part of the prompt sent to the LLM. This creates an elegant way to provide contextual instructions directly into the data structures, guiding the model to the correct results without redundant code.

When applied, the @contract decorator modifies the class and in particular its forward method, adding a few crucial attributes: contract_successful indicates whether the contract succeeded, contract_result contains the result (or None if the contract failed), and a new contract_perf_stats method that provides detailed timing metrics for each phase of contract validation. The forward method requires a specific signature with an input parameter and must declare a return type annotation of the form -> SomeLLMDataModel. This return type serves as a guarantee of the contract that your method must always return an object of this type. Furthermore, it does not accept args, only keyword arguments. All other keyword arguments passed, apart from input, are sent to the backend, where they reach the current neuro-symbolic engine (say gpt-4o). There’s a whole discussion about how we use the word kwargs in symbolicai, but that’s all you need to know for now, and you won’t need to understand this particular detail to use contracts effectively.

def forward(self, input: TripletInput, **kwargs) -> TripletOutput:
    # The contract modified the class so self has now this new attribute
    if self.contract_result is None:
        return TripletOutput(triplets=None)
    # … do more work here …
    return self.contract_result

A critical aspect of contract implementation is the order of execution: contract validation runs first, but the forward method always runs after that, regardless of the validation results. This design puts the developer’s responsibility to check that contract_successful is True before proceeding with complex operations or constructing an appropriate default return value that matches the declared type. While you could raise an exception when validation fails, the recommended pattern is to return a valid but empty or default object.

Keep this in mind, because it’s very important. There’s nothing more wasteful than waiting for a computation to complete only to have it fail in a way that could have been prevented. In case you’re wondering, I originally set the default to return the contract result, but defaults undermine the very purpose of what I’m building: a system that prioritizes design awareness. Attention is one of the most vital skills. Think of it as a mindfulness exercise.

def pre(self, input: TripletInput) -> bool:
    # No semantic validation for now
    return True

def post(self, output: TripletOutput) -> bool:
    if output.triplets is None:
        # We can consider skipping since the LLM didn't
        # find any triplet for the given input.
        return True
    for triplet in output.triplets:
        if triplet.confidence < self.threshold:
            raise ValueError(
                "Confidence score "
                f"{triplet.confidence} "
                "has to be above threshold "
                f"{self.threshold}! "
                "Extract relationships between entities "
                "that are meaningful and relevant!"
            )
    return True

@property
def prompt(self) -> str:
    return (
        "You are an expert "
        "at extracting semantic relationships from text "
        "according to ontology schemas. "
        "For the given text and ontology:\n"
        "1. Identify entities matching the allowed entity types\n"
        "2. Extract relationships between entities matching the "
        "defined relationship types\n"
        "3. Assign confidence scores based on certainty of extraction\n"
        "4. Ensure all entity and relationship types conform to the ontology\n"
        "5. Do not duplicate triplets\n"
        "6. If triplets can't be found, default to None"
    )

To implement a complete contract, you need to define two key methods and one property. The pre and post methods serve as probabilistic semantic validation containers for the input and output, respectively, and can be as complex as you want. These methods should return True when validation succeeds and raise an error with a descriptive message when validation fails. Raising an error is a deliberate design choice because these error messages, similar to the description argument of Field, serve as instructions that guide the LLM toward self-correction. While you may not always need to semantically validate an input that you designed yourself, this capability becomes relevant when you chain contracts together. In these scenarios, the output of one contract becomes the input of another, and although it passed validation in its original context, it may not meet the standards of the downstream contract. The required prompt property must be a string describing the task you want to perform, and you have complete freedom in how you construct it. Two additional optional properties can enhance contracts: template provides a mechanism for text completion (e.g., “text… {fill this thing}… text”), while payload can contain any additional data relevant to the current computation.

@contract(
    pre_remedy=False,
    post_remedy=True,
    verbose=True,
    remedy_retry_params=dict(
        tries=1,
        delay=0.5,
        max_delay=15,
        jitter=0.1,
        backoff=2,
        graceful=False
    )
)

The @contract decorator accepts several parameters that control its behavior. The remedy_retry_params dictionary configures the underlying retry mechanisms that attempt to remedy validation failures, one for type validation and the other for semantic validation. I won’t go into detail about the internals of these two elements, but I will say this. Under the hood, they use a component called Function, which was originally inspired by Wolfram’s LLMFunction. We liked this high-level approach, so we added it to our framework. These include parameters like tries (maximum retry attempts) and others to fine-tune the remediation process. Two boolean parameters, pre_remedy and post_remedy, control whether autocorrection should be applied to inputs and outputs, respectively. When enabled, the system will retry failed validations up to the specified number of times. The verbose parameter enables verbose logging of the contract’s internal operations.

Contract execution follows a well-defined sequence of operations. First, it validates whether the input is indeed an instance of LLMDataModel, immediately raising an error if this fundamental type check fails. If pre_remedy is enabled, it semantically validates the input according to the pre method, retrying with guidance from error messages when validation fails. Then, the system attempts to create a valid output object that conforms to the specified return type. If post_remedy is enabled, it semantically validates this output using the post method, again with retry capabilities.

To demonstrate contracts in action, let’s explore a practical example: extracting structured knowledge from legal documents. In this task, we design an ontology that guides LLM to extract semantic triplets from a Terms and Conditions document (in this case, X’s Terms and Conditions). These triplets then form a knowledge graph that represents the key concepts and relationships of the document. We process the document in chunks to increase the probability of capturing all relevant entities and relationships. This segmentation strategy is important for complete extraction, as attempting to process the entire document at once could cause the model to miss important details or exceed contextual limits. We will compare the results of two different models: gpt-4o-mini and deepseek-r1-distill-qwen-32b-q4_k_m (unsloth), demonstrating how different models interpret and extract information under the same contractual constraints.

And, the results (clickable) for gpt-4o-mini and deepseek-r1-distill-qwen-32b-q4_k_m .

Well, that’s all for now, but there’s still much more to explore. We’re only at the beginning, and I’m confident that if more people — especially those smarter than me — adopt this pattern, we could build far more reliable systems. There are other ideas I believe hold merit, like for instance if two different LLMs satisfy the same contract, they could be functionally equivalent, at least with respect to that contract. I think this idea of functional equivalence is a very promising research direction, because in principle you could replace one LLM with another, or you could even replace an LLM with something else entirely, and as long as both satisfy the same contract, your application should continue functioning correctly.

Anyway, looking ahead, there are several directions worth exploring:

How can we formalize the probabilistic aspects of contracts, potentially calculating the probability of success for a given contract (perhaps across different models)?
What should the life cycle of a contract be?
In the area of security, can contracts prevent prompt injection/contextual violations?
How can we develop a standardized format for serializing and deserializing contracts, allowing for easier sharing and version control (e.g. .GGUF)?
How can we create a hub where developers can share and discover contracts for common tasks, a collaborative ecosystem around semantic correctness?

Would love to hear your thoughts on this.