Challenging Agentic AI with SPARK using OpenUxAS

A question we are often asked at AdaCore is: “How well does generative AI do with writing Ada?” The simple answer is: “quite well!” But the details are also interesting, so I encourage you to read on!

For a number of years, AdaCore has worked with the US Air Force Research Laboratory in Dayton, OH, to apply SPARK to their OpenUxAS platform for research into multi-UAV cooperative decision making. The objective of our collaboration is to demonstrate how one might apply software best practices in general and deductive formal verification in particular to complex, Air Force-Relevant systems.

OpenUxAS is a service-oriented architecture that uses ZeroMQ for publish/subscribe messaging and is implemented in C++, a language for which deductive formal verification is practically impossible. So our approach to date has been to select a C++ service, rewrite it in SPARK, prove absence of run-time errors (AoRTE; also called SPARK Silver), and pose and prove functional-correctness properties (called SPARK Gold). OpenUxAS has only very limited requirements, so in practice, we’ve had to read the C++ very carefully and write the SPARK by hand, finding and fixing problems (in the SPARK alone) as we’ve encountered them. This has, of course, worked, but it’s time and effort intensive.

I wanted to know: can I use agentic AI to translate a (new, previously untranslated and unproven) OpenUxAS service from C++ into SPARK, automatically? And how far can I take the proof, automatically? The answers are: yes! and not very far!

Testing Infrastructure

OpenUxAS doesn’t have comprehensive or rigorous requirements. It also doesn’t have comprehensive tests, just a set of examples that demonstrate the use of the platform through predefined scenarios (since these examples require a Java-based simulator and user interface, they’re difficult to use for automated testing).

In prior work, AdaCore developed a Python-based testing framework that sends messages to and receives messages from OpenUxAS services using the ZeroMQ networking bridge. For the Automation Request Validator service, we exploited this framework to build a set of test cases that achieve nearly 100% statement coverage. Our vision was to exploit this framework to enable back-to-back testing of an existing C++ service against a new SPARK implementation: we would send test data to the C++ service, record the results, send the same data to the SPARK implementation, and compare the results against those from the C++. Unfortunately, this work was not completed in our prior efforts.

As a first step, I used agentic AI to develop this back-to-back testing capability. With a few high-level prompts, I ended up with a testing capability that allowed me to:

Test and record coverage for an Ada or C++ service
Run back-to-back tests, where either Ada or C++ was used as the test oracle and where Ada or C++ was used as the system under test (testing C++ against C++ was a good sanity check that the infrastructure worked as expected)
XFAIL (expected failure) back-to-back tests, with a failure reported on XPASS (unexpected pass)

This was highly successful. In particular, the agent did an excellent job at entering the rather complex codebase and figuring out how to use some very specific APIs that we leveraged (e3-testsuite). But since the test infrastructure is written in Python, this step, while necessary, didn’t answer my question.

Sensor Manager Service Requirements and C++ Tests

Test infrastructure in hand, I set about completing my test oracle. Since my selected service, the Sensor Manager service, didn’t have existing C++ tests, I asked the agent to write test cases for me. I actually did this in two steps: first, I asked the agent to write requirements; then I asked the agent to write tests based on those requirements.

My thought was to emulate a typical requirements-based testing process as you’d encounter doing certifiable software development. But since I didn’t forbid the agent from looking at the details of the C++, my requirements-based tests, which should have been black-box, ended up very light grey-box: I know the agent went off and read the C++ code and used what it found to influence test-case generation. (This is a worthwhile lesson when working with agents: they will do a lot of digging around, unless you sandbox them carefully.)

After a bit of back and forth, mostly because the agent found an interesting bug or two that had to be reflected in the test results, I had tests for the Sensor Manager that achieved 98% coverage. I was ready to tackle the SPARK implementation.

SPARK Implementation

Using the agent to generate the SPARK implementation was dead easy. OpenUxAS provides a template for use in creating a new SPARK service that includes detailed instructions for use. I pointed the agent to this and asked the agent to follow the requirements and use the template to design and implement the service.

And it worked! The agent quickly generated a SPARK implementation of the Sensor Manager service that compiled and linked. The implementation consisted of: a pure Ada piece that managed service startup and binding to the ZeroMQ networking bridge and receipt messages to which the service had subscribed; a “mailbox” layer that allows SPARK to send messages; and the core service functionality. In all, the agent generated about a thousand lines of implementation.

Iterating with the back-to-back test runs, the agent quickly identified a few bugs and fixed them - usually modifying the SPARK implementation to match an idiosyncrasy of the C++ code.

How’s the code? I’d say it’s a pretty good start. The agent did nice, idiomatic things like:

Introduced declare blocks to keep constants / variables scoped and close to their usage
Used constants whenever possible
Used for .. of loops whenever possible (note - this is good Ada, but currently often problematic in SPARK, which needs the loop index to perform induction)
Good variable names

What’s clear is that the code is not simply a transliteration of the C++ code. So I was pretty happy with the results. Until, that is, I started trying to prove the generated code to SPARK Silver!

SPARK Silver

With my completed implementation in hand, I asked the agent to prove the code to SPARK Silver. And the agent crashed and burned: after a bunch of time and tokens spent, it ended up unable to prove much of anything and littered my code with `pragma Annotate`s justifying the failed checks. (This is not uncommon behavior - agents are known to take the easy way out when they struggle to solve problems.)

So I spent the next couple of weeks (calendar time, not consistent effort!) working with the agent to get much of the implementation to prove at SPARK Silver. Along the way, I learned some interesting things that will possibly become a later blog post and certainly are making their way into a “Skill file” for the agent so that next time someone tries to do this, the agent will be better set up for success! (Stay tuned for a follow-up post on this!)

There were five key reasons the agent failed to make meaningful progress with the Silver proofs out of the box:

The SPARK code, while overall fairly idiomatic, was poorly architected for proof;
The SPARK types were simply Int32 or Real64 or Real32;
floating-point proofs are simply hard;
SPARK's imprecise support for trigonometric operations required the introduction of axioms;
The code was simply wrong and thus unprovable, even to SPARK Silver.

Let’s look at each of these.

Architecture for Proof

Sometimes, writing code to maximize proof success and minimize proof effort requires specific architectural decisions. In general, however, whatever you’d do naturally to make code as modular and readable as possible will make proof easier. You should definitely not do things like have more than 10 levels of nesting that includes five nested loops - which is what the agent did for the core service functionality.

The first thing I did was ask the agent to break the functionality into subprograms that were as small as possible.

In particular, there are many loops that lack data dependencies amongst iterations (e.g., apply an operation to each element of a collection); I asked the agent to create subprograms for the bodies of these kinds of loops. This is not only easier to read, but it also makes proof much easier: SPARK proves loops using induction - but if there are no data dependencies amongst iterations, induction isn’t needed and just makes the proof more complicated. (Of course, if you need to prove something about the map operation, induction will still be needed to reason about what’s been mapped so far.) By pulling the body out into a subprogram, SPARK can prove this without induction, and a reader can be sure that there are no data dependencies amongst loop iterations.

I also looked for opportunities to encapsulate computation: bits of code where the final result was needed by the intermediate computations were otherwise just excess context for SPARK. An excellent example of this was a surprisingly lengthy bit of code that ensured some input values were converted to the correct units and clamped to an appropriate range: all that mattered was the final, valid ranges; not how they were computed.

Now, this kind of refactoring isn’t without cost: because SPARK is modular, each of these new subprograms need preconditions and postconditions. But the postconditions would have been assertions anyway (to help breadcrumb the solvers through the excessive context of the original, pre-refactoring computation). And the preconditions actually help with understanding what’s needed elsewhere in the computation to ensure absence of run-time errors.

Use of Raw Numeric Types

The second significant challenge that the agent introduced was via its use of raw numeric types like Int32 or Real32 or Real64. To prove absence of run-time errors, SPARK requires (because Ada requires!) that computations won’t overflow and that assignments fit within the bounds of their types. If you add or subtract or multiply two arbitrary numeric types, you have absolutely no reason to assume absence of overflow. You have to be able to constrain the range of the inputs so you can prove the bounds on the outputs.

For embedded code, this probably isn’t a problem. Embedded code, operating as it does in the constraints of the real world, has to work with real-world limits. Sensors have fixed ranges. Altitudes have physics-based limits. Actuators have fixed limits. So at the boundaries, a well-engineered system will have some sort of interface control document that will clearly specify the expected and acceptable ranges on all of its inputs and outputs.

SPARK (and Ada) was designed with this in mind. But use of raw numeric types discards this information. That’s why we always encourage new SPARK and Ada programmers to introduce numeric subtypes with narrow ranges whenever possible. The intent is encoded in the type (self-documenting code!), Ada will check these ranges at runtime, and SPARK will prove these ranges statically.

The agent, unfortunately, simply copied the types it found in C++. I had to ask it to analyze the requirements, the message specification, and ultimately the C++ code itself to reverse engineer the narrowest justifiable type ranges. Once this was done, the proof went from impossible to possible.

This is a key takeaway - and certainly nothing new! See our recent blog post entitled “Never use Float or Integer”!

Floating-Point Woes

Alone, however, the changes above weren’t enough. While automatic provers have gotten much more capable at reasoning about floating-point computations, including nonlinear floating-point computations, they remain much more difficult than integer computations. My target was proof at Level 2 ([insert definition here]). To reach this level, I had to be very verbose about the known ranges on the result of most computations within key subprograms.

Without them, I could still prove at Level 3 (which offers longer timeouts) - but I didn’t want to have to wait that long to get results. And, to a degree, I liked to be able to read the code and see why I knew what I knew, as one computations built on another. I’d plan to encourage the agent to use this style; it’s easier to remove intermediate assertions than add them!

Trigonometric Imprecision

Another significant challenge resulted from SPARK’s imprecision around trigonometry operations. The sine operation, for example, is defined in SPARK as returning a value between -1.0 and 1.0 - true, but not sufficiently precise to prove absence of run-time errors in part of the code that depended on the result of sine.

So we had to introduce an axiom (a subprogram that is marked Ghost and Import, hence declared but never defined) that through its contract offered a more precise range on the outputs based on a more precise range of inputs. This was an axiom because we’ve not proven formally that it’s correct: we have a comment that declares that we evaluated the result of sine at the endpoints of the input range and asserting monotonicity along the range - but this hasn’t been proven anywhere. We’d have to use an external theorem-proving system to complete this proof, which we deemed overkill for this example.

Axioms are powerful, but nearly as risky as assumptions. They’re better because we can use the precondition to state precisely the validity of their application - but we still rely on justification of the postcondition, rather than proof. The agent didn’t know to introduce these initially - but after recording in its memory that they had been used, eventually proposed a (correct!) use of a similar axiom as we continued to pursue the Silver proof.

Bugs in the C++

Ultimately, the biggest challenge to completing the Silver proof came down to bugs in the C++. There were simply things the C++ was doing that made no sense - like computing a sensor footprint of (near) infinite or even negative width, because the code incorrectly guarded against a sensor that wasn’t pointed at the ground.

Once I knew to look for these problems, the agent helped me diagnose the problem and, together, we proposed defensible limits in the SPARK implementation (resulting in XFAILs in the back-to-back test suite) that allowed us to complete the proof. But initially, without identification of these bugs, the code was unprovable because it was erroneous.

And this is a win for formal methods - and a kind of win for the use of agentic AI!

Formal-methods practitioners are frequently asked: Did you find bugs when you did the proofs? And usually the answer is … no. Why? Because the process of preparing the code for proof and writing the contracts typically requires that we think hard(er) about what we’re doing and, as a result, we find and fix bugs. After all, who would blithely press forward writing known-wrong code just to be able to find it by running the prover?

But the agent did exactly this. Or, if you prefer, the agent didn’t “see” the bugs during its initial translation and thus faithfully copied them into its SPARK implementation. Its naiveté was - after a fashion - useful as a means to reveal the power of deductive formal verification. So that’s a kind of win for this approach!

I would like to revisit this example, after I build up a useful “Skill file”, to see if the agent, using the prover, is able to find (and fix) these bugs autonomously. But in the meantime, I think this example has done a great job of illustrating a key point: agentic AI + SPARK makes for a fantastic combination.

AI + SPARK: The Way Things Should Be Done

Agentic AI is really impressive. If you’ve not used it (recently!), you really should give it a try.

But the agent won’t be perfect. It will make mistakes. And it will write a lot of code that you (or a colleague!) will have to review. Which … isn’t great fun, honestly.

SPARK can help. With SPARK, you need only review the specification - the public subprograms and their contracts - carefully. For the implementation, the private/internal subprograms and the code, a cursory review to ensure there aren’t any major red flags should be enough.

Why? Because SPARK will prove that the implementation:

Doesn’t lead to any runtime errors (SPARK Silver) and
Compiles with its contract (SPARK Gold)

You don’t have to trust the agent for that. You can trust SPARK.

What’s Next

In spite of the agent’s failure to complete the Silver proof automatically, working with the agent was overall a positive experience that allowed me to progress much, much faster than had I been working alone. And - a teaser - with even just the first draft of my Skill, the agent was able to complete the Silver proof automatically! Stay tuned for more on this!

Want To Try This Yourself?

The concepts that I applied in this blog post are readily repeatable on your own code if you want to give it a try. Alire is the package manager for Ada and SPARK and is available in open source. As a matter of fact, since you are using an LLM already, you can ask the LLM to create a SPARK project in Alire and use that as your starting point for the SPARK implementation of your existing project. You’ll be compiling and proving your code in no time.

Tools

Services

Get an Overview

Ready To Go?

Languages

Ready To Go?

Industries

Learn how NVIDIA Adopted SPARK for Security-Critical Software Development

Company

Careers

Explore Resources

The AdaCore Blog

Learn Ada & SPARK

Community

Get Started with Ada

Support

Product Roadmap

Challenging Agentic AI with SPARK using OpenUxAS

Testing Infrastructure

Sensor Manager Service Requirements and C++ Tests

SPARK Implementation

SPARK Silver

Architecture for Proof

Use of Raw Numeric Types

Floating-Point Woes

Trigonometric Imprecision

Bugs in the C++

AI + SPARK: The Way Things Should Be Done

What’s Next

Want To Try This Yourself?

Author

M. Anthony Aiello

Latest Blog Posts

Information Hiding and Context Management in SPARK

Unit Testing with Google Test and GNATcoverage

From CodeSecure to AdaCore: Our Journeys in Product and Technical Marketing