Lagrange Engineering Update: Q4 2025

Q4 2025 marked a major inflection point for DeepProve’s proving architecture. Over the quarter, the team focused on scaling the system from a single-node prover into a distributed, GPU-accelerated proving pipeline capable of sustaining real-world inference workloads. This required coordinated advances across cryptographic execution, tensor algebra, memory flow, and performance optimization. The result is a materially more scalable, accurate, and production-ready zkML system.
Distributed Proving Architecture
We re-architected DeepProve’s proving logic to operate over an explicit execution graph. Instead of treating the proving process as a monolithic sequence of steps, the full cryptographic workflow is now represented as a graph whose nodes encode computation and proof obligations.
Key properties of this system include:
- The execution graph can be split into multiple independent partitions.
- Each partition is assigned a “color,” analogous to a network address or routing identifier.
- Workers (servers) are assigned one or more colored partitions.
- Independent subgraphs can execute in parallel, allowing proof generation to scale horizontally.
This architecture allows DeepProve to express proving logic as a parallelizable system rather than a strictly sequential pipeline. We are currently integrating the cryptographic proof logic fully into this distributed execution layer, enabling proofs to be generated cooperatively across multiple machines.
Einsum-First Linear Algebra
In the previous update, we reported the completion of an einsum layer used to express select linear operations. In Q4, we extended this approach across the entire model stack. Specifically, all existing linear layers were replaced with explicit einsum instantiations, including dense layers and QKV projections.
The proving logic was generalized to support arbitrary tensor rank configurations. This shift significantly simplified the codebase by eliminating many bespoke linear layer implementations. By unifying linear algebra under einsum, we reduced the number of independent components that must be maintained while improving flexibility for future architectures.
Non-Linear Layer Speedups via Unified Lookup
Non-linear layers remain one of the most expensive components of zkML systems due to their reliance on lookup tables. Previously, each non-linear operation (e.g. softmax, ReLU, GELU, layer normalization) had a separate implementation.
We introduced a single generalized Lookup layer that:
- Can be parameterized to represent different non-linear functions.
- Supports integrated requantization as part of the lookup process.
By folding requantization into the lookup layer itself, we eliminated the need for many standalone requantization stages. This reduced both proving overhead and implementation complexity while improving end-to-end performance.
Accuracy Improvements
We conducted a comprehensive accuracy evaluation for the models currently supported by DeepProve, with a focus on GPT-2 and Gemma-3. Measured against PyTorch FP32 baselines, GPT-2 shows less than a 1% increase in perplexity and Gemma-3 shows approximately a 4% increase in perplexity. These results demonstrate that the proving pipeline maintains high numerical fidelity even as performance and scalability improve.
Caching-Friendly Inference
We formalized a more robust caching strategy to better support long-sequence and incremental inference. The system now supports two distinct caches:
- A positional cache that tracks sequence position, used by positional and attention-related layers.
- A tensor concatenation cache that accumulates tensors across inference steps, such as cached K and V matrices from QKV projections.
This structure improves both inference efficiency and proof reuse across sequential model executions.
Unified Memory Flow
One of the more complex efforts this quarter was a large-scale refactor of tensor management to support a unified memory flow. The goal is to allow tensors to move seamlessly across:
- Persistent storage
- Network transport
- System RAM
- GPU VRAM
The new data flow model allows tensors to be sampled or intercepted at arbitrary stages of the inference and proving pipeline. While this work is foundational, it enables future optimizations in distributed proving, GPU execution, and asynchronous proof generation.
GPU Execution and Optimistic Proving
All model layers are now capable of running on GPU, unlocking significant performance improvements. Beyond reducing end-to-end latency, this enables an important new capability: optimistic proving. Under this model, inference results can be returned to users immediately, proof generation proceeds asynchronously, and proofs are delivered once ready (without blocking inference). This design is critical for operational settings where low latency is required but cryptographic assurance must still be preserved.
Performance and Throughput
A comprehensive bottleneck analysis across the proving pipeline led to multiple targeted optimizations. As a result, DeepProve now sustains approximately 1.5 proofs per second. Reaching this level of sustained throughput is significant for two reasons. First, it validates that zkML proving can keep pace with practical inference workloads rather than operating as an offline or batch-only process. Second, it establishes a performance baseline that future parallelization, hardware acceleration, and distributed proving can scale from.
Together, these results validate the architectural direction taken throughout Q4 and demonstrate that DeepProve is moving from experimental performance into sustained, production-scale operation.

