LLM Inference on a Static Manifold: A Gauge-Theoretic Framework

Сhoose

Category: research

Date: November 18, 2025

Author: CausalLM

From Sequence Shuffling to Geometric Reconstruction: Posing the Problem

The fundamental contradiction in current large model inference systems is the rigid coupling of computational logic to storage structure. These models forcibly map the attention relationships between tokens onto the contiguous address space of physical memory. This one-dimensional linear assumption, while seemingly efficient for autoregressive generation, fundamentally constrains the system's flexibility. When faced with concurrent multi-path processing, asynchronous data stream fusion, or dynamic context reorganization, every deviation from strictly sequential access triggers memory movement on the order of $O(L \cdot d)$ , where $L$ is the sequence length and $d$ is the hidden dimension.

Figure 1: Comparison of KV cache reordering paradigms. The traditional method (left) physically moves data, an expensive O(L·d) operation. The geometric method (right) keeps data static and transforms the query's viewpoint, a cheap O(d) computation.

The crux of the problem is not a lack of engineering optimization but a paradigmatic limitation: compressing a high-dimensional coordinate transformation problem into a one-dimensional linked list. The key to breaking this impasse lies in acknowledging that attention computation is inherently dependent on relative positional relationships, not absolute physical addresses. The KV cache can be viewed as a static geometric basis, while the query vector acts as a dynamic viewpoint. By rotating the query to simulate different positional relationships, the computation graph can be dynamically reconstructed over immutable physical data.

The Group Structure of RoPE: Transformation, Not Labels

The power of Rotary Position Embedding (RoPE) lies in its additive and invertible matrix group properties¹. For a query $q$ at position $m$ and a key $k$ at position $n$ , RoPE applies positional encoding via an orthogonal rotation matrix $\mathcal{R}(m)$ :

The key properties of the rotation matrix satisfy the algebraic structure of a Lie group²:

This means that the attention score depends solely on the relative displacement $n-m$ , irrespective of absolute coordinates¹. The position ID itself is a free parameter in the choice of coordinates and can be arbitrarily remapped without altering the computational result. Traditional systems hardcode physical memory addresses as the default coordinate system, which is equivalent to fixing a global rotational basis and thereby forfeiting the entire exploratory space of coordinate flexibility.

Static Basis + Dynamic Viewpoint: Rotating Queries, Not Moving Data

By treating the KV cache as a static geometric basis, the position encoding matrix $A(m) = mJ$ can be reinterpreted as a coordinate system shift operator. Attention computation is essentially the projection of a query vector onto this basis, and the group properties of RoPE allow us to simulate different positional relationships through query rotation transformations.

For a cache block $B$ at physical coordinate $p$ , if computation is required at a logical coordinate $\ell$ , no KV data needs to be moved. Instead, we simply apply a transformation to the query:

Figure 2: The query rotation mechanism. To compute the attention between a query q at logical position ℓ and a KV block at physical position p, the system does not move data. Instead, it applies a rotational transform R(ℓ-p) to the query vector q before the computation, generating a temporary, transformed query q'.

This operation achieves computational equivalence: the calculation $\mathcal{R}(-p)K$ in the physical basis, after transformation, produces an identical attention distribution to the calculation $\mathcal{R}(-\ell)K$ in the logical coordinate system. The complexity is reduced from $O(L \cdot d)$ for memory movement to $O(N_{\text{blocks}} \cdot d)$ for rotational computation, where $N_{\text{blocks}}$ is the number of cache blocks, and typically $N_{\text{blocks}} \ll L$ . This complexity reduction has been validated in practice by systems like vLLM through its PagedAttention mechanism³.

The zero-curvature property, $\mathcal{F}_{\mu\nu} = 0$ , guarantees the path-independence of parallel transport⁴⁵⁶. This ensures that inserting, deleting, or reordering logical views at any point during inference does not disrupt the geometric consistency of historical computations, providing an algebraic foundation for incremental topological editing. Empirical studies have also shown that the manifolds learned by deep generative models often exhibit near-zero curvature⁷, lending practical support to this assumption⁸.

A Unified Framework: Three Classes of Homomorphic Mapping Strategies

All computational modes share the same set of rotational operators, differing only in their coordinate mapping strategy, $\sigma$ . This is the core idea of homomorphic transformation:

Figure 3: Coordinate mapping strategies (σ) map logical views to shared physical KV blocks. 1. Identity Mapping: Logical and physical coordinates have a one-to-one correspondence, used for standard autoregression. 2. Permutation Mapping: Different requests (A and B) share physical blocks via distinct permutation tables (π), enabling conflict-free concurrency. 3. Spectral Offset Mapping: Different data streams (A and B) are assigned large logical coordinate offsets (Φ), separating them in the "spectrum" for asynchronous stream fusion.

1. Identity Mapping (Standard Autoregression)

The boundary condition is $\sigma_{\text{causal}}(B_i) = p_i$ , where logical coordinates coincide with physical coordinates, and the rotation degenerates to an identity transformation. This is the degenerate case of fixed coordinates, corresponding to the traditional inference model.

2. Permutation Homomorphism (Separating Concurrent Computations)

Let $N$ inference instances share a physical basis $\mathcal{B}$ . Each instance $\alpha$ defines a permutation $\pi_\alpha \in \text{Sym}(\mathcal{B})$ :

When a query $q_\alpha$ interacts with block $B_i$ , a rotation $\mathcal{R}(\sigma_\alpha(B_i) - p_i)$ is applied. The computations for all $N$ instances proceed in parallel, with different rotations applied only on the query side. This strategy of achieving concurrent sharing through block-level address permutation is already employed in modern inference engines³⁹. "Conflicts" between instances are automatically resolved by the orthogonal decomposition of the representation space:

Concurrency control is thus transformed into a problem of group representation theory, eliminating the need for locking mechanisms—an algebraic implementation of the spectral isolation principle.

3. Spectral Shift Mapping (Asynchronous Stream Fusion)

We assign spectral parameters $\Phi: \mathbb{Z}_M \to \mathbb{Z}$ to $M$ asynchronous streams:

The idea of converting temporal disparities in asynchronous streams into controllable phase differences in a spectral domain is analogous to systems like FlowKV, which optimize KV cache transfer¹⁰. The attention operator splits into block-diagonal and off-diagonal components, a concept theoretically grounded in measure-theoretic interpretations of self-attention as an interacting particle system¹¹:

The constant offset difference $\Delta = \Phi_k - \Phi_{k'}$ acts as a learnable parameter encoding inter-stream dependencies, achieving spectral multiplexing: asynchronicity in the time domain is translated into a controllable phase difference in the frequency domain.

An Information Geometry Perspective: Attention as a Similarity Metric

Viewing token representations as points on a statistical manifold $\mathcal{S}$ , the attention score can be reinterpreted as a probability decay based on distance⁸:

Here, $d(p_i, p_j)$ is the geodesic distance on the manifold. This perspective, which defines attention scores as an exponential function of geodesic distance, finds direct theoretical support in Transformer frameworks based on Riemannian and non-Euclidean geometries⁴¹²¹³. The invariance of coordinate transformations guarantees that $d(\sigma(p_i), \sigma(p_j)) = d(p_i, p_j)$ , meaning that re-encoding does not affect performance because the geometric distance remains unchanged.

Figure 4: Token representations are viewed as points (pᵢ, pⱼ) on a statistical manifold. The attention score is no longer a simple dot product but a function of the geodesic distance (d(pᵢ, pⱼ)) between two points on the manifold. The closer the distance, the higher the similarity and the attention score. This geometric distance is invariant under coordinate transformations.

Zero sectional curvature implies that $\mathcal{S}$ is locally flat⁷, so all topological restructuring operations preserve information entropy. The optimal mapping $\sigma^*$ should minimize a cost function closely related to the variational principle for geodesics and optimal transport theory¹⁴¹⁵¹⁶:

The training process becomes a gradient flow on the moduli space, allowing the model to adaptively learn the optimal coordinate mapping.

Hardware Implementation: From Rotations to Complex Multiplication

The core operation is the matrix exponential $\exp(\theta J)$ , which can be mapped to channel-wise complex multiplication on modern GPUs¹:

By leveraging complex matrix multiplication units, the query vector can be broadcast in parallel to all cache blocks while applying different unitary rotations, thereby localizing computation and delocalizing the perspective.

In a multi-GPU scenario, global coordinate transformations are synchronized via an algebraic protocol:

The central charge $K_c$ encodes the phase difference across devices. AllReduce only needs to synchronize $O(d)$ parameters, reducing communication from $O(N_{\text{tokens}} \cdot d)$ to $O(d)$ .

Paradigm Shift: From Movement to Transformation

The essence of this revolution is to elevate inference from data movement to coordinate programming. Positional encodings are no longer static labels but dynamic coordinate transformations; attention propagation is a geometric instantiation of a group action; different computational modes correspond to different configurations of a spectral measure.

When the attention mechanism is promoted from a tensor contraction to a variational problem on a manifold, the model transforms from a passive sequence predictor into an active topological reconstruction engine. The future direction is to enable the model to autonomously learn the optimal mapping¹⁴¹⁵¹⁶:

When the model can evolve stable configurations that yield the smoothest coordinate transformations, inference becomes a constructive process of geometric optimization. This marks a leap from engineering heuristics to mathematical first principles: the core value of a language model is no longer predicting the next token, but acting as a programmable geometric engine that actively shapes the topology of information flow in digital space.

References

1 Su, Jianlin, et al. “RoFormer: Enhanced Transformer with Rotary Position Embedding.” ArXiv (2021). ↩ ↩ ↩

2 Liu, Haiping, Lijing Lin, et al. “Rethinking RoPE: A Mathematical Blueprint for N-dimensional Positional Embedding.” (2025). ↩

3 Kwon, Woosuk, Zhuohan Li, et al. “Efficient Memory Management for Large Language Model Serving with PagedAttention.” Proceedings of the 29th Symposium on Operating Systems Principles (2023). ↩ ↩

4 Ji, Zhongping. “RiemannFormer: A Framework for Attention in Curved Spaces.” (2025). ↩ ↩

5 Grindstaff, Gillian. “Geometric data analysis for phylogenetic trees and non-contractible manifolds.” (2021). ↩

6 Kelshaw, Daniel, and Luca Magri. “Computing distances and means on manifolds with a metric-constrained Eikonal approach.” ArXiv (2024). ↩

7 Shao, Hang, Abhishek Kumar, et al. “The Riemannian Geometry of Deep Generative Models.” 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW) (2017). ↩ ↩

8 Wang, Sijie, Rui She, et al. “DistilVPR: Cross-Modal Knowledge Distillation for Visual Place Recognition.” ArXiv (2024). ↩ ↩

9 Zhong, Yinmin, Shengyu Liu, et al. “DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving.” USENIX Symposium on Operating Systems Design and Implementation (2024). ↩

10 Li, Weiqing, Guochao Jiang, et al. “FlowKV: A Disaggregated Inference Framework with Low-Latency KV Cache Transfer and Load-Aware Scheduling.” (2025). ↩

11 Vuckovic, James, A. Baratin, et al. “A Mathematical Theory of Attention.” ArXiv (2020). ↩

12 He, Neil, Jiahong Liu, et al. “Position: Beyond Euclidean -- Foundation Models Should Embrace Non-Euclidean Geometries.” (2025). ↩

13 Li, Peizhuo, Tuanfeng Y. Wang, et al. “Neural Garment Dynamics via Manifold‐Aware Transformers.” Computer Graphics Forum (2024). ↩

14 Yu, Yanmin, Yongcai Lai, et al. “The Novel Sequence Distance Measuring Algorithm Based on Optimal Transport and Cross-Attention Mechanism.” Shock and Vibration (2021). ↩ ↩

15 Wang, Rui, Chen Hu, et al. “A Grassmannian Manifold Self-Attention Network for Signal Classification.” Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence (2024). ↩ ↩

16 Burger, Martin, Samira Kabri, et al. “Analysis of mean-field models arising from self-attention dynamics in transformer architectures with layer normalization.” (2025). ↩ ↩

More field notes

Explore the archive

Retrieval-SFT-Chat: A New Synthetic Dialogue Dataset

Datasets

February 28, 2025

Retrieval-SFT-Chat: A New Synthetic Dialogue Dataset

Retrieval-Based Multi-Turn Chat SFT Synthetic Data, a new 100k entry, multi-turn synthetic dialogue dataset for SFT, building on our work with CausalLM/Refined-Anime-Text.

Unlocking LLM Potential with Our "Special Sauce" for Synthetic Data

Datasets

February 26, 2024

Unlocking LLM Potential with Our "Special Sauce" for Synthetic Data

We introduce our unique recipe for generating high-quality synthetic datasets to boost LLM performance, featuring our new 1M+ entry Anime dataset as a proof of concept.

Research Areas

Connect

About CausalLM