A non-profit research initiative advancing the frontiers of artificial intelligence. We focus on omni-modal AI systems, efficient architectures, and synthetic data at scale.
LLM Inference on a Static Manifold: A Gauge-Theoretic Framework
Category: research
Date:
Author: CausalLM
From Sequence Shuffling to Geometric Reconstruction: Posing the Problem
The fundamental contradiction in current large model inference systems is the rigid coupling of computational logic to storage structure. These models forcibly map the attention relationships between tokens onto the contiguous address space of physical memory. This one-dimensional linear assumption, while seemingly efficient for autoregressive generation, fundamentally constrains the system's flexibility. When faced with concurrent multi-path processing, asynchronous data stream fusion, or dynamic context reorganization, every deviation from strictly sequential access triggers memory movement on the order of , where is the sequence length and is the hidden dimension.
Figure 1: Comparison of KV cache reordering paradigms. The traditional method (left) physically moves data, an expensive O(L·d) operation. The geometric method (right) keeps data static and transforms the query's viewpoint, a cheap O(d) computation.
The crux of the problem is not a lack of engineering optimization but a paradigmatic limitation: compressing a high-dimensional coordinate transformation problem into a one-dimensional linked list. The key to breaking this impasse lies in acknowledging that attention computation is inherently dependent on relative positional relationships, not absolute physical addresses. The KV cache can be viewed as a static geometric basis, while the query vector acts as a dynamic viewpoint. By rotating the query to simulate different positional relationships, the computation graph can be dynamically reconstructed over immutable physical data.
The Group Structure of RoPE: Transformation, Not Labels
The power of Rotary Position Embedding (RoPE) lies in its additive and invertible matrix group properties1. For a query at position and a key at position , RoPE applies positional encoding via an orthogonal rotation matrix :
The key properties of the rotation matrix satisfy the algebraic structure of a Lie group2:
This means that the attention score depends solely on the relative displacement , irrespective of absolute coordinates1. The position ID itself is a free parameter in the choice of coordinates and can be arbitrarily remapped without altering the computational result. Traditional systems hardcode physical memory addresses as the default coordinate system, which is equivalent to fixing a global rotational basis and thereby forfeiting the entire exploratory space of coordinate flexibility.
Static Basis + Dynamic Viewpoint: Rotating Queries, Not Moving Data
By treating the KV cache as a static geometric basis, the position encoding matrix can be reinterpreted as a coordinate system shift operator. Attention computation is essentially the projection of a query vector onto this basis, and the group properties of RoPE allow us to simulate different positional relationships through query rotation transformations.
For a cache block at physical coordinate , if computation is required at a logical coordinate , no KV data needs to be moved. Instead, we simply apply a transformation to the query:
Figure 2: The query rotation mechanism. To compute the attention between a query q at logical position ℓ and a KV block at physical position p, the system does not move data. Instead, it applies a rotational transform R(ℓ-p) to the query vector q before the computation, generating a temporary, transformed query q'.
This operation achieves computational equivalence: the calculation in the physical basis, after transformation, produces an identical attention distribution to the calculation in the logical coordinate system. The complexity is reduced from for memory movement to for rotational computation, where is the number of cache blocks, and typically . This complexity reduction has been validated in practice by systems like vLLM through its PagedAttention mechanism3.
The zero-curvature property, , guarantees the path-independence of parallel transport456. This ensures that inserting, deleting, or reordering logical views at any point during inference does not disrupt the geometric consistency of historical computations, providing an algebraic foundation for incremental topological editing. Empirical studies have also shown that the manifolds learned by deep generative models often exhibit near-zero curvature7, lending practical support to this assumption8.
A Unified Framework: Three Classes of Homomorphic Mapping Strategies
All computational modes share the same set of rotational operators, differing only in their coordinate mapping strategy, . This is the core idea of homomorphic transformation:
Figure 3: Coordinate mapping strategies (σ) map logical views to shared physical KV blocks. 1. Identity Mapping: Logical and physical coordinates have a one-to-one correspondence, used for standard autoregression. 2. Permutation Mapping: Different requests (A and B) share physical blocks via distinct permutation tables (π), enabling conflict-free concurrency. 3. Spectral Offset Mapping: Different data streams (A and B) are assigned large logical coordinate offsets (Φ), separating them in the "spectrum" for asynchronous stream fusion.
1. Identity Mapping (Standard Autoregression)
The boundary condition is , where logical coordinates coincide with physical coordinates, and the rotation degenerates to an identity transformation. This is the degenerate case of fixed coordinates, corresponding to the traditional inference model.
Let inference instances share a physical basis . Each instance defines a permutation :
When a query interacts with block , a rotation is applied. The computations for all instances proceed in parallel, with different rotations applied only on the query side. This strategy of achieving concurrent sharing through block-level address permutation is already employed in modern inference engines39. "Conflicts" between instances are automatically resolved by the orthogonal decomposition of the representation space:
Concurrency control is thus transformed into a problem of group representation theory, eliminating the need for locking mechanisms—an algebraic implementation of the spectral isolation principle.
We assign spectral parameters to asynchronous streams:
The idea of converting temporal disparities in asynchronous streams into controllable phase differences in a spectral domain is analogous to systems like FlowKV, which optimize KV cache transfer10. The attention operator splits into block-diagonal and off-diagonal components, a concept theoretically grounded in measure-theoretic interpretations of self-attention as an interacting particle system11:
The constant offset difference acts as a learnable parameter encoding inter-stream dependencies, achieving spectral multiplexing: asynchronicity in the time domain is translated into a controllable phase difference in the frequency domain.
An Information Geometry Perspective: Attention as a Similarity Metric
Viewing token representations as points on a statistical manifold , the attention score can be reinterpreted as a probability decay based on distance8:
Here, is the geodesic distance on the manifold. This perspective, which defines attention scores as an exponential function of geodesic distance, finds direct theoretical support in Transformer frameworks based on Riemannian and non-Euclidean geometries41213. The invariance of coordinate transformations guarantees that , meaning that re-encoding does not affect performance because the geometric distance remains unchanged.
Attention from an Information-Geometric Perspectivepᵢ(Query)pⱼ(Key)pₖd(pᵢ, pⱼ)d(pᵢ, pₖ)Score:αᵢⱼ = exp(-d²(pᵢ, pⱼ)2σ²)The shorter the distance (smaller d),the higher the attention (α).Figure 4: Token representations are viewed as points (pᵢ, pⱼ) on a statistical manifold. The attention score is no longer a simple dot product but a function of the geodesic distance (d(pᵢ, pⱼ)) between two points on the manifold. The closer the distance, the higher the similarity and the attention score. This geometric distance is invariant under coordinate transformations.
Zero sectional curvature implies that is locally flat7, so all topological restructuring operations preserve information entropy. The optimal mapping should minimize a cost function closely related to the variational principle for geodesics and optimal transport theory141516:
The training process becomes a gradient flow on the moduli space, allowing the model to adaptively learn the optimal coordinate mapping.
Hardware Implementation: From Rotations to Complex Multiplication
The core operation is the matrix exponential , which can be mapped to channel-wise complex multiplication on modern GPUs1:
By leveraging complex matrix multiplication units, the query vector can be broadcast in parallel to all cache blocks while applying different unitary rotations, thereby localizing computation and delocalizing the perspective.
In a multi-GPU scenario, global coordinate transformations are synchronized via an algebraic protocol:
The central charge encodes the phase difference across devices. AllReduce only needs to synchronize parameters, reducing communication from to .
Paradigm Shift: From Movement to Transformation
The essence of this revolution is to elevate inference from data movement to coordinate programming. Positional encodings are no longer static labels but dynamic coordinate transformations; attention propagation is a geometric instantiation of a group action; different computational modes correspond to different configurations of a spectral measure.
When the attention mechanism is promoted from a tensor contraction to a variational problem on a manifold, the model transforms from a passive sequence predictor into an active topological reconstruction engine. The future direction is to enable the model to autonomously learn the optimal mapping141516:
When the model can evolve stable configurations that yield the smoothest coordinate transformations, inference becomes a constructive process of geometric optimization. This marks a leap from engineering heuristics to mathematical first principles: the core value of a language model is no longer predicting the next token, but acting as a programmable geometric engine that actively shapes the topology of information flow in digital space.
References
1 Su, Jianlin, et al. “RoFormer: Enhanced Transformer with Rotary Position Embedding.” ArXiv (2021). ↩↩↩
2 Liu, Haiping, Lijing Lin, et al. “Rethinking RoPE: A Mathematical Blueprint for N-dimensional Positional Embedding.” (2025). ↩
3 Kwon, Woosuk, Zhuohan Li, et al. “Efficient Memory Management for Large Language Model Serving with PagedAttention.” Proceedings of the 29th Symposium on Operating Systems Principles (2023). ↩↩
4 Ji, Zhongping. “RiemannFormer: A Framework for Attention in Curved Spaces.” (2025). ↩↩
5 Grindstaff, Gillian. “Geometric data analysis for phylogenetic trees and non-contractible manifolds.” (2021). ↩
6 Kelshaw, Daniel, and Luca Magri. “Computing distances and means on manifolds with a metric-constrained Eikonal approach.” ArXiv (2024). ↩
7 Shao, Hang, Abhishek Kumar, et al. “The Riemannian Geometry of Deep Generative Models.” 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW) (2017). ↩↩
8 Wang, Sijie, Rui She, et al. “DistilVPR: Cross-Modal Knowledge Distillation for Visual Place Recognition.” ArXiv (2024). ↩↩
9 Zhong, Yinmin, Shengyu Liu, et al. “DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving.” USENIX Symposium on Operating Systems Design and Implementation (2024). ↩
10 Li, Weiqing, Guochao Jiang, et al. “FlowKV: A Disaggregated Inference Framework with Low-Latency KV Cache Transfer and Load-Aware Scheduling.” (2025). ↩
11 Vuckovic, James, A. Baratin, et al. “A Mathematical Theory of Attention.” ArXiv (2020). ↩
12 He, Neil, Jiahong Liu, et al. “Position: Beyond Euclidean -- Foundation Models Should Embrace Non-Euclidean Geometries.” (2025). ↩
13 Li, Peizhuo, Tuanfeng Y. Wang, et al. “Neural Garment Dynamics via Manifold‐Aware Transformers.” Computer Graphics Forum (2024). ↩
14 Yu, Yanmin, Yongcai Lai, et al. “The Novel Sequence Distance Measuring Algorithm Based on Optimal Transport and Cross-Attention Mechanism.” Shock and Vibration (2021). ↩↩
15 Wang, Rui, Chen Hu, et al. “A Grassmannian Manifold Self-Attention Network for Signal Classification.” Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence (2024). ↩↩
16 Burger, Martin, Samira Kabri, et al. “Analysis of mean-field models arising from self-attention dynamics in transformer architectures with layer normalization.” (2025). ↩↩