Spatiotemporal diffusion
& the surrogate foundation model
Two ideas, one bench. Diffusion turns a screen of pure noise into a coherent field — coverage, load, CSI — over space and time. A surrogate foundation model is the network you pretrain once across many worlds, then calibrate to one deployment instead of rebuilding. Everything below is framed in the language you already speak: SNR, AWGN, equalizers, path-loss calibration, behavioral models.
N 200×113
noise→structure
step 000
σ —
A spatiotemporal field, not a number
The thing a RAN twin must produce is a field over a map that evolves in time: RSRP / SINR / PRB-load / interference across a city grid, frame after frame. It has two kinds of structure stacked on top of each other. Spatial structure — neighbouring bins are correlated; coverage is smooth except at cell edges and shadowing boundaries. Temporal structure at two rates — a slow envelope (diurnal load, mobility, slow fading) riding over fast per-TTI fluctuation (fast fading, scheduling churn).
This is exactly why your generator runs two timescales — a slow super-step over c_load / c_traffic and a fast per-TTI Mamba/S4 recurrence. A generative field model has to respect both, or it produces coverage maps that are spatially plausible but temporally incoherent (the failure mode deterministic MSE models fall into).
Think of it as a movie of a heatmap, or a stack of radio maps. One frame is a spatial snapshot; the stack is a coverage trajectory. The "pixels" are bins on your grid; the "channels" are your measured quantities.
Forward = walk SNR to −∞. Reverse = a learned MMSE climb back.
Forward process. Add Gaussian noise in graded steps until the field is pure AWGN. The marginal at any step is a clean signal scaled down plus noise — a planned SNR ramp:
γₜ ∈ [0,1] = signal retention (the ᾱₜ of DDPM). SNR(t) = γₜ/(1−γₜ). The "schedule" is just a log-SNR ramp from clean (high SNR) down to −∞ dB (snow).
Reverse process. Train a network to undo one step: predict the noise ε̂ that was added. By Tweedie's identity the noise-prediction is the score, and the implied clean estimate is the posterior mean — i.e. the MMSE denoise of the field given its noisy observation. Sampling is iterative refinement: estimate, step, re-estimate.
Forward is your signal degrading through an AWGN channel as you crank SNR down. Reverse is an iterative MMSE equalizer / turbo decoder: each pass is a soft estimate, the score ∇log p plays the role of the LLR pointing at the likely transmitted field, and you climb SNR back up over many decoding iterations.
From one map to a coherent trajectory — and an ensemble
Two ways to put time in. Joint block denoising (video-diffusion style): treat a space×time tensor as one object, denoise it whole, with factorized spatial + temporal attention so the frames stay consistent. Autoregressive next-state diffusion: factor the trajectory into conditional steps and diffuse one frame at a time given the last — this is how GenCast generates 15-day global weather, conditioning each 12-h step on the previous state and modelling the joint distribution over space and time.
The payoff that deterministic models can't give you: because sampling is stochastic, you draw a different plausible future every time. Run it N times → an ensemble with calibrated spread. That is uncertainty quantification for free, and it's why diffusion beat the operational physics ensemble on probabilistic skill.
An ensemble is your set of Monte-Carlo fading realisations — re-seed the channel, get a different but statistically valid draw. Spread across members is your confidence interval on coverage, the thing a single ray-trace can never hand you.
Steer the generator with config, geometry, and actions
A conditional diffusion model takes a side input c and biases every denoising step toward fields consistent with it — geometry, antenna config, traffic, or a control action. The "Digital-Twin-of-Channel" line does exactly this: map UE position → statistical CSI, no pilots. Classifier-free guidance lets you dial how hard c constrains the output. Because diffusion models multimodal distributions well, they're a strong fit for action-conditioned control where one action can lead to several plausible outcomes.
This is the formal home of your non-batch, config-injectable mode: c carries the mid-simulation action, and the generator rolls a conditioned, action-aware future. Your batch mode is the same model with c fixed — open-loop dataset generation. The batch/non-batch axis you flagged as thesis-critical is, in diffusion terms, just unconditional sampling vs guided sampling.
Conditioning is side information / pilots / known CSI steering the decode. Guidance scale is how much weight you put on that side info versus the learned prior — turn it up and the output snaps to the constraint; turn it down and you get more diversity.
A fast emulator standing in for an expensive simulator
A surrogate learns the input→output map of a costly solver so you can skip the solver. You already build these: a behavioral PA model (memory polynomial, Volterra, neural DPD) replaces circuit-level SPICE; a learned channel model replaces full ray tracing. In SciML the standard surrogates are neural operators — FNO, DeepONet — that approximate a PDE's solution operator directly. The trade is always the same: orders-of-magnitude speed for a bounded fidelity gap you must measure and defend.
Your generative twin is a surrogate for the expensive RAN simulator — and your own notes already pin the expensive side: Sionna RT isn't natively multi-GPU, so ray-traced datasets are slow to produce. That cost asymmetry is the entire reason a surrogate exists. The fidelity gap is your A+B spine's "sim-to-real / sim-to-surrogate discrepancy" — the thing you decompose rather than wave away.
Pretrain across many worlds, then calibrate — don't rebuild
A classic surrogate is task-specific: train an FNO for one regime, and it breaks the moment the geometry or physics changes — retrain from scratch. A foundation surrogate is pretrained at scale on heterogeneous data spanning many scenarios, learning a shared physical latent space, then adapted to a specific instance with light fine-tuning. Poseidon (and siblings MPP, DPOT) does this for PDEs — pretrained on fluid dynamics, it generalises to unseen equations with an order-of-magnitude better sample efficiency than training fresh.
It's already landing in your field. LAETwin-XL builds a conditional-diffusion generative foundation model on top of Sionna RT, pretrained to learn transferable channel representations from incomplete observations. The adaptation recipe is the universal one: freeze the backbone, fit a small adapter on local data.
You've already named the adapter: LoRA calibration as the analogue of RF path-loss calibration. That's the whole foundation play — one pretrained spatiotemporal backbone (your calibrated latent state-space generator), a frozen physics structure (the differentiable 3GPP graph Φ), and a per-deployment LoRA fit from UE reports. Calibrate to a cell, don't retrain a city.
Motion reduced: connector flow is paused.
Diffusion vs your SSM loop — trade, don't replace
Your closed-loop generator is SSM-based for good reasons. The two families have nearly complementary strengths, so the useful question isn't "which one" but "which does what."
| Axis | Spatiotemporal diffusion | SSM closed loop · Mamba/S4 (yours) |
|---|---|---|
| sampling | iterative — many denoise steps per frame | single linear-time pass per step |
| time | block or autoregressive next-state | native streaming recurrence, per-TTI |
| uncertainty | principled ensembles, calibrated spread | needs added stochastic head for spread |
| action-conditioning | via guidance; great on multimodal outcomes | natural in the recurrent state — your strength |
| cost / rollout step | high (NFE-bound) unless distilled | low — built for long fast rollouts |
| best fit in your stack | spatial structure, innovations, initial fields, batch-mode datasets, UQ | the fast per-TTI closed loop, non-batch control |
Three concrete couplings: (1) diffusion as a spatial emission / innovation generator feeding your physics-structured head Φ — it draws the correlated spatial field, Φ imposes 3GPP structure; (2) diffusion for batch-mode ensemble datasets with calibrated spread, the SSM for non-batch streaming control; (3) diffusion to synthesise calibration targets for under-observed cells, then your LoRA fit closes the loop. Diffusion is the field painter; your SSM is the field controller.
Sampling cost, physical consistency, generalisation
Sampling cost. Naïve diffusion needs many function evals per sample — a problem for a fast surrogate. The fix is distillation / consistency / few-step samplers that collapse the trajectory to a handful of steps. Physical consistency. Generative fields can look right and violate physics; your differentiable 3GPP graph Φ is precisely the structural prior that keeps emissions admissible. Generalisation. A foundation surrogate is only as good as its pretraining coverage — out-of-distribution deployments degrade, which is why calibration and an explicit discrepancy term matter.
This is where your parity work earns its keep: bitwise parity for behaviour-neutral refactors, TOST equivalence for behavioural changes, common random numbers for paired stochastic comparisons. For the ensemble side, add CRPS and spread-skill to check the spread is honest, not just sharp. Sharp-but-wrong is the diffusion trap; calibrated-and-sharp is the bar.