Study module · generative modelling for a city-scale RAN twin

Spatiotemporal diffusion
& the surrogate foundation model

Two ideas, one bench. Diffusion turns a screen of pure noise into a coherent field — coverage, load, CSI — over space and time. A surrogate foundation model is the network you pretrain once across many worlds, then calibrate to one deployment instead of rebuilding. Everything below is framed in the language you already speak: SNR, AWGN, equalizers, path-loss calibration, behavioral models.

FIELD RSRP·grid
N 200×113

REVERSE
noise→structure

γ 0.00
step 000

SNR —
σ —

live forward marginal q(xₜ|x₀) swept across the log-SNR schedule, then resampled — the canonical diffusion dissolve/reform

CH-01the object we are generating

A spatiotemporal field, not a number

The thing a RAN twin must produce is a field over a map that evolves in time: RSRP / SINR / PRB-load / interference across a city grid, frame after frame. It has two kinds of structure stacked on top of each other. Spatial structure — neighbouring bins are correlated; coverage is smooth except at cell edges and shadowing boundaries. Temporal structure at two rates — a slow envelope (diurnal load, mobility, slow fading) riding over fast per-TTI fluctuation (fast fading, scheduling churn).

your stack

This is exactly why your generator runs two timescales — a slow super-step over c_load / c_traffic and a fast per-TTI Mamba/S4 recurrence. A generative field model has to respect both, or it produces coverage maps that are spatially plausible but temporally incoherent (the failure mode deterministic MSE models fall into).

RF read

Think of it as a movie of a heatmap, or a stack of radio maps. One frame is a spatial snapshot; the stack is a coverage trajectory. The "pixels" are bins on your grid; the "channels" are your measured quantities.

CH-02the diffusion mechanism

Forward = walk SNR to −∞. Reverse = a learned MMSE climb back.

Forward process. Add Gaussian noise in graded steps until the field is pure AWGN. The marginal at any step is a clean signal scaled down plus noise — a planned SNR ramp:

q(xₜ|x₀) = 𝒩( xₜ ; √γₜ·x₀ , (1−γₜ)·I )

γₜ ∈ [0,1] = signal retention (the ᾱₜ of DDPM). SNR(t) = γₜ/(1−γₜ). The "schedule" is just a log-SNR ramp from clean (high SNR) down to −∞ dB (snow).

Reverse process. Train a network to undo one step: predict the noise ε̂ that was added. By Tweedie's identity the noise-prediction is the score, and the implied clean estimate is the posterior mean — i.e. the MMSE denoise of the field given its noisy observation. Sampling is iterative refinement: estimate, step, re-estimate.

RF read

Forward is your signal degrading through an AWGN channel as you crank SNR down. Reverse is an iterative MMSE equalizer / turbo decoder: each pass is a soft estimate, the score ∇log p plays the role of the LLR pointing at the likely transmitted field, and you climb SNR back up over many decoding iterations.

Diffusion benchidle

noise level tt = 500

γ retain0.500

σ noise0.707

SNR0.0 dB

drag the slider, or run a sweep — left = clean field, right = pure AWGN

CH-03space × time together

From one map to a coherent trajectory — and an ensemble

Two ways to put time in. Joint block denoising (video-diffusion style): treat a space×time tensor as one object, denoise it whole, with factorized spatial + temporal attention so the frames stay consistent. Autoregressive next-state diffusion: factor the trajectory into conditional steps and diffuse one frame at a time given the last — this is how GenCast generates 15-day global weather, conditioning each 12-h step on the previous state and modelling the joint distribution over space and time.

The payoff that deterministic models can't give you: because sampling is stochastic, you draw a different plausible future every time. Run it N times → an ensemble with calibrated spread. That is uncertainty quantification for free, and it's why diffusion beat the operational physics ensemble on probabilistic skill.

Ensemble — three stochastic draws of the same conditioned futureresampling

member 01

member 02

member 03

RF read

An ensemble is your set of Monte-Carlo fading realisations — re-seed the channel, get a different but statistically valid draw. Spread across members is your confidence interval on coverage, the thing a single ray-trace can never hand you.

CH-04conditioning & control

Steer the generator with config, geometry, and actions

A conditional diffusion model takes a side input c and biases every denoising step toward fields consistent with it — geometry, antenna config, traffic, or a control action. The "Digital-Twin-of-Channel" line does exactly this: map UE position → statistical CSI, no pilots. Classifier-free guidance lets you dial how hard c constrains the output. Because diffusion models multimodal distributions well, they're a strong fit for action-conditioned control where one action can lead to several plausible outcomes.

your stack

This is the formal home of your non-batch, config-injectable mode: c carries the mid-simulation action, and the generator rolls a conditioned, action-aware future. Your batch mode is the same model with c fixed — open-loop dataset generation. The batch/non-batch axis you flagged as thesis-critical is, in diffusion terms, just unconditional sampling vs guided sampling.

RF read

Conditioning is side information / pilots / known CSI steering the decode. Guidance scale is how much weight you put on that side info versus the learned prior — turn it up and the output snaps to the constraint; turn it down and you get more diversity.

CH-05what "surrogate" means

A fast emulator standing in for an expensive simulator

A surrogate learns the input→output map of a costly solver so you can skip the solver. You already build these: a behavioral PA model (memory polynomial, Volterra, neural DPD) replaces circuit-level SPICE; a learned channel model replaces full ray tracing. In SciML the standard surrogates are neural operators — FNO, DeepONet — that approximate a PDE's solution operator directly. The trade is always the same: orders-of-magnitude speed for a bounded fidelity gap you must measure and defend.

your stack

Your generative twin is a surrogate for the expensive RAN simulator — and your own notes already pin the expensive side: Sionna RT isn't natively multi-GPU, so ray-traced datasets are slow to produce. That cost asymmetry is the entire reason a surrogate exists. The fidelity gap is your A+B spine's "sim-to-real / sim-to-surrogate discrepancy" — the thing you decompose rather than wave away.

Cost per sample — physics solver vs learned surrogate

Physics solver · ray-trace + full RAN0 ms

Learned surrogate · one forward pass0 ms

order-of-magnitude, illustrative — the point is the asymmetry, not the exact figures

CH-06what makes a surrogate a foundation

Pretrain across many worlds, then calibrate — don't rebuild

A classic surrogate is task-specific: train an FNO for one regime, and it breaks the moment the geometry or physics changes — retrain from scratch. A foundation surrogate is pretrained at scale on heterogeneous data spanning many scenarios, learning a shared physical latent space, then adapted to a specific instance with light fine-tuning. Poseidon (and siblings MPP, DPOT) does this for PDEs — pretrained on fluid dynamics, it generalises to unseen equations with an order-of-magnitude better sample efficiency than training fresh.

It's already landing in your field. LAETwin-XL builds a conditional-diffusion generative foundation model on top of Sionna RT, pretrained to learn transferable channel representations from incomplete observations. The adaptation recipe is the universal one: freeze the backbone, fit a small adapter on local data.

your stack

You've already named the adapter: LoRA calibration as the analogue of RF path-loss calibration. That's the whole foundation play — one pretrained spatiotemporal backbone (your calibrated latent state-space generator), a frozen physics structure (the differentiable 3GPP graph Φ), and a per-deployment LoRA fit from UE reports. Calibrate to a cell, don't retrain a city.

Motion reduced: connector flow is paused.

CH-07honest placement in your design

Diffusion vs your SSM loop — trade, don't replace

Your closed-loop generator is SSM-based for good reasons. The two families have nearly complementary strengths, so the useful question isn't "which one" but "which does what."

Axis	Spatiotemporal diffusion	SSM closed loop · Mamba/S4 (yours)
sampling	iterative — many denoise steps per frame	single linear-time pass per step
time	block or autoregressive next-state	native streaming recurrence, per-TTI
uncertainty	principled ensembles, calibrated spread	needs added stochastic head for spread
action-conditioning	via guidance; great on multimodal outcomes	natural in the recurrent state — your strength
cost / rollout step	high (NFE-bound) unless distilled	low — built for long fast rollouts
best fit in your stack	spatial structure, innovations, initial fields, batch-mode datasets, UQ	the fast per-TTI closed loop, non-batch control

where they compose

Three concrete couplings: (1) diffusion as a spatial emission / innovation generator feeding your physics-structured head Φ — it draws the correlated spatial field, Φ imposes 3GPP structure; (2) diffusion for batch-mode ensemble datasets with calibrated spread, the SSM for non-batch streaming control; (3) diffusion to synthesise calibration targets for under-observed cells, then your LoRA fit closes the loop. Diffusion is the field painter; your SSM is the field controller.

CH-08what bites, and how you'd measure it

Sampling cost, physical consistency, generalisation

Sampling cost. Naïve diffusion needs many function evals per sample — a problem for a fast surrogate. The fix is distillation / consistency / few-step samplers that collapse the trajectory to a handful of steps. Physical consistency. Generative fields can look right and violate physics; your differentiable 3GPP graph Φ is precisely the structural prior that keeps emissions admissible. Generalisation. A foundation surrogate is only as good as its pretraining coverage — out-of-distribution deployments degrade, which is why calibration and an explicit discrepancy term matter.

evaluation — you already have the rig

This is where your parity work earns its keep: bitwise parity for behaviour-neutral refactors, TOST equivalence for behavioural changes, common random numbers for paired stochastic comparisons. For the ensemble side, add CRPS and spread-skill to check the spread is honest, not just sharp. Sharp-but-wrong is the diffusion trap; calibrated-and-sharp is the bar.

REFdecoder ring · RF ⇄ generative

The same idea in two vocabularies

RF / comms

Diffusion / foundation

AWGN at low SNR

fully-noised state x_T

log-SNR / link budget ramp

noise schedule γₜ

MMSE equalizer / turbo decode

reverse denoising step

soft LLR toward sent symbol

score ∇log p(xₜ)

pilots / known CSI

conditioning input c

Monte-Carlo fading draws

ensemble members

behavioral PA / channel model

surrogate model

universal model, many scenarios

foundation pretraining

path-loss calibration

LoRA / fine-tune adapter

3GPP structure constraints

physics-structured head Φ

Spatiotemporal diffusion& the surrogate foundation model

A spatiotemporal field, not a number

Forward = walk SNR to −∞. Reverse = a learned MMSE climb back.

From one map to a coherent trajectory — and an ensemble

Steer the generator with config, geometry, and actions

A fast emulator standing in for an expensive simulator

Pretrain across many worlds, then calibrate — don't rebuild

Diffusion vs your SSM loop — trade, don't replace

Sampling cost, physical consistency, generalisation

The same idea in two vocabularies

Spatiotemporal diffusion
& the surrogate foundation model