Study module · from confused to a first concrete step

Build a generative-AI digital twin
one rung at a time

You're not building "an AI." You're building a fast generative stand-in for an expensive stochastic RAN simulator — a model that, given a city's geometry and config, emits the same per-cell / per-UE / per-TTI KPI distributions the simulator would, only in milliseconds. This page is the ladder: the background you actually need, the one thing to build first, and the three rungs after it. Everything is mapped to the RF you already know.

OBJECT hetero graph
nodes cells + UEs

OUTPUT per-node
type distribution

edges serve · interfere

resample 000

the object you are modelling — typed nodes on a graph, each emitting a KPI distribution (whisker = p10–p90, tick = median), resampled like a Monte-Carlo run

00orient before you build

What the thing actually is — in one breath

Your in-house simulator is a traditional stochastic system-level simulator — Monte-Carlo fading, scheduler logic, traffic processes. No ML inside. It is slow, but it is your teacher and oracle. The job is to train a generative model that learns to sample like the simulator: feed it the same scene, and it should produce KPI samples drawn from the same distribution — fast enough to run a city or a long sweep.

Three properties decide every design choice downstream. The output is graph-shaped (per cell, per UE — typed nodes, not pixels). It is stochastic (per-TTI KPIs scatter — so you predict a distribution, never a single number). And it is temporal (it evolves TTI by TTI). Hold those three and the rest of this page is just "add one property at a time."

first correction

The teacher is a trusted reference with a known error budget (≈ ±3 dB), not "ground truth." You grade the surrogate on fidelity to the simulator's own distribution, not on bit-for-bit equality — stochastic output has no single right answer to match.

your stack

This is the batch-mode, open-loop generator first: scene in → KPI distribution out, config fixed. Non-batch / config-injection (the action-conditioned, closed-loop version) is the same model with a conditioning input added later — don't start there.

BGthe background you actually need

What you already have vs. what's genuinely new

You are not starting from zero — most of the math is RF math wearing different notation. The honest list of prerequisites is short, and only two items are truly new to a wireless engineer. Learn them in this order; each rung of the build needs only the boxes above it.

Probability & distributionsyou have it

Gaussians, sampling, expectation, Monte-Carlo. You run fading realisations for a living — this is distributional thinking. Reuse it directly.

Linear algebra & DSPyou have it

Vectors, matrices, convolution, transforms. Embeddings are just feature vectors; a layer is a matrix multiply plus a nonlinearity.

Training loop & autogradbrush up

PyTorch tensors, a forward pass, a loss, loss.backward(), an optimizer step. Conceptually: gradient descent on an error surface. One afternoon to get fluent.

Graphs & message passingnew — core

Nodes, edges, neighbourhoods; how a GNN updates a node from its neighbours. This is the genuinely new mechanic. PyTorch Geometric + HeteroData. §03 teaches it visually.

Distributional outputsmall reframe

Quantile regression + pinball loss → calibrated bands. You already think in percentiles; this just makes the network emit them. §04.

Latent-variable modelsnew — for v0.5

VAE, the ELBO (reconstruction − KL), reparameterization. Needed only when you add the shared-shock latent. Defer until §06.

your stack

Sequence models for time (the v1 rung) you already know from your closed-loop work — Mamba / S4 state-space recurrence. And diffusion / score models are an advanced head, not a prerequisite — they sit at the top of the ladder, covered on the companion page.

RF read

Treat this like learning a new instrument, not a new physics. The quantities are familiar (power, interference, distributions); the only new "knob" is how information moves over a graph instead of over a grid or a time series.

02the reframe that fixes everything

Your data is a graph, not an image

The tempting first instinct — "it's a coverage map, so use a U-Net / image model" — is wrong for this output. A raster CNN assumes a dense, regular grid of pixels. But your KPIs live on irregular, typed entities: cells and UEs, at arbitrary positions, with meaningful relationships — which cell serves which UE, which cells interfere, which are adjacent.

That structure is a graph: cells and UEs are typed nodes; serving / interference / adjacency are typed edges. Modelling it as a graph keeps the relationships first-class instead of smearing them onto pixels. Toggle the bench to see the same scene as a raster (lossy) and as a graph (faithful).

correction baked in

U-Net / image-to-image only applies if the output were a dense coverage raster. It is not — it's per-cell / per-UE values. So the backbone is a heterogeneous GNN, not a CNN. (RadioUNet-style raster models are a different problem.)

Raster vs. graph — same scenegraph

cell node UE node serving edge interference

03how a GNN actually computes

Message passing — a node learns from its neighbours

A graph neural network does one simple thing, repeated. Each round, every node gathers messages from its neighbours, aggregates them (sum / mean / attention), and updates its own feature vector. After K rounds, a node's embedding has absorbed information from its K-hop neighbourhood. Heterogeneous means the rules are type-aware: a "serving" edge and an "interference" edge use different learned weights (this is HGT / R-GCN; in code, HGTConv over a HeteroData object).

Message passing — watch information propagatehop 0 / 3

start from the highlighted cell — each hop, its influence spreads one ring outward

RF read

This is interference aggregation made learnable. A cell's SINR depends on its neighbours' transmit state; message passing is exactly that "gather from neighbours, update my state" computation — except the aggregation rule is learned from data rather than fixed by a path-loss formula.

your stack

Take the WirelessNet (HMPGNN) design as a reference — typed nodes/edges, per-edge-type weight sharing, trained on system-level sims, generalising to unseen deployments. No public code exists, so you reimplement it in PyG. The one change: swap its deterministic point head for a quantile head (next section).

04stochastic output, done honestly

Predict a distribution, not a number

Per-TTI KPIs scatter — run the simulator twice on the same scene and SINR lands differently each time (fast fading, scheduling churn). A model trained with plain MSE collapses all that scatter to its mean and reports a single confident line. That line is sharp and wrong: it throws away the spread that is the whole point of a stochastic twin.

Instead, emit quantiles — say p10, p50, p90 — trained with the pinball (quantile) loss. The band should cover the truth at its stated rate: ~80% of real samples inside the p10–p90 band. Drag the bench: the point model ignores the cloud; the quantile band tracks it.

the mantra

"Build the sampler, grade the distribution." The simulator is itself a sampler; your generator learns to sample like it; you score the two distributions, not individual values. So "give me samples" and "give me uncertainty" are the same ask — never a real choice.

Point estimate vs. quantile bandquantiles

target coverage80%

empirical—

RF read

You would never report a single fade value and call it the channel — you report the distribution of fades. Same here: the deliverable is the band and its coverage, not a point. A point estimate is a link budget with the margins deleted.

v0step 1 · the thing to build first

v0 — heterogeneous GNN with quantile heads

This is your first concrete deliverable, and you build it before anything else — no latent, no time. It gives correct marginals (each node's own distribution) and, crucially, something you can trust and measure. Make this solid; every later rung wraps around it.

Motion reduced: pipeline flow is static.

The recipe, end to end

# v0 — PyTorch Geometric skeleton (the first thing to build)
from torch_geometric.data import HeteroData
from torch_geometric.nn import HGTConv

g = HeteroData()
g['cell'].x = cell_feats   # RT + sim features per cell
g['ue'].x   = ue_feats     # per-UE features
g['cell','serves','ue'].edge_index   = serve_idx
g['cell','interferes','cell'].edge_index = intf_idx

# 2–3 type-aware message-passing layers → per-node embeddings h_i
h = HGTConv(...)(g.x_dict, g.edge_index_dict)   # ×2–3

# type-specific heads — NOT one uniform head
rsrp_q = QuantileHead(h['ue'])     # continuous → p10/p50/p90
sinr_q = QuantileHead(h['ue'])     # continuous
ho     = EventHead(h['ue'])        # handover = discrete event!
sched  = QuantileHead(h['cell'])   # per-cell scheduler KPI

loss = pinball(rsrp_q, y_rsrp) + pinball(sinr_q, y_sinr) + bce(ho, y_ho) + ...
# eval on HELD-OUT DEPLOYMENTS, not just held-out samples

three things people get wrong here

(1) Heads must be type-appropriate — RSRP/SINR/throughput are continuous regression, but handover is a discrete event (classification / point-process), not a number. (2) Use quantile heads from day one — do not inherit a deterministic point head. (3) Evaluate on held-out deployments (unseen cities/layouts), not just held-out samples from seen ones — that's the only test of generalisation.

your stack

Let your data's schema drive the graph — node types, edge types, and features come from your simulator's reports and your RT corpus, not from any paper. Useful reference collection: jwwthu/GNN-Communication-Networks. Track runs in Weights & Biases from the start.

v0.5step 2 · cross-node correlation

v0.5 — add a shared latent (this is a CVAE)

v0 gets each node's marginal right, but treats nodes as independent given the graph. Reality couples them: a traffic surge or a shared fading realisation moves many nodes together. To capture that joint structure, add a shared latent z sampled once per scene and fed to every head — one draw, one coherent "shock" across the whole map.

Training a latent-variable model needs two networks: a prior p(z | graph) and a train-only posterior q(z | graph, KPIs), optimised with the ELBO = reconstruction − KL. Toggle the bench: without z, nodes twinkle independently; inject a latent shock and they swing in unison.

your original sketch, corrected

Your Graph → GNN → z → decode sketch was essentially v0.5 — but missing the posterior and the KL term. A latent model cannot be trained from the generation path alone; you need q(z|·) and the ELBO to learn what z should mean.

Independent nodes vs. shared shock+ latent z

each tile is a node's KPI — watch whether they move together or alone

RF read

z is the shared realisation — the common cause that correlates a whole region: one traffic event, one weather front, one slow-fading draw. v0 can match the per-cell histograms but will never reproduce the joint pattern that z creates. Energy score (not per-node coverage) is what catches the difference.

v1step 3 · per-TTI evolution

v1 — add time (the per-TTI rung)

Now unroll over time. Wrap v0.5 in a transition p(z_t | z_{t-1}, graph_t) stepped TTI by TTI, while the graph itself evolves — UEs move, attach, and detach. The latent carries state forward so the trajectory stays coherent instead of flickering frame to frame.

Rolling trajectory — graph evolves, latent carries stateTTI 000

UEs drift and re-associate; each node's band is re-sampled per TTI but stays temporally smooth

the design fork that matters most

Bigger than which GNN flavour you pick: literal per-TTI sample paths vs. per-TTI-resolution distributions (a coarser latent rate). Most RAN use cases need distributions, not literal sub-ms paths — sequence length, compute, and error accumulation are brutal at true TTI granularity. Your two-timescale design (slow super-step + fast recurrence) is exactly the right shape here.

RF read

Think slow envelope over fast fading: a diurnal / mobility trend riding on per-TTI fluctuation. v1 has to respect both rates, or you get maps that look right in a snapshot but jitter incoherently over time — the classic failure mode of deterministic per-frame models.

+the advanced generative head

Where diffusion fits — later, not first

Once v0→v1 is trustworthy, you can upgrade the head that turns embeddings into a distribution. A simple quantile head is the floor; a diffusion / score-based head is the expressive ceiling — it models rich, multimodal, spatially-correlated fields and gives calibrated ensembles for free. But it is an upgrade to a working backbone, not the starting point. Build the graph + latent + time first; bolt on the generative head when the marginals and the joint structure already check out.

↗ Companion module: the full diffusion + surrogate-foundation deep-dive — forward/reverse process, conditioning, ensembles, and pretrain-then-calibrate. Open “Spatiotemporal diffusion & the surrogate foundation model” →

your stack

In your framing the diffusion head is the spatial emission / innovation generator feeding the physics-structured head Φ, and the SSM remains the fast per-TTI controller. The batch vs non-batch axis maps to unconditional vs guided sampling. None of that changes the first three rungs.

EVALhow you know it works

You grade the distribution, not the value

Stochastic output rules out bit-for-bit parity — there is no single correct sample to match. So you test distributionally. The first question is calibration / coverage: does the truth land inside the stated band at the stated rate? Plot nominal vs. empirical coverage — a calibrated model sits on the diagonal. Then add sharpness (narrow bands are better, but only if still covered) and joint scores: energy score / CRPS, plus spread–skill for ensembles.

The bar is calibrated-and-sharp. Sharp-but-miscalibrated is the seductive trap — confident, narrow, and wrong. And remember the floor: the oracle carries a ±3 dB error budget, so no derived metric can claim more precision than the teacher has.

Reliability — empirical vs. nominal coveragecalibrated

on the diagonal = honest; below = bands too narrow (over-confident)

your stack

Your existing parity rig carries straight over: bitwise parity for behaviour-neutral refactors, TOST equivalence for behavioural changes, common random numbers for paired stochastic comparisons. For the generative side, add CRPS + spread–skill. Same discipline — applied to distributions.

PATHthe whole ladder in one view

From here to a working twin

Background

PyTorch training loop · graphs & message passing · quantile/pinball. Defer latent-variable models & sequence models until you need them.

Build first · hetero-GNN + quantile heads

HeteroData → HGTConv ×2–3 → type-specific quantile heads → pinball loss → held-out-deployment eval. Correct marginals; trustworthy.

v0.5

Add shared latent · CVAE

prior p(z|graph) + posterior q(z|graph,KPIs), trained by the ELBO. Captures cross-node correlation v0 can't.

Add time · per-TTI recurrence

transition p(z_t | z_{t-1}, graph_t), evolving graph. Choose distributions over literal per-TTI paths.

Advanced head & conditioning

diffusion generative head; config-injection / guidance for non-batch mode. Upgrade a working backbone — never the start.

What to do this week

Don't try to see the whole system at once. Get a PyTorch training loop running, then build the v0 HeteroData skeleton on a tiny slice of your simulator's reports — even two node types and one edge type. Make a quantile head emit p10/p50/p90 for one KPI and check coverage on held-out deployments. That single loop, working and measured, is the foundation the entire ladder stands on.

offered next

If useful, the natural follow-ups are a concrete PyG HeteroData v0 code skeleton (real node/edge stubs + HGTConv + pinball head) and a one-page PhD-proposal spine aligning the build to the distributional-fidelity thesis.

REFdecoder ring · RF ⇄ what to learn

The same idea in two vocabularies

RF / wireless you know

GenAI-twin concept

cells, UEs, serving/interference

typed nodes + typed edges (graph)

interference aggregation

message passing / neighbour update

Monte-Carlo fading distribution

predicted KPI distribution

percentiles / margins

quantile heads + pinball loss

shared fading / traffic event

latent z (shared shock, CVAE)

slow envelope over fast fading

temporal transition p(z_t|z_{t-1})

path-loss calibration

fine-tune / LoRA adapter

link-budget margin check

coverage / calibration test

trusted reference ±3 dB

oracle (not "ground truth")

behavioral PA / channel model

surrogate of the simulator

Build a generative-AI digital twinone rung at a time

What the thing actually is — in one breath

What you already have vs. what's genuinely new

Your data is a graph, not an image

Message passing — a node learns from its neighbours

Predict a distribution, not a number

v0 — heterogeneous GNN with quantile heads

The recipe, end to end

v0.5 — add a shared latent (this is a CVAE)

v1 — add time (the per-TTI rung)

Where diffusion fits — later, not first

You grade the distribution, not the value

From here to a working twin

What to do this week

The same idea in two vocabularies

Build a generative-AI digital twin
one rung at a time