Build a generative-AI digital twin
one rung at a time
You're not building "an AI." You're building a fast generative stand-in for an expensive stochastic RAN simulator — a model that, given a city's geometry and config, emits the same per-cell / per-UE / per-TTI KPI distributions the simulator would, only in milliseconds. This page is the ladder: the background you actually need, the one thing to build first, and the three rungs after it. Everything is mapped to the RF you already know.
nodes cells + UEs
type distribution
What the thing actually is — in one breath
Your in-house simulator is a traditional stochastic system-level simulator — Monte-Carlo fading, scheduler logic, traffic processes. No ML inside. It is slow, but it is your teacher and oracle. The job is to train a generative model that learns to sample like the simulator: feed it the same scene, and it should produce KPI samples drawn from the same distribution — fast enough to run a city or a long sweep.
Three properties decide every design choice downstream. The output is graph-shaped (per cell, per UE — typed nodes, not pixels). It is stochastic (per-TTI KPIs scatter — so you predict a distribution, never a single number). And it is temporal (it evolves TTI by TTI). Hold those three and the rest of this page is just "add one property at a time."
The teacher is a trusted reference with a known error budget (≈ ±3 dB), not "ground truth." You grade the surrogate on fidelity to the simulator's own distribution, not on bit-for-bit equality — stochastic output has no single right answer to match.
This is the batch-mode, open-loop generator first: scene in → KPI distribution out, config fixed. Non-batch / config-injection (the action-conditioned, closed-loop version) is the same model with a conditioning input added later — don't start there.
What you already have vs. what's genuinely new
You are not starting from zero — most of the math is RF math wearing different notation. The honest list of prerequisites is short, and only two items are truly new to a wireless engineer. Learn them in this order; each rung of the build needs only the boxes above it.
Gaussians, sampling, expectation, Monte-Carlo. You run fading realisations for a living — this is distributional thinking. Reuse it directly.
Vectors, matrices, convolution, transforms. Embeddings are just feature vectors; a layer is a matrix multiply plus a nonlinearity.
PyTorch tensors, a forward pass, a loss, loss.backward(), an optimizer step. Conceptually: gradient descent on an error surface. One afternoon to get fluent.
Nodes, edges, neighbourhoods; how a GNN updates a node from its neighbours. This is the genuinely new mechanic. PyTorch Geometric + HeteroData. §03 teaches it visually.
Quantile regression + pinball loss → calibrated bands. You already think in percentiles; this just makes the network emit them. §04.
VAE, the ELBO (reconstruction − KL), reparameterization. Needed only when you add the shared-shock latent. Defer until §06.
Sequence models for time (the v1 rung) you already know from your closed-loop work — Mamba / S4 state-space recurrence. And diffusion / score models are an advanced head, not a prerequisite — they sit at the top of the ladder, covered on the companion page.
Treat this like learning a new instrument, not a new physics. The quantities are familiar (power, interference, distributions); the only new "knob" is how information moves over a graph instead of over a grid or a time series.
Your data is a graph, not an image
The tempting first instinct — "it's a coverage map, so use a U-Net / image model" — is wrong for this output. A raster CNN assumes a dense, regular grid of pixels. But your KPIs live on irregular, typed entities: cells and UEs, at arbitrary positions, with meaningful relationships — which cell serves which UE, which cells interfere, which are adjacent.
That structure is a graph: cells and UEs are typed nodes; serving / interference / adjacency are typed edges. Modelling it as a graph keeps the relationships first-class instead of smearing them onto pixels. Toggle the bench to see the same scene as a raster (lossy) and as a graph (faithful).
U-Net / image-to-image only applies if the output were a dense coverage raster. It is not — it's per-cell / per-UE values. So the backbone is a heterogeneous GNN, not a CNN. (RadioUNet-style raster models are a different problem.)
Message passing — a node learns from its neighbours
A graph neural network does one simple thing, repeated. Each round, every node gathers messages from its neighbours, aggregates them (sum / mean / attention), and updates its own feature vector. After K rounds, a node's embedding has absorbed information from its K-hop neighbourhood. Heterogeneous means the rules are type-aware: a "serving" edge and an "interference" edge use different learned weights (this is HGT / R-GCN; in code, HGTConv over a HeteroData object).
This is interference aggregation made learnable. A cell's SINR depends on its neighbours' transmit state; message passing is exactly that "gather from neighbours, update my state" computation — except the aggregation rule is learned from data rather than fixed by a path-loss formula.
Take the WirelessNet (HMPGNN) design as a reference — typed nodes/edges, per-edge-type weight sharing, trained on system-level sims, generalising to unseen deployments. No public code exists, so you reimplement it in PyG. The one change: swap its deterministic point head for a quantile head (next section).
Predict a distribution, not a number
Per-TTI KPIs scatter — run the simulator twice on the same scene and SINR lands differently each time (fast fading, scheduling churn). A model trained with plain MSE collapses all that scatter to its mean and reports a single confident line. That line is sharp and wrong: it throws away the spread that is the whole point of a stochastic twin.
Instead, emit quantiles — say p10, p50, p90 — trained with the pinball (quantile) loss. The band should cover the truth at its stated rate: ~80% of real samples inside the p10–p90 band. Drag the bench: the point model ignores the cloud; the quantile band tracks it.
"Build the sampler, grade the distribution." The simulator is itself a sampler; your generator learns to sample like it; you score the two distributions, not individual values. So "give me samples" and "give me uncertainty" are the same ask — never a real choice.
You would never report a single fade value and call it the channel — you report the distribution of fades. Same here: the deliverable is the band and its coverage, not a point. A point estimate is a link budget with the margins deleted.
v0 — heterogeneous GNN with quantile heads
This is your first concrete deliverable, and you build it before anything else — no latent, no time. It gives correct marginals (each node's own distribution) and, crucially, something you can trust and measure. Make this solid; every later rung wraps around it.
Motion reduced: pipeline flow is static.
The recipe, end to end
# v0 — PyTorch Geometric skeleton (the first thing to build) from torch_geometric.data import HeteroData from torch_geometric.nn import HGTConv g = HeteroData() g['cell'].x = cell_feats # RT + sim features per cell g['ue'].x = ue_feats # per-UE features g['cell','serves','ue'].edge_index = serve_idx g['cell','interferes','cell'].edge_index = intf_idx # 2–3 type-aware message-passing layers → per-node embeddings h_i h = HGTConv(...)(g.x_dict, g.edge_index_dict) # ×2–3 # type-specific heads — NOT one uniform head rsrp_q = QuantileHead(h['ue']) # continuous → p10/p50/p90 sinr_q = QuantileHead(h['ue']) # continuous ho = EventHead(h['ue']) # handover = discrete event! sched = QuantileHead(h['cell']) # per-cell scheduler KPI loss = pinball(rsrp_q, y_rsrp) + pinball(sinr_q, y_sinr) + bce(ho, y_ho) + ... # eval on HELD-OUT DEPLOYMENTS, not just held-out samples
(1) Heads must be type-appropriate — RSRP/SINR/throughput are continuous regression, but handover is a discrete event (classification / point-process), not a number. (2) Use quantile heads from day one — do not inherit a deterministic point head. (3) Evaluate on held-out deployments (unseen cities/layouts), not just held-out samples from seen ones — that's the only test of generalisation.
Let your data's schema drive the graph — node types, edge types, and features come from your simulator's reports and your RT corpus, not from any paper. Useful reference collection: jwwthu/GNN-Communication-Networks. Track runs in Weights & Biases from the start.
v0.5 — add a shared latent (this is a CVAE)
v0 gets each node's marginal right, but treats nodes as independent given the graph. Reality couples them: a traffic surge or a shared fading realisation moves many nodes together. To capture that joint structure, add a shared latent z sampled once per scene and fed to every head — one draw, one coherent "shock" across the whole map.
Training a latent-variable model needs two networks: a prior p(z | graph) and a train-only posterior q(z | graph, KPIs), optimised with the ELBO = reconstruction − KL. Toggle the bench: without z, nodes twinkle independently; inject a latent shock and they swing in unison.
Your Graph → GNN → z → decode sketch was essentially v0.5 — but missing the posterior and the KL term. A latent model cannot be trained from the generation path alone; you need q(z|·) and the ELBO to learn what z should mean.
z is the shared realisation — the common cause that correlates a whole region: one traffic event, one weather front, one slow-fading draw. v0 can match the per-cell histograms but will never reproduce the joint pattern that z creates. Energy score (not per-node coverage) is what catches the difference.
v1 — add time (the per-TTI rung)
Now unroll over time. Wrap v0.5 in a transition p(z_t | z_{t-1}, graph_t) stepped TTI by TTI, while the graph itself evolves — UEs move, attach, and detach. The latent carries state forward so the trajectory stays coherent instead of flickering frame to frame.
Bigger than which GNN flavour you pick: literal per-TTI sample paths vs. per-TTI-resolution distributions (a coarser latent rate). Most RAN use cases need distributions, not literal sub-ms paths — sequence length, compute, and error accumulation are brutal at true TTI granularity. Your two-timescale design (slow super-step + fast recurrence) is exactly the right shape here.
Think slow envelope over fast fading: a diurnal / mobility trend riding on per-TTI fluctuation. v1 has to respect both rates, or you get maps that look right in a snapshot but jitter incoherently over time — the classic failure mode of deterministic per-frame models.
Where diffusion fits — later, not first
Once v0→v1 is trustworthy, you can upgrade the head that turns embeddings into a distribution. A simple quantile head is the floor; a diffusion / score-based head is the expressive ceiling — it models rich, multimodal, spatially-correlated fields and gives calibrated ensembles for free. But it is an upgrade to a working backbone, not the starting point. Build the graph + latent + time first; bolt on the generative head when the marginals and the joint structure already check out.
In your framing the diffusion head is the spatial emission / innovation generator feeding the physics-structured head Φ, and the SSM remains the fast per-TTI controller. The batch vs non-batch axis maps to unconditional vs guided sampling. None of that changes the first three rungs.
You grade the distribution, not the value
Stochastic output rules out bit-for-bit parity — there is no single correct sample to match. So you test distributionally. The first question is calibration / coverage: does the truth land inside the stated band at the stated rate? Plot nominal vs. empirical coverage — a calibrated model sits on the diagonal. Then add sharpness (narrow bands are better, but only if still covered) and joint scores: energy score / CRPS, plus spread–skill for ensembles.
The bar is calibrated-and-sharp. Sharp-but-miscalibrated is the seductive trap — confident, narrow, and wrong. And remember the floor: the oracle carries a ±3 dB error budget, so no derived metric can claim more precision than the teacher has.
Your existing parity rig carries straight over: bitwise parity for behaviour-neutral refactors, TOST equivalence for behavioural changes, common random numbers for paired stochastic comparisons. For the generative side, add CRPS + spread–skill. Same discipline — applied to distributions.
From here to a working twin
HeteroData → HGTConv ×2–3 → type-specific quantile heads → pinball loss → held-out-deployment eval. Correct marginals; trustworthy.p(z|graph) + posterior q(z|graph,KPIs), trained by the ELBO. Captures cross-node correlation v0 can't.p(z_t | z_{t-1}, graph_t), evolving graph. Choose distributions over literal per-TTI paths.What to do this week
Don't try to see the whole system at once. Get a PyTorch training loop running, then build the v0 HeteroData skeleton on a tiny slice of your simulator's reports — even two node types and one edge type. Make a quantile head emit p10/p50/p90 for one KPI and check coverage on held-out deployments. That single loop, working and measured, is the foundation the entire ladder stands on.
If useful, the natural follow-ups are a concrete PyG HeteroData v0 code skeleton (real node/edge stubs + HGTConv + pinball head) and a one-page PhD-proposal spine aligning the build to the distributional-fidelity thesis.