Engineering Guide · GPU Acceleration

From CPU Bottleneck
to GPU Superpower

A step-by-step guide to thinking, profiling, planning, and implementing GPU acceleration — using real examples from the AI-RSG 5G simulator.

RF Phase Speedup
phase:rf
KPI Phase Speedup
phase:kpi
Datagen Speedup
batch cycles
Overall Simulation
end-to-end
Section 01

The Engineering Mindset

GPU acceleration isn't just about writing CUDA. It's a way of thinking. Before touching any code, every GPU engineer asks the same questions.

🔍

Find the Real Hotspot

Don't guess. 90% of your wall-time lives in 10% of the code. Use a profiler to find it before writing a single kernel.

📦

Think in Batches

GPUs hate one-by-one work. Convert sequential per-item loops into operations over entire arrays at once.

🧵

Expose Parallelism

Ask: "Are these computations independent?" If UE-1's RSRP doesn't affect UE-2's, you can compute them simultaneously.

🚌

Minimize Data Movement

PCIe transfers are expensive. The ideal pattern is: upload once → run many kernels → download once.

📐

Measure, Not Assume

After every change, profile again. GPU wins aren't always where you expect. Let numbers guide decisions.

🔢

Match Precision to Need

Float32 is 2× faster than float64 on most GPU cores. Only use float64 where physics demands it.

The Golden Rule: The best GPU kernel is the one that never runs — because you designed your algorithm to need fewer total operations. Algorithmic complexity beats hardware every time.
Section 02

Finding the Hotspot — Profiling

The AI-RSG simulator ran at ~1,400 cycles total before hitting wall-time limits. Profiling revealed 99% of time lived in a single function.

⏱ Nsight Systems — Phase Timeline (CPU baseline, 1,415 cycles)
phase:rf (RF datagen) 1026.6 s — 99.2%
RF 1,026s
phase:kpi 5.5 s — 0.5%
KPI
phase:output 3.4 s — 0.3%
Out
Key Insight: phase:rf consumed 99.2% of CPU time. This is your target. Everything else is noise.
1

Add NVTX Phase Annotations

Instrument your code with named ranges so the profiler can distinguish phases. Use consistent names so CPU and GPU profiles align.

NSYS_PROFILE_SIM=1 nsys profile --trace=cuda,nvtx,osrt python sim.py
2

Capture with Nsight Systems

Enable --trace-fork-before-exec=true to capture NVTX from forked worker processes. Output lands in /tmp/sim-nsys-{pid}.nsys-rep.

3

Identify the Dominant Phase

Sort phases by wall time. The top entry is your GPU target. In RSG, phase:rf was 726 ms/cycle × 1,415 cycles = 1,026 seconds.

4

Quantify the Inner Loop

Understand why it's slow. RSG's RF phase called _compute_rsrp_batch() once per (cycle × UE × cell) = ~36,000 scalar calls per datagen batch.

⚠ Before — No Visibility
# simulation.py — no annotations
# Profiler sees one opaque block
def run_cycle(self, cycle):
    # RF: black box
    self.radio_model.update()

    # KPI: also black box
    self.kpi_engine.compute()

    # Output: invisible
    self.write_influx()
✓ After — NVTX Annotated
# simulation.py — phases visible in Nsight
from Tera5G.util import nvtx_range as _phase

def run_cycle(self, cycle):
    with _phase("phase:rf", "red"):
        self.radio_model.update()

    with _phase("phase:kpi", "green"):
        self.kpi_engine.compute()

    with _phase("phase:output", "blue"):
        self.write_influx()
Section 03

Planning — What to Offload and How

Not everything benefits from GPU. The decision tree below mirrors how AI-RSG chose what to offload.

Is this computation data-parallel? (independent per item)
Can UE-1's RSRP be computed independently of UE-2's?
YES Is the working set large enough? (>1k elements)
Fewer than ~1k elements → GPU launch overhead dominates. RSRP for 600 UEs × 90 cells = 54,000 pairs → yes, offload
YES Can you batch across time? (multiple cycles at once)
Instead of 1 cycle/call, process 512 cycles per kernel → amortize launch cost → tile by 512 cycles
✓ GPU Use a Numba CUDA kernel (or CuPy for reductions)
rsrp_3gpp_kernel, handover_kernel, kpi_and_link_kernel
NO (too small) Keep on CPU with NumPy — launch overhead not worth it
NO (sequential dependency) Each step depends on the previous (e.g. state machine)
Example: handover carries state cycle-to-cycle. Solution: parallelize across UEs (each thread owns one UE's entire timeline)

Memory Transfer Strategy

The #1 mistake beginners make: transferring data every cycle. Upload once, download once per batch.

🖥 Host (CPU RAM)
UE positions (512 cycles)
Cell config
Beam patterns
LOS grid
→→→
H2D Upload
once per batch
🎮 Device (GPU VRAM)
rsrp_3gpp_kernel
handover_kernel
kpi_and_link_kernel
cell_aggregate_kernel
→→→
D2H Download
once per batch
🖥 Host (CPU RAM)
RSRP table
Serving cells
KPI arrays
Cell aggregates
🔍
What is a "cycle" and what is a "batch"?
Plain-English breakdown of cycle tiling in RSG
A simulation cycle
One tick of simulated time — roughly 100 ms of virtual 5G network life. Each cycle, every UE has moved to a new position, and the network needs fresh RSRP, SINR, and handover decisions for every UE–cell pair. In CPU code this is literally a Python for cycle in range(n_cycles): loop body.
A batch
A chunk of 512 consecutive cycles collected together and sent to the GPU in one go. Instead of calling the GPU 512 separate times (once per cycle), we call it once with all 512 cycles' worth of data. The GPU processes all of them simultaneously — one thread per (UE, cell) pair, looping over the 512 cycles internally.
🐌 CPU — one cycle at a time
▼ processing
726 ms × 512 cycles = ~6 minutes
⚡ GPU — all 512 cycles at once (one batch)
Upload 110 MB + kernel = ~12 ms total
A tile
A tile is one fixed-size chunk that you process at a time. You have 36,000 total cycles but can't fit them all in GPU memory at once — so you cut the work into tiles of 512 cycles each, process one tile, download the results, then move to the next tile. Think of it like reading a book 20 pages at a time because your desk only fits 20 pages open.
Why exactly 512 cycles? — VRAM budget math
RSRP tensor size 512 cycles × 600 UEs × 90 cells × 4 bytes (float32) = 110 MB
GPU VRAM available RTX PRO 6000 Blackwell 96,000 MB
VRAM used by tile 110 MB / 96,000 MB 0.11% ✓
If you tried to do 36,000 cycles at once: 36,000 × 600 × 90 × 4 B = 7.8 GB — still fits here, but on a 8 GB GPU it would overflow. Tiling keeps the code portable and leaves room for intermediate buffers.
Section 04

GPU Acceleration KPIs — When Does It Actually Win?

These are the quantitative thresholds you need to cross before GPU acceleration pays off. Each one is a hard engineering constraint, not a preference.

① The Break-Even Equation

GPU wins only when total GPU time beats CPU time. Total GPU time has three unavoidable terms:

Tgpu = Ttransfer + Tlaunch + Tkernel
GPU is worth it only when Tgpu < Tcpu
Ttransfer = bytes / PCIe bandwidth Tlaunch = ~5–20 µs per kernel (fixed overhead) Tkernel = flops / (SMs × FLOPS/SM × occupancy)
RSG example: phase:rf CPU time = 726 ms/cycle. GPU transfer = ~2 ms, launch = ~0.05 ms, kernel = ~0.6 ms → Tgpu ≈ 2.65 ms. Break-even was at ~4 ms of CPU work — well exceeded.

② Required Degree of Parallelism

You need enough independent work items to keep the GPU busy. An underoccupied GPU is slower than CPU.

Too few — GPU loses
< 1 k
Independent work items. Launch overhead (~10 µs) dominates. CPU wins on a single function call in <1 µs.
Marginal — profile carefully
1 k – 10 k
GPU can win if arithmetic intensity is high. Measure; don't assume. Many warps will be idle.
Solid — GPU wins
10 k – 1 M
Enough to saturate most SMs. RSG: 600 UE × 90 cells × 512 cycles = 27.6 M independent RSRP values per batch.
Ideal — full utilization
> 1 M
All SMs saturated. Memory bandwidth becomes the bottleneck, not compute. Use tiling and shared memory to reduce traffic.
🔍
What is an SM? What is a warp? What does "resident" mean?
Plain-English breakdown of GPU occupancy
SM — Streaming Multiprocessor
Think of an SM as one independent mini-processor on the GPU chip. The RTX 6000 Blackwell has 188 SMs. Each SM can run work completely in parallel with all the others — like having 188 separate CPUs on one chip, except each SM is optimised for running many threads at once rather than one thread very fast.
Warp — a group of 32 threads
A warp is 32 threads that always execute the same instruction at the same time (lockstep SIMD). You write code as if each thread is independent, but the hardware secretly runs them 32 at a time. Your thread block of 256 threads = 8 warps. The SM schedules warps, not individual threads.
Resident warps — warps loaded on an SM right now
When a kernel launches, the GPU assigns some warps to each SM. Those warps are resident — their register state lives on the SM. At any moment, some warps are actively computing; others are stalled waiting for memory. The SM switches between them instantly (zero-cost context switch). This is how the GPU hides memory latency.
Why you need multiple resident warps per SM — latency hiding animation
SM #0
Running (computing)
Waiting for memory (~200 cycles latency)
Ready (next in queue)
⚡ SM is busy — switches instantly when a warp stalls
Time →
With only 1 warp: SM stalls and sits idle while waiting for memory
How to check and ensure good occupancy
Thread block size Use 128 or 256 threads (4–8 warps per block). Too small = underutilised SM. RSG: 256 ✓
Registers per thread Each SM has 65,536 registers. If your kernel uses 64 regs/thread: 65536/64 = 1024 threads max → 4 warps. Keep registers low. nsys shows this
Shared memory per block Each SM has 228 KB shared mem. If a block uses 100 KB, only 2 blocks fit → 2 × 8 warps = 16 resident warps. Fewer is fine if compute-bound. RSG: minimal ✓
Minimum warps to hide latency GPU memory latency ≈ 200–800 cycles. Each warp takes ~4 cycles to issue. Need 200/4 = ~50 warps in flight to fully hide latency (theoretical). Rule: ≥ 4

③ Transfer Budget — The PCIe Tax

Every byte you move between CPU and GPU costs time. This is your most important constraint to calculate upfront.

LinkPeak BWReal-world BW1 GB costs
PCIe Gen 4 x16 (H2D) 32 GB/s ~20 GB/s ~50 ms
PCIe Gen 4 x16 (D2H) 32 GB/s ~18 GB/s ~55 ms
GPU GDDR7 VRAM (RTX 6000) 960 GB/s ~900 GB/s ~1.1 ms
CPU L3 cache bandwidth ~300 GB/s ~250 GB/s ~4 ms

Use this formula to decide if transfer is affordable:

Transfer time = (n_cycles × n_ue × n_cells × 4 bytes) / PCIe_BW
RSG: 512 × 600 × 90 × 4 B = 110 MB → 110 / 20,000 = 5.5 ms H2D + ~4 ms D2H = ~10 ms transfer per batch
vs CPU cost for same batch: 512 cycles × 726 ms = 372 s → Transfer is <0.003% of what we're saving ✓
Transfer budget as % of saved CPU time — lower = more headroom
H2D + D2H (~14 ms/batch)
0.004%
GPU kernel time (~0.65 ms/cycle)
0.09%
CPU cost avoided (726 ms/cycle × 512)
372 s baseline

④ Arithmetic Intensity — Compute vs Memory Bound

The roofline model tells you whether your kernel is limited by compute throughput or memory bandwidth. This determines which axis to optimize.

Memory-bound (low AI) — go faster by reducing loads
Compute-bound (high AI) — go faster by more parallelism or tensor cores
RSG RSRP kernel (~8 FLOP/byte)
RSG handover kernel (~2 FLOP/byte)
KernelFlops / elementBytes readAI (FLOP/B)Bound by
rsrp_3gpp_kernel~40~5 B ~8 Compute
handover_kernel~6~3 B ~2 Memory
kpi_and_link_kernel~25~4 B ~6 Compute
cell_aggregate_kernel~3~8 B ~0.4 Memory
Ridge point for RTX 6000 Blackwell: 82.6 TFLOPS FP32 ÷ 960 GB/s = ~86 FLOP/byte. Any kernel below this line is memory-bound — throwing more SMs at it won't help; you need better memory access patterns (coalescing, tiling, shared memory).

⑤ Minimum Data Dimensions & Shape

These are practical minimums derived from GPU hardware constraints, not arbitrary rules.

📏

Minimum element count

❌ < 1,024 Launch overhead dominates
⚠ 1 k – 32 k GPU may win if compute-heavy
✓ > 32 k Reliable GPU win for most kernels

Why: one CUDA warp = 32 threads. One SM = up to 64 warps = 2,048 threads. RTX 6000 has 188 SMs → need ~386 k threads to saturate all SMs at 100% occupancy.

🧱

Memory layout (critical)

✓ Contiguous C-order (row-major)
❌ Fortran-order / strided access
❌ List of Python objects / dicts

Why: GPU memory controllers load 128-byte cache lines. 32 adjacent threads reading adjacent floats = 1 transaction (coalesced). Non-contiguous = 32 separate transactions = 32× bandwidth waste.

🔢

Data type

✓ float32 — 2× throughput vs float64
⚠ float64 — only for physics requiring it
✓ int32/uint16 — great for indices/masks

RTX 6000: 82.6 TFLOPS FP32 vs 41.3 TFLOPS FP64. RSG uses float32 for RSRP (range −157 to −30 dB, 3 decimal places → float32 is sufficient).

📐

Tensor shape for 2D kernels

✓ Both dims ≥ 32 (fills warps)
⚠ One dim = 1 (degenerate — use 1D kernel)
✓ Dims are multiples of 32 (no tail warp waste)

RSG thread block is 16×16 = 256 threads (8 warps). Grid is ⌈n_cells/16⌉ × ⌈n_UEs/16⌉. With 90 cells and 600 UEs: 6×38 = 228 blocks × 256 threads = 58 k threads → solid.

⑥ Quick Go / No-Go Reference

KPINo-Go thresholdGo thresholdRSG value
Independent work items < 1,000 > 32,000 27.6 M ✓
Transfer time / CPU time saved > 50% < 10% 0.004% ✓
Arithmetic intensity (FLOP/byte) < 0.5 > 2.0 2–8 ✓
Data layout Strided / object list Contiguous float32 array NumPy/CuPy arrays ✓
Tensor size in VRAM > 80% VRAM < 30% VRAM 110 MB / 96 GB ✓
Parallelism type Strict serial dependency Independent or axis-separable UE-parallel ✓
Section 05

Batch Programming — Rewriting for Parallelism

The first step is always: replace per-item scalar loops with operations over entire arrays. This works even before you touch GPU code.

🎬 CPU Sequential vs GPU Parallel — computing RSRP for 48 UE-cell pairs

🐌 CPU — one at a time

UE-0 × Cell-0
UE-1 × Cell-0
UE-2 × Cell-0
UE-3 × Cell-0
UE-4 × Cell-0
UE-5 × Cell-0 … ×48

Sequential · ~726 ms/cycle

⚡ GPU — all at once (16×16 grid)

Parallel · ~0.65 ms/cycle

⚠ Before — NumPy vectorized (still CPU)
# propagation.py — runs on CPU each cycle
def path_loss_uma(d3D, d2D, freq_ghz,
                   h_bs, h_ut, is_los, xp=np):
    d3D  = xp.asarray(d3D, dtype=xp.float64)
    d2D  = xp.asarray(d2D, dtype=xp.float64)
    log_f = xp.log10(freq_ghz)

    d_bp = _breakpoint_uma(h_bs, h_ut, freq_ghz, ...)
    near = d3D <= d_bp

    pl_los = xp.where(near,
        28 + 22 * xp.log10(d3D) + 20 * log_f,
        28 + 40 * xp.log10(d3D) + 20 * log_f
        - 9 * xp.log10(d_bp**2 + (h_bs-h_ut)**2))

    pl_nlos = (13.54 + 39.08 * xp.log10(d3D)
               + 20 * log_f - 0.6 * (h_ut - 1.5))

    # Called 600×90=54,000 times per cycle
    return xp.where(is_los, pl_los,
                   xp.maximum(pl_los, pl_nlos))
✓ After — Numba CUDA device function
# cuda_kernels.py — runs inside GPU kernel
# One thread per (UE, cell) — all in parallel
@cuda.jit(device=True)
def path_loss_uma(d3D, d2D, freq_ghz,
                   h_bs, h_ut, is_los):
    # Scalar: each thread does one pair
    log_f = math.log10(freq_ghz)
    h_e   = 1.0
    d_bp  = (4.0 * (h_bs - h_e) * (h_ut - h_e)
             * freq_ghz * 1e9 / C_LIGHT)

    if d3D <= d_bp:
        pl_los = 28.0 + 22.0*math.log10(d3D) + 20.0*log_f
    else:
        pl_los = (28.0 + 40.0*math.log10(d3D) + 20.0*log_f
                  - 9.0*math.log10(d_bp**2+(h_bs-h_ut)**2))
    if is_los:
        return pl_los
    pl_nlos = (13.54 + 39.08*math.log10(d3D)
               + 20.0*log_f - 0.6*(h_ut - 1.5))
    return max(pl_los, pl_nlos)
⚠ Before — Per-UE sequential loop
# radio_model.py — called once per cycle
for ue in active_ues:
    ue_loc = XYZ(*ue["state"]["xyz"])

    for nb_report in ue.measurements:
        cell_name = nb_report['cell']
        cell = self.cells[cell_name]

        distance = ue_loc.distance(cell['xyz'])
        # ↑ Called ~36,000 times per datagen batch
        rx_power = self.received_power[cell_name](
            cell, distance)
        ue_rf_report = nb_report['value']
        ue_rf_report.update(rx_power)

# Total: 600 UEs × 90 cells = 54,000 iters
# Plus: called per simulation cycle
# Result: 726 ms/cycle on 1 CPU core
✓ After — Batched GPU kernel (all cycles)
# cuda_kernels.py — one kernel for ALL cycles
@cuda.jit
def rsrp_3gpp_kernel(positions, cell_xyz,
    cell_power, ..., n_cycles, n_ue, n_cells):
    # 2D thread grid: (cell, UE)
    ci = (cuda.blockIdx.x * cuda.blockDim.x
          + cuda.threadIdx.x)
    ui = (cuda.blockIdx.y * cuda.blockDim.y
          + cuda.threadIdx.y)
    if ui >= n_ue or ci >= n_cells:
        return
    cx, cy, cz = cell_xyz[ci]
    pwr = cell_power[ci]

    for c in range(n_cycles):  # loop over time
        ux, uy, uz = positions[c, ui]
        dx, dy = ux-cx, uy-cy
        d2D = math.sqrt(dx*dx + dy*dy)
        d3D = math.sqrt(d2D*d2D + (uz-cz)**2)
        pl  = path_loss_uma(d3D, d2D, ...)
        sf  = sos_eval(ux, uy, ci, ...)
        ag  = beam_pattern_gain(dx, dy, dz, ...)
        rsrp_out[c, ui, ci] = max(min(
            pwr - pl - sf + ag, -30.0), -157.0)
# Grid: (⌈n_cells/16⌉ × ⌈n_UEs/16⌉) × 16×16
# All 54,000 pairs in parallel, all 512 cycles
⚠ Before — CuPy einsum (92 MB temp)
# tensor_sim.py — cell aggregation
# Builds large intermediate tensor

# serving_onehot: (cycles × UEs × cells)
# thput:          (cycles × UEs)
cell_thput = cp.einsum(
    'cui,cu->ci',
    serving_onehot,  # 92 MB intermediate!
    thput            # must be materialized
)

# Problem:
# - 92 MB one-hot tensor (bool) allocated
# - cuBLAS dispatch overhead
# - Cannot fuse with other reductions
cell_rank = cp.einsum(
    'cui,cu->ci', serving_onehot, rank)
# ↑ Another 92 MB pass
✓ After — Warp-shuffle reduction kernel
# cuda_kernels.py — zero intermediate tensor
@cuda.jit
def cell_aggregate_kernel(serving, thput, prb,
        rank, cell_thput, cell_prb,
        cell_rank_sum, n_ue):
    cycle = cuda.blockIdx.x
    cell  = cuda.blockIdx.y
    tid   = cuda.threadIdx.x  # 128 threads

    # Each thread accumulates its stripe
    t_sum = 0.0
    for ui in range(tid, n_ue, cuda.blockDim.x):
        if serving[cycle, ui] == cell:
            t_sum += thput[cycle, ui]

    # Warp shuffle: no shared memory needed
    mask = 0xffffffff
    for off in (16, 8, 4, 2, 1):
        t_sum += cuda.shfl_xor_sync(mask,t_sum,off)

    # Write once per warp
    if cuda.threadIdx.x & 31 == 0:
        shared_mem[cuda.threadIdx.x >> 5] = t_sum
    cuda.syncthreads()
    if cuda.threadIdx.x == 0:
        cell_thput[cycle, cell] = (
            shared_mem[0]+shared_mem[1]+
            shared_mem[2]+shared_mem[3])
Section 06

Kernel Anatomy — What Goes Where

A CUDA kernel has four distinct regions. Understanding each lets you optimize systematically.

📐 Handover A3 Kernel Anatomy — handover_kernel

① Thread Index → Work Item

# Each thread owns one UE's entire timeline
ui = (cuda.blockIdx.x * cuda.blockDim.x
      + cuda.threadIdx.x)
if ui >= n_ue:
    return  # guard — always needed

# Key insight: dependency is across time
# (cycle t+1 needs result of cycle t)
# Solution: each thread processes ALL cycles
# for ITS OWN UE — serial in time, parallel in UE

② Load Constants → Registers

# Read cell-level config into registers
# (accessed ~512× per thread, so cache pays)
for nb_idx in range(max_nb):
    nb = nb_table[srv, nb_idx]  # L2 cache
    if nb < 0:
        break

# Stack-resident state (fast):
# srv, cooldown, ho_count all in registers
srv      = serving[0, ui]
cooldown = 0
ho       = 0

③ Main Computation Loop

for c in range(n_cycles):
    if cooldown > 0:
        cooldown -= 1
        serving[c+1, ui] = srv
        continue
    srv_rsrp = rsrp[c, ui, srv]  # global mem
    for nb_idx in range(max_nb):
        nb = nb_table[srv, nb_idx]
        nb_rsrp = rsrp[c, ui, nb]  # global mem
        if nb_rsrp - hyst > srv_rsrp + a3_off:
            srv = nb; ho += 1
            cooldown = cooldown_cyc; break
    serving[c+1, ui] = srv

④ Write Output Once

# Writes happen every cycle (serving cell)
# and once at end (HO count)
# Minimize global writes — they're expensive
serving[c+1, ui] = srv  # per cycle
ho_count[ui]       = ho  # once at end

# Grid: ⌈n_UEs / 256⌉ blocks × 256 threads
# 600 UEs → 3 blocks (tiny launch overhead)
# RTX 6000: 188 SMs → all 3 blocks
# fit on <1 SM, but still parallelizes 600 UEs
⚠ Before — Sequential per-cycle state machine
# gnodeb_model.py — runs per cycle, per UE
for c in range(n_cycles):
    ue_signal = measurements[c]
    serving = ue.serving_cell

    for nb in serving_neighbors:
        nb_rsrp = rsrp_cache[(nb['cell'], ue)]
        if nb_rsrp > serving_rsrp + a3_offset:
            # State update (serial dependency!)
            serving = nb['cell']
            ue.serving_cell = serving
            ho_count += 1

# Problem: cycle t+1 depends on cycle t
# → Cannot parallelize across cycles
# → 1,415 cycles × 600 UEs = 849,000 updates
✓ After — Parallel across UEs, serial in time
# cuda_kernels.py — 600 threads, each owns 1 UE
@cuda.jit
def handover_kernel(rsrp, serving, nb_table,
                     a3_off, hyst, cooldown_cyc,
                     ho_count, n_cycles, n_ue):
    ui = (cuda.blockIdx.x * cuda.blockDim.x
          + cuda.threadIdx.x)
    if ui >= n_ue: return

    cooldown = 0; ho = 0
    srv = serving[0, ui]

    for c in range(n_cycles):  # serial per UE
        srv_rsrp = rsrp[c, ui, srv]
        for nb_idx in range(max_nb):
            nb = nb_table[srv, nb_idx]
            if rsrp[c,ui,nb]-hyst > srv_rsrp+a3_off:
                srv=nb; ho+=1; break
        serving[c+1, ui] = srv

    ho_count[ui] = ho
# All 600 UEs run simultaneously on the GPU
# Each thread is its own state machine
# RTX 6000: 26s total for 43,200 cycles!
Section 07

Benchmarking — Measuring What Matters

A benchmark isn't just timing. It validates correctness, quantifies improvement, and finds your next bottleneck.

📊 Nsight Systems — GPU vs CPU Phase Comparison (normalized per cycle)
phase:rf (RF datagen)
CPU 726 ms/cycle
726 ms
GPU 0.65 ms/cycle
0.65 ms
phase:kpi
CPU 3.9 ms/cycle
3.9 ms
GPU 0.07 ms/cycle
0.07 ms
phase:output
CPU 2.4 ms/cycle
2.4 ms
GPU 0.07 ms/cycle
0.07 ms
⚠ Wrong benchmark approach
# Don't do this — includes JIT warmup
import time

t0 = time.perf_counter()
result = my_gpu_kernel(data)  # includes JIT!
t1 = time.perf_counter()
print(f"Time: {t1-t0:.3f}s")

# Problems:
# 1. First call triggers Numba JIT compilation
#    (can be 10-30 seconds!)
# 2. No warmup = GPU clock not at boost freq
# 3. Single measurement = high variance
# 4. No parity check = maybe wrong answer
✓ Correct benchmark approach
# benchmark.py — from AI-RSG
WARMUP_ROUNDS = 2   # trigger JIT compilation
MEASURE_ROUNDS = 3  # actual measurement

# 1. Parity check first
cpu_result = reference_cpu(data)
gpu_result = gpu_kernel(data)
np.testing.assert_allclose(
    gpu_result, cpu_result, atol=1e-4,
    err_msg="GPU parity failed!")

# 2. Warmup (JIT + GPU clock boost)
for _ in range(WARMUP_ROUNDS):
    _ = gpu_kernel(data)
    cp.cuda.Stream.null.synchronize()

# 3. Timed measurement
times = []
for _ in range(MEASURE_ROUNDS):
    t0 = time.perf_counter()
    _ = gpu_kernel(data)
    cp.cuda.Stream.null.synchronize() # wait!
    times.append(time.perf_counter() - t0)
print(f"Median: {np.median(times)*1000:.1f}ms")
print(f"Speedup: {cpu_time/np.median(times):.0f}×")
Critical: Always call cp.cuda.Stream.null.synchronize() (CuPy) or cuda.synchronize() (Numba) before stopping the timer. GPU kernel launches are asynchronous — timing without sync measures kernel launch, not execution.

GPU Acceleration Checklist

🏁 AI-RSG GPU Pipeline — Final Results (RTX PRO 6000 Blackwell, 188 SMs)
📡
RF
CPU: 726 ms
GPU: 0.65 ms
🔀
Handover
CPU: —
GPU: 0.60 ms
📊
KPI
CPU: 3.9 ms
GPU: 0.07 ms
💾
Output
CPU: 2.4 ms
GPU: 0.07 ms