GPU Acceleration Engineering Guide

Section 01

The Engineering Mindset

GPU acceleration isn't just about writing CUDA. It's a way of thinking. Before touching any code, every GPU engineer asks the same questions.

🔍

Find the Real Hotspot

Don't guess. 90% of your wall-time lives in 10% of the code. Use a profiler to find it before writing a single kernel.

📦

Think in Batches

GPUs hate one-by-one work. Convert sequential per-item loops into operations over entire arrays at once.

🧵

Expose Parallelism

Ask: "Are these computations independent?" If UE-1's RSRP doesn't affect UE-2's, you can compute them simultaneously.

🚌

Minimize Data Movement

PCIe transfers are expensive. The ideal pattern is: upload once → run many kernels → download once.

📐

Measure, Not Assume

After every change, profile again. GPU wins aren't always where you expect. Let numbers guide decisions.

🔢

Match Precision to Need

Float32 is 2× faster than float64 on most GPU cores. Only use float64 where physics demands it.

The Golden Rule: The best GPU kernel is the one that never runs — because you designed your algorithm to need fewer total operations. Algorithmic complexity beats hardware every time.

Section 02

Finding the Hotspot — Profiling

The AI-RSG simulator ran at ~1,400 cycles total before hitting wall-time limits. Profiling revealed 99% of time lived in a single function.

⏱ Nsight Systems — Phase Timeline (CPU baseline, 1,415 cycles)

phase:rf (RF datagen) 1026.6 s — 99.2%

RF 1,026s

phase:kpi 5.5 s — 0.5%

KPI

phase:output 3.4 s — 0.3%

Out

Key Insight: phase:rf consumed 99.2% of CPU time. This is your target. Everything else is noise.

1

Add NVTX Phase Annotations

Instrument your code with named ranges so the profiler can distinguish phases. Use consistent names so CPU and GPU profiles align.

NSYS_PROFILE_SIM=1 nsys profile --trace=cuda,nvtx,osrt python sim.py

2

Capture with Nsight Systems

Enable --trace-fork-before-exec=true to capture NVTX from forked worker processes. Output lands in /tmp/sim-nsys-{pid}.nsys-rep.

3

Identify the Dominant Phase

Sort phases by wall time. The top entry is your GPU target. In RSG, phase:rf was 726 ms/cycle × 1,415 cycles = 1,026 seconds.

4

Quantify the Inner Loop

Understand why it's slow. RSG's RF phase called _compute_rsrp_batch() once per (cycle × UE × cell) = ~36,000 scalar calls per datagen batch.

⚠ Before — No Visibility

# simulation.py — no annotations
# Profiler sees one opaque block
def run_cycle(self, cycle):
    # RF: black box
    self.radio_model.update()

    # KPI: also black box
    self.kpi_engine.compute()

    # Output: invisible
    self.write_influx()

✓ After — NVTX Annotated

# simulation.py — phases visible in Nsight
from Tera5G.util import nvtx_range as _phase

def run_cycle(self, cycle):
    with _phase("phase:rf", "red"):
        self.radio_model.update()

    with _phase("phase:kpi", "green"):
        self.kpi_engine.compute()

    with _phase("phase:output", "blue"):
        self.write_influx()

Section 03

Planning — What to Offload and How

Not everything benefits from GPU. The decision tree below mirrors how AI-RSG chose what to offload.

Is this computation data-parallel? (independent per item)

Can UE-1's RSRP be computed independently of UE-2's?

YES Is the working set large enough? (>1k elements)

Fewer than ~1k elements → GPU launch overhead dominates. RSRP for 600 UEs × 90 cells = 54,000 pairs → yes, offload

YES Can you batch across time? (multiple cycles at once)

Instead of 1 cycle/call, process 512 cycles per kernel → amortize launch cost → tile by 512 cycles

✓ GPU Use a Numba CUDA kernel (or CuPy for reductions)

rsrp_3gpp_kernel, handover_kernel, kpi_and_link_kernel

NO (too small) Keep on CPU with NumPy — launch overhead not worth it

NO (sequential dependency) Each step depends on the previous (e.g. state machine)

Example: handover carries state cycle-to-cycle. Solution: parallelize across UEs (each thread owns one UE's entire timeline)

Memory Transfer Strategy

The #1 mistake beginners make: transferring data every cycle. Upload once, download once per batch.

🖥 Host (CPU RAM)

UE positions (512 cycles)

Cell config

Beam patterns

LOS grid

→→→

H2D Upload
once per batch

🎮 Device (GPU VRAM)

rsrp_3gpp_kernel

handover_kernel

kpi_and_link_kernel

cell_aggregate_kernel

→→→

D2H Download
once per batch

🖥 Host (CPU RAM)

RSRP table

Serving cells

KPI arrays

Cell aggregates

🔍

What is a "cycle" and what is a "batch"?

Plain-English breakdown of cycle tiling in RSG

A simulation cycle

One tick of simulated time — roughly 100 ms of virtual 5G network life. Each cycle, every UE has moved to a new position, and the network needs fresh RSRP, SINR, and handover decisions for every UE–cell pair. In CPU code this is literally a Python for cycle in range(n_cycles): loop body.

A batch

A chunk of 512 consecutive cycles collected together and sent to the GPU in one go. Instead of calling the GPU 512 separate times (once per cycle), we call it once with all 512 cycles' worth of data. The GPU processes all of them simultaneously — one thread per (UE, cell) pair, looping over the 512 cycles internally.

🐌 CPU — one cycle at a time

▼ processing

726 ms × 512 cycles = ~6 minutes

⚡ GPU — all 512 cycles at once (one batch)

Upload 110 MB + kernel = ~12 ms total

A tile

A tile is one fixed-size chunk that you process at a time. You have 36,000 total cycles but can't fit them all in GPU memory at once — so you cut the work into tiles of 512 cycles each, process one tile, download the results, then move to the next tile. Think of it like reading a book 20 pages at a time because your desk only fits 20 pages open.

Why exactly 512 cycles? — VRAM budget math

RSRP tensor size 512 cycles × 600 UEs × 90 cells × 4 bytes (float32) = 110 MB

GPU VRAM available RTX PRO 6000 Blackwell 96,000 MB

VRAM used by tile 110 MB / 96,000 MB 0.11% ✓

If you tried to do 36,000 cycles at once: 36,000 × 600 × 90 × 4 B = 7.8 GB — still fits here, but on a 8 GB GPU it would overflow. Tiling keeps the code portable and leaves room for intermediate buffers.

Section 04

GPU Acceleration KPIs — When Does It Actually Win?

These are the quantitative thresholds you need to cross before GPU acceleration pays off. Each one is a hard engineering constraint, not a preference.

① The Break-Even Equation

GPU wins only when total GPU time beats CPU time. Total GPU time has three unavoidable terms:

T_gpu = T_transfer + T_launch + T_kernel

GPU is worth it only when T_gpu < T_cpu

T_transfer = bytes / PCIe bandwidth T_launch = ~5–20 µs per kernel (fixed overhead) T_kernel = flops / (SMs × FLOPS/SM × occupancy)

RSG example: phase:rf CPU time = 726 ms/cycle. GPU transfer = ~2 ms, launch = ~0.05 ms, kernel = ~0.6 ms → T_gpu ≈ 2.65 ms. Break-even was at ~4 ms of CPU work — well exceeded.

② Required Degree of Parallelism

You need enough independent work items to keep the GPU busy. An underoccupied GPU is slower than CPU.

Too few — GPU loses

< 1 k

Independent work items. Launch overhead (~10 µs) dominates. CPU wins on a single function call in <1 µs.

Marginal — profile carefully

1 k – 10 k

GPU can win if arithmetic intensity is high. Measure; don't assume. Many warps will be idle.

Solid — GPU wins

10 k – 1 M

Enough to saturate most SMs. RSG: 600 UE × 90 cells × 512 cycles = 27.6 M independent RSRP values per batch.

Ideal — full utilization

> 1 M

All SMs saturated. Memory bandwidth becomes the bottleneck, not compute. Use tiling and shared memory to reduce traffic.

🔍

What is an SM? What is a warp? What does "resident" mean?

Plain-English breakdown of GPU occupancy

SM — Streaming Multiprocessor

Think of an SM as one independent mini-processor on the GPU chip. The RTX 6000 Blackwell has 188 SMs. Each SM can run work completely in parallel with all the others — like having 188 separate CPUs on one chip, except each SM is optimised for running many threads at once rather than one thread very fast.

Warp — a group of 32 threads

A warp is 32 threads that always execute the same instruction at the same time (lockstep SIMD). You write code as if each thread is independent, but the hardware secretly runs them 32 at a time. Your thread block of 256 threads = 8 warps. The SM schedules warps, not individual threads.

Resident warps — warps loaded on an SM right now

When a kernel launches, the GPU assigns some warps to each SM. Those warps are resident — their register state lives on the SM. At any moment, some warps are actively computing; others are stalled waiting for memory. The SM switches between them instantly (zero-cost context switch). This is how the GPU hides memory latency.

Why you need multiple resident warps per SM — latency hiding animation

SM #0

Running (computing)

Waiting for memory (~200 cycles latency)

Ready (next in queue)

⚡ SM is busy — switches instantly when a warp stalls

Time →

With only 1 warp: SM stalls and sits idle while waiting for memory

How to check and ensure good occupancy

Thread block size Use 128 or 256 threads (4–8 warps per block). Too small = underutilised SM. RSG: 256 ✓

Registers per thread Each SM has 65,536 registers. If your kernel uses 64 regs/thread: 65536/64 = 1024 threads max → 4 warps. Keep registers low. nsys shows this

Shared memory per block Each SM has 228 KB shared mem. If a block uses 100 KB, only 2 blocks fit → 2 × 8 warps = 16 resident warps. Fewer is fine if compute-bound. RSG: minimal ✓

Minimum warps to hide latency GPU memory latency ≈ 200–800 cycles. Each warp takes ~4 cycles to issue. Need 200/4 = ~50 warps in flight to fully hide latency (theoretical). Rule: ≥ 4

③ Transfer Budget — The PCIe Tax

Every byte you move between CPU and GPU costs time. This is your most important constraint to calculate upfront.

LinkPeak BWReal-world BW1 GB costs

PCIe Gen 4 x16 (H2D) 32 GB/s ~20 GB/s ~50 ms

PCIe Gen 4 x16 (D2H) 32 GB/s ~18 GB/s ~55 ms

GPU GDDR7 VRAM (RTX 6000) 960 GB/s ~900 GB/s ~1.1 ms

CPU L3 cache bandwidth ~300 GB/s ~250 GB/s ~4 ms

Use this formula to decide if transfer is affordable:

Transfer time = (n_cycles × n_ue × n_cells × 4 bytes) / PCIe_BW

RSG: 512 × 600 × 90 × 4 B = 110 MB → 110 / 20,000 = 5.5 ms H2D + ~4 ms D2H = ~10 ms transfer per batch

vs CPU cost for same batch: 512 cycles × 726 ms = 372 s → Transfer is <0.003% of what we're saving ✓

Transfer budget as % of saved CPU time — lower = more headroom

H2D + D2H (~14 ms/batch)

0.004%

GPU kernel time (~0.65 ms/cycle)

0.09%

CPU cost avoided (726 ms/cycle × 512)

372 s baseline

④ Arithmetic Intensity — Compute vs Memory Bound

The roofline model tells you whether your kernel is limited by compute throughput or memory bandwidth. This determines which axis to optimize.

Memory-bound (low AI) — go faster by reducing loads

Compute-bound (high AI) — go faster by more parallelism or tensor cores

RSG RSRP kernel (~8 FLOP/byte)

RSG handover kernel (~2 FLOP/byte)

KernelFlops / elementBytes readAI (FLOP/B)Bound by

rsrp_3gpp_kernel~40~5 B ~8 Compute

handover_kernel~6~3 B ~2 Memory

kpi_and_link_kernel~25~4 B ~6 Compute

cell_aggregate_kernel~3~8 B ~0.4 Memory

Ridge point for RTX 6000 Blackwell: 82.6 TFLOPS FP32 ÷ 960 GB/s = ~86 FLOP/byte. Any kernel below this line is memory-bound — throwing more SMs at it won't help; you need better memory access patterns (coalescing, tiling, shared memory).

⑤ Minimum Data Dimensions & Shape

These are practical minimums derived from GPU hardware constraints, not arbitrary rules.

📏

Minimum element count

❌ < 1,024 Launch overhead dominates

⚠ 1 k – 32 k GPU may win if compute-heavy

✓ > 32 k Reliable GPU win for most kernels

Why: one CUDA warp = 32 threads. One SM = up to 64 warps = 2,048 threads. RTX 6000 has 188 SMs → need ~386 k threads to saturate all SMs at 100% occupancy.

🧱

Memory layout (critical)

✓ Contiguous C-order (row-major)

❌ Fortran-order / strided access

❌ List of Python objects / dicts

Why: GPU memory controllers load 128-byte cache lines. 32 adjacent threads reading adjacent floats = 1 transaction (coalesced). Non-contiguous = 32 separate transactions = 32× bandwidth waste.

🔢

Data type

✓ float32 — 2× throughput vs float64

⚠ float64 — only for physics requiring it

✓ int32/uint16 — great for indices/masks

RTX 6000: 82.6 TFLOPS FP32 vs 41.3 TFLOPS FP64. RSG uses float32 for RSRP (range −157 to −30 dB, 3 decimal places → float32 is sufficient).

📐

Tensor shape for 2D kernels

✓ Both dims ≥ 32 (fills warps)

⚠ One dim = 1 (degenerate — use 1D kernel)

✓ Dims are multiples of 32 (no tail warp waste)

RSG thread block is 16×16 = 256 threads (8 warps). Grid is ⌈n_cells/16⌉ × ⌈n_UEs/16⌉. With 90 cells and 600 UEs: 6×38 = 228 blocks × 256 threads = 58 k threads → solid.

⑥ Quick Go / No-Go Reference

KPINo-Go thresholdGo thresholdRSG value

Independent work items < 1,000 > 32,000 27.6 M ✓

Transfer time / CPU time saved > 50% < 10% 0.004% ✓

Arithmetic intensity (FLOP/byte) < 0.5 > 2.0 2–8 ✓

Data layout Strided / object list Contiguous float32 array NumPy/CuPy arrays ✓

Tensor size in VRAM > 80% VRAM < 30% VRAM 110 MB / 96 GB ✓

Parallelism type Strict serial dependency Independent or axis-separable UE-parallel ✓

Section 05

Batch Programming — Rewriting for Parallelism

The first step is always: replace per-item scalar loops with operations over entire arrays. This works even before you touch GPU code.

🎬 CPU Sequential vs GPU Parallel — computing RSRP for 48 UE-cell pairs

🐌 CPU — one at a time

UE-0 × Cell-0

UE-1 × Cell-0

UE-2 × Cell-0

UE-3 × Cell-0

UE-4 × Cell-0

UE-5 × Cell-0 … ×48

Sequential · ~726 ms/cycle

⚡ GPU — all at once (16×16 grid)

Parallel · ~0.65 ms/cycle

⚠ Before — NumPy vectorized (still CPU)

# propagation.py — runs on CPU each cycle
def path_loss_uma(d3D, d2D, freq_ghz,
                   h_bs, h_ut, is_los, xp=np):
    d3D  = xp.asarray(d3D, dtype=xp.float64)
    d2D  = xp.asarray(d2D, dtype=xp.float64)
    log_f = xp.log10(freq_ghz)

    d_bp = _breakpoint_uma(h_bs, h_ut, freq_ghz, ...)
    near = d3D <= d_bp

    pl_los = xp.where(near,
        28 + 22 * xp.log10(d3D) + 20 * log_f,
        28 + 40 * xp.log10(d3D) + 20 * log_f
        - 9 * xp.log10(d_bp**2 + (h_bs-h_ut)**2))

    pl_nlos = (13.54 + 39.08 * xp.log10(d3D)
               + 20 * log_f - 0.6 * (h_ut - 1.5))

    # Called 600×90=54,000 times per cycle
    return xp.where(is_los, pl_los,
                   xp.maximum(pl_los, pl_nlos))

✓ After — Numba CUDA device function

# cuda_kernels.py — runs inside GPU kernel
# One thread per (UE, cell) — all in parallel
@cuda.jit(device=True)
def path_loss_uma(d3D, d2D, freq_ghz,
                   h_bs, h_ut, is_los):
    # Scalar: each thread does one pair
    log_f = math.log10(freq_ghz)
    h_e   = 1.0
    d_bp  = (4.0 * (h_bs - h_e) * (h_ut - h_e)
             * freq_ghz * 1e9 / C_LIGHT)

    if d3D <= d_bp:
        pl_los = 28.0 + 22.0*math.log10(d3D) + 20.0*log_f
    else:
        pl_los = (28.0 + 40.0*math.log10(d3D) + 20.0*log_f
                  - 9.0*math.log10(d_bp**2+(h_bs-h_ut)**2))
    if is_los:
        return pl_los
    pl_nlos = (13.54 + 39.08*math.log10(d3D)
               + 20.0*log_f - 0.6*(h_ut - 1.5))
    return max(pl_los, pl_nlos)

⚠ Before — Per-UE sequential loop

# radio_model.py — called once per cycle
for ue in active_ues:
    ue_loc = XYZ(*ue["state"]["xyz"])

    for nb_report in ue.measurements:
        cell_name = nb_report['cell']
        cell = self.cells[cell_name]

        distance = ue_loc.distance(cell['xyz'])
        # ↑ Called ~36,000 times per datagen batch
        rx_power = self.received_power[cell_name](
            cell, distance)
        ue_rf_report = nb_report['value']
        ue_rf_report.update(rx_power)

# Total: 600 UEs × 90 cells = 54,000 iters
# Plus: called per simulation cycle
# Result: 726 ms/cycle on 1 CPU core

✓ After — Batched GPU kernel (all cycles)

# cuda_kernels.py — one kernel for ALL cycles
@cuda.jit
def rsrp_3gpp_kernel(positions, cell_xyz,
    cell_power, ..., n_cycles, n_ue, n_cells):
    # 2D thread grid: (cell, UE)
    ci = (cuda.blockIdx.x * cuda.blockDim.x
          + cuda.threadIdx.x)
    ui = (cuda.blockIdx.y * cuda.blockDim.y
          + cuda.threadIdx.y)
    if ui >= n_ue or ci >= n_cells:
        return
    cx, cy, cz = cell_xyz[ci]
    pwr = cell_power[ci]

    for c in range(n_cycles):  # loop over time
        ux, uy, uz = positions[c, ui]
        dx, dy = ux-cx, uy-cy
        d2D = math.sqrt(dx*dx + dy*dy)
        d3D = math.sqrt(d2D*d2D + (uz-cz)**2)
        pl  = path_loss_uma(d3D, d2D, ...)
        sf  = sos_eval(ux, uy, ci, ...)
        ag  = beam_pattern_gain(dx, dy, dz, ...)
        rsrp_out[c, ui, ci] = max(min(
            pwr - pl - sf + ag, -30.0), -157.0)
# Grid: (⌈n_cells/16⌉ × ⌈n_UEs/16⌉) × 16×16
# All 54,000 pairs in parallel, all 512 cycles

⚠ Before — CuPy einsum (92 MB temp)

# tensor_sim.py — cell aggregation
# Builds large intermediate tensor

# serving_onehot: (cycles × UEs × cells)
# thput:          (cycles × UEs)
cell_thput = cp.einsum(
    'cui,cu->ci',
    serving_onehot,  # 92 MB intermediate!
    thput            # must be materialized
)

# Problem:
# - 92 MB one-hot tensor (bool) allocated
# - cuBLAS dispatch overhead
# - Cannot fuse with other reductions
cell_rank = cp.einsum(
    'cui,cu->ci', serving_onehot, rank)
# ↑ Another 92 MB pass

✓ After — Warp-shuffle reduction kernel

# cuda_kernels.py — zero intermediate tensor
@cuda.jit
def cell_aggregate_kernel(serving, thput, prb,
        rank, cell_thput, cell_prb,
        cell_rank_sum, n_ue):
    cycle = cuda.blockIdx.x
    cell  = cuda.blockIdx.y
    tid   = cuda.threadIdx.x  # 128 threads

    # Each thread accumulates its stripe
    t_sum = 0.0
    for ui in range(tid, n_ue, cuda.blockDim.x):
        if serving[cycle, ui] == cell:
            t_sum += thput[cycle, ui]

    # Warp shuffle: no shared memory needed
    mask = 0xffffffff
    for off in (16, 8, 4, 2, 1):
        t_sum += cuda.shfl_xor_sync(mask,t_sum,off)

    # Write once per warp
    if cuda.threadIdx.x & 31 == 0:
        shared_mem[cuda.threadIdx.x >> 5] = t_sum
    cuda.syncthreads()
    if cuda.threadIdx.x == 0:
        cell_thput[cycle, cell] = (
            shared_mem[0]+shared_mem[1]+
            shared_mem[2]+shared_mem[3])

Section 06

Kernel Anatomy — What Goes Where

A CUDA kernel has four distinct regions. Understanding each lets you optimize systematically.

📐 Handover A3 Kernel Anatomy — handover_kernel

① Thread Index → Work Item

# Each thread owns one UE's entire timeline
ui = (cuda.blockIdx.x * cuda.blockDim.x
      + cuda.threadIdx.x)
if ui >= n_ue:
    return  # guard — always needed

# Key insight: dependency is across time
# (cycle t+1 needs result of cycle t)
# Solution: each thread processes ALL cycles
# for ITS OWN UE — serial in time, parallel in UE

② Load Constants → Registers

# Read cell-level config into registers
# (accessed ~512× per thread, so cache pays)
for nb_idx in range(max_nb):
    nb = nb_table[srv, nb_idx]  # L2 cache
    if nb < 0:
        break

# Stack-resident state (fast):
# srv, cooldown, ho_count all in registers
srv      = serving[0, ui]
cooldown = 0
ho       = 0

③ Main Computation Loop

for c in range(n_cycles):
    if cooldown > 0:
        cooldown -= 1
        serving[c+1, ui] = srv
        continue
    srv_rsrp = rsrp[c, ui, srv]  # global mem
    for nb_idx in range(max_nb):
        nb = nb_table[srv, nb_idx]
        nb_rsrp = rsrp[c, ui, nb]  # global mem
        if nb_rsrp - hyst > srv_rsrp + a3_off:
            srv = nb; ho += 1
            cooldown = cooldown_cyc; break
    serving[c+1, ui] = srv

④ Write Output Once

# Writes happen every cycle (serving cell)
# and once at end (HO count)
# Minimize global writes — they're expensive
serving[c+1, ui] = srv  # per cycle
ho_count[ui]       = ho  # once at end

# Grid: ⌈n_UEs / 256⌉ blocks × 256 threads
# 600 UEs → 3 blocks (tiny launch overhead)
# RTX 6000: 188 SMs → all 3 blocks
# fit on <1 SM, but still parallelizes 600 UEs

⚠ Before — Sequential per-cycle state machine

# gnodeb_model.py — runs per cycle, per UE
for c in range(n_cycles):
    ue_signal = measurements[c]
    serving = ue.serving_cell

    for nb in serving_neighbors:
        nb_rsrp = rsrp_cache[(nb['cell'], ue)]
        if nb_rsrp > serving_rsrp + a3_offset:
            # State update (serial dependency!)
            serving = nb['cell']
            ue.serving_cell = serving
            ho_count += 1

# Problem: cycle t+1 depends on cycle t
# → Cannot parallelize across cycles
# → 1,415 cycles × 600 UEs = 849,000 updates

✓ After — Parallel across UEs, serial in time

# cuda_kernels.py — 600 threads, each owns 1 UE
@cuda.jit
def handover_kernel(rsrp, serving, nb_table,
                     a3_off, hyst, cooldown_cyc,
                     ho_count, n_cycles, n_ue):
    ui = (cuda.blockIdx.x * cuda.blockDim.x
          + cuda.threadIdx.x)
    if ui >= n_ue: return

    cooldown = 0; ho = 0
    srv = serving[0, ui]

    for c in range(n_cycles):  # serial per UE
        srv_rsrp = rsrp[c, ui, srv]
        for nb_idx in range(max_nb):
            nb = nb_table[srv, nb_idx]
            if rsrp[c,ui,nb]-hyst > srv_rsrp+a3_off:
                srv=nb; ho+=1; break
        serving[c+1, ui] = srv

    ho_count[ui] = ho
# All 600 UEs run simultaneously on the GPU
# Each thread is its own state machine
# RTX 6000: 26s total for 43,200 cycles!

Section 07

Sizing the Grid — How Many Threads & Blocks?

A GPU launch is a grid of blocks, and each block is a group of threads. Sizing the grid is not guesswork — it falls straight out of your scenario dimensions. The rule: launch one thread per independent work item, then round up into blocks.

📏

The 3-Step Grid Recipe

Works for every kernel — derive the launch from the workload, never hard-code it.

① Count work items — N

Find the independent axes of your scenario and multiply them. These are the dimensions that can run at the same time with no data dependency. This product is N.

② Pick threads / block — TPB

A fixed, hardware-friendly constant. Must be a multiple of 32 (the warp size). 256 is the safe default; 128–512 are all reasonable. This is the one number you choose by hand.

③ Derive blocks

Round up so every work item is covered: blocks = ⌈N / TPB⌉. The launch covers blocks × TPB threads — always ≥ N, so the kernel needs a guard to drop the few extra threads.

blocks = ⌈ N / TPB ⌉

Threads launched = blocks × TPB ≥ N — the kernel's first line must be if idx >= N: return

N = product of independent scenario dimensions TPB = threads per block (32-multiple, default 256) blocks = how many SMs-worth of work to schedule

Map your scenario to `N`

N is not a tuning knob — it is dictated by which axes of the scenario are independent. The two RSG kernels show the two extremes.

RSRP — fully independent (3 axes)

# Every (cycle, UE, cell) RSRP value is a pure
# function of position + cell params → no deps.
# All three axes parallelize.
N = n_cycles * n_ue * n_cells

# City scenario: 512 × 600 × 90
N = 27_648_000   # 27.6 M threads — saturates GPU

Handover — one axis only (UE)

# serving[c+1] depends on serving[c] → the cycle
# axis is SERIAL. Only the UE axis is independent.
# Each thread walks all cycles for its own UE.
N = n_ue

# City scenario: 600 UEs
N = 600          # only 600 threads — under-fills GPU

Worked scenarios (TPB = 256)

Same formula, three scenario scales. Notice how the RSRP grid grows with the scenario while the handover grid barely moves — that asymmetry is why handover is the bottleneck.

Lab — 50 UE · 12 cells · 512 cyc

307 k

RSRP: N = 307 k → 1200 blocks.
Handover: N = 50 → 1 block (GPU mostly idle here).

City — 600 UE · 90 cells · 512 cyc

27.6 M

RSRP: N = 27.6 M → 107 813 blocks.
Handover: N = 600 → 3 blocks.

Dense — 2000 UE · 200 cells · 512 cyc

204.8 M

RSRP: N = 204.8 M → 800 000 blocks.
Handover: N = 2000 → 8 blocks.

How to implement it

Compute the grid from the live scenario sizes at launch time. Below: a 1-D launch (handover) and a 2-D launch (RSRP), in Numba.

1-D grid — handover_kernel

# N = n_ue (the only independent axis)
tpb    = 256
blocks = (n_ue + tpb - 1) // tpb   # ⌈N/TPB⌉

handover_kernel[blocks, tpb](
    d_rsrp, d_serving, d_nb_table,
    a3_off, hyst, cooldown, d_ho_count,
    n_cycles, n_ue)

# inside the kernel — drop the extra threads:
def handover_kernel(...):
    ui = cuda.grid(1)
    if ui >= n_ue: return     # the guard

2-D grid — rsrp_3gpp_kernel

# Map axes to thread dims: x=cell, y=UE,
# and tile the cycle axis across grid.z
tpb    = (16, 16)            # 256 threads/block
blocks = (
    (n_cells   + 15) // 16,    # ⌈cells/16⌉
    (n_ue      + 15) // 16,    # ⌈UE/16⌉
    n_cycles)                  # one block-layer per cycle

rsrp_3gpp_kernel[blocks, tpb](d_pos, d_cell, d_rsrp, ...)

# inside the kernel:
ci, ui = cuda.grid(2)
c      = cuda.blockIdx.z
if ci >= n_cells or ui >= n_ue: return

CuPy shortcut: high-level CuPy ops (cp.matmul, broadcasting, cp.sum) pick the grid for you — you only size grids by hand for custom @cuda.jit / RawKernel kernels. Write the explicit grid only where you wrote the explicit kernel.

Three sizing traps: (1) TPB must be a multiple of 32 — a partial warp wastes lanes. (2) Each grid dimension caps at 2³¹−1 blocks; if N is huge, use a grid-stride loop instead of one thread per item. (3) Too few blocks under-fills the SMs — the city handover (3 blocks on 188 SMs) is the textbook case, and exactly why it dominates the profile despite the GPU being mostly idle.

Section 08

Benchmarking — Measuring What Matters

A benchmark isn't just timing. It validates correctness, quantifies improvement, and finds your next bottleneck.

📊 Nsight Systems — GPU vs CPU Phase Comparison (normalized per cycle)

phase:rf (RF datagen)

CPU 726 ms/cycle

726 ms

GPU 0.65 ms/cycle

0.65 ms

phase:kpi

CPU 3.9 ms/cycle

3.9 ms

GPU 0.07 ms/cycle

0.07 ms

phase:output

CPU 2.4 ms/cycle

2.4 ms

GPU 0.07 ms/cycle

0.07 ms

⚠ Wrong benchmark approach

# Don't do this — includes JIT warmup
import time

t0 = time.perf_counter()
result = my_gpu_kernel(data)  # includes JIT!
t1 = time.perf_counter()
print(f"Time: {t1-t0:.3f}s")

# Problems:
# 1. First call triggers Numba JIT compilation
#    (can be 10-30 seconds!)
# 2. No warmup = GPU clock not at boost freq
# 3. Single measurement = high variance
# 4. No parity check = maybe wrong answer

✓ Correct benchmark approach

# benchmark.py — from AI-RSG
WARMUP_ROUNDS = 2   # trigger JIT compilation
MEASURE_ROUNDS = 3  # actual measurement

# 1. Parity check first
cpu_result = reference_cpu(data)
gpu_result = gpu_kernel(data)
np.testing.assert_allclose(
    gpu_result, cpu_result, atol=1e-4,
    err_msg="GPU parity failed!")

# 2. Warmup (JIT + GPU clock boost)
for _ in range(WARMUP_ROUNDS):
    _ = gpu_kernel(data)
    cp.cuda.Stream.null.synchronize()

# 3. Timed measurement
times = []
for _ in range(MEASURE_ROUNDS):
    t0 = time.perf_counter()
    _ = gpu_kernel(data)
    cp.cuda.Stream.null.synchronize() # wait!
    times.append(time.perf_counter() - t0)
print(f"Median: {np.median(times)*1000:.1f}ms")
print(f"Speedup: {cpu_time/np.median(times):.0f}×")

Critical: Always call cp.cuda.Stream.null.synchronize() (CuPy) or cuda.synchronize() (Numba) before stopping the timer. GPU kernel launches are asynchronous — timing without sync measures kernel launch, not execution.

GPU Acceleration Checklist

✓ Profile first — identify the real hotspot (use nsys/nvtx)
✓ Verify parallelism — confirm computations are independent
✓ Batch across time/space — at least 1k elements for GPU to win
✓ Minimize H2D/D2H transfers — upload once, download once per batch
✓ Use float32 where precision allows — 2× throughput vs float64
✓ Validate parity against CPU reference before benchmarking
✓ Warm up the GPU (2 rounds) before measuring
✓ Synchronize before stopping the timer
→ Profile again after GPU — find the new bottleneck (Amdahl's Law)
→ RSG next target: handover kernel (26s of 28s phase:rf)

🏁 AI-RSG GPU Pipeline — Final Results (RTX PRO 6000 Blackwell, 188 SMs)

📡

RF

CPU: 726 ms

GPU: 0.65 ms

🔀

Handover

CPU: —

GPU: 0.60 ms

📊

KPI

CPU: 3.9 ms

GPU: 0.07 ms

💾

Output

CPU: 2.4 ms

GPU: 0.07 ms

From CPU Bottleneckto GPU Superpower

The Engineering Mindset

Find the Real Hotspot

Think in Batches

Expose Parallelism

Minimize Data Movement

Measure, Not Assume

Match Precision to Need

Finding the Hotspot — Profiling

Add NVTX Phase Annotations

Capture with Nsight Systems

Identify the Dominant Phase

Quantify the Inner Loop

Planning — What to Offload and How

Memory Transfer Strategy

GPU Acceleration KPIs — When Does It Actually Win?

① The Break-Even Equation

② Required Degree of Parallelism

③ Transfer Budget — The PCIe Tax

④ Arithmetic Intensity — Compute vs Memory Bound

⑤ Minimum Data Dimensions & Shape

Minimum element count

Memory layout (critical)

Data type

Tensor shape for 2D kernels

⑥ Quick Go / No-Go Reference

Batch Programming — Rewriting for Parallelism

🐌 CPU — one at a time

⚡ GPU — all at once (16×16 grid)

Kernel Anatomy — What Goes Where

① Thread Index → Work Item

② Load Constants → Registers

③ Main Computation Loop

④ Write Output Once

Sizing the Grid — How Many Threads & Blocks?

Map your scenario to N

Worked scenarios (TPB = 256)

How to implement it

Benchmarking — Measuring What Matters

GPU Acceleration Checklist

From CPU Bottleneck
to GPU Superpower

Map your scenario to `N`