A step-by-step guide to thinking, profiling, planning, and implementing GPU acceleration — using real examples from the AI-RSG 5G simulator.
GPU acceleration isn't just about writing CUDA. It's a way of thinking. Before touching any code, every GPU engineer asks the same questions.
Don't guess. 90% of your wall-time lives in 10% of the code. Use a profiler to find it before writing a single kernel.
GPUs hate one-by-one work. Convert sequential per-item loops into operations over entire arrays at once.
Ask: "Are these computations independent?" If UE-1's RSRP doesn't affect UE-2's, you can compute them simultaneously.
PCIe transfers are expensive. The ideal pattern is: upload once → run many kernels → download once.
After every change, profile again. GPU wins aren't always where you expect. Let numbers guide decisions.
Float32 is 2× faster than float64 on most GPU cores. Only use float64 where physics demands it.
The AI-RSG simulator ran at ~1,400 cycles total before hitting wall-time limits. Profiling revealed 99% of time lived in a single function.
Instrument your code with named ranges so the profiler can distinguish phases. Use consistent names so CPU and GPU profiles align.
NSYS_PROFILE_SIM=1 nsys profile --trace=cuda,nvtx,osrt python sim.py
Enable --trace-fork-before-exec=true to capture NVTX from forked worker processes. Output lands in /tmp/sim-nsys-{pid}.nsys-rep.
Sort phases by wall time. The top entry is your GPU target. In RSG, phase:rf was 726 ms/cycle × 1,415 cycles = 1,026 seconds.
Understand why it's slow. RSG's RF phase called _compute_rsrp_batch() once per (cycle × UE × cell) = ~36,000 scalar calls per datagen batch.
# simulation.py — no annotations # Profiler sees one opaque block def run_cycle(self, cycle): # RF: black box self.radio_model.update() # KPI: also black box self.kpi_engine.compute() # Output: invisible self.write_influx()
# simulation.py — phases visible in Nsight from Tera5G.util import nvtx_range as _phase def run_cycle(self, cycle): with _phase("phase:rf", "red"): self.radio_model.update() with _phase("phase:kpi", "green"): self.kpi_engine.compute() with _phase("phase:output", "blue"): self.write_influx()
Not everything benefits from GPU. The decision tree below mirrors how AI-RSG chose what to offload.
The #1 mistake beginners make: transferring data every cycle. Upload once, download once per batch.
for cycle in range(n_cycles): loop body.These are the quantitative thresholds you need to cross before GPU acceleration pays off. Each one is a hard engineering constraint, not a preference.
GPU wins only when total GPU time beats CPU time. Total GPU time has three unavoidable terms:
You need enough independent work items to keep the GPU busy. An underoccupied GPU is slower than CPU.
Every byte you move between CPU and GPU costs time. This is your most important constraint to calculate upfront.
Use this formula to decide if transfer is affordable:
The roofline model tells you whether your kernel is limited by compute throughput or memory bandwidth. This determines which axis to optimize.
These are practical minimums derived from GPU hardware constraints, not arbitrary rules.
Why: one CUDA warp = 32 threads. One SM = up to 64 warps = 2,048 threads. RTX 6000 has 188 SMs → need ~386 k threads to saturate all SMs at 100% occupancy.
Why: GPU memory controllers load 128-byte cache lines. 32 adjacent threads reading adjacent floats = 1 transaction (coalesced). Non-contiguous = 32 separate transactions = 32× bandwidth waste.
RTX 6000: 82.6 TFLOPS FP32 vs 41.3 TFLOPS FP64. RSG uses float32 for RSRP (range −157 to −30 dB, 3 decimal places → float32 is sufficient).
RSG thread block is 16×16 = 256 threads (8 warps). Grid is ⌈n_cells/16⌉ × ⌈n_UEs/16⌉. With 90 cells and 600 UEs: 6×38 = 228 blocks × 256 threads = 58 k threads → solid.
The first step is always: replace per-item scalar loops with operations over entire arrays. This works even before you touch GPU code.
Sequential · ~726 ms/cycle
Parallel · ~0.65 ms/cycle
# propagation.py — runs on CPU each cycle def path_loss_uma(d3D, d2D, freq_ghz, h_bs, h_ut, is_los, xp=np): d3D = xp.asarray(d3D, dtype=xp.float64) d2D = xp.asarray(d2D, dtype=xp.float64) log_f = xp.log10(freq_ghz) d_bp = _breakpoint_uma(h_bs, h_ut, freq_ghz, ...) near = d3D <= d_bp pl_los = xp.where(near, 28 + 22 * xp.log10(d3D) + 20 * log_f, 28 + 40 * xp.log10(d3D) + 20 * log_f - 9 * xp.log10(d_bp**2 + (h_bs-h_ut)**2)) pl_nlos = (13.54 + 39.08 * xp.log10(d3D) + 20 * log_f - 0.6 * (h_ut - 1.5)) # Called 600×90=54,000 times per cycle return xp.where(is_los, pl_los, xp.maximum(pl_los, pl_nlos))
# cuda_kernels.py — runs inside GPU kernel # One thread per (UE, cell) — all in parallel @cuda.jit(device=True) def path_loss_uma(d3D, d2D, freq_ghz, h_bs, h_ut, is_los): # Scalar: each thread does one pair log_f = math.log10(freq_ghz) h_e = 1.0 d_bp = (4.0 * (h_bs - h_e) * (h_ut - h_e) * freq_ghz * 1e9 / C_LIGHT) if d3D <= d_bp: pl_los = 28.0 + 22.0*math.log10(d3D) + 20.0*log_f else: pl_los = (28.0 + 40.0*math.log10(d3D) + 20.0*log_f - 9.0*math.log10(d_bp**2+(h_bs-h_ut)**2)) if is_los: return pl_los pl_nlos = (13.54 + 39.08*math.log10(d3D) + 20.0*log_f - 0.6*(h_ut - 1.5)) return max(pl_los, pl_nlos)
# radio_model.py — called once per cycle for ue in active_ues: ue_loc = XYZ(*ue["state"]["xyz"]) for nb_report in ue.measurements: cell_name = nb_report['cell'] cell = self.cells[cell_name] distance = ue_loc.distance(cell['xyz']) # ↑ Called ~36,000 times per datagen batch rx_power = self.received_power[cell_name]( cell, distance) ue_rf_report = nb_report['value'] ue_rf_report.update(rx_power) # Total: 600 UEs × 90 cells = 54,000 iters # Plus: called per simulation cycle # Result: 726 ms/cycle on 1 CPU core
# cuda_kernels.py — one kernel for ALL cycles @cuda.jit def rsrp_3gpp_kernel(positions, cell_xyz, cell_power, ..., n_cycles, n_ue, n_cells): # 2D thread grid: (cell, UE) ci = (cuda.blockIdx.x * cuda.blockDim.x + cuda.threadIdx.x) ui = (cuda.blockIdx.y * cuda.blockDim.y + cuda.threadIdx.y) if ui >= n_ue or ci >= n_cells: return cx, cy, cz = cell_xyz[ci] pwr = cell_power[ci] for c in range(n_cycles): # loop over time ux, uy, uz = positions[c, ui] dx, dy = ux-cx, uy-cy d2D = math.sqrt(dx*dx + dy*dy) d3D = math.sqrt(d2D*d2D + (uz-cz)**2) pl = path_loss_uma(d3D, d2D, ...) sf = sos_eval(ux, uy, ci, ...) ag = beam_pattern_gain(dx, dy, dz, ...) rsrp_out[c, ui, ci] = max(min( pwr - pl - sf + ag, -30.0), -157.0) # Grid: (⌈n_cells/16⌉ × ⌈n_UEs/16⌉) × 16×16 # All 54,000 pairs in parallel, all 512 cycles
# tensor_sim.py — cell aggregation # Builds large intermediate tensor # serving_onehot: (cycles × UEs × cells) # thput: (cycles × UEs) cell_thput = cp.einsum( 'cui,cu->ci', serving_onehot, # 92 MB intermediate! thput # must be materialized ) # Problem: # - 92 MB one-hot tensor (bool) allocated # - cuBLAS dispatch overhead # - Cannot fuse with other reductions cell_rank = cp.einsum( 'cui,cu->ci', serving_onehot, rank) # ↑ Another 92 MB pass
# cuda_kernels.py — zero intermediate tensor @cuda.jit def cell_aggregate_kernel(serving, thput, prb, rank, cell_thput, cell_prb, cell_rank_sum, n_ue): cycle = cuda.blockIdx.x cell = cuda.blockIdx.y tid = cuda.threadIdx.x # 128 threads # Each thread accumulates its stripe t_sum = 0.0 for ui in range(tid, n_ue, cuda.blockDim.x): if serving[cycle, ui] == cell: t_sum += thput[cycle, ui] # Warp shuffle: no shared memory needed mask = 0xffffffff for off in (16, 8, 4, 2, 1): t_sum += cuda.shfl_xor_sync(mask,t_sum,off) # Write once per warp if cuda.threadIdx.x & 31 == 0: shared_mem[cuda.threadIdx.x >> 5] = t_sum cuda.syncthreads() if cuda.threadIdx.x == 0: cell_thput[cycle, cell] = ( shared_mem[0]+shared_mem[1]+ shared_mem[2]+shared_mem[3])
A CUDA kernel has four distinct regions. Understanding each lets you optimize systematically.
handover_kernel
# Each thread owns one UE's entire timeline ui = (cuda.blockIdx.x * cuda.blockDim.x + cuda.threadIdx.x) if ui >= n_ue: return # guard — always needed # Key insight: dependency is across time # (cycle t+1 needs result of cycle t) # Solution: each thread processes ALL cycles # for ITS OWN UE — serial in time, parallel in UE
# Read cell-level config into registers # (accessed ~512× per thread, so cache pays) for nb_idx in range(max_nb): nb = nb_table[srv, nb_idx] # L2 cache if nb < 0: break # Stack-resident state (fast): # srv, cooldown, ho_count all in registers srv = serving[0, ui] cooldown = 0 ho = 0
for c in range(n_cycles): if cooldown > 0: cooldown -= 1 serving[c+1, ui] = srv continue srv_rsrp = rsrp[c, ui, srv] # global mem for nb_idx in range(max_nb): nb = nb_table[srv, nb_idx] nb_rsrp = rsrp[c, ui, nb] # global mem if nb_rsrp - hyst > srv_rsrp + a3_off: srv = nb; ho += 1 cooldown = cooldown_cyc; break serving[c+1, ui] = srv
# Writes happen every cycle (serving cell) # and once at end (HO count) # Minimize global writes — they're expensive serving[c+1, ui] = srv # per cycle ho_count[ui] = ho # once at end # Grid: ⌈n_UEs / 256⌉ blocks × 256 threads # 600 UEs → 3 blocks (tiny launch overhead) # RTX 6000: 188 SMs → all 3 blocks # fit on <1 SM, but still parallelizes 600 UEs
# gnodeb_model.py — runs per cycle, per UE for c in range(n_cycles): ue_signal = measurements[c] serving = ue.serving_cell for nb in serving_neighbors: nb_rsrp = rsrp_cache[(nb['cell'], ue)] if nb_rsrp > serving_rsrp + a3_offset: # State update (serial dependency!) serving = nb['cell'] ue.serving_cell = serving ho_count += 1 # Problem: cycle t+1 depends on cycle t # → Cannot parallelize across cycles # → 1,415 cycles × 600 UEs = 849,000 updates
# cuda_kernels.py — 600 threads, each owns 1 UE @cuda.jit def handover_kernel(rsrp, serving, nb_table, a3_off, hyst, cooldown_cyc, ho_count, n_cycles, n_ue): ui = (cuda.blockIdx.x * cuda.blockDim.x + cuda.threadIdx.x) if ui >= n_ue: return cooldown = 0; ho = 0 srv = serving[0, ui] for c in range(n_cycles): # serial per UE srv_rsrp = rsrp[c, ui, srv] for nb_idx in range(max_nb): nb = nb_table[srv, nb_idx] if rsrp[c,ui,nb]-hyst > srv_rsrp+a3_off: srv=nb; ho+=1; break serving[c+1, ui] = srv ho_count[ui] = ho # All 600 UEs run simultaneously on the GPU # Each thread is its own state machine # RTX 6000: 26s total for 43,200 cycles!
A benchmark isn't just timing. It validates correctness, quantifies improvement, and finds your next bottleneck.
# Don't do this — includes JIT warmup import time t0 = time.perf_counter() result = my_gpu_kernel(data) # includes JIT! t1 = time.perf_counter() print(f"Time: {t1-t0:.3f}s") # Problems: # 1. First call triggers Numba JIT compilation # (can be 10-30 seconds!) # 2. No warmup = GPU clock not at boost freq # 3. Single measurement = high variance # 4. No parity check = maybe wrong answer
# benchmark.py — from AI-RSG WARMUP_ROUNDS = 2 # trigger JIT compilation MEASURE_ROUNDS = 3 # actual measurement # 1. Parity check first cpu_result = reference_cpu(data) gpu_result = gpu_kernel(data) np.testing.assert_allclose( gpu_result, cpu_result, atol=1e-4, err_msg="GPU parity failed!") # 2. Warmup (JIT + GPU clock boost) for _ in range(WARMUP_ROUNDS): _ = gpu_kernel(data) cp.cuda.Stream.null.synchronize() # 3. Timed measurement times = [] for _ in range(MEASURE_ROUNDS): t0 = time.perf_counter() _ = gpu_kernel(data) cp.cuda.Stream.null.synchronize() # wait! times.append(time.perf_counter() - t0) print(f"Median: {np.median(times)*1000:.1f}ms") print(f"Speedup: {cpu_time/np.median(times):.0f}×")
cp.cuda.Stream.null.synchronize() (CuPy) or cuda.synchronize() (Numba) before stopping the timer. GPU kernel launches are asynchronous — timing without sync measures kernel launch, not execution.