Electronic Thesis and Dissertation Repository

Thesis Format

Integrated Article


Master of Science


Computer Science


Moreno Maza, Marc


In this thesis, we present KLARAPTOR (Kernel LAunch parameters RAtional Program estimaTOR), a freely available tool to dynamically determine the values of kernel launch parameters of a CUDA kernel. We describe a technique for building a helper program, at the compile-time of a CUDA program, that is used at run-time to determine near-optimal kernel launch parameters for the kernels of that CUDA program. This technique leverages the MWP-CWP performance prediction model, runtime data parameters, and runtime hardware parameters to dynamically determine the launch parameters for each kernel invocation. This technique is implemented within the KLARAPTOR tool, utilizing the LLVM Pass Framework and NVIDIA Nsight Compute CLI profiler. We demonstrate the effectiveness of our approach through experimentation on the PolyBench benchmark suite of CUDA kernels.

Summary for Lay Audience

KLARAPTOR is a tool designed to optimize the performance of GPU programs by dynamically determining the best kernel launch parameters for each kernel invocation. A kernel is a small piece of code that runs on a GPU and performs calculations in parallel, enabling faster processing times. The kernel launch parameters greatly impact the running time of a GPU program, and their optimal choice depends on various factors such as input data, hardware resources, and program parameters. To address this issue, KLARAPTOR leverages a two-step approach: (1) at compile-time, it determines formulas describing low-level performance metrics for each kernel and inserts them into the host code of a CUDA program; (2) at runtime, a helper program evaluates these formulas using the actual data and hardware parameters to determine the thread block configuration that minimizes the kernel’s execution time. The effectiveness of KLARAPTOR is demonstrated through experimentation on a set of benchmarks consisting of CUDA kernels, showing that it can accurately predict near-optimal thread block configurations.