> K-Search is an automated kernel engineering system that uses large language models (GPT-5, Gemini etc.) to iteratively generate and optimize GPU kernels. Unlike one-shot code generation, K-Search maintains a co-evolving world model — a structured search tree that encodes hypotheses about kernel bottlenecks, design alternatives, and optimization strategies — guiding multi-round, evidence-driven search over the kernel design space efficiently.
Abstract: This paper introduces Heliostat, which enhances page translation bandwidth on GPUs by harnessing underutilized ray tracing accelerators (RTAs). While most existing studies focused on better utilizing the provided translation bandwidth, this paper introduces a new opportunity to fundamentally increase the translation bandwidth. Instead of overprovisioning the GPU memory management unit (GMMU), Heliostat repurposes the existing RTAs by leveraging the operational similarities between ray tracing and page table walks. Unlike earlier studies that utilized RTAs for certain workloads, Heliostat democratizes RTA for supporting any workloads by improving virtual memory performance. Heliostat+ optimizes Heliostat by handling predicted future address translations proactively. Heliostat outperforms baseline and two state-of-the-arts by 1.93 ×, 1.92 ×, and 1.66 ×. Heliostat+ further speeds up Heliostat by 1.23 ×. Compared to an overprovisioned comparable solution, Heliostat occupies only 1.53% of the area and consumes 5.8% of the power.
> To manage exceptions, software relies on a key architectural guarantee, precision: that exceptions appear to execute between instructions. However, this definition, dating back over 60 years, fundamentally assumes a sequential programmers model. Modern architectures such as Arm-A with programmer-observable relaxed behaviour make such a naive definition inadequate, and it is unclear exactly what guarantees programmers have on exception entry and exit.
> In this paper, we clarify the concepts needed to discuss exceptions in the relaxed-memory setting – a key aspect of precisely specifying the architectural interface between hardware and software. We explore the basic relaxed behaviour across exception boundaries, and the semantics of external aborts, using Arm-A as a representative modern architecture. We identify an important problem, present yet unexplored for decades: pinning down what it means for exceptions to be precise in a relaxed setting. We describe key phenomena that any definition should account for. We develop an axiomatic model for Arm-A precise exceptions, tooling for axiomatic model execution, and a library of tests. Finally we explore the relaxed semantics of software-generated interrupts, as used in sophisticated programming patterns, and sketch how they too could be modelled.
> Tensor core and memory pipelining: it turns out some tensor core instructions are implicitly pipelined, without proper documentation. Identifying these implicit semantics and the resulting pipelining tactics can boost your throughput by up to 10%.
> Hinting the PTX assembler properly: even logically identical PTX code can compile into meaningfully different SASS instructions, depending on how you write it. Signaling the assembler with the right instruction patterns is significant for minimizing latency.
> Occupancy: with all the modern GPU features, it gets tricky, and it is (again) poorly documented. Distributed shared memory doesn’t behave identically across all SMs, and 5th-generation tensor core instructions silently cap occupancy.
> Our experiments show that Intel’s port assignment policies can diverge significantly from the well-documented "least-loaded eligible port" model, illustrated in Figure 1. Using carefully crafted two-instruction microbenchmarks preceded by an LFENCE, we consistently observed dynamic scheduling policies. Instead of a fixed distribution across eligible ports, the port assignment changes as the unroll factor increases, producing distinct regions separated by cutoffs. As illustrated in Figure 2 for the “LFENCE; CBW; CBW” snippet, the port scheduler employs three different strategies depending on the number of loop iterations. At lower unroll factors, one sparsest port is strongly preferred. After a first cutoff, the allocation becomes approximately uniform across all eligible ports, albeit noisy. At a second cutoff, the scheduler shifts again, favoring a different subset of ports. The second cutoff’s unroll factor is twice the first’s unroll factor. These dynamics are not isolated: we observed similar cutoff-based transitions across multiple instructions and instruction pairs, and in some cases, the behavior also depends on the order of instructions in the block or on immediate values used in operands. We believe that this might serve as a new microarchitectural attack surface which can be harnessed towards implementing, e.g., covert channels, fingerprinting, etc. Importantly, the observed cutoffs are consistent and reproducible across multiple runs, but differ between CPU generations. These findings show that static eligibility sets cannot fully describe port assignment. Instead, the allocator follows multiple hidden policies, switching between them in ways not accounted for by existing models.
> We present a calculational approach to the design of type checkers, showing how they can be derived from behavioural specifications using equational reasoning. We focus on languages whose semantics can be expressed as a fold, and show how the calculations can be simplified using fold fusion. This approach enables the compositional derivation of correct-by-construction type checkers based on solving and composing fusion preconditions. We introduce our approach using a simple expression language, to which we then add support for exception handling and checked exceptions.
> This week I gave a lecture series at the School on Logical Frameworks and Proof Systems Interoperability. I spoke about programming language techniques for proof assistants. The lecture slides and the reference implementations of a minimalist type theory are available at:
K-Search: LLM-Driven GPU Kernel Optimization with Co-Evolving Intrinsic World Model, https://github.com/caoshiyi/K-Search
> K-Search is an automated kernel engineering system that uses large language models (GPT-5, Gemini etc.) to iteratively generate and optimize GPU kernels. Unlike one-shot code generation, K-Search maintains a co-evolving world model — a structured search tree that encodes hypotheses about kernel bottlenecks, design alternatives, and optimization strategies — guiding multi-round, evidence-driven search over the kernel design space efficiently.
reply