More

matt_d · 2026-06-26T20:22:11 1782505331

Looks interesting!

Out of curiosity, how does it compare with vLLM Semantic Router?

For reference:

https://vllm-semantic-router.com/

https://github.com/vllm-project/semantic-router

vLLM Semantic Router: Signal Driven Decision Routing for Mixture-of-Modality Models, https://arxiv.org/abs/2603.04444

https://github.com/vllm-project/semantic-router

For instance, does it offer similar algorithms:

- vllm-sr/auto: efficient, fast, balanced routing, similar in spirit to Fugu // Sakana Fugu — Multi-Agent System as a Model: https://sakana.ai/fugu/ - vllm-sr/fusion: panel-style multi-model reasoning and synthesis. - vllm-sr/flow: router-native workflow orchestration - vllm-sr/remom: multi-round reasoning over one or multiple models.

FWIW, it does look good on https://routeworks.github.io/leaderboard

Ref.

RouterArena: An Open Platform for Comprehensive Comparison of LLM Routers, https://arxiv.org/abs/2510.00202, https://github.com/RouteWorks/RouterArena

adchurch · 2026-06-26T21:33:08 1782509588

Good questions. From what I can tell, vLLM semantic router is more optimized for one-off prompt/response workflows rather than agentic coding (I don't think it's cache aware).

As another commenter (https://news.ycombinator.com/item?id=48689994) pointed out, for one-off requests, I think it makes more sense to lock to one model whose behavior you understand very well. For dynamic requests like the ones going to a coding agent I think dynamic routing makes more sense but it does need to be cache aware.

hmokiguess · 2026-06-27T01:04:56 1782522296

I tried Sakana Fugu, boy is it hungry ... it blows up tokens like nothing I have ever seen. Not that impressed with the results I got from it however if I'm being honest. Now I'm bought into their buy 1 get 2nd month free so will keep trying it but may cancel after.

matt_d · 2026-04-30T22:50:09 1777589409

Blog post: https://www.rabdos.ai/research/introducing-mathduels-ai

Leaderboard: https://mathduels.ai/

matt_d · 2026-04-24T20:55:07 1777064107

Pretty nifty work; more details (full type system specification, including type inference rules, collective signatures, and forward-backward pairs):

https://github.com/meta-pytorch/spmd_types/blob/main/DESIGN....

I'd also recommend the series of posts by Edward:

- https://blog.ezyang.com/2026/01/global-vs-local-spmd/

- https://blog.ezyang.com/2026/01/jax-sharding-type-system/

- https://blog.ezyang.com/2026/02/dtensor-erasure/

- https://blog.ezyang.com/2026/02/replicate-forwards-partial-b...

Also interesting:

AutoParallel: a PyTorch library that automatically shards and parallelizes models for distributed training. Given a model and a device mesh, it uses linear programming to find an optimal sharding strategy (FSDP, tensor parallelism, or a mix) and applies it — no manual parallelism code required.

https://github.com/meta-pytorch/autoparallel

> AutoParallel is a PyTorch library that automatically shards and parallelizes models for distributed training. Given a model and a device mesh, it uses linear programming to find an optimal sharding strategy (FSDP, tensor parallelism, or a mix) and applies it — no manual parallelism code required.

matt_d · 2026-03-30T18:12:45 1774894365

Slides (PDF): https://compiler-research.org/assets/presentations/Aaron_Vip...

Abstract: https://compiler-research.org/presentations/#USINGSTDCPP2026

> Despite its high-performance capabilities, C++ is not the first programming language that comes to mind for rapidly developing robust applications, mainly due to the long edit-compile-run cycles. Ongoing research in the compiler-research.org group aims to provide practical, interactive capabilities for C++, enabling dynamic interoperability, rapid prototyping, and exploratory programming, essential for data science and other applications. This talk explores how interactive C++ can be leveraged for various scientific usecases and teaching. Attendees will also learn how to leverage Clang as a library to build a simple C++ REPL for incremental compilation and introspection, integrating this layer with the Python runtime.

> The second part of this talk covers CppInterOp, a production-grade C++ interoperability library based on LLVM and Clang that provides compiler-as-a-service capabilities for seamless cross-language integration. CppInterOp formalizes a stable, backward-compatible API that enables dynamic languages to harness the full power of modern C++ without sacrificing expressiveness or performance. We explore applications of the CppInterOp library in the context of Python/C++ bindings, interactive C++ notebooks with xeus-cpp, and WebAssembly.

matt_d · 2026-03-25T17:37:37 1774460257

See also C++ coroutines resources (posts, research, software, talks): https://gist.github.com/MattPD/9b55db49537a90545a90447392ad3...

ZoomZoomZoom · 2026-03-25T18:26:11 1774463171

For a layperson it's clear that it's either "Writings" and "Talks", or "Readings" and "'Listenings", but CPP profeciency is in an inverse relation with being apt in taxonomy, it looks like.

Thanks for the list.

matt_d · 2026-02-26T22:48:22 1772146102

Repo:

K-Search: LLM-Driven GPU Kernel Optimization with Co-Evolving Intrinsic World Model, https://github.com/caoshiyi/K-Search

> K-Search is an automated kernel engineering system that uses large language models (GPT-5, Gemini etc.) to iteratively generate and optimize GPU kernels. Unlike one-shot code generation, K-Search maintains a co-evolving world model — a structured search tree that encodes hypotheses about kernel bottlenecks, design alternatives, and optimization strategies — guiding multi-round, evidence-driven search over the kernel design space efficiently.

matt_d · 2026-02-22T19:10:28 1771787428

Paper: https://doi.org/10.1145/3695053.3731011

PDF: https://dl.acm.org/doi/pdf/10.1145/3695053.3731011

Abstract: This paper introduces Heliostat, which enhances page translation bandwidth on GPUs by harnessing underutilized ray tracing accelerators (RTAs). While most existing studies focused on better utilizing the provided translation bandwidth, this paper introduces a new opportunity to fundamentally increase the translation bandwidth. Instead of overprovisioning the GPU memory management unit (GMMU), Heliostat repurposes the existing RTAs by leveraging the operational similarities between ray tracing and page table walks. Unlike earlier studies that utilized RTAs for certain workloads, Heliostat democratizes RTA for supporting any workloads by improving virtual memory performance. Heliostat+ optimizes Heliostat by handling predicted future address translations proactively. Heliostat outperforms baseline and two state-of-the-arts by 1.93 ×, 1.92 ×, and 1.66 ×. Heliostat+ further speeds up Heliostat by 1.23 ×. Compared to an overprovisioned comparable solution, Heliostat occupies only 1.53% of the area and consumes 5.8% of the power.

matt_d · 2026-02-21T01:19:10 1771636750

Paper (PDF): https://2plus2a.com/files/publications/2025-ISCA-precise-exc..., https://2plus2a.com/publications/errata#exc-isca25

DOI: https://doi.org/10.1145/3695053.3731102

Abstract:

> To manage exceptions, software relies on a key architectural guarantee, precision: that exceptions appear to execute between instructions. However, this definition, dating back over 60 years, fundamentally assumes a sequential programmers model. Modern architectures such as Arm-A with programmer-observable relaxed behaviour make such a naive definition inadequate, and it is unclear exactly what guarantees programmers have on exception entry and exit.

> In this paper, we clarify the concepts needed to discuss exceptions in the relaxed-memory setting – a key aspect of precisely specifying the architectural interface between hardware and software. We explore the basic relaxed behaviour across exception boundaries, and the semantics of external aborts, using Arm-A as a representative modern architecture. We identify an important problem, present yet unexplored for decades: pinning down what it means for exceptions to be precise in a relaxed setting. We describe key phenomena that any definition should account for. We develop an axiomatic model for Arm-A precise exceptions, tooling for axiomatic model execution, and a library of tests. Finally we explore the relaxed semantics of software-generated interrupts, as used in sophisticated programming patterns, and sketch how they too could be modelled.

matt_d · 2026-02-20T20:09:18 1771618158

Some highlights (by Stuart Sul):

> Tensor core and memory pipelining: it turns out some tensor core instructions are implicitly pipelined, without proper documentation. Identifying these implicit semantics and the resulting pipelining tactics can boost your throughput by up to 10%.

> Hinting the PTX assembler properly: even logically identical PTX code can compile into meaningfully different SASS instructions, depending on how you write it. Signaling the assembler with the right instruction patterns is significant for minimizing latency.

> Occupancy: with all the modern GPU features, it gets tricky, and it is (again) poorly documented. Distributed shared memory doesn’t behave identically across all SMs, and 5th-generation tensor core instructions silently cap occupancy.

matt_d · 2026-02-03T22:21:15 1770157275

Main takeaway:

> Our experiments show that Intel’s port assignment policies can diverge significantly from the well-documented "least-loaded eligible port" model, illustrated in Figure 1. Using carefully crafted two-instruction microbenchmarks preceded by an LFENCE, we consistently observed dynamic scheduling policies. Instead of a fixed distribution across eligible ports, the port assignment changes as the unroll factor increases, producing distinct regions separated by cutoffs. As illustrated in Figure 2 for the “LFENCE; CBW; CBW” snippet, the port scheduler employs three different strategies depending on the number of loop iterations. At lower unroll factors, one sparsest port is strongly preferred. After a first cutoff, the allocation becomes approximately uniform across all eligible ports, albeit noisy. At a second cutoff, the scheduler shifts again, favoring a different subset of ports. The second cutoff’s unroll factor is twice the first’s unroll factor. These dynamics are not isolated: we observed similar cutoff-based transitions across multiple instructions and instruction pairs, and in some cases, the behavior also depends on the order of instructions in the block or on immediate values used in operands. We believe that this might serve as a new microarchitectural attack surface which can be harnessed towards implementing, e.g., covert channels, fingerprinting, etc. Importantly, the observed cutoffs are consistent and reproducible across multiple runs, but differ between CPU generations. These findings show that static eligibility sets cannot fully describe port assignment. Instead, the allocator follows multiple hidden policies, switching between them in ways not accounted for by existing models.