More

frabonacci · 2026-04-20T23:29:54 1776727794

what's even more alarming is how exploitable GitHub Trending itself is these days. you can get the star count to fork ratio right and you land on the front page, which then pulls in real organic stars

frabonacci · 2026-04-20T23:22:25 1776727345

it reminds me of https://github.com/dockur/windows with its compose-style YAML over QEMU/KVM. The difference i'm seeing is scope: dockur ships curated OS images (Windows/macOS), while holos looks more like a generic single-host VM runner. Is that a fair read? also curious any plans to support running unattended processes for OS installs?

zeroecco · 2026-04-20T23:48:52 1776728932

True and decent read. Holos is a generic runner underneath, and I'm not trying to compete on the curated-OS-image front; the assumption is you bring a cloud image and cloud-init does the rest. Unattended installs for Linux work today through cloud-init (that's most of what the YAML drives). Windows/macOS would need a different path.. autounattend.xml injection for Windows, custom payload for macOS; I haven't built that. If there's interest, a provisioner: type block that picks the right unattend mechanism based on OS isn't crazy, but it's not on the near-term list.

frabonacci · 2026-01-28T20:06:22 1769630782

Thanks - trajectory export was key for us since most teams want both eval and training data.

On non-determinism: we actually handle this in two ways. For our simulated environments (HTML/JS apps like the Slack/CRM clones), we control the full render state so there's no variance from animations or loading states. For native OS environments, we use explicit state verification before scoring - the reward function waits for expected elements rather than racing against UI timing. Still not perfect, but it filters out most flaky failures.

Windows Arena specifically - we're focusing on common productivity flows (file management, browser tasks, Office workflows) rather than the edge cases you mentioned. UAC prompts and driver dialogs are exactly the hard mode scenarios that break most agents today. We're not claiming to solve those yet, but that's part of why we're open-sourcing this - want to build out more adversarial tasks with the community.

frabonacci · 2026-01-28T20:01:11 1769630471

Fair point - we just open-sourced this so benchmark results are coming. We're already working with labs on evals, focusing on tasks that are more realistic than OSWorld/Windows Agent Arena and curated with actual workers. If you want to run your agent on it we'd love to include your results.

frabonacci · 2026-01-28T16:39:07 1769618347

Hey visarga - I'm the founder of Cua, we might have met at the CUA ICML workshop? The OS-agnostic VNC approach of your benchmark is smart and would make integration easy. We're open to collaborating - want to shoot me an email at f@trycua.com?

frabonacci · 2026-01-27T00:49:44 1769474984

The author's differential testing (2.3M random battles) is great as final validation, but the real lesson here is that modular testing should happen during the port, not after.

1. Port tests first - they become your contract 2. Run unit tests per module before moving on - catches issues like the "two different move structures" early 3. Integration tests at boundaries before proceeding 4. E2e/differential testing as final validation

When you can't read the target language, your test suite is your only reliable feedback. The debugging time spent on integration issues would've been caught earlier with progressive testing.

cryptonector · 2026-01-27T04:58:24 1769489904

The real lesson... I mean, if all of this took 1 month, the TFA already did amazingly well. Next time they'll do even better, no doubt.

frabonacci · 2026-01-20T01:10:37 1768871437

Thanks! On API call visibility - Lume's MCP interface doesn't expose outbound network traffic directly. It's focused on VM lifecycle (create, run, stop) and command execution, not network inspection.

For agent observability, we handle this at the Cua framework level rather than the VM level:

- Agent actions and tool calls are logged via our tracing integration (Laminar, OpenTelemetry) - You can see the full decision trace - what the agent saw, what it decided, what tools it invoked - For the "what HTTP requests actually went out" question, proxying is still the right approach. You could configure the VM's network to route through a transparent proxy, or set up mitmproxy inside the VM. We haven't built that into Lume itself since network inspection feels orthogonal to VM management.

That said, it's an interesting idea - exposing a proxy config option in Lume that automatically routes VM traffic through a capture layer. Would that be useful for your workflow?

frabonacci · 2026-01-19T07:23:26 1768807406

MDM platforms can skip Setup Assistant, but they require the device to be pre-enrolled in Apple Business Manager before first boot - VMs can't be enrolled in ABM, so those hooks aren't available.

defaults write only works after you have shell access, which means Setup Assistant is already done.

There are tools that modify marker files like .AppleSetupDone via Recovery Mode, but that's mainly for bypassing MDM enrollment on physical Macs - you'd still need to create a valid user account with proper Directory Services entries, keychain, etc.

The VNC + OCR approach is less elegant but works reliably without needing to reverse-engineer macOS internals or rely on undocumented behaviors that might break between versions.

saagarjha · 2026-01-19T10:42:22 1768819342

Surely your VNC script is guaranteed to break between versions

frabonacci · 2026-01-19T04:24:04 1768796644

Thanks! On graphics - currently it's paravirtualized via Apple's Virtualization Framework, so basic 2D acceleration but no GPU passthrough. Fine for desktop use, web browsing, coding, productivity apps. Wouldn't recommend it for anything GPU-intensive though.

Good news is there are hints of GPU passthrough coming (_VZPCIDeviceConfiguration symbol appeared in Tahoe's Virtualization framework), so that might land in a future macOS release. We're keeping an eye on it.

frabonacci · 2026-01-19T04:22:16 1768796536

Nice, thanks for sharing! It'd be interesting to integrate MIST into lume's ipsw command - right now Apple's native features in Apple Vz only provides download links for the latest supported version of the host, so grabbing older versions requires workarounds like this.