Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Though you certainly know more than I do about the subject, my understanding is that differently privileged environments can enqueue messages to each other without pipeline flushes, and general forms of that mechanism have performed better than subroutine calls on high-end systems since the early 01990s: Thinking Machines, MasPar, Tera, even RCU on modern amd64.

And specialized versions of this principle predate computers: a walkie-talkie has the privilege to listen to sounds in its environment, a privilege it only exercises when its talk button is pressed and which it does not delegate to other walkie-talkies, and the communication latency between two such walkie-talkies may be tens of nanoseconds, though audio communication doesn't really benefit from such short latencies. The latency across a SATA link is subnanosecond, which is useful, and neither end trusts the other.



oh totally, but then you aren't "making a procedure call" you're doing something different.

In this case your data is likely traversing the memory hierarchy far enough so that the message data gets shared (more likely the sending data goes into the sending CPU's data cache and the receiving one will use the cache coherency protocol to pull it from there) - that's likely to take of the order of a pipe flush to happen.

You could also have bespoke pipe-like hardware - that's going to be a fixed resource that will require management/flow control/etc if it's going to be a general facility


Agreed, but even in the cache-line-stealing case, those are latency costs, while a pipeline flush is also a throughput cost, no? Unless one of the CPUs has to wait for the cache line ownership to be transferred.


well if you're making a synchronous call you have to wait for the response which is likely as bad as a pipe flush (or worse, because you likely flood the pipe with a tight loop waiting for the response, or a context switch to do something else while you wait)

Also note that stealing a cache line can be very expensive, if the CPUs are both SMT with each other it's in the same L1, almost 0 cost, if they are on the same die it will be a few (4-5?) clocks across the L2/cache coherency fabric but if they are on separate chiplets connected via a memory controller with L3/L4 in it then it's 4 chip boundary crossings - an order or 2 in magnitude in cost


All that makes sense to me. So for high performance collaboration across security boundaries needs to be either very rare or nonblocking?

Multithreading within a security boundary is one way to "synchronously wait" without incurring a giant context-switch cost (SMT or Tera-style or Padauk FPPA-style; do GPUs do this too, at larger-than-warp granularity?). Event loops are a variant on this, and io_uring seems to think that's the future. But the GreenArrays approach is to decide that the limiting resource is nanojoules dissipated, not transistors, so just idle some transistors in a synchronous wait. Not sure if that'll ever go mainstream, but it'd fit well with the trend to greater heterogeneity.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: