Though you certainly know more than I do about the subject, my understanding is ...

Taniwha · on March 23, 2022

oh totally, but then you aren't "making a procedure call" you're doing something different.

In this case your data is likely traversing the memory hierarchy far enough so that the message data gets shared (more likely the sending data goes into the sending CPU's data cache and the receiving one will use the cache coherency protocol to pull it from there) - that's likely to take of the order of a pipe flush to happen.

You could also have bespoke pipe-like hardware - that's going to be a fixed resource that will require management/flow control/etc if it's going to be a general facility

kragen · on March 23, 2022

Agreed, but even in the cache-line-stealing case, those are latency costs, while a pipeline flush is also a throughput cost, no? Unless one of the CPUs has to wait for the cache line ownership to be transferred.

Taniwha · on March 23, 2022

well if you're making a synchronous call you have to wait for the response which is likely as bad as a pipe flush (or worse, because you likely flood the pipe with a tight loop waiting for the response, or a context switch to do something else while you wait)

Also note that stealing a cache line can be very expensive, if the CPUs are both SMT with each other it's in the same L1, almost 0 cost, if they are on the same die it will be a few (4-5?) clocks across the L2/cache coherency fabric but if they are on separate chiplets connected via a memory controller with L3/L4 in it then it's 4 chip boundary crossings - an order or 2 in magnitude in cost

kragen · on March 23, 2022

All that makes sense to me. So for high performance collaboration across security boundaries needs to be either very rare or nonblocking?

Multithreading within a security boundary is one way to "synchronously wait" without incurring a giant context-switch cost (SMT or Tera-style or Padauk FPPA-style; do GPUs do this too, at larger-than-warp granularity?). Event loops are a variant on this, and io_uring seems to think that's the future. But the GreenArrays approach is to decide that the limiting resource is nanojoules dissipated, not transistors, so just idle some transistors in a synchronous wait. Not sure if that'll ever go mainstream, but it'd fit well with the trend to greater heterogeneity.