mww09's comments

mww09 · on Oct 23, 2019

Generating C/C++ bindings in Rust is unfortunately still quite a hassle from my experience: Often C libraries end up with important functions as static inline in header files (these do not get translated automatically by bindgen). If the library uses the macro system to rename functions, these too won't show up as symbols in your library (so you'd have to manually define those names too). Then bindgen often pulls in too many dependencies and I end up manually white/blacklisting what symbols I want/don't want (and track changes through multiple versions). That being said, it's a hard problem and Rust has still one of the better toolings for FFI from what I've seen in other languages.

mww09 · on June 19, 2018

>This suggests a long-term compromise solution where threads within a process can use hyperthreading to share a core, but threads in different processes can't. Given that hyperthreads share L1 cache, this might also be better for performance.

Intuitively this may sound logical, however in practice it's often not the case. For many workloads putting two threads of the same program on a core ends up being worse than co-locating with threads from different programs. The reason is that two threads of the same program will often end up executing similar instruction streams (a really good example is when both are using vector instructions (these registers are shared between the two hyperthreads)).

ajross · on June 19, 2018

In practice it sometimes is the case, though.

SMT/hyperthreading is complicated. If you have a workload dominated by non-local DRAM fetches, it's a huge win because when the CPU pipeline is stalled on one thread it can still issue instructions from the other.

If you have a workload dominated by L1 cache bandwidth, the opposite is true because the threads compete for the same resource.

On balance, on typical workloads, it's a win. But there are real-world problems for which turning it off is a legitimate performance choice.

adrianratnapala · on June 20, 2018

> workload dominated by non-local DRAM fetches,

How often is that a polite way of saying "software that is inefficient"?

ajross · on June 20, 2018

Often. But software is what software does, and a CPU that only worked well on "efficient" code will always fail when compared with one that runs the junk faster than the competition.

Also, to be fair: sometimes a DRAM fetch is just inherent in the problem. Big RAM-resident databases are all about DRAM latency because while sure, it's a lot slower than L1 cache, it's still faster than flash. I mean, memcached is a giant monument in praise of the pipeline stall, and it's hugely successful.

adrianratnapala · on June 20, 2018

> But software is what software does, and a CPU that only

Indeed. It is arguably rational for Intel to take on the burden in a centralised place rather than expecting every two-bit software shop to to do it.

But then the existence of this kind of security issue shows that the added complexity is not always worthwhile. We might be forced to to accept that computers which actually behave well are a little bit slower than we thought. But in return they will be simpler and more amenable to software optimisation.

imtringued · on June 20, 2018

I'm not sure there is a correlation. I can think of many situations in which non-local DRAM fetches are more efficient and I can think of many other situations where the opposite is true.

Trees or hashmaps which use non-local DRAM fetches can be more efficient than a brute force linear search through a continguous array given a sufficiently high enough number of elements.

At the same time continguous arrays can be significantly more efficient than linked lists which use non-local DRAM fetches.

yk · on June 20, 2018

With most software, well most software is pretty inefficent and profits from HT. However there are a lot of reasons for that, writing in something interpreted because it is faster to develop and the software does not need to be very efficient in the first place would be one application. (Not to say that all Python/JS/etc is inefficient, just that software that needs to be efficient is precisely the kind were one would consider an unmanaged language.) Additionally, things like webservers or dbs often just don't know which piece of data they need next, simply because they don't know the next query, have a tendency to profit from HT, even though the software is hardly known for being inefficient.

paulirwin · on June 20, 2018

FWIW, you mention databases, but even some database workloads can have better performance with HT turned off. I first learned this from a DBA at a former job when I was curious as to why they turned HT off. A member of the SQL Server team back in 2005 ran some experiments and found that you can get a 10% performance improvement in some workloads with HT off [1]. I don't know how much of that is still true today, however, as nearly all of my recent experience is PaaS in the cloud.

[1]: https://blogs.msdn.microsoft.com/slavao/2005/11/12/be-aware-...

naikrovek · on June 20, 2018

> How often is that a polite way of saying "software that is inefficient"?

One could also say "software written with strong OOP patterns" because those are almost always written to benefit the developer later, rather than the CPU and RAM at runtime.

willvarfar · on June 20, 2018

There are plenty of problems with poor mechanical sympathy.

To take an extreme example, traversing graphs is notorious. Cray and Sun iirc have some fascinating processors with many many hyperthreads because all the programs do is wait on dram but luckily there are lots of searches that can be done in parallel.

wumpus · on June 19, 2018

Typical workloads? What's that? People run hugely diverse workloads on cpus, and they change over time.

ajross · on June 19, 2018

Building software, serving web pages, executing database queries, running a DOM layout, managing game logic... I mean, come on. You knew what I meant. Those are all tasks with "medium" cache residency and "occasional" stalls on DRAM. Anything that does a bunch of different things with a big-ish world of data.

Conversely: finding a task that is L1-cache-bound but does not frequently have to stall for memory is much harder. The only ones off the top of my head are streaming tasks like software video decode.

wumpus · on June 19, 2018

Oh, you meant typical for you.

One task that is L1 cache bound and does not frequently stall for memory (if you code it up well) is matrix multiply.

kbenson · on June 20, 2018

> Oh, you meant typical for you.

I'm pretty sure those are meant to be, and I think are, "typical" for the general purpose CPU in use, and thus the general case.

Both mobile and desktop CPUs will be doing DOM layout, DB queries (whether to SQLite or the registry or just the filesystem), and possibly computing game logic on a regular basis.

wumpus · on June 20, 2018

It's becoming popular to want to push machine learning tasks onto edge devices like mobile and desktop CPUs, for example apps that include some machine learning. Some of these machine learning algorithms do a lot of matrix multiplies.

"Typical" is highly varied, and it changes.

Edit: here's an example: Google brings on-device machine learning to mobile with TensorFlow Lite

https://thenextweb.com/artificial-intelligence/2017/11/15/go...

kbenson · on June 20, 2018

Would they be using mostly CPU for that, or would they offload it to the GPU or a dedicated chip? I would assume you would use your general purpose CPU only if all else wasn't available (and generally there's a GPU available on most end user devices these days).

wumpus · on June 20, 2018

If possible the GPU, but not all GPUs have either a library or enough documentation to write one. I’ve seen complaints about this issue on mobile GPUs for years, no idea how widespread it is now.

BTW, this is just one example algorithm that I picked because it does (on the cpu) what the person I replied to said was rare.

rbanffy · on June 20, 2018

Running the model is much easier than training it. On power-constrained environments, DSPs can do it.

adrianN · on June 20, 2018

I don't know whether it's still true, but a couple of years ago a majority of the world's CPU cycles were spent sorting things.

smaddox · on June 20, 2018

That's an interesting claim. Do you remember the source?

amelius · on June 19, 2018

> The reason is that two threads of the same program will often end up executing similar instruction streams

Why is that bad?

CHY872 · on June 19, 2018

Your processor has a certain number of execution units which can actually execute individual instructions, maybe like 4 floating point units, maybe 8 arithmetic ones, and maybe 1 that can do vector processing (these numbers are not real, but are like, good enough for sake of message).

So the idea with SMT is that most of the time, lots of the execution units are unused because the thread a) isn't using them at all (e.g. a process to do encryption won't use the floating point units) and/or b) can't use them all because of how the program's written (for example, if I say 'load a random memory address, then add it to a register, then load another random memory address, then add it, etc' I'm going to be spending most of my time waiting for memory to be loaded.

SMT basically means that you run another program at the same time, so even if the encryption process can't use the floating point units, maybe there's another process that we can schedule that will.

However, imagine my encryption process can use 6 of the 8 arithmetic units. If I have 2 encryption processes scheduled on the same core, I have demand for 12 when there are only 8. So now I have contention for resources, and I won't see a speedup from using SMT.

Other comments mention registers and not execution units: I'm suspicious of this, since modern processors have many registers (for Skylake, 250+) which they remap between aggressively as part of pipelining. Maybe this is different for the SIMD units.

That said, I haven't looked at this stuff since university so could well be wrong on the execution unit vs register comparison.

blattimwind · on June 19, 2018

The contention would actually not be on a per-EU level, but one level higher up. The reservation station has a bunch (~5-8) of ports and typically multiple EUs are connected to one port. Can't use one port for two different things at the same time.

Here's a simplified block diagram of a Skylake core: https://en.wikichip.org/wiki/intel/microarchitectures/skylak...

CHY872 · on June 19, 2018

Thanks! Yeah figured I'd be wrong somewhere in there!

Twirrim · on June 20, 2018

There's also some funkiness around the CPU cache, at least at one stage. If your two HT threads are working on the same data, there was a chance you'd get some great cache performance out of it. However the hyperthread when faced with a cache miss, can cause the cache to get evicted to be replaced with the data it needs. Under those circumstances, performance takes quite a nose dive as both threads are stomping over each other somewhat.

Hyperthreading can be a real mixed bag for performance, though generally good and a lot of engineering effort has gone in to making it shine. As ever it's strongly advisable that people benchmark real world conditions on a server, and it's worth giving a shot with hyperthreading turned on and off.

gpderetta · on June 20, 2018

You are pretty much right. A couple of things to add: even hand optimized asm code wont be able to use all ports all the time with a single instruction steam; the biggest win for hyperthreading is filling the pipeline bubbles caused by memory loads out of L1 (there is only so mach that OoO scheduling can do on your average load)

amelius · on June 20, 2018

So on a RISC architecture, this would happen even for non-similar programs, because the number of instructions is smaller? Or would they just duplicate the processing units?

Dylan16807 · on June 20, 2018

The units aren't dedicated to esoteric instructions, you have functions like "small alu", "big alu", "fp adder", "address generator and memory load".

RISC will perform about the same and you can hyperthread one fine.

jfoutz · on June 19, 2018

Different instruction streams use different registers. It’s like sharing a bathroom. I can shower while you brush your teeth. There’s more contention when we’re both trying to shower.

uryga · on June 19, 2018

"both are using vector instructions (these registers are shared between the two hyperthreads)"

So I guessing GP meant there's going to be contention for those registers, and thus no speedup?

xeroaura · on June 19, 2018

Taking a guess, but since they are running similar streams, they have similar loads at a specific time. Competition between main thread and hyperthread could hurt performance instead?