Mostly I agree, but there is one factor that makes C++ (or any native compiled language) better than java for HFT: the ability to lie to the optimizer about what the hot path is. in HFT you have thousands of no trades for every trade, so the java optimizer will optimize the no-trade code path as more likely, then when the trade happens java pays a CPU branch prediction miss penalty at the only time low latency matters.
You have to have good algorithms optimized to the max for this to matter though.
If you have a low-latency trading component written in Java, a common trick is to continuously bombard it with ‘fake’ inputs to keep the desired code paths nice and hot.
The fake inputs should be virtually indistinguishable from real ones that you would normally act on. The more subtle the difference, the better, e.g., just flip the sign on the timestamp field.
You can use that subtle difference to pick whether the order goes out to the real exchange or a fake exchange. The decision needs to avoid actual branching instructions, though, or the JVM will likely optimize out the ‘real’ hot path, and you’ll fall back to interpreted mode when an actionable ‘real’ event comes in. I usually use a branchless selector to index into an array ([0] goes to a real socket, [1] goes to a black hole socket).
You can also use this technique to make sure you can respond to very rare events quickly. For example, you may want to respond to news signals from Bloomberg. Actionable news is rare, so if you want to keep your news parsing/analysis code warmed up and in the cache, it needs to constantly be reacting to warmup data.
Thankfully, no (knocks on wood). But you have to base your design around the idea that real and fake trades are indistinguishable until the last possible moment. That critical requirement needs to always be in your mind.
I would never try to bolt those kinds of optimizations onto an existing system. It’d be too easy to miss something.
I was with you until you mentioned branch prediction... isn't branch prediction a hardware feature? How do you trick the HW branch predictor into predicting the unlikely case?
The cpu still needs to load code in via instruction cacheline fetches. For every instruction fetch, that core isn't doing much.
The compiler alleviates this somewhat by putting the hot path right under the branch instruction so that the fetch that grabs the branch also grabs the start of the hot path as part of the same cacheline.
It sounds minimal, but if that fetch is swapped out of L2 cache due to long periods of inactivity, it can take upwards of 100ns, which starts to add up in HFT.
Yes it is a hardware feature. However the hardware can be given hints as to which branch is more likely. This is generally documented by the manufacturer, in one of those technical documents aimed at compiler writers.
With profile guided optimization it is possible for the compiler to have much better information about branches than the CPU can guess. Java applies profile guided optimization in real time, with C++ it much more complex to apply.
> However the hardware can be given hints as to which branch is more likely.
I don't think that is the case for modern (last 8 years or so) Intel processors. For example, I'm under the impression that gcc's __builtin_expect only affects the layout of the generated code. However I'd love to learn something new here; do you have a source or any additional info you could share?
The hints are purely for the compiler. When branch probabilities are available (either via heuristics, annotations or profile data), it will optimize hot paths differently from cold paths. For example it might be more aggressive with inlinig or vice versa optimize for size. Also will attempt to put cold code in separate pages so that it doesn't get pollute the cache. Also non taken braches are marginally "faster" than taken so it is worth, when possible to put hot code in the non taken branch.
I'm not a compiler writer, I'm sure there is more.
To me, that section says that you can emit machine code that will allow the branch predictor to do a better job, not that you can control what the branch predictor does.
My understanding is the layout is the hint. Modern CPUs would not want to use cache space for any hints, not to mention all the silicon to decode the hint.
Compiler writers are smart people and can layout code to give the CPU the right hints, so long as they know what the CPU does. CPU manufactures want the compilers writers to do this as the compiler is a significant factor in making one CPU faster than the competition in benchmarks.
It seems to me that it's misleading to call that a hint. Rather, certain code paths create hard-to-predict branches, and others create predictable branches. A hint would be some kind of metadata that says "you'll want to predict this branch this way based on your algorithm, but please don't!"
I think you're discussing two different phases of optimization. PGO and Java's JIT use branch information to emit different machine code. Hardware branch prediction takes machine code and determines which branches in the machine code are taken, and speculates based on that information. There's an underlying pattern that both follow, but they're very different.
PGO and JIT both use their information to change the machine code to suggest which branch is more likely.
PGO and JIT also do a lot of other things that are unrelated to this discussion. Some of those things can have a much larger gain than branch prediction.
Unless you write HFT code, or follow talks by those writing HFT code you probably wouldn't. When you write HFT code you have to look at profiles and think about cache misses until branch missed become something to consider. If you don't come up with that idea someone else will and they will beat you to every trade and put you out of business.
So, if I want to manipulate the market I need to work out (or poison) the hot path and have a system that's tailored to being faster on profitable - if colder - paths?
You have to have good algorithms optimized to the max for this to matter though.