c7a servers have dedicated CPUs. Personally I like seeing benchmarks on AWS inst...

ndriscoll · on Feb 5, 2024

That gp3 volume is extremely slow compared to a $100 NVMe drive. If each txn does a heap update, index update, wal write, and heap read, that's 4 IOs per txn right there (well, not for sequential IDs because you don't need to flush heap/index on every update). The volume gets 16k IOPS max, so that 2600-3400 txn/s is somewhat close to its capabilities assuming multiple IOs per txn. It's a little hard to find info, but latency of a gp3 volume is approximately 1 ms? That's going to limit you on WAL writes since they're synchronous. An NVMe drive that does say 20k read and 50k write IOPS at qd1 has 50 us read / 20 us write latency. A database should be more of a qd32 workload, so hundreds of thousands to millions of IOPS.

It's a single core, so no parallelism in the db itself. There's a fraction of the RAM my phone has, so that slow IO is more pronounced.

The basic implications of different keys and detailed look at the cache internals are valid and interesting, but the hardware is nothing like a server you'd want to run a database on, so the benchmark isn't very interesting. An iPhone is probably beefier in every way.

ardentperf · on Feb 5, 2024

agree - network storage is slower than local NVMe - the choice was intentional for two reasons

1) the percentages would be different but the basic implications should hold true even with NVMe and 96 cores, as long as we scaled up the data size and workload

2) in addition making it a bit easier to demonstrate what we'd expect to see (there's not really anything surprising here), chose this setup because it's cheap so anyone else could replicate the exact results or play around with the scripts and try variations without having to spend much money. for example, someone on twitter was curious about uuidv7 in a text field - would be easy & cheap to try it out and see what happens - also could easily go to bigger hardware and local NVMe, changing client and row counts

20 years ago when i wanted to benchmark oracle RAC, i had to go out and buy dual-attach firewire drives and that was a hack because who wants to spend their personal vacation money on an old EMC clarion storage array from eBay [i might have bought personally an old sun server or two though!]

size results should be independent of hardware setup, but the perf results are specific to this setup, which is why the post includes detailed specs and scripts for transparency

also, FWIW, most production use cases for databases these days include some kind of high availability which means network involvement in the persistence path - so even when the database is on local NVMe, it's not uncommon to have a hot standby or patroni or something with sync replication

ndriscoll · on Feb 5, 2024

Yeah the high level information is good, and the buffer cache analysis is super neat. I haven't seen that kind of thing elsewhere. It's a great article to explain why performances differences exist. My list of gripes is probably more about Amazon marketing suggesting that something is big or high-performance or scalable when it's... not.

If you're something like a bank, you need synchronous replication, but a lot of use-cases would probably be fine with async with a couple ms RPO. Then again most people probably don't need more than a few thousand writes/second anyway. For banks, I worked on storage arrays at IBM ~10 years ago, and I think our synchronous replication was sub 100 us, but can't remember anymore.

BSDobelix · on Feb 5, 2024

>c7a servers have dedicated CPUs

I can't see any proof of that, sources?

They talk about vCPU here:

https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/compute-...

Do you mean for example c6g.metal? For c7a there are also "metal" instances available, but they don't mention that:

https://docs.aws.amazon.com/ec2/latest/instancetypes/co.html...

ndriscoll · on Feb 5, 2024

They're referencing this:

> One of the major differences between C7a instances and the previous generations of instances, such as C6a instances, is their vCPU to physical processor core mapping. Every vCPU on a C7a instance is a physical CPU core. This means there is no Simultaneous Multi-Threading (SMT). By contrast, every vCPU on prior generations such as C6a instances is a thread of a CPU core.

So your "vCPU" is not an SMT thread sharing a core with some other "vCPU". Of course you're still sharing the rest of the CPU, so things like memory bandwidth and presumably upper level cache, and you might be affected by thermal load from other cores? idk whether that kind of thing applies in server contexts.

https://aws.amazon.com/ec2/instance-types/c7a/

tlb · on Feb 5, 2024

It is, however, sharing the memory bandwidth and L3 cache. Which is often a big factor for things like inserting millions of table rows. Benchmarks should be run on an idle bare-metal machine, or you can't really compare results.

ardentperf · on Feb 5, 2024

FWIW, this was why the benchmark was executed on two different servers, and repeated three times on each server. The blog includes a graph of TPS at 5-second intervals for all 24 runs and the results are very tight and consistent. I think that specific graph gives reasonable confidence in the reliability of these numbers.

ndriscoll · on Feb 5, 2024

Yeah, agree. I had ninja edited that in there but not in time I guess. And as I posted elsewhere, the disk they're using is very slow. And ideally your access patterns are more similar to what the other commenter had with COPY, but that's a whole different issue.

AWS will happily sell you bargain bin performance at enterprise prices if you don't understand what you're buying. There's a reason they're almost a $2T company.

QuinnyPig · on Feb 5, 2024

Only c7a.48xlarge or (ideally) its c7a-48.metal equivalent instance. Otherwise you’ve got a CPU-stealing neighbor.

ericpauley · on Feb 5, 2024

My understanding is that vCPUs are mapped to hyperthreads or (in this case) physical cores (note: not whole CPU packages). As some noted there’s shared L3 cache, but AFAIK this would not be “steal” as the kernel presents it.