Hacker Newsnew | past | comments | ask | show | jobs | submit | vitalyd's commentslogin

Lest someone gets the wrong idea, Rust makes _mutable_ globals painful to work with; readonly is fine.

As for hardware DMA’able memory, it’s true that it adds friction to work with in Rust. But C or C++ would fall into the same boat - one would need to sprinkle volatile or atomics, as appropriate, to avoid the optimizer from interfering. In Rust, you’d need to do the same (ptr::{read,write}_volatile or its atomics).

I’m having a slightly hard time imagining a db where “most” of the address space is DMA’able. I’ve some experience with kernel bypass networking, which has its own NIC interactions wrt memory buffers, but applications built on top have plenty of heap that’s unshared with hardware. What’s an example db where most of the VA is accessible to hardware “arbitrarily”?

Also, regardless of how much VA is shared, there’s going to be some protocol that the software and hardware use to coordinate. The interesting bit here is whether Rust and its type system can allow for expressing these protocols such that violations are compile-time detectable (if not all, perhaps some). Any sane C++ code would similarly try to build some abstractions around this, but how well things can be encoded is up for debate.

When a typical Rust discussion ensues, it’s commonly implied (or occasionally made explicit) that “write X in Rust” == “write X in safe Rust”. And this is the right default. But I think any non-trivial system hyper optimized for performance will have a healthy amount of unsafe code. The more interesting question, to me at least, is how well can that unsafety (and the “hidden”-from-rustc protocol) be contained.

As for a db scheduler obviating the need for a compiler to arbitrate ownership, that’s certainly true to a degree. But, this comes back to the protocol I mentioned - the scheduler is what provided the protocol and so long as other components work via it, it can provide safety barriers (and allow for optimizations). But again, I don’t immediately see why Rust (with careful use of unsafe) couldn’t do the same. And afterall, everything is safe so long as things play by the (often unwritten or poorly so) rules. Once systems get big and hairy, it gets tougher to stay within the guardrails and that’s where getting assistance from your language/compiler can be very helpful.

Some of the Rust libs/frameworks for embedded/microcontrollers deal with hardware accessible memory and otherwise “unsafe” protocols, but I’ve seen some clever ways folks encode this using Rust’s type system.


You have some misconceptions about how this all works in real databases. Rust experts who have looked into porting these kinds of C++ database kernels have not been sanguine in my experience. This isn't a theoretical exercise, we need to minimize defects and maximize performance.

- All pointers are ordinary, the fact that the same memory can also be DMA-ed by the hardware is immaterial. You do need accounting mechanisms that let the code infer which objects in memory are at risk of being read/written by a DMA operation. No atomics or volatiles required in userspace. Modern database code is effectively single-threaded.

- Most of the address space in a database is DMA-able because most runtime structures in a database engine must be adaptively persistable to storage. There are various workloads that will force different parts of your runtime state to be pageable because they can overflow RAM while operating within design constraints. Unless you are assuming small databases, that complexity is inconvenient but necessary for robust systems.

- C++ is more expressive than Rust when it comes to making hackery like this transparent, most of which is resolved at compile-time in C++. Much of the mechanics can be taken care of with a pointer wrapper that heavily overlaps the semantics of std::unique_ptr, making the code quite clean and natural. Most code never needs to know how that magic happens. C++ compile-time facilities are currently far beyond Rust.

- You can formally verify the scheduler design, and we sometimes do, but actually implementing it efficiently in real code without the borrow-checker losing its mind is a separate concern.

As I originally stated, you can write such systems in Rust while managing the amount of unsafe code. You just wouldn't want to and there would be little to recommend it compared to the C++ equivalent since it would be objectively worse by most metrics.


It seems you’ve already made up your mind and nothing anyone says will change that :).

The volatile I mentioned isn’t due to concurrency of userspace threads, but to avoid the optimizer from eliminating read/write operations. If the src/dst of those memory ops is DMA’d memory touched by hardware, you’d need to do that. Has nothing to do with concurrency.

Capability to spill to disk is certainly needed, no argument. But “most” of the address space and “most” of the runtime structures? Can you elaborate? Is there an OSS example or some paper or any discussion of such a thing in the open?

You can have custom smart pointers in Rust just as well, and back them with your own mem allocator. While there are features in C++ not currently available in Rust, C++ facilities “far beyond” is hyperbole. How well do you know Rust? Genuine question.


It sounds like GP is using "DMA" to mean a memory-mapped file. I recall there was discussion about how to safely handle them in Rust. See https://users.rust-lang.org/t/how-unsafe-is-mmap/19635


G1 has read barriers for SATB marking. Whether a read barrier is used or not is a function of which GC is used so can’t really say “Hotspot doesn’t use read barriers”.


Do you have a reference on G1 read barriers? I believe you, but I can't find any information about that from a Google search.

I don't immediately see why you'd need a read barrier for snapshot-at-the-beginning. You only need to trace references from gray objects to white objects that were deleted, which is a write.


Sorry, that was my mistake - it has pre-write barriers for SATB but not actual read barriers.

However, ZGC will have read barriers and if Shenandoah ever gets integrated into Hotspot, it has them too.


No, it’s sadly true. EA may scalarize the allocation but this optimization falls apart very easily in Hotspot.


.NET doesn’t have interior pointer. Any `ref` must be on the stack and that’s tracked in the stackmap. You cannot have a ref as a field.


.NET doesn't have heap interior pointers (today), but that doesn't matter for this argument. You still need to be able to mark objects as live even if they're only referenced by interior pointers.


> .NET doesn't have heap interior pointers (today), but that doesn't matter for this argument.

I think it matters for this argument.

You are arguing that designing a garbage collector which is concurrent, low latency (pauses < 5 ms), compacting, generational, and supports interior pointers, is easy.

You mentioned C#/.NET as an example, but C# doesn't have interior pointers (from heap to heap, not from stack to heap which is possible).

As far as I know, neither Java, .NET, Go, Haskell, OCaml, D, V8 or Erlang satisfies all these requirements at the same time.

I'm not saying it's impossible, and Ian Lance Taylor in the thread I linked earlier is not saying either. I'm just saying it's certainly hard.

If it would be so easy, then most languages would already have interior pointers and a concurrent, low latency, compacting, generational GC. That's the whole difference between "today" (as in your comment) and "tomorrow".


I think it's fair to question a type being auto-derived to be Sync if only individual fields are Sync. It may lead to improper sharing of a reference to this value where threadsafety across fields is needed. That would be a bug, yes, but compiler never made the author pause to consider putting Sync there explicitly (and presumably thinking this through). So that "linting" aspect that's applied to, e.g., *mut T is not present.


I'm not entirely sure what you mean.

The compiler does make the author pause: dangerous types like `*mut T and `UnsafeCell<T>` are not Sync, so types containing them are also not automatically Sync.

In any case, Rust only guarantees no memory unsafety (requiring no data races). It tries to help with other things, like cleaning up resources via destructors, but these are "best-effort" rather than guarantees. The only way to get a data race and/or memory unsafety with an automatically implemented Sync is with `unsafe` code, and any `unsafe` code near a data type "infects" it and so means the whole type require great care.

Tooling like asan and tsan and, hopefully, Rust-specific sanitizers and static analyzers will make this easier to get right, but fundamentally as soon as `unsafe` comes into the equation the programmer has to be paranoid. Of course, as the MutexGuard problem indicates, humans getting it right error-prone, which is why the aforementioned tooling and formal proofs---like the one that found that problem---are important, as is building and using appropriate abstractions (e.g. MutexGuard is semantically designed to be a &mut T, so maybe it could indicate this by using PhantomData, or even just storing that directly: this does require manual work, but pushing conventions like that might bridge help the gap to having great `unsafe` tooling in future).


I suppose my point was that auto-deriving Sync may not have been such a great idea :). I understand the rationale for it but it does open up traps for people to fall into.


Somewhat tangential, but what ensures memory visibility in Rust? Say I allocate a struct (heap or stack), and then pass an immutable reference to a function that takes T: Sync. Assume the struct itself is Sync (e.g. bunch of integer fields). What ensures that the other thread sees all writes to this struct prior to the handoff?


It is the responsibility of cross-thread communication abstractions to use the right fencing (if it is touting itself as safe), probably with the various things in std::sync (especially ...::atomics) if it is pure Rust. For instance, spawning a thread, using a channel (std::sync::mpsc) or a mutex all do such things.

Just calling a function taking T: Sync doesn't need to do any of this, since that call happens all on a single thread. The function might do it internally if it needs to, but that is its own explicit implementation decision.


Ok, that's what I figured - thanks.

That does bring up the question, though, whether it's correct to say that a Sync type doesn't permit data races. In the example I gave above, publishing a Sync struct incorrectly can exhibit data race like symptoms on the receiving thread. So even though the type itself is Sync, that's not enough of a guarantee in the face of "unsafe" publication.


In safe Rust, sharing a value of any Sync type between threads can't result in data races. Send and Sync provide thread safety guarantees about types that other safe abstractions can rely upon, and fencing correctly is one of the things those abstractions have to do to be safe.

I guess "Sync types don't have data races" is the abbreviated version of "Sync types don't have data races in any safe code, no matter how weird and wonderful". That said, this qualification doesn't seem very interesting to me: something equivalent is required about pretty much any statement about any guarantee in any language with unsafe code or FFI (e.g. in Python, something along the lines of "pointers don't dangle in any code that doesn't use `ctypes`"), and thus is elided in a lot of discussions about programming languages.

If you consider `unsafe` Rust, then failing to fence correctly is just one way to get a data race.


It is a guarantee---or else it's a bug (which is the same as every other safe foundation). A type T gets to be `Sync` in one of two ways:

  1. It is "auto derived" when all of its constituent types are Sync.

  2. It is explicitly implemented using `unsafe impl Sync for T {}`. Note the use of the `unsafe` keyword.


Right, but my question isn't about T itself, but rather how it's published to another thread. The example I gave is of a plain struct with no atomics or any other synchronization types internally. A &T is auto-derived to be Sync. But, if a publisher incorrectly publishes this reference, the other thread may see a partially initialized value.


It's the responsibility of the code that transfers the reference to another thread to ensure that. Deep down, past the abstractions, you can't transfer stuff between threads without using unsafe code. It's the responsibility of this unsafe code to ensure that if the value is visible from another thread, all the writes in the current thread have completed before it's visible. One can do this using memory fences.


There are three ways of sharing data across threads.

One is by sharing the data with the thread when it is spawned via a closure. Spawning will fence. No problem there.

The second is to use a good 'ol Sender/Receiver channel pair. These are effectively a shared ring buffer that you can push to and pop from. They also have a fence somewhere.

Finally, you can stick your data into a mutex shared between threads (and let the other thread wait and read it). This will IIRC fence, or do something equivalent.

You can of course build your own ways to do this, but they will need unsafe code to be built (the three APIs above are also built with unsafe code). It is up to you to ensure you handle the fences right when doing this.

The responsibility here is on the publishing mechanism. Most folks use one of the three ways above using primitives from the stdlib depending on the use case.


Yeah, I understand and what I expected to be the answer. My point is that when people talk about Sync not allowing data races, there's the asterisk attached to that statement. That footnote is that publishing code, which is completely separate from the type itself, needs to uphold its responsibility. Unsafe code is usually discussed in light of raw pointers and more generally raw memory ops, but I rarely see this aspect mentioned.


My point above was subtle but important: the asterisk you're mentioning here is not specific to Sync. This is true of all safety guarantees in Rust. unsafe code must uphold the invariants that safe code will rely on, otherwise it's buggy. For example, if the implementation of `Vec<T>` accidentally got its internal `length` out-of-sync with the data on the heap, then nothing bad in and of itself necessarily happens immediately. The bad thing only happens the next time you try to do something with the `Vec<T>`, which will be in safe code.

The safety of things in Rust is built on abstraction. If abstraction gets something wrong in its `unsafe` details, then there is a bug there. In other words, the asterisk you're mentioning is "You can trust that safe Rust is free of memory safety, unless there are bugs." I feel like that's discussed and acknowledged quite a bit.

> Unsafe code is usually discussed in light of raw pointers and more generally raw memory ops, but I rarely see this aspect mentioned.

I guess I don't see a difference. The safeness of Rust code depends on the correct use of `unsafe`, and this applies to everything, not just Sync.

This idea of `unsafe` being "trusted code" is one of the first things that the Rustonomicon covers: https://doc.rust-lang.org/nomicon/safe-unsafe-meaning.html


> I guess I don't see the difference

At a high and general level, yeah, it's all "unsafe". But, most conversations about unsafe don't talk about this aspect. So while what you say is true, I'm merely pointing out that this concurrency aspect doesn't seem to be mentioned much. And while it's implied at a high level, I think it's worth mentioning.

Basically, there's no issue - I just think this should be called out more when concurrency is discussed.


You need raw pointer ops (or, well, dealing with UnsafeCell, which does raw pointer stuff), or syscalls for these too.

Concurrency is not special here. There are all kinds of invariants unsafe code might be required to uphold. So yeah, we could mention concurrency, but then we could also mention UTF8, noalias, initialization, the vector length invariant, the HashMap robin hood invariant, various BTreeMap invariants, etc etc. "Make sure you have fences" is just another semi-specific invariant.

I disagree that "most conversations about unsafe don't talk about this aspect", compartmentalizing unsafe invariants is a major part of these discussions (it's like the first chapter of the nomicon, even)


> Concurrency is not special here I beg to differ. Concurrency comes with its own bag of hazards, as I mentioned in my reply to burntsushi. Comparing its invariants with Vec's length invariant/HashMap's RH invariant, and any other single threaded/internal invariants misses the point.

Unsafe discussions that I've seen rarely talk about fences - they tend to focus on raw pointer ops, ffi, transmutes, unbound lifetimes, and in general, are single thread focused.


Right, but I can make the same point about other invariants. Each comes with its own bag of hazards. You can write pages about the robin hood invariants.

Concurrency is particularly complex, perhaps. I think one of the reasons you don't see that much discussion of this is that in general folks in Rust don't write that many internally-unsafe concurrent abstractions. There are a bunch of great safe building blocks out there (stdlib ones, rayon, crossbeam) which folks use for concurrency; it's very rare to build your own. So that might be it.

At least with the stuff I work on like 50% of the unsafe Rust discussions have been around thread safety and ordering and fences, but we're in that relatively rare situation where we need to build those abstractions, so perhaps it's just me who sees these discussions happening.

-----

It's also probably just that discussions introducing unsafe will deal with problems people are used to -- and memory safety is a far more "normal" problem than thread safety.


Could you say more about why you see concurrency as a special thing to call out? I'm genuinely curious so that I can adapt how I explain this stuff in the future. It would help to understand the distinction you are seeing that I am not. :-)


Only because I've had several people ask me how Rust handles memory ordering after they've learned of Sync. Sync documentation focuses very narrowly on there being no data races against the type that implements it. There's also obviously a difference between a struct consisting of a single AtomicUInt (for example) and a struct consisting of a plain uint. The former can be published unsafely because it internally already provides the appropriate fences by virtue of the atomic. In the latter, it requires the publishing to provide the necessary fences. Again, I don't mean to say that doesn't make sense or is unexpected (after one thinks about it), but I'd urge a bit more focus on that part as well.

Given there's no official memory model, similar to say Java Memory Model, there's not much to go by (correct me if I'm wrong). The JMM, for instance, spells out what needs to happen for happens-before edges to be created. It also talks about racy publication of values. Granted there's no close analog to Rust's unsafe in Java. But saying "T: Sync means it's free of data races in safe code", while correct, is a slightly vacuous statement since that T interacts with other components. And yes, those components will likely involve unsafe code, and unsafe code has its own caveats, but still, I don't think it hurts to make this more pronounced. Concurrency is a special beast, rife with its own hard-to-debug hazards. Being a bit more verbose and possibly repetitive about the hazards won't hurt :).


I don't think it's vacuous. It's very important for compartmentalizing unsafe. It interacts with other components, and that statement tells you the responsibilities of both T implementing Sync, and abstractions accepting Sync types.

It is vacuous at a big picture level where you're trying to understand the complete thread safety story, but that was never what that statement was trying to convey.

----

I think documentation here can be improved, though. When I get more time one of the things I want to do is do a major revamp of the concurrency docs, including paragraphs on how the memory ordering stuff works. I'll include filling out the Sync docs in this, thanks for the feedback!


The atomic type doesn't provide the necessary fences automatically. Operations on it have various sorts of fencing, but these do not necessarily connect in the way you think they do. The partial initialisation problem of creating a struct and publishing a pointer happens for the atomic types too, as they don't (and should not!) do everything with sequential consistency.


Note that I didn't mention anything about it being automatic. It's merely a case of being more explicit about what may happen to the value.

Sequential consistency is irrelevant to the example I gave of a single valued struct. I wasn't talking about any ordering with respect other loads/stores.


Being explicit isn't relevant: a relaxed store to an atomic followed by publishing a pointer to it won't have the expected happens-before relationship (program order won't be respected).

In any case, the only reason to think about publishing is how loads and stores are ordered, so it isn't irrelevant. The single valued struct can still have the partial initialisation problem, it just appears as the single field being completely uninitialised.


That's all true, but I think we're talking past each other a bit. My point isn't about what mechanics would be used or the type of memory order specified (that's the irrelevant part), but merely to highlight that this could use more attention in the docs. The atomic used may be viewed as a "lint" to a reader/user that something interesting may be going on with how this type is used - similar to how *mut isn't automatically Sync; it's an opportunity for people to pause and think things through. If you're looking at a type absent of any atomics or concurrency primitives, it's not apparent that it may be used like that (particularly since Sync is auto-derived and doesn't appear in the source code of the type).

I know one can look at this as being unimportant because only unsafe code that works with Sync types carries the burden of upholding the invariants. But, I still contend that memory ordering ought to be discussed, somewhere, while there's no official memory model documentation (not even sure that's in the plans, although I feel like I may have come across a github issue for that).

Anyway, I don't think I'm saying anything new in this reply over my others. If you feel existing docs are adequate in this regard, that's fine - we can agree to disagree :).


> I may have misunderstood Ralf's bug. Is it really the case that MutexGuard<T> was seen as Sync if T was Send, rather that Sync? Wouldn't that be a bigger problem than just the case of MutexGuard?

So T: Sync if &T: Send. MutexGuard internally contains a &Mutex<T> (and Poison, but that's irrelevant here). T was Cell<i32>. If you follow the rabbit hole, you'll net out that T was Send, and therefore MutexGuard was Sync.


My confusion (and I suspect others) is about what it means for &T to be Sync. Cell<T> isn't safe to be shared across threads (so isn't Sync) but it is Send if T:Send. But that means &Cell<T> is Sync? You can share a reference to something across threads but not the thing itself? What does that even mean?

You could imagine an alternate world where MutexGuard is Send, to allow transfer of ownership of a lock to a different thread while keeping the mutex locked. But that would mean &MutexGuard is Sync, WTF?


The syntax was a bit confusing: T: Sync means (&T): Send and also (&T): Sync. T being Send or not doesn't affect the threadsafety of &T (Send is about transferring ownership which cannot happen with a &T).

You are correct to be confused about (&Cell<i32>) possibly being Sync, because the assumptions that were implied were wrong: when Sync talks about sharing a T, that can be entirely thought of as transferring a &T to another thread (aka Sending the &T). In this sense, sharing a &T between threads (as in, (&T): Sync) is the same as transferring a &&T to another thread, but the inner &T can be copied out so the original &T was also transferred (not just shared) between threads; that is to say, (&T): Sync is 100% equivalent to T: Sync.

Anyway, back to the example here, Cell<i32> is not Sync, so neither is &Cell<i32>, but M = Mutex<Cell<i32>> is Sync (this is a major reason Mutex exists in that form: allowing threadsafe shared mutation/manipulation of types that do not automatically support it), and thus &M is Sync too. Since MutexGuard<Cell<i32>> contains &M, it was thus automatically, incorrectly Sync.

For your second confusion, it is okay for &MutexGuard to be Sync, if MutexGuard itself is. The problem here was MutexGuard was Sync incorrectly in some cases. (MutexGuard is semantically just a fancy wrapper around a &mut T, and so should behave the same as that for traits like Send and Sync.)


Ah! I wasn't following the references close enough.


Tiered JITs are meant to allow slower and more aggressive optimizations to be done on truly hot code. However, you're right in that they still cannot spend as much time or resources as an AOT compiler.


The jit must be rerunned every time the process is loaded. In image based language like smalltalk the jit state can be saved in the image with performance optimizations. So next time the image is reloaded the jit status is hot.


Yes, Azul has a similar feature (ReadyNow).

This is nontrivial because lots of optimizations depend on class load ordering and runtime profile information.


Devirtualization is mostly an issue for Java since everything is virtual by default and the language doesn't have support for compile time monomorphization.

While C++ code does use virtuals, it's nowhere near the amount as Java - there are language constructs to avoid that and move the dispatch selection to compile time.


The fantastic standard library mostly goes away because it allocates. It's possible to write Java code that doesn't allocate in steady state, but the coding style becomes terrible (e.g. overuse of primitives, mutable wrappers, manually flattened data structures, etc).

There's also the issue that even without GC running you pay the cost of card marking (GC store barriers) on every reference write. There's unpredictability due to deoptimizations occurring due to type/branch profile changes, safepoints occurring due to housekeeping, etc.

It's unclear whether that style of Java coding is actually a net win over using languages with better performance model.


Sometimes you just need to allocate, whether due to necessity or expediency.

If you make sure that "almost all" allocations are short-lived, GC is very fast. Allocation is bumping a pointer and cleanup is O(number of new, live objects). It's considerably faster than malloc/free for general-case allocation.


It all depends on what time scale we're talking about since "very fast" is relative. High performance native systems don't use (in any meaningful manner) naive malloc/free, so that comparison is somewhat moot. I hear this argument quite often when Java vs C++/C is discussed, but it's not comparing idioms/techniques in actual use.

Also don't forget that when GC runs it trashes your d/i-caches; temporaries/garbage allocs reduce your d-cache efficacy; GC must suspend and resume the java threads, which is trips to the kernel scheduler; there are some pathologies with Java threads reaching/detecting safepoints.

GC store barriers (aka card marking) don't have anything to do with thread contention (apart from one thing, which I'll note later). This is a commonly used technique to record old->young gen references, and serves as a way to reduce the set of roots when doing young GC only (i.e. you don't need to scan the entire heap). So this isn't about thread contention, per say -- with the exception that you can get false sharing due to an implementation detail, such as in Oracle's Hotspot.

The card table is an array of bytes. Each heap object's address can be mapped into a byte in this array. Whenever a reference is assigned, Hotspot needs to mark the byte covering the referrer as dirty. The false sharing comes about when different threads end up executing stores where the objects requiring a mark end up mapping to bytes that are on the same cacheline - fairly nasty if you hit this problem as it's completely opaque. So Hotspot has a XX:+UseCondCardMark flag that tries to neutralize this by first checking if the card is already dirty, and if so, skips the mark; as you can imagine, this inserts an extra compare and branch into the existing card marking code - no free lunch.


The idea is, there's a space between "performance doesn't matter" and "so fast it can't use malloc" in the trade-offs of software development. It turns out that space is very large.

"Performance-critical code" can even go in that space in an environment where developer cycles and program safety are things that matter, which is definitely the case in HFT.


Sure, but that space isn't just Java anymore anyway.

Also, what's an (non-toy) environment where developer productivity and safety/correctness don't matter? I always find that statement bizarre when talking about production systems.


No, the GC is a net win from the perspective of code development. The JIT is just one of the things that makes Java not as slow as you'd expect.

As I said, the JVM is an acceptable platform for the slower HFT. That's the kind where a clever predictive strategy matters (maybe with lead time of seconds) and you'll get more money from accurately predicting the future than from shaving off 250us.

Make no mistake - you'll still make money shaving off 250us, but not so much that you want to be bogged down structuring your code the C++ "if we structure it right we won't leak things" way.


You should've made it explicit then that you're referring to slow HFT -- the post I was replying to drew no such distinction apart from saying the "extreme end" uses FPGAs. Obviously if young gen GC pauses aren't an issue, then there's nothing to talk about here but then I'd argue that's not really HFT, although I know the term is quite vague, and is no different than other types of systems. There are other issues with GC and garbage allocations, such as d-cache pollution, but I suppose no need to really discuss them given the type of system you're discussing.

I know you were throwing 250us out there as a pseudo example, but that's actually a very long time even outside of UHFT/MM.

Also don't forget that your trading daemons will be under a fire hose consuming marketdata, so beyond being able to tick-to-trade quickly, you need to be able to consume that stream without building up a substantial backlog (or worse, OOM or enter permanent gapping).


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: