The Nagle algorithm was created back in the day of multi-point networking. Multiple hosts were all tied to the same communications (Ethernet) channel, so they would use CSMA (https://en.wikipedia.org/wiki/Carrier-sense_multiple_access_...) to avoid collisions. CSMA is no longer necessary on Ethernet today because all modern connections are point-to-point with only two "hosts" per channel. (Each host can have any number of "users.") In fact, most modern (copper) (Gigabit+) Ethernet connections have both ends both transmitting and receiving AT THE SAME TIME ON THE SAME WIRES. A hybrid is used on the PHY at each end to subtract what is being transmitted from what is being received. Older (10/100 Base-T) can do the same thing because each end has dedicated TX/RX pairs. Fiber optic Ethernet can use either the same fiber with different wavelengths, or separate TX/RX fibers. I haven't seen a 10Base-2 Ethernet/DECnet interface for more than 25 years. If any are still operating somewhere, they are still using CSMA. CSMA is also still used for digital radio systems (WiFi and others). CSMA includes a "random exponential backoff timer" which does the (poor) job of managing congestion. (More modern congestion control methods exist today.) Back in the day, disabling the random backoff timer was somewhat equivalent to setting TCP_NODELAY.
Dumping the Nagle algorithm (by setting TCP_NODELAY) almost always makes sense and should be enabled by default.
False. It really was just intended to coalesce packets.
I’ll be nice and not attack the feature. But making that the default is one of the biggest mistakes in the history of networking (second only to TCP’s boneheaded congestion control that was designed imagining 56kbit links)
TCP uses the worst congestion control algorithm for general networks except for all of the others that have been tried. The biggest change I can think of is adjusting the window based on RTT instead of packet loss to avoid bufferbloat (Vegas).
Unless you have some kind of special circumstance you can leverage it's hard to beat TCP. You would not be the first to try.
For serving web pages, TCP is only used by legacy servers.
The fundamental congestion control issue is that after you drop to half, the window is increased by /one packet/, which for all sorts of artificial reasons is about 1500 bytes. Which means the performance gets worse and worse the greater the bandwidth-delay product (which have increased by tens of orders of magnitude). Not to mention head-of-line blocking etc.
The reason for QUIC's silent success was the brilliant move of sidestepping the political quagmire around TCP congestion control, so they could solve the problems in peace
TCP Reno fixed that problem. QUIC is more about sending more parts of the page in parallel. It does do its own flow control, but that's not where it gets the majority of the improvement.
TCP Reno Vegas etc all addressed congestion control with various ideas, but were all doomed by the academic downward spiral pissing contest.
QUIC is real and works great, and they sidestepped all of that and just built it and tuned it and has basically won. As for QUIC "sending more parts of the page in parallel" yes thats what I referred to re head of line blocking in TCP.
There is nothing magic about the congestion control in QUIC. It shares a lot with TCP BBR.
Unlike TLS over TCP, QUIC is still not able to be offloaded to NICs. And most stacks are in userspace. So it is horrifically expensive in terms of watts/byte or cycles/byte sent for a CDN workload (something like 8x as a expensive the last time I looked), and its primarily used and advocated for by people who have metrics for latency, but not server side costs.
> Unlike TLS over TCP, QUIC is still not able to be offloaded to NICs.
That's not quite true. You can offload QUIC connection steering just fine, as long as your NICs can do hardware encryption. It's actually _easier_ because you can never get a QUIC datagram split across multiple physical packets (barring the IP-level fragmentation).
The only real difference from TCP is the encryption for ACKs.
From a CDN perspective, whats missing is there is no kernel stack on FreeBSD / Linux, and no support for sendfile/sendpage and no support for segmentation offload entirely in hardware. So you can't just send an entire file (or a large range) and forget about it, like you can with TCP.
Some NICs, like Broadcom's newer ones, support crypto offloads, but this is not enough to be competitive with TCP / TLS. Especially since support for those offloads are not in any mainline kernel in Linux or BSD.
What did you still need to connect with 10mbit half duplex in 2014? I had gigabit to the desktop for a relatively small company in 2007, by 2014 10mb was pretty dead unless you had something Really Interesting connected....
If you worked in an industrial setting, legacy tech abounds due to the capital costs of replacing the equipment it supports (includes manufacturing, older hospitals, power plants, and etc). Many of these even still use token ring, coax, etc.
One co-op job at a manufacturing plant I worked at ~20 years ago involved replacing the backend core networking equipment with more modern ethernet kit, but we had to setup media converters (in that case token ring to ethernet) as close as possible to the manufacturing equipment (so that token ring only ran between the equipment and the media converter for a few meters at most).
They were "lucky" in that:
1) the networking protocol that was supported by the manufacturing equipment was IPX/SPX, so at least that worked cleanly on ethernet and newer upstream control software running on an OS (HP-UX at the time)
2) there were no lives at stake (eg nuclear safety/hospital), so they had minimal regulatory issues.
There is always some legacy device which does weird/old connections. I distinctly remember the debit card terminals in the late '00 required a 10mbit capable ethernet connection which allowed x25 to be transmitted over the network. It is not a stretch to add 5 to 10 more years to those kind of devices.
Technical debt goes hard, I had a discussion with a facilities guy why they never got around to ditch the last remnants of token ring in an office park. Fortunately in 2020 they had plenty of time to rip that stuff out without disturbing facility operation. Building automation, security and so on often lives way longer than you'd dare planning.
Everyone is forgetting the no delay is per application and not a system configuration. Yep, old things will still be old and that’s ok. That new fangled packet farter will need to set no delay which is a default in many scenarios. This article reminds us it is a thing and especially true for home grown applications.
This hasn't mattered in 20 years for me personally, but in 2003 I killed connectivity to a bunch of Siemens 505-CP2572 PLC ethernet cards by switching a hub from 10Mbps to 100Mbps mode. The button was right there, and even back then I assumed there wouldn't be anything requiring 10Mbps any more. The computers were fine but the PLCs were not. These things are still in use in production manufacturing facilities out there.
There's plenty of use cases for small things which don't need any sorts of speeds, where you might as well have used a 115200 baud serial connection but ethernet is more useful. Designing electronics for 10Mbit/s is infinitely easier and cheaper than designing electronics for 100Mbit/s, so if you don't need 100Mbit/s, why would you spend the extra effort and expense?
There is also power consumption and reliability. I have part of my home network on 100Mbps. It eats about 60% less energy compared to Gb Ethernet. Less prone to interference from PoE.
Some old DEC devices used to connect console ports of servers. Didn't need it per say but also didn't need to spend $3k on multiple new console routers.
Was an old isp/mobile carrier so could find all kinds of old stuff. Even the first SMSC from the 80s (also DEC, 386 or similar cpu?) was still in it's racks because they didn't need the rack space as 2 modern racks used up all the power for that room, was also far down in a mountain so was annoying to remove equipment.
Thanks for the clarification. They're so close to being the same thing that I always call it CSMA/CD. Avoiding a collision is far more preferable than just detecting one.
Yeah, many enterprise switches don't even support 100Base-T or 10Base-T anymore. I've had to daisy chain an old switch that supports 100Base-T onto a modern one a few times myself. If you drop 10/100 support, you can also drop HD (simplex) support. In my junk drawer, I still have a few old 10/100 hubs (not switches), which are by definition always HD.
Is avoiding a collision always preferable? CSMA/CA has significant overhead (backoff period) for every single frame sent, on a less congested line CSMA/CD has less overhead.
CSMA/CD only requires that you back off if there actually is a collision. CSMA/CA additionally requires that for every frame sent, after sensing the medium as clear, that you wait for a random amount of time before sending it to avoid collisions. If the medium is frequently clear, CA will still have the overhead of this initial wait where CD will not.
Depending upon how it's actually implemented, CSMA/CA may have the same (untended?) behavior of CSMA/CD in the sense that setting TCP_NODELAY will also set the backoff timer to zero. It would be interesting to test.
Nagle is quite sensible when your application isn't taking any care to create sensibly-sized packets, and isn't so sensitive to latency. It avoids creating stupidly small packets unless your network is fast enough to handle them.
At this point, this is an application level problem and not something the kernel should be silently doing for you IMO. An option for legacy systems or known problematic hosts fine, but off by default and probably not a per SOCKOPT.
Every modern language has buffers in their stdlib. Anyone writing character at a time to the wire lazily or unintentionally should fix their application.
>> TCP_NODELAY can also make fingerprinting easier in various ways which is a reason to make it something you have to ask for
> Only because it's on by default for no real reason. I'm saying the default should be off.
This is wrong.
I'm assuming here that you mean that Nagle's algorithm is on by default, i.e TCP_NODELAY is off by default. It seems you think the only extra fingerprinting info TCP_NODELAY gives you is the single bit "TCP_NODELAY is on vs off". But it's more than that.
In a world where every application's traffic goes through Nagle's algorithm, lots of applications will just be seen to transmit a packet every 300ms or whatever as their transmissions are buffered up by the kernel to be sent in large packets. In a world where Nagle's algorithm is off by default, those applications could have very different packet sizes and timings.
With something like Telnet or SSH, you might even be able to detect who exactly is typing at the keyboard by analyzing their key press rhythm!
To be clear, this is not an argument in favor of Nagle's algorithm being on by default. I'm relatively neutral on that matter.
> I'm assuming here that you mean that Nagle's algorithm is on by default, i.e TCP_NODELAY is off by default.
Correct, I wrote that backwards, good callout.
RE: fingerprinting, I'd concede the point in a sufficiently lazy implementation. I'd fully expect the application layer to handle this, especially in cases where this matters.
Nagles algorithm does really well when you're on shitty wifi.
Applications also don't know the MTU (the size of packets) on the interface they're using. Hell, they probably don't even know which interface they're using! This is all abstracted away. So, if you're on a network with a 14xx MTU (such as a VPN), assuming an MTU of 1500 means you'll send one full packet and then a tiny little packet after that. For every one packet you think you're sending!
Nagle's algorithm lets you just send data; no problem. Let the kernel batch up packets. If you control the protocol, just use a design that prevents Delayed ACK from causing the latency. IE, the "OK" from Redis.
If nobody is maintaining them, do we really need them? In which case, does it really matter?
If we need them, and they’re not being maintained, then maybe that’s the kind of “scream test” wake up we need for them to either be properly deprecated, or updated.
> If nobody is maintaining them, do we really need them?
Given how often issues can be traced back to open source projects barely scraping along? Yes and they are probably doing something important. Hell, if you create enough pointless busywork you can probably get a few more "helpfull" hackers into projects like xz.
A gzip encoder has no business deciding whether a socket should wait to fill up packets, however. The list of relevant applications and libraries gets a lot shorter with that restriction.
So to be clear, you believe every program that outputs a bulk stream to stdout should be written to check if stdout is a socket and enable Nagle's algorithm if so? That's not just busywork - it's also an abstraction violation. By explicitly turning off Nagle's, you specify that you understand TCP performance and don't need the abstraction, and this is a reasonable way to do things. Imagine if the kernel pinned threads to cores by default and you had to ask to unpin them...
No, the program should take care to enable TCP_NODELAY when creating the socket. If the program gets passed a FD from outside it's on the outside program to ensure this. If somehow the program very often gets outside FDs from an oblivious source that could be a TCP socket, then it might indeed have to manually check if it really wants Nagle's algorithm.
No, I did not. Am I now forbidden from using sentence structures that AI has also used? That's not just stupid - it's insane. You know that's not even an em-dash, right?
If by "latency" you mean a hundred milliseconds or so, that's one thing, but I've seen Nagle delay packets by several seconds. Which is just goofy, and should never have been enabled by default, given the lack of an explicit flush function.
A smarter implementation would have been to call it TCP_MAX_DELAY_MS, and have it take an integer value with a well-documented (and reasonably low) default.
It delays one RTT, so if you have seen seconds of delays that means your TCP ACK packages were received seconds later for whatever reason (high load?). Decreasing latency in that situation would WORSEN the situation.
I was testing some low-bandwidth voice chat code using two unloaded PCs sitting on the same desk. I nearly jumped out of my skin when "HELLO, HELLO?" came through a few seconds late, at high volume, after I had already concluded it wasn't working. After ruling out latency on the audio side, TCP_NODELAY solved the problem.
All respect to Animats, but whoever thought this should be the default behavior of TCP/IP had rocks in their head, and/or were solving a problem that had a better solution that they just didn't think of at the time.
Hard to say without looking at the complete setup - and probably just a side-effect of the underlying issue. The question is, why did you have such high RTTs? That already points to a different cause.
I would even argue that NODELAY for a VoIP solution makes no sense - why are you even using TCP instead of UDP in the first place?
Reminds me of trying to do IoT stuff in hospitals before IoT was a thing.
Send exactly one 205 byte packet. How do you really know? I can see it go out on a scope. And the other end receives a packet with bytes 0-56. Then another packet with bytes 142-204. Finally a packet a 200ms later with bytes 57-141.
At the application layer you would not see the reordered bytes. However on the network you have IP beneath both UDP and TCP and network hardware is normally free to slice and reorder those IP packages however it wants.
It's not. Routers are expected to be allowed to slice IPv4 packets above 576 bytes. They can't slice IPv6 and they can't slice TCP.
However, malicious middleboxes insert themselves into your TCP connections, terminating a separate TCP connection on each side of the spyware and therefore completely rewriting TCP segment boundaries.
In less common scenarios, the same may be done by non malicious middleboxes - but it's almost always malicious ones. The party that attacked xmpp.is/jabber.ru terminated not only TCP but also TLS and issued itself a Let's Encrypt certificate.
CSMA further limits the throughput of the network in cases where you're sending lots of small transmissions by making sure that you're always contending for the carrier.
I think you are confusing network layers and their functionality.
"CSMA is no longer necessary on Ethernet today because all modern connections are point-to-point with only two "hosts" per channel."
Ethernet really isn't ptp. You will have a switch at home (perhaps in your router) with more than two ports on it. At layer 1 or 2 how do you mediate your traffic, without CSMA? Take a single switch with n ports on it, where n>2. How do you mediate ethernet traffic without CSMA - its how the actual electrical signals are mediated?
"Ethernet connections have both ends both transmitting and receiving AT THE SAME TIME ON THE SAME WIRES."
That's full duplex as opposed to half duplex.
Nagle's algo has nothing to do with all that messy layer 1/2 stuff but is at the TCP layer and is an attempt to batch small packets into fewer larger ones for a small gain in efficiency. It is one of many optimisations at the TCP layer, such as Jumbo Frames and mini Jumbo Frames and much more.
> You will have a switch at home (perhaps in your router) with more than two ports on it. At layer 1 or 2 how do you mediate your traffic, without CSMA? Take a single switch with n ports on it, where n>2. How do you mediate ethernet traffic without CSMA - its how the actual electrical signals are mediated?
CSMA/CD is specifically for a shared medium (shared collision domain in Ethernet terminology), putting a switch in it makes every port its own collision domain that are (in practice these days) always point-to-point. Especially for gigabit Ethernet, there was some info in the spec allowing for half-duplex operation with hubs but it was basically abandoned.
As others have said, different mechanisms are used to manage trying to send more data than a switch port can handle but not CSMA (because it's not doing any of it using Carrier Sense, and it's technically not Multiple Access on the individual segment, so CSMA isn't the mechanism being used).
> That's full duplex as opposed to half duplex.
No actually they're talking about something more complex, 100Mbps Ethernet had full duplex with separate transmit and receive pairs, but with 1000Base-T (and 10GBase-T etc.) the four pairs all simultaneously transmit and receive 250 Mbps (to add up to 1Gbps in each direction). Not that it's really relevant to the discussion but it is really cool and much more interesting than just being full duplex.
It's P2P as far as the physical layer (L1) is concerned.
Usually, full duplex requires two separate channels. The introduction of a hybrid on each end allows the use of the same channel at the same time.
Some progress has been made in doing the same thing with radio links, but it's harder.
Nagle's algorithm is somewhat intertwined with the backoff timer in the sense that it prevents transmitting a packet until some condition is met. IIRC, setting the TCP_NODELAY flag will also disable the backoff timer, at least this is true in the case of TCP/IP over AX25.
> It's P2P as far as the physical layer (L1) is concerned.
Only in the sense that the L1 "peer" is the switch. As soon as the switch goes to forward the packet, if ports 2 and 3 are both sending to port 1 at 1Gbps and port 1 is a 1Gbps port, 2Gbps won't fit and something's got to give.
Right but the switch has internal buffers and ability to queue those packets or apply backpressure. Resolving at that level is a very different matter from an electrical collision at L1.
Not as far as TCP is concerned it isn't. You sent the network a packet and it had to throw it away because something else sent packets at the same time. It doesn't care whether the reason was an electrical collision or not. A buffer is just a funny looking wire.
Ethernet has had the concept of full duplex for several decades and I have no idea what you mean by: "hybrid on each end allows the use of the same channel at the same time."
The physical electrical connections between a series of ethernet network ports (switch or end point - it doesn't matter) are mediated by CSMA.
No idea why you are mentioning radios. That's another medium.
My understanding is that no one used hubs anymore, so your collision domain goes from a number of machines on a hub to a dedicated channel between the switch and the machine. There obviously won’t be collisions if you’re the only one talking and you’re able to do full duplex communications without issue.
Hubs still exist(ed), but nobody implemented half-duplex or CSMA from gigabit ethernet on up (I can't remember if it was technically part of the gig-e spec or not)
"No one" and "no new installations" are not the same. There are many many many millions of hubs out there in the world. The statement, as written, is just ludicrously naive, entirely disconnected from reality.
> Ethernet has had the concept of full duplex for several decades and I have no idea what you mean by: "hybrid on each end allows the use of the same channel at the same time."
Gigabit (and faster) is able to do full duplex without needing separate wires in each direction. That's the distinction they're making.
> The physical electrical connections between a series of ethernet network ports (switch or end point - it doesn't matter) are mediated by CSMA.
Not in a modern network, where there's no such thing as a wired collision.
> Take a single switch with n ports on it, where n>2. How do you mediate ethernet traffic without CSMA - its how the actual electrical signals are mediated?
Switches are not hubs. Switches have a separate receiver for each port, and each receiver is attached to one sender.
In modern ethernet, there is also flow-control via the PAUSE frame. This is not for collisions at the media level, but you might think of it as preventing collisions at the buffer level. It allows the receiver to inform the sender to slow down, rather than just dropping frames when its buffers are full.
At least in networks I've used, it's better for buffers to overflow than to use PAUSE.
Too many switches will get a PAUSE frame from port X and send it to all the ports that send packets destined for port X. Then those ports stop sending all traffic for a while.
About the only useful thing is if you can see PAUSE counters from your switch, you can tell a host is unhealthy from the switch whereas inbound packet overflows on the host might not be monitored... or whatever is making the host slow to handle packets might also delay monitoring.
Sadly, I'm not too surprised to hear that. I wish we had more rapid iteration to improve such capabilities for real world use cases.
Things like back pressure and flow control are very powerful systems concepts, but intrinsically need there to be an identifiable flow to control! Our systems abstractions that multiplex and obfuscate flows are going to be unable to differentiate which application flow is the one that needs back pressure, and paint too-wide brush.
In my view, the fundamental problem is we're all trying to "have our cake and eat it". We expect our network core to be unaware of the edge device and application goals. We expect to be able to saturate an imaginary channel between two edge devices without any prearrangement, as if we're the only network users. We also expect our sparse and async background traffic to somehow get through promptly. We expect fault tolerance and graceful degradation. We expect fairness.
We don't really define or agree what is saturation, what is prompt, what is graceful, or what is fair... I think we often have selfish answers to these questions, and this yields a tragedy of the commons.
At the same time, we have so many layers of abstraction where useful flow information is effectively hidden from the layers beneath. That is even before you consider adversarial situations where the application is trying to confuse the issue.
Dumping the Nagle algorithm (by setting TCP_NODELAY) almost always makes sense and should be enabled by default.