The EC2 firewall is broken

minimax · on Nov 28, 2012

PMTU discovery on the Internet is generally unreliable. Very few people understand that it exists, and even fewer understand how it actually works. Most ADSL (PPPoE) providers rewrite the TCP MSS on TCP SYN packets traveling over their network to account for the PMTU discovery brokenness. [1] You see the same thing happen with VPN connections where the PMTU is effectively reduced by the size of the overhead for the encapsulation protocol.

1. http://www.cisco.com/en/US/docs/ios/12_2t/12_2t4/feature/gui...

gonzo · on Nov 28, 2012

just because people don't understand doesn't give them latitude to break the RFCs.

minimax · on Nov 29, 2012

I agree, and when I worked in that world it was a huge pain in the ass. Users don't care about RFCs though, they just want the Internet to work, so you end up doing something that's pragmatic but kludgey. As I pointed out in another comment, Google is responding with a TCP MSS of 1430 (so assuming a PMTU of 1470). It's just what you do so you can get on with your business.

ChuckMcM · on Nov 28, 2012

From the RFC quoted :

"A packet-filtering router acting as a firewall which permits outgoing IP packets with the Don't Fragment (DF) bit set MUST NOT block incoming ICMP Destination Unreachable / Fragmentation Needed errors sent in response to the outbound packets from reaching hosts inside the firewall, as this would break the standards-compliant usage of Path MTU discovery by hosts generating legitimate traffic. "

That would be great, next tell the folks at SBCGlobal to fix their damn network as well. I don't know how many folks we've had to 'patch' by manually walking the MTU down on the local router until packets actually get through. It really really sucks and leads to sending way more small packets than needed.

minimax · on Nov 29, 2012

I'm not sure I completely understand your situation, but turning down the MTU on an intermediate router is rarely a good fix for PMTU issues. If you have web servers, you may consider turning down the MTU on the Internet facing interfaces. Even if you turned it down to say 1400, you'd only have around 5% more packets ... probably not enough load increase to melt your routers.

In fact, I just ran some test TCP connections against Google, and their web servers respond a TCP MSS of only 1430, suggesting they are assuming a PMTU of 1470. That sounds pretty pragmatic to me and I wouldn't be surprised if other large Internet companies did a similar thing.

ChuckMcM · on Nov 29, 2012

The way this expresses itself is some web sites work and some just hang. Looking at the traffic with wireshark you see packets go in but they don't come back. But other web sites work just fine. So you start ratcheting down the MTU size on the outbound router until the non-functioning web site starts returning your calls. Doing ping's with various sized packets can (if the server is responding to pings) also identify the longest packet you can send before someone between you and them decided they want a different fragment size.

csense · on Nov 28, 2012

There should be some sort of open-source testing environment for common breakages. For example, maybe a bunch of VirtualBox VM's running Linux. Maybe a unit test for the problem mentioned in the article sets up machines A, B, FW, and C connected A <-> B <-> FW <-> C, has B fragment packets, and has A perform path MTU discovery to C. The firewall configuration under test goes on FW and running the test will catch this problem.

With a good enough framework of this type, all the testing could be done "out of the box" so all you have to do is set up a disk image or IP address of a firewall box to test, and the testing is fully automatic. The firewall under test can run any OS or firewall that will run as a VirtualBox client -- or even be its own box connected to the testing machine's Ethernet port. Heck, if you had machines on both sides of some third-party you don't control, like your ISP, you could even use it to probe their network configuration for issues without any special cooperation from them.

If the test suite gets good enough, maybe eventually pressure will build on vendors to make their products pass and we'll see firewall brokenness start to disappear.

As well as cloud services like AWS, such tests could be used by Linux distros, operating system vendors, and network equipment manufacturers.

I'd build it myself, but I'm not a networking expert and I'm not particularly enthusiastic about becoming one.

fjarlq · on Nov 29, 2012

http://www.emulab.net/ is along those lines.

gonzo · on Nov 28, 2012

yeah, that would work, a bunch of virtualbox VMs running linux. Gotcha.

mike_heffner · on Nov 28, 2012

"While at Amazon re:invent I had the opportunity to complain to some Amazonians..."

So what was their response? Was their response the `ec2-authorize` command to run?

cperciva · on Nov 28, 2012

Their response was generally along the lines of "yes, that's something we really ought to fix some day...".

One person commented that "public blog posts tend to hurry things along". I imagine that getting to the top of HN might help too...

jacques_chester · on Nov 29, 2012

It got me a refund from a certain blog host.

revelation · on Nov 28, 2012

This sort of shenanigans will be over with IPv6. Blocking ICMP is not an option there.

beagle3 · on Nov 28, 2012

Then it is likely that they will never be over.

Much like Google did with SPDY, I suspect they can (and at some point some big player will, be it google or someone else) release a TCP replacement that, unlike IPv6, is actually useful if you only have an IPv4 link between endpoints; and it will be this protocol that will eventually replace IPv4 rather than IPv6.

There's even a candidate: CCN; read apenwarr on Van Jacobson for more http://apenwarr.ca/log/?m=201211#11

I just hope they combine it with something like NACL, so that we have secure by default, end-to-end unsniffable communciation.

Locke1689 · on Nov 29, 2012

OK, no.

First, don't read apenwarr for the info, read the paper.[1] Unfortunately, that still won't carry you that far because there's a whole lot of "community knowledge" that is carried in DS papers that you don't get unless you read a lot of them or are in the community.

CCN is not an actual proposal. It's a pie-in-the-sky networking redesign.

Do you know what would be required to implement CCN? Go read the paper. The hardware cost alone is insane. This is not coming any time soon. Be prepared for IPv6 for the next 20 years at least.

[1] http://sites.google.com/site/zhongrenshomepage/Networking%20...

beagle3 · on Nov 29, 2012

I'll take your word for how pie-in-the-sky CCN really is, I skimmed through the paper without a critical eye. However ...

> Be prepared for IPv6 for the next 20 years at least.

I have, for the last 10 (or more?) I think it was 1999 when I was thinking "I should probably implement IPv6 in this project, because next year it's going to be everywhere".

I think you mean "be prepared for IPv6 to be the up-and-coming protocol for the next 20 years at least".

Maybe not CCN, but I suspect it's more likely that a killer-app with it's own tunnelled-in-IPv4 protocol is more likely to take over than IPv6. BitTorrent had a chance to do that, but didn't. I can't point to the next one myself.

mef · on Nov 28, 2012

Can you elaborate on why that is?

dsl · on Nov 28, 2012

Because IPv6 breaks hard if ICMP is filtered. It depends on a bunch of newly allocated message types and replaces things like ARP and (some functions of) DHCP with broadcast ICMP.

There is tons of little crap like this in the low level details of IPv6. "We don't like that people do X, so we will force them to stop"

revelation · on Nov 28, 2012

IPv6 no longer supports fragmenting on routers. That means if you don't want to be stuck with the default minimum MTU of 1280 (which you really don't want to for low-latency applications) you need to support Path MTU Discovery, which in turn requires ICMP to go unhindered across a large number of different networks between you and the receiver.

aidenn0 · on Nov 29, 2012

ipv6 enforces a minimum MTU of 1280? I've personally run into many VPNs with much smaller MTUs

danudey · on Nov 29, 2012

And so the networks involved in those VPNs need to support path MTU discovery, and they won't have any problems.

kami8845 · on Nov 28, 2012

OK so I can see how it violates standards. How many of the millions of users that send traffic through EC2 does this affect however? I can see how they would be reluctant to mess with Firewall rulesets. Even if it they only apply it to new users that would mean fragmentation ... Keep it simple stupid. Again it depends on how many users this affects and from the sounds of the blog post - vanishingly few

zurn · on Nov 28, 2012

In my experience this is the rule rather than the exception, most firewall configs are broken in some way and there are often several firewalls on the path. I turn them off where circumstances allow.

csense · on Nov 28, 2012

> most firewall configs are broken

If we just accept "networks are unreliable and sometimes broken" as a fact of life, things will never get better. I applaud the unsung heroes who are finding and fixing the actual root causes of lower layers of our networks. Other important networking issues that come to mind are bufferbloat and IPv6 brokenness.

I'm sure it will be fixed if this stays on the front page for a while, so be sure to upvote the article.

cube13 · on Nov 28, 2012

>If we just accept "networks are unreliable and sometimes broken" as a fact of life, things will never get better. I applaud the unsung heroes who are finding and fixing the actual root causes of lower layers of our networks.

As developers, this should always be a fact of life. True, we should strive for making the networks perfect, but at the end of the day, these things still need to be accounted for.

Because you don't have control over a client deciding that 50 cent network cards are "good enough" for their deployment even though they've demanded five 9's uptime from your software.

Because you can't know when someone's going to spill beer on the switch.

Because someone's just going to pull the wrong cable.

Because the water company accidentally cut the line into the building.

xyzzy123 · on Nov 29, 2012

Right, but if PMTUD is broken, basically large transfers to clients with low MTU don't and never will work, and there's not much you can do about it.

It's not like it's an "unreliable connection" which can be worked around with error correction. It's a "broken connection", which needs to be fixed.

magila · on Nov 28, 2012

You're mostly talking about reliability at the physical layer, which as you say is never going to be perfect. Csense is talking about reliability at the link layer and above, which is infinitely more attainable.

cube13 · on Nov 29, 2012

That's kind of my point, though.

If you can't guarantee that every layer below you is absolutely reliable, then you need to assume that everything below you might be broken, and that you need to handle it. You can't start with the mindset that everything works below you, and have everything above you also work fine.

The fact that we have people with the mindset that everything below them is broken is the entire reason that these kinds of issues get detected and fixed.

xyzzy123 · on Nov 29, 2012

Yeah if you try and work around broken PMTUD though you're going to mess up your application layer protocol SO bad...

EDIT: I suppose you could rewrite TCP MSS on your own firewall or drop MTU on all web servers.... but of course if you're going to reconfigure your firewall / intefaces, you may as well just fix the problem - which was caused by poor device configuration in the first place.

ajdecon · on Nov 28, 2012

You also can't know when someone between you and the other endpoint has configured something stupidly. :)

As a sysadmin, I do my best to make the network as damn near perfect as possible; but as a developer, I still generally assume the network is unreliable and plan accordingly.

chewxy · on Nov 28, 2012

I think it SHOULD be accepted as a fact of life. That way we can then learn to improve our designs of applications and processes to be more tolerant to network faults and unreliability, which is a good thing.

brian_cloutier · on Nov 28, 2012

IPv6 brokenness?

agwa · on Nov 28, 2012

Here's an example that affected me just a few days ago:

http://www.datacenterknowledge.com/archives/2009/10/22/peeri...

Basically, Hurricane Electric and Cogent (two major transit providers) are refusing to exchange IPv6 traffic. That breaks stuff. For me the problem manifested as a user facing a 20 second delay whenever he ssh'd in (over IPv4). (I'm on Hurricane Electric. The reverse DNS for his IPv4 address is served by Cogent DNS servers. Reverse DNS lookups were trying to access the Cogent name server over IPv6 and had to time out before IPv4 was tried.)

Other than that though, I actually haven't encountered very much IPv6 breakage.

X-Istence · on Nov 28, 2012

Peering disputes are nothing new though. I was a Level 3 and Cogent customer, and suddenly my two servers couldn't talk anymore ...

Peering disputes are problematic for both IPv4 and IPv6. This has nothing at all to do with the IP version!

(p.s. You do realise that the article you linked to is from 2009 ...)

agwa · on Nov 29, 2012

Has any IPv4 peering dispute lasted this long? (Yes, the article is from 2009. They're still locked in a dispute in 2012.)

X-Istence · on Nov 29, 2012

No, mainly because it would be business suicide for whoever is involved. Especially if neither party is willing to pay for transit to the other... IPv6 is still small enough that it doesn't yet force one providers hand over the other.

That being said, if you are single homed on HE only then I would suggest you purchase transit elsewhere as well, or find a datacenter that has multiple transit providers. Being single home is dangerous.

The real question is going to be how long can Cogent hold-out. HE is willing to provide them free IPv6 peering, why not accept it so that their customers can get a full IPv6 table...

quantumhobbit · on Nov 28, 2012

Presumably the failure of some to jump onto the IPv6 bandwagon? I guess we can expect IPv4 to stick around like IE6 for a while.

el_cuadrado · on Nov 29, 2012

The shit is always broken, and always was. I understand some idealistic network engineers may disagree, but this is a fact of life. Deal with it.

And this 'news' definitely do not deserve the frontpage of Y.

nuje · on Nov 29, 2012

People often bungle their firewall rules because they don't know any better, but Amazon continuing to willfully fuck up TCP for all of AWS is a pretty large issue for the functioning of the net at large.

jrockway · on Nov 29, 2012

What's with the comments on the article:

"johndurbinn • 29 minutes ago I'm bouncing on my toes wah me soopsoak dat hoe"

"Tony Stender • 35 minutes ago Fix this it needs word wrap and zoom capabilities"

I'd downmod them but I'd have to create an account to do so.

kv3 · on Nov 28, 2012

It doesn't stop ssh or my web traffic. Why should I care?

keithwinstein · on Nov 28, 2012

It absolutely does stop SSH or Web traffic, if the network path goes through a link with MTU < 1500 and the connection comes in with MSS > PMTU - 40. But only, as the post says, once you start sending a lot of data in one TCP segment.

sillysaurus · on Nov 28, 2012

I thought MTU of ~1500 was (realistically) the minimum nowadays?

gonzo · on Nov 28, 2012

you thought wrong. RFC 791, p. 24, "Every internet module must be able to forward a datagram of 68 octets without further fragmentation."

danudey · on Nov 29, 2012

I've had more than a few cases where I couldn't SSH into a system at a hospital or clinic because the VPN/firewall/whatever that their connection went over rejected packets with too high of an MTU. Generally you'll get your SSH connection and it'll hang in the middle of the MOTD.

marshray · on Nov 28, 2012

Because people like AGL, CPercival, and me do care, i.e., we develop network applications, care about users, and prefer to avoid rather than troubleshoot random brokenness.

From today:

http://news.ycombinator.com/item?id=4840330 http://news.ycombinator.com/item?id=4844121 http://stackoverflow.com/questions/13596019/openssl-1-0-1-ha...

__alexs · on Nov 28, 2012

Because it stops traffic for some of your users as mentioned in TFA?

mibbitier · on Nov 28, 2012

Realistically it's probably <0.01% of users.

pyre · on Nov 28, 2012

My advice:

Publicly expound on how you don't care about those users and they are SOL because they are in the minority. It won't matter because it doesn't affect the rest of your userbase, right?

</sarcasm>

mibbitier · on Nov 29, 2012

You could say the same about IE6 users.

MichaelGG · on Nov 28, 2012

When working on a packet analysis system for a VoIP company, I found that around 1% of incoming UDP traffic was fragmented, the size being around 576 bytes.

I don't know how representative that was (10s of thousands of users, but large users generating more traffic than smaller ones), but < 0.01% is probably on the low side.

JonnieCache · on Nov 28, 2012

If you have a million users that's 100 pissed off people.

stfp · on Nov 28, 2012

And this is EC2, so yeah, millions of users is not unrealistic!