Over on the Strongarm blog I’ve got an in-depth article about testing the performance of the Linux firewall. We knew it was fast, but how fast? The answer is perhaps surprisingly “significantly faster than the kernel’s own packet handling” – blocking packets to an un-bound UDP port was 2.5* faster than accepting them, and in that case a single-core EC2 server managed to process almost 300kpps. We also tested the performance of blocking DDoS attacks using ipset and/or string matching.
I’ve archived the article below:
As we mentioned in a previous post, at Strongarm, we love the Linux firewall. While a lot of effort has gone into making the firewall highly flexible and suitable for nearly every use case, there is very little information out there about the actual performance of the firewall. There are many studies on the performance of the Linux network stack, but they don’t get to the heart of the performance of the firewall component itself.
With the core part of the Strongarm product focused on making highly reliable DNS infrastructure, we thought we’d take a look at the Linux firewall, especially as it relates to UDP performance.
Preparing to Test
To begin, we wrote a small C program that sends 128-byte UDP packets very efficiently to the loopback device on a single source/destination port. We tested this with one of the latest Ubuntu 16.04 4.4-series kernels on a small single-core EC2 instance. We ran each test multiple times over a period of 30 seconds to ensure that the CPU was pegged at 100% throughout, verifying that we were CPU-bound. We then averaged the performance results to present the values shown here, although, because this was all software-based on the same box, there was very low deviation from the average between test runs.
Here are the results we found and some potential conclusions you can draw from them to optimize your own firewall.
NOTRACK Performance
First, we ran the packet generator with no firewall installed but with the connection tracking modules loaded. We saw a rate of 87k PPS (Packets Per Second). Then, we added the following rules to disable connection tracking on the UDP packets (e.g. converting our firewall to be stateless):
iptables -t raw -I PREROUTING -j NOTRACK iptables -t raw -I OUTPUT -j NOTRACK
With this, we saw the firewall perform at 106k PPS, or a 20% speedup, as I first reported in this blog post. We then disabled connection state tracking for all future tests.
Conclusion: Where possible, use a stateless firewall.
Maximum Possible Performance
We are primarily interested in the performance of blocking UDP based attacks, so we focused on testing various DROP rules with connection state tracking disabled.
First, to get a baseline performance with regards to packet dropping, we wrote a simple rule to drop packets and put it as the first entry on the very first part of the iptables processing stack, the “raw PREROUTING” table:
iptables -t raw -I PREROUTING -p udp -i lo -j DROP
This produced a baseline performance of 280k PPS for the firewall when dropping packets! This is a massive 260% increase in performance over just having a blank firewall in place. This shows that by putting blocks earlier on in the process, you can bypass kernel work, resulting in a significant speedup.
We then did the same drop, but on the default ‘filter’ table:
iptables -I INPUT -p udp -i lo -j DROP
This produced a performance of 250k PPS, or roughly a 12% performance drop.
Conclusion: Drop unexpected packets in your firewall, rather than letting them go further down the network stack. It doesn’t matter as much if you put them in the ‘raw’ table or the default ‘filter’ table, but raw does give slightly better performance.
When we redid this with the ‘raw’ table and inserted 5 non-matching rules, the performance dropped to 250k PPS, or 12% less compared to the baseline.
Additionally, we concluded from this that you should try to put your most common rules towards the top of your firewall, as position does make a difference. However, it is not too significant.
ipset Performance
If you’re under attack and you insert a rule in the firewall for every single IP address you wish to block, these rules are processed one-at-a-time for each packet. So instead of doing this, you can use the ipset command to create a lookup table of IPs. You can specify how these are to be stored, but if you use the most efficient option (a hash) you can make operations on sets of IPs scale very well. At least, this is the theory, so let’s see how it performs in practice:
ipset create blacklist hash:ip counters family inet maxelem 2000000 iptables -t raw -A PREROUTING -m set --match-set blacklist src -j DROP
First, we just added in the loopback’s IP address so it would behave the same as the baseline. This produced a result of 255k PPS, or 10% slower than the baseline of blocking all traffic on this interface. Then, we added another 65k entries to the blacklist using the following small script:
perl -E 'for $a (0..255){ for $b (0..255) { say "add blacklist 1.1.$a.$b" } }' |ipset restore
The result was again 255k PPS. We added another 200k entries and the performance stayed the same.
Conclusion: Using ipset’s to do operations on groups of IPs is very efficient and scales well, especially for blacklisting groups of IP addresses.
String Matching Performance
One of the more advanced abilities of the Linux firewall is filtering packets based on looking at the contents (packet inspection). I have often have seen DNS DDoS attacks performed against a particular domain. In this case, matching the string and blocking all packets containing this string is one of the easiest ways to stop an attack in its tracks. But, I assumed that performing a string search in every packet would be pretty inefficient:
iptables -t raw -I PREROUTING -p udp -m string --algo kmp --hex-string 'foo.com' -j DROP
(Note that this is just fake traffic. Real DNS traffic doesn’t encode domains with dot separators)
There are two possible algorithms (kmp and bm) in the string matching module. The UDP generator was sending 128-byte packets. The results are: bm handling 255k PPS traffic (10% lower than baseline) and kmp: handling 235k PPS (17% lower).
Conclusion: String-based packet filtering was much more efficient than we expected and a great way to stop DDoS attacks where a common string can be found. You’ll want to use the ‘bm’ algorithm for best performance.
Hashlimit Performance
Finally, we tested the performance of hashlimit, which is a great first layer of defense against DDoS attacks. We will say “Limit each /16 (65,000 server) block on the internet to only be allowed to send 1000 PPS of UDP traffic”:
iptables -t raw -A PREROUTING -p udp -m hashlimit --hashlimit-mode srcip --hashlimit-srcmask 16 --hashlimit-above 1000/sec --hashlimit-name drop-udp-bursts -j DROP
Bearing in mind that our server only handles 106k PPS of traffic not blocked by the firewall, we would expect a ~1% performance drop just because we’re now allowing 1k PPS through, so we’ll use a baseline of, say, 275k PPS. With this rule, we saw 225k PPS processed by the server (1k PPS accepted, the rest rejected), so 20% lower performance, but it will give quite a bit of protection against a DDoS attack.
Conclusion: Consider using the hashlimit module to do basic automatic anti-DDoS protection on your servers. The 20% overhead is generally worth it.
Summary of Results
Method | Performance (PPS) | Cost (%) |
Accepting all packets | ||
Connection tracking disabled | 106k | – |
Connection tracking enabled | 87k | 20% |
Rejecting all packets | ||
Dropping at very start of raw table | 280k | – |
Dropping after 5 rules | 250k | 12% |
Dropping in normal (filter) table | 250k | 12% |
Ipset (1-250k entries) | 255k | 10% |
String matching: bm | 255k | 10% |
String matching: kmp | 235k | 17% |
Hashlimit (1k PPS accepted) | 225k | 20% |
Drawing Final Conclusions
Linux iptables are incredibly flexible, and for the most part, very scalable. Even on one of the smallest single-core virtual machines in EC2, the Linux firewall can handle filtering over 250k PPS. In fact, if you put rules right at the entry of the netfilter stack, you can reject packets over twice as fast as the kernel itself can process them. Even some of the modules which have greater algorithmic complexity, for example hashlimit and string matching, performed much better than we expected.