Category Archives: PowerDNS

Tracking down Lua JSON decoding issues

I’ve recently been doing quite a bit of Lua scripting for a client wanting some PowerDNS customizations. I’ve actually grown to quite like Lua, even though it’s very simple and quick, you can do some very complex programming with it reasonably straight forwardly. I think it could perhaps be compared to a stripped-down version of Perl which is also a language that I very much like because of its incredible flexibility.

Anyway, as part of this work are wanting to look up incoming IP addresses in a table of non-overlapping IP address ranges. For high performance I recommended LMDB as I’ve used it extensively before and I know that for its quirks and tendency to crash if you mishandle any aspect of its API, it is very high performance, low over head, scales very well to multiple cores, and can do pretty much anything you ask of it.

So basically the problem was “how do we store an IP range as an indexed key in LMDB” (which is just a key-value database where all keys are b-tree indexed). In the future we may want to support IPv6, and we may also want to support IP ranges which cannot be expressed in subnet-mask representation. The solution I came up with is to store the first IP in raw binary format (ie 4 bytes for IPv4, or 16 bytes for IPv6) as the key, and then as part of the value we store the end IP address. In order to see if a given IP is within a subnet, you look up you open a cursor on the table, seek to the position of the IP you are trying. If you get a direct hit, then obviously it has found the first IP in the subnet and so you know it is valid. If it does not get a direct hit you seek back to the previous entry (this is a great feature of LMDB and is found in surprisingly few indexed key-value data store APIs, even though it should be very simple to implement). You then take the value of that, get the end IP of the range and check to see if the requested IP is within the start and end of the range.

Because we wanted a very flexible and easily extensible data storage format for the values in this table we decided to encode it all as JSON. Lua has a number of JSON decoders and lua-cjson seemed pretty quick and easy, and was also available as a pre-built ubuntu package so we went with that. As we were storing the key’s IP address in raw binary notation, we figured it would make the code-path simplest if we stored the end IP address in the same manner. So, we did this, wrote a test suite with some non-public IPv4 addresses (10.xxx and 127.xxx) and verified that it was all working correctly, and then launched the code.

A few days later we started getting some complaints from customers that some IP addresses in their network ranges were not being identified correctly. But when we added the exact same details into the test suite with our private IP ranges, it clearly worked fine.

Finally I started trying to use the exact IP addresses that the customers were reporting issues with in the test scripts and discovered that there was actually a problem. Basically, whenever a component of the address was greater than 127 and the code did not go down the direct hit code path (ie the address was part of a subnet larger than a /32 and not the first entry) the decoded end IP address would be incorrect. Very strange! So, our test code which was using ranges like 127.0.0.1-127.1.2.3 worked fine, but an IP range like 1.0.0.0-1.0.129.0 would fail!

Looking more closely at the cjson Lua documentation I saw the line “cjson.decode will deserialise any UTF-8 JSON string into a Lua value or table”. And reading through the C code I saw that the routines were hard-coded to treat any JSON escaped \uXXXX value that was greater than 127 as part of a UTF-8 encoded character. This is because Lua uses the platform’s underlying char[] to store strings which means usually each character in a string can only be 8-bits, meaning that in order to store wider characters the bytes need to be encoded into a single character which is what UTF-8 is for. With our encoding we knew that all parts of the string would fit into 8-bits, but there was no way to tell the decoder this. Because cjson is aiming to be a fast module, this is all hard-coded and there is no way that I could see to easily work around this utf-8 decoding. We tried some other Lua JSON modules but they either had the same problem, or were orders of magnitude slower than cjson.

Eventually a colleague suggested just hex-encoding the end IP address prior to including it in the JSON data which was the simplest solution we could find. It should also reduce the storage required for an IP address as assuming 50% of the characters are usually encoded with a \uXXXX escape sequence in JSON, an average IPv4 address would take 14 bytes in the database, whereas with hex this would be a fixed 8 bytes per IPv4 address.

If the encoding program had been using perl we could probably have used some of the features of the JSON::XS module (specifically, the utf8 flag) to write characters directly as bytes into the string, which although is perhaps not technically valid JSON, from my reading of the Lua module should have bypassed the UTF-8 encoding of escaped values. However we wern’t using perl in our encoding routines so this wasn’t possible.

Protecting an Open DNS Resolver

As another piece of work I’ve been doing for the excellent Strongarm anti-malware team we recently converted the service so that it can be used to get instant protection wherever you are. Part of this involved my work in converting the core (customized) DNS server into an open resolver. This is usually strongly advised against as you can unwittingly become part of some very serious Denial of Service attacks, however in this blog post I show you how to implement some pretty simple restrictions and limitations to prevent this from happening so you can run a DNS open resolver without running this risk.

Here’s a copy of the article:

One of the challenges of running an open DNS resolver is that it can be used in a number of different attacks, compared to a server that is only allowed access from a known set of IPs. One of the most well known is the DNS amplification attack. As this article explains, “The fact that a DNS reply may be many times larger than a DNS query allows the attacker to achieve amplification by spoofing a relatively small query that is known to generate a large answer in response”. That means that if I can send a DNS question that takes 50 bytes, and I send it pretending to be the computer that I want to attack, and the answer to that question is 1000 bytes, then I have effectively multiplied the traffic that I can attack with by 20 times. Especially as DNSSEC (Domain Name System Security Extensions) become more common, the RRSIG and DNSKEY DNS response codes can contain a lot of data that can be used in this type of attack.

In this post, I’d like to present a couple of ways to easily protect your open DNS resolver from being involved in malware attacks like the DNS amplification attack.

Configuring a DNS Resolver

Many DNS servers, or frontends such as PowerDNS or dnsdist, have the built-in or user-configurable ability to limit some types of attacks. In the case of dnsdist, the loadbalancer sits in front of the DNS servers and monitors the traffic going to and from them in order to blacklist hosts that are abusing the platform.

However, when configuring this within Strongarm’s servers, we wanted the ultimate scalability and flexibility on our DNS infrastructure, so we decided not to use dnsdist but instead use a pure networking approach. Here are a few steps that you can take to protect your DNS infrastructure no matter whether you use a DNS loadbalancer or servers interfacing directly to the internet.

The first step you can take in protecting your server is to ensure that ANY queries cannot be used in an attack. An ANY query returns all the records of a particular domain so naturally it returns more data than a standard query. This is usually easy to configure with an option like ‘any-to-tcp’ in PowerDNS. This setting says that if the recursive server receives an ANY query, it will automatically send back a small redirect: “TCP is required”.

To understand why this helps prevent attacks we need to understand the following three things.

  1. An ANY query will usually return larger responses as it asks for all records under a particular domain.
  2. 99% of the time, an ANY query is not legitimate traffic. Usually, a host will only want a specific type of record such as A or MX.
  3. Whereas it’s easy to spoof UDP traffic, it’s virtually impossible to spoof TCP. This is because establishing a TCP connection requires a 3-way handshake. For example, if the client says “I’d like to open a connection”, and the server says “Okay, you’d like to open a connection, it’s now open”, then the client says, “Thanks, the connection is now open”. While you can spoof the initiation of the connection, when the server says “Okay, you’d like to open a connection, it’s now open,” the host that has been spoofed will reply “What?! I didn’t ask to open a connection!” and it won’t go any further.

Putting this all together, we can see that this can be a very effective preventative measure for abusing an open DNS resolver. Legitimate clients will fall back to using TCP and attackers will simply give up. We can’t use this for all connections because having to do every DNS lookup over TCP would noticeably slow down internet browsing speed, but we can do this easily enough on connections that have a high probability of being attack traffic.

In a similar vein, another useful option for many DNS servers is the ability to limit the size of a return packet over UDP. Typically, you would configure this to say, “If the return packet is more than X bytes, send a TCP redirect and only allow this over TCP.”

Firewall Limiting of Potential Attack Traffic

In addition to doing the above, we implemented a pure firewall-based approach to throttling attack traffic. To do this, we needed to configure our firewall to be stateless, as we described how to do in a previous post.

As opposed to dnsdist or other frontend servers, this allows you to deploy either on a single server or on a frontend router that covers multiple resolvers. This also should be much more efficient as all processing occurs in-kernel via netfilter rather than having to go through a program which may crash or be somehow limited in the speed at which it can process data. As we showed in a previous post this is very efficient at packet processing.

We start by creating an ‘ipset’ of IPs that we have currently blacklisted. We’ll use the ‘timeout’ option to specify that after we have added an IP into this blacklist, it will automatically expire after a certain time. We’ll also limit it to a maximum 100,000 IPs so that an attacker cannot use this to take our server offline:

ipset create throttled-ips hash:ip timeout 600 family inet maxelem 100000

Then, if an IP is on this list, we’ll block it from doing any UDP traffic to our server:

iptables -t raw -A PREROUTING -p udp -m set --match-set throttled-ips src -j DROP

Now for the clever part: we’ll look for DNS responses that are over a certain threshold packet size (700 bytes) and start monitoring them to see the rate at which someone is sending them:

iptables -N LARGE_DNS_PACKET_TRACKING # Create the destination chain
iptables -A OUTPUT -p udp --sport 53 \
        -m length --length 700:0xffff \
        -j LARGE_DNS_PACKET_TRACKING

This points to a new iptables chain called “LARGE_DNS_PACKET_TRACKING” which we’ll set up as follows:

iptables -A LARGE_DNS_PACKET_TRACKING -m hashlimit --hashlimit-mode dstip --hashlimit-dstmask 32 \
   --hashlimit-upto 50kb/min --hashlimit-burst 10 --hashlimit-name large-dns-packets --hashlimit-htable-max 100000 \
   -j ACCEPT

This first rule allows up to 50kb of large DNS responses per minute to a single IP (the 32 means a /32, i.e. a single IP address), and always allows the first 10 large response packets through. Again, it tracks, at most, 100,000 IPs in order to avoid an attack vector against our server.

After a host goes over this threshold, we’ll pass the traffic through to the next stage of the chain:

iptables -A LARGE_DNS_PACKET_TRACKING -j SET --add-set throttled-ips dstip --timeout 600 --exist

This is where the magic happens. If the client breaches the threshold set above, then it will add its IP to the ipset we created earlier, meaning that it will be blocked for 10 minutes. Finally, let’s note this in the system log and then drop the packet:

iptables -A LARGE_DNS_PACKET_TRACKING -j LOG --log-prefix "DNS-amplification protection: "
iptables -A LARGE_DNS_PACKET_TRACKING -j DROP

Conclusions

With the right protection in place, it’s not such a bad thing to run an open DNS resolver on the internet. If you look in your server’s configuration manual, you should find a few options that can also help in preventing attacks. Additionally, we recommend setting up a firewall-based system like I detailed above so that you can limit the amount of traffic you send out. Otherwise, you may easily find your server being disconnected by your ISP for being part of an attack.

Linux Stateless Firewalling for High Performance

I’m currently doing a fun bit of consulting on high performance Linux with a great company called Strongarm. I’ve written a post on their blog about we went about adapting a standard linux firewall to make it much more efficient and less resilient to DDoS attack. In short, remove the connection tracking modules and easily do it yourself – but watch out for hidden traps especially on the AWS EC2 platform because it uses jumbo frames!

I’ve archived the content below:

When we were building Strongarm, we came across an interesting challenge that we hadn’t seen addressed before: how to make a Linux stateless firewall that guarantees performance and resilience. Below, I’ll explain how we went about exploring and eventually solving this problem and offer some specific tips you can apply if you are trying to achieve something similar.

Stateful Connection Tracking and Its Issues

I love Linux, and in particular its firewalling capability. The excellent iptables utility has so many extensions and features it’s hard to keep track of everything that it can do. However, one of the issues I’ve seen many times on high performance systems is that, while the Linux firewall behaves excellently in many common situations, there are some loads where it can severely degrade performance and even cripple your server.

By default, as soon as you load iptables, stateful connection tracking is enabled. This allows you to build a firewall that will totally protect your computer as simply as:

iptables -A INPUT -m state --state ESTABLISHED,RELATED -j ACCEPT
iptables -P INPUT DROP

The first command allows any connections that are already established to be able to continue when the responses come in, and the second one says to drop anything else by default. This will let you talk to anyone else on the network, but not let anyone else talk to you. There are obviously much more complex firewalling setups, but unless you explicitly disable connection tracking, it will always be there keeping a list of which connections have been established to or from your computer and what their current states are.

Usually this is not a problem, but under high performance situations or in the event of  an attack, you can sometimes see the dreaded “ip_conntrack: table full, dropping packet” error in the kernel logs. This probably means that someone tried to connect to your server but couldn’t. In other words, it seems to at least some part of the internet that your server has gone offline! The purpose of this is to prevent DoS attacks. Linux limits the number of connections that it tracks so that it doesn’t use all of the system’s memory. You can see what your connection limit is on newer kernels by running:

$ cat /proc/sys/net/nf_conntrack_max
262144

(If you’re having this issue right now, as a temporary workaround, you can write a larger value, perhaps echoing 5000000 into that file to raise the limit. For a longer-term solution, please read on.)

Being able to track 250,000 connections simultaneously might seem like quite a lot, but what exactly is a connection? The answer is relatively easy to define for a stateful protocol like TCP, however in a stateless protocol such as UDP or ICMP, all you can do is say, “We didn’t see a packet in the past X seconds – I guess we don’t have a connection anymore.” How many seconds by default?

$ cat /proc/sys/net/netfilter/nf_conntrack_udp_timeout
30

Now, protocols such as UDP or ICMP are easily spoofed (i.e. the source address can be easily faked). Since UDP has 65,000 ports, this basically means that if you can send roughly 50,000 UDP packets (one per port) from 5 servers or spoofed addresses in a period of less than 30 seconds, you will cause the Linux connection tracking table to overflow and effectively block any other connections for that time. This is pretty easy to do. A UDP packet only needs to be 28 bytes, and 28 * 250,000 = 7Mb, or 60Mbit of data. In other words, on a 100Mbps connection, or from 10 people’s 10Mbps connections, you can send enough data in about 0.5 seconds to take a server offline for 29 seconds. Oops!

It’s not just deliberate attacks that can cause this, either. Many services that simply get lots of connections‚especially DNS servers (because they work over UDP)—can easily hit this limit because they’ve been left with stateful connection tracking enabled by default. Fortunately, this is quite straightforward to disable, and as long as you do it the correct way, there should be no downside to it. Moreover, as you might imagine, tracking a number of connections can take quite a lot of processing power, which is another reason to disable stateful connection tracking. In a simple test of UDP traffic we were able to achieve a 20% performance increase by disabling stateful connection tracking on our firewall.

There are, of course, some situations where you can’t use a stateless firewall. The main scenario in which a stateless firewall won’t work arises when you need to NAT traffic. This can be the case, for example, when you are configuring the server as a router. In this case, the kernel needs to keep track of all the connections flowing through the router. However, for many situations, you can convert your firewall to be stateless with very little hassle. Below, we’ll explain how to do this.

How to Convert a Stateful Firewall to Stateless

Let’s take a slightly more complex example than above for a web server:

# Allow any established connections, dropping everything else
iptables -A INPUT -m state --state ESTABLISHED,RELATED -j ACCEPT
iptables -A OUTPUT -m state --state ESTABLISHED,RELATED -j ACCEPT
iptables -P INPUT DROP
iptables -P OUTPUT DROP

# Allow remote ssh and http access
iptables -A INPUT -p tcp --dport 22 -j ACCEPT # ssh
iptables -A INPUT -p tcp --dport 80 -j ACCEPT # http

# Allow DNS lookups to be initiated from this server
iptables -A OUTPUT -p udp --dport 53 -j ACCEPT # dns
iptables -A OUTPUT -p tcp --dport 53 -j ACCEPT # dns

Simple enough to understand (hopefully).

The first thing to do is to allow TCP (stateful) connections to keep working as they always have, but without tracking their state. We can do this by changing our “-m state” lines to look something like:

iptables -A INPUT -p tcp \! --syn -j ACCEPT
iptables -A OUTPUT -p tcp \! --syn -j ACCEPT

This means, “If the TCP connection is already established, let it through,” (i.e. it doesn’t have the SYN flag set). This will mean that all TCP connections work exactly as before.

However, because we don’t have a state for UDP connections, we have to flip the rules around. For example, in the above code, we are saying, “Allow outbound connections to port 53,” so we now need to add a rule that also states, “Allow inbound connections from port 53.” This means you will need to add the following rule:

iptables -A INPUT -p udp --sport 53 -j ACCEPT # dns responses

One other thing that recently caught us by surprise is that you need to allow certain types of ICMP signalling traffic:

iptables -A INPUT -p icmp --icmp-type destination-unreachable -j ACCEPT

(Your stateless firewall will work fine without this 99% of the time. However, at Strongarm, we hit an issue on EC2 with our servers when using jumbo frames (packet size 9000) that were trying to communicate with the internet (packet size 1500) over the https protocol. EC2 tried to tell us using ICMP to make our packets smaller (the pmtu protocol), but because we were dropping it all automatically, we didn’t receive those packets, and so we couldn’t speak https with the internet. D’oh!)

Finally, we add in the magic that tells conntrack to not run, using the special “raw” table:

iptables -t raw -I PREROUTING -j NOTRACK
iptables -t raw -I OUTPUT -j NOTRACK

And you’re done. For an extra few lines of firewall code, you can achieve a 20% performance improvement in packet processing speed, lower memory usage, greater resistance to withstand DoS attacks, and much better scalability.

Now let’s put this all together to give you a final script that you can use to build an awesome firewall:

# Stateless firewall!
iptables -t raw -I PREROUTING -j NOTRACK
iptables -t raw -I OUTPUT -j NOTRACK

# Allow any established connections, dropping everything else
iptables -A INPUT -p tcp \! --syn -j ACCEPT
iptables -A OUTPUT -p tcp \! --syn -j ACCEPT
iptables -P INPUT DROP
iptables -P OUTPUT DROP

iptables -A INPUT -p icmp --icmp-type destination-unreachable -j ACCEPT # icmp routing messages

# Allow remote ssh and http access
iptables -A INPUT -p tcp --dport 22 -j ACCEPT # ssh
iptables -A INPUT -p tcp --dport 80 -j ACCEPT # http

# Allow DNS lookups to be initiated from this server
iptables -A OUTPUT -p udp --dport 53 -j ACCEPT # dns
iptables -A OUTPUT -p tcp --dport 53 -j ACCEPT # dns
iptables -A INPUT -p udp --sport 53 -j ACCEPT # dns responses

Blacklisting domains using PowerDNS Recursor

I recently had a client that was wanting to provide a recursive DNS service within his company, however wanted to blacklist a lot of domains to redirect internally. And I mean a lot – over 1 million porn/spam/… domains. It’s one thing to use the excellent unbound recursive DNS software and set up the blocks using the local-data argument, but it was requiring over 6gb of memory to load the list and crashing the process because of that.

As it turns out, yet again PowerDNS to the rescue. I love PowerDNS for its flexibility (so much so that I created a very high performance DNS backend for it and also run a company consulting on DNS deployments).

As the full list of domains would not fit in memory we had to use a database, I took inspiration from a previously posted Lua script which used a tinycdb. Unfortunately tinycdb requires manual compilation and so wasn’t an option, and as the client already had the list of domains in a MySQL deployment we ended up using that. Both PowerDNS authoritative server and recursor can support Lua scripting to do pretty much anything you need, and there are a number of database options that Lua can use. I started off using luadbi as it seemed to have a nicer interface, however unfortunately luadbi only supports lua 5.1 whereas the debian build of PowerDNS uses lua 5.2. This meant switching to use luasql which is a bit lower-level.

So, the following Lua script will redirect any blacklisted subdomains (in the mysql table domains with field name) to 127.0.0.1:

driver = require "luasql.mysql"
env = assert( driver.mysql() )

function preresolve ( remoteip, domain, qtype )
        con = assert(env:connect("database_name", 'username', 'password'))

        domain = domain:gsub("%.$", "")

        while domain ~= "" do
                local sth = assert (con:execute( string.format("SELECT 1 FROM domains WHERE name = '%s'", con:escape( domain )) ) )
                if sth:fetch() then 
                        return 0, { { qtype=pdns.A, content="127.0.0.1" } }
                end

                domain = domain:gsub("^[^.]*%.?", "")
        end

        return -1, {}
end

As establishing a MySQL connection each request is quite a high overhead it might well be worth switching to use SQLite in the future, this should be very simple to do by just changing the driver name and connection string.

Ultra-high performance with PowerDNS

I love PowerDNS, it’s so flexible through the multiple backends that you can do pretty much whatever you want with it. When we had a nightmare weekend at 123-reg several years ago, I designed and built the next generation of the DNS infrastructure on PowerDNS because it would connect straight into our main DNS databases and we could easily change schema etc without having issues.

However, because of this flexibility sometimes PowerDNS has some performance issues. When designing the second generation of high-performance DNS servers for the HostEurope group I tested a number of other opensource servers. The highest performing by far was nsd however it used a lot of memory and wasn’t able to give us the flexibility that we required.

Looking at PowerDNS it had a number of scaling issues – up to 4 or 6 cores it was fine but after that it just couldn’t scale any further. So, working with a number of tools such as valgrind and perf I identified a number of scaling issues with the distributor code and created a series of patches which were rolled in to PowerDNS Authoritative 3.3 and 3.4 and enable PowerDNS to scale easily to 32 or 64 cores. I also came across a Linux kernel issue with locking (actually an improvement in the packet handling code) which meant that the boxes couldn’t handle more than about 250k PPS before the thread contention on the socket lock hit a limit. Fortunately with Linux 3.9 the patches that the Google guys had produced a few years ago finally got incorporated and so the SO_REUSEPORT patch was also added to PowerDNS.

Finally, I found a very high performance key/value database (lmdb) which is very reliable and fast (sqlite was limited to about 3 or 4 cores, BDB always corrupts itself, KoyotoDB is nice but updates cause a lot of contention and I’m not sure if it’s being actively developed any more). And the result is a high performance backend for PowerDNS (lmdbbackend) which can scale to over 1m QPS/server with instant updates and low response times. The best way to avoid an embarrassing DNS DDOS is to have enough capacity to respond to all queries that are thrown at the server, after that you can try to filter the big hitters. Unfortunately many anti-DDOS appliances that we tested turned out to fail quite badly on UDP-based attacks, or attacks at > 1m PPS so it’s often best to not put them in front of infrastructure that may sustain such an attack.

If you want professional consulting and assistance with setting up a high-performance resilient DNS infrastructure please contact me through my consultancy company dns-consultants.com