Category Archives: Linux

Easily switch between KVM and VirtualBox virtual machines

Wednesday, December 14, 2016 Mark 3 Comments

I’ve done quite a bit of development recently in Android and also been working with a client who has a local virtual environment using Oracle/Sun’s VirtualBox vm. So, I found myself switching between the two platforms quite frequently which unfortunately requires removing and reinstalling kernel modules. So, I wrote the below shell script to switch between the two platforms. Simply put in a directory in $PATH (for me I always have ~/bin as a directory there for my user-local scripts) and call the script something like switch_vm. Use it like:

switch_vm virtualbox
switch_vm kvm

Here’s the script:

#!/bin/bash
VM=$1

if [ "$VM" = "kvm" ]; then
    sudo rmmod vboxpci vboxnetflt vboxnetadp vboxdrv
    sudo modprobe kvm_intel
fi

if [ "$VM" = "virtualbox" ]; then
    sudo rmmod kvm_intel kvm
    for i in vboxdrv vboxpci vboxnetflt vboxnetadp; do
        sudo modprobe $i
    done
fi

KDE, Linux, Ubuntu

KUbuntu double login screen

Monday, December 12, 2016 Mark Leave a comment

More a note to myself, but hopefully it will be helpful to others too! After updating to KUbuntu 16.04 a few months back, I had an issue where when I closed the lid to suspend the laptop, and then opened it again it presented both a gnome login/password screen and then a KDE5 password screen. When I used the button, or some other times when I suspended it only showed the KDE 5 Plasma unsuspend prompt. After some googling around I found a SO thread which pointed me in the right direction.

Going in to the org.gnome.desktop.screensaver of dconf-editor I deselected the idle-activation-enabled, lock-enabled and ubuntu-lock-on-suspend options and I’ve not had this issue ever since

AWS, High Performance Linux, IPTables, Linux, Networking

Testing the Performance of the Linux Firewall

Monday, August 8, 2016 Mark Leave a comment

Over on the Strongarm blog I’ve got an in-depth article about testing the performance of the Linux firewall. We knew it was fast, but how fast? The answer is perhaps surprisingly “significantly faster than the kernel’s own packet handling” – blocking packets to an un-bound UDP port was 2.5* faster than accepting them, and in that case a single-core EC2 server managed to process almost 300kpps. We also tested the performance of blocking DDoS attacks using ipset and/or string matching.

I’ve archived the article below:

As we mentioned in a previous post, at Strongarm, we love the Linux firewall. While a lot of effort has gone into making the firewall highly flexible and suitable for nearly every use case, there is very little information out there about the actual performance of the firewall. There are many studies on the performance of the Linux network stack, but they don’t get to the heart of the performance of the firewall component itself.

With the core part of the Strongarm product focused on making highly reliable DNS infrastructure, we thought we’d take a look at the Linux firewall, especially as it relates to UDP performance.

Preparing to Test

To begin, we wrote a small C program that sends 128-byte UDP packets very efficiently to the loopback device on a single source/destination port. We tested this with one of the latest Ubuntu 16.04 4.4-series kernels on a small single-core EC2 instance. We ran each test multiple times over a period of 30 seconds to ensure that the CPU was pegged at 100% throughout, verifying that we were CPU-bound. We then averaged the performance results to present the values shown here, although, because this was all software-based on the same box, there was very low deviation from the average between test runs.

Here are the results we found and some potential conclusions you can draw from them to optimize your own firewall.

NOTRACK Performance

First, we ran the packet generator with no firewall installed but with the connection tracking modules loaded. We saw a rate of 87k PPS (Packets Per Second). Then, we added the following rules to disable connection tracking on the UDP packets (e.g. converting our firewall to be stateless):

iptables -t raw -I PREROUTING -j NOTRACK
iptables -t raw -I OUTPUT -j NOTRACK

With this, we saw the firewall perform at 106k PPS, or a 20% speedup, as I first reported in this blog post. We then disabled connection state tracking for all future tests.

Conclusion: Where possible, use a stateless firewall.

Maximum Possible Performance

We are primarily interested in the performance of blocking UDP based attacks, so we focused on testing various DROP rules with connection state tracking disabled.

First, to get a baseline performance with regards to packet dropping, we wrote a simple rule to drop packets and put it as the first entry on the very first part of the iptables processing stack, the “raw PREROUTING” table:

iptables -t raw -I PREROUTING -p udp -i lo -j DROP

This produced a baseline performance of 280k PPS for the firewall when dropping packets! This is a massive 260% increase in performance over just having a blank firewall in place. This shows that by putting blocks earlier on in the process, you can bypass kernel work, resulting in a significant speedup.

We then did the same drop, but on the default ‘filter’ table:

iptables -I INPUT -p udp -i lo -j DROP

This produced a performance of 250k PPS, or roughly a 12% performance drop.

Conclusion: Drop unexpected packets in your firewall, rather than letting them go further down the network stack. It doesn’t matter as much if you put them in the ‘raw’ table or the default ‘filter’ table, but raw does give slightly better performance.

When we redid this with the ‘raw’ table and inserted 5 non-matching rules, the performance dropped to 250k PPS, or 12% less compared to the baseline.

Additionally, we concluded from this that you should try to put your most common rules towards the top of your firewall, as position does make a difference. However, it is not too significant.

ipset Performance

If you’re under attack and you insert a rule in the firewall for every single IP address you wish to block, these rules are processed one-at-a-time for each packet. So instead of doing this, you can use the ipset command to create a lookup table of IPs. You can specify how these are to be stored, but if you use the most efficient option (a hash) you can make operations on sets of IPs scale very well. At least, this is the theory, so let’s see how it performs in practice:

ipset create blacklist hash:ip counters family inet maxelem 2000000
iptables -t raw -A PREROUTING -m set --match-set blacklist src -j DROP

First, we just added in the loopback’s IP address so it would behave the same as the baseline. This produced a result of 255k PPS, or 10% slower than the baseline of blocking all traffic on this interface. Then, we added another 65k entries to the blacklist using the following small script:

perl -E 'for $a (0..255){ for $b (0..255) { say "add blacklist 1.1.$a.$b" } }' |ipset restore

The result was again 255k PPS. We added another 200k entries and the performance stayed the same.

Conclusion: Using ipset’s to do operations on groups of IPs is very efficient and scales well, especially for blacklisting groups of IP addresses.

String Matching Performance

One of the more advanced abilities of the Linux firewall is filtering packets based on looking at the contents (packet inspection). I have often have seen DNS DDoS attacks performed against a particular domain. In this case, matching the string and blocking all packets containing this string is one of the easiest ways to stop an attack in its tracks. But, I assumed that performing a string search in every packet would be pretty inefficient:

iptables -t raw -I PREROUTING -p udp -m string --algo kmp --hex-string 'foo.com' -j DROP

(Note that this is just fake traffic. Real DNS traffic doesn’t encode domains with dot separators)

There are two possible algorithms (kmp and bm) in the string matching module. The UDP generator was sending 128-byte packets. The results are: bm handling 255k PPS traffic (10% lower than baseline) and kmp: handling 235k PPS (17% lower).

Conclusion: String-based packet filtering was much more efficient than we expected and a great way to stop DDoS attacks where a common string can be found. You’ll want to use the ‘bm’ algorithm for best performance.

Hashlimit Performance

Finally, we tested the performance of hashlimit, which is a great first layer of defense against DDoS attacks. We will say “Limit each /16 (65,000 server) block on the internet to only be allowed to send 1000 PPS of UDP traffic”:

iptables -t raw -A PREROUTING -p udp -m hashlimit --hashlimit-mode srcip --hashlimit-srcmask 16 --hashlimit-above 1000/sec --hashlimit-name drop-udp-bursts -j DROP

Bearing in mind that our server only handles 106k PPS of traffic not blocked by the firewall, we would expect a ~1% performance drop just because we’re now allowing 1k PPS through, so we’ll use a baseline of, say, 275k PPS. With this rule, we saw 225k PPS processed by the server (1k PPS accepted, the rest rejected), so 20% lower performance, but it will give quite a bit of protection against a DDoS attack.

Conclusion: Consider using the hashlimit module to do basic automatic anti-DDoS protection on your servers. The 20% overhead is generally worth it.

Summary of Results

Method	Performance (PPS)	Cost (%)
Accepting all packets
Connection tracking disabled	106k	–
Connection tracking enabled	87k	20%

Rejecting all packets
Dropping at very start of raw table	280k	–
Dropping after 5 rules	250k	12%
Dropping in normal (filter) table	250k	12%
Ipset (1-250k entries)	255k	10%
String matching: bm	255k	10%
String matching: kmp	235k	17%
Hashlimit (1k PPS accepted)	225k	20%

Drawing Final Conclusions

Linux iptables are incredibly flexible, and for the most part, very scalable. Even on one of the smallest single-core virtual machines in EC2, the Linux firewall can handle filtering over 250k PPS. In fact, if you put rules right at the entry of the netfilter stack, you can reject packets over twice as fast as the kernel itself can process them. Even some of the modules which have greater algorithmic complexity, for example hashlimit and string matching, performed much better than we expected.

AWS, High Performance Linux, Linux, PowerDNS

Linux Stateless Firewalling for High Performance

Thursday, July 14, 2016 Mark 2 Comments

I’m currently doing a fun bit of consulting on high performance Linux with a great company called Strongarm. I’ve written a post on their blog about we went about adapting a standard linux firewall to make it much more efficient and less resilient to DDoS attack. In short, remove the connection tracking modules and easily do it yourself – but watch out for hidden traps especially on the AWS EC2 platform because it uses jumbo frames!

I’ve archived the content below:

When we were building Strongarm, we came across an interesting challenge that we hadn’t seen addressed before: how to make a Linux stateless firewall that guarantees performance and resilience. Below, I’ll explain how we went about exploring and eventually solving this problem and offer some specific tips you can apply if you are trying to achieve something similar.

Stateful Connection Tracking and Its Issues

I love Linux, and in particular its firewalling capability. The excellent iptables utility has so many extensions and features it’s hard to keep track of everything that it can do. However, one of the issues I’ve seen many times on high performance systems is that, while the Linux firewall behaves excellently in many common situations, there are some loads where it can severely degrade performance and even cripple your server.

By default, as soon as you load iptables, stateful connection tracking is enabled. This allows you to build a firewall that will totally protect your computer as simply as:

iptables -A INPUT -m state --state ESTABLISHED,RELATED -j ACCEPT
iptables -P INPUT DROP

The first command allows any connections that are already established to be able to continue when the responses come in, and the second one says to drop anything else by default. This will let you talk to anyone else on the network, but not let anyone else talk to you. There are obviously much more complex firewalling setups, but unless you explicitly disable connection tracking, it will always be there keeping a list of which connections have been established to or from your computer and what their current states are.

Usually this is not a problem, but under high performance situations or in the event of an attack, you can sometimes see the dreaded “ip_conntrack: table full, dropping packet” error in the kernel logs. This probably means that someone tried to connect to your server but couldn’t. In other words, it seems to at least some part of the internet that your server has gone offline! The purpose of this is to prevent DoS attacks. Linux limits the number of connections that it tracks so that it doesn’t use all of the system’s memory. You can see what your connection limit is on newer kernels by running:

$ cat /proc/sys/net/nf_conntrack_max
262144

(If you’re having this issue right now, as a temporary workaround, you can write a larger value, perhaps echoing 5000000 into that file to raise the limit. For a longer-term solution, please read on.)

Being able to track 250,000 connections simultaneously might seem like quite a lot, but what exactly is a connection? The answer is relatively easy to define for a stateful protocol like TCP, however in a stateless protocol such as UDP or ICMP, all you can do is say, “We didn’t see a packet in the past X seconds – I guess we don’t have a connection anymore.” How many seconds by default?

$ cat /proc/sys/net/netfilter/nf_conntrack_udp_timeout
30

Now, protocols such as UDP or ICMP are easily spoofed (i.e. the source address can be easily faked). Since UDP has 65,000 ports, this basically means that if you can send roughly 50,000 UDP packets (one per port) from 5 servers or spoofed addresses in a period of less than 30 seconds, you will cause the Linux connection tracking table to overflow and effectively block any other connections for that time. This is pretty easy to do. A UDP packet only needs to be 28 bytes, and 28 * 250,000 = 7Mb, or 60Mbit of data. In other words, on a 100Mbps connection, or from 10 people’s 10Mbps connections, you can send enough data in about 0.5 seconds to take a server offline for 29 seconds. Oops!

It’s not just deliberate attacks that can cause this, either. Many services that simply get lots of connections‚especially DNS servers (because they work over UDP)—can easily hit this limit because they’ve been left with stateful connection tracking enabled by default. Fortunately, this is quite straightforward to disable, and as long as you do it the correct way, there should be no downside to it. Moreover, as you might imagine, tracking a number of connections can take quite a lot of processing power, which is another reason to disable stateful connection tracking. In a simple test of UDP traffic we were able to achieve a 20% performance increase by disabling stateful connection tracking on our firewall.

There are, of course, some situations where you can’t use a stateless firewall. The main scenario in which a stateless firewall won’t work arises when you need to NAT traffic. This can be the case, for example, when you are configuring the server as a router. In this case, the kernel needs to keep track of all the connections flowing through the router. However, for many situations, you can convert your firewall to be stateless with very little hassle. Below, we’ll explain how to do this.

How to Convert a Stateful Firewall to Stateless

Let’s take a slightly more complex example than above for a web server:

# Allow any established connections, dropping everything else
iptables -A INPUT -m state --state ESTABLISHED,RELATED -j ACCEPT
iptables -A OUTPUT -m state --state ESTABLISHED,RELATED -j ACCEPT
iptables -P INPUT DROP
iptables -P OUTPUT DROP

# Allow remote ssh and http access
iptables -A INPUT -p tcp --dport 22 -j ACCEPT # ssh
iptables -A INPUT -p tcp --dport 80 -j ACCEPT # http

# Allow DNS lookups to be initiated from this server
iptables -A OUTPUT -p udp --dport 53 -j ACCEPT # dns
iptables -A OUTPUT -p tcp --dport 53 -j ACCEPT # dns

Simple enough to understand (hopefully).

The first thing to do is to allow TCP (stateful) connections to keep working as they always have, but without tracking their state. We can do this by changing our “-m state” lines to look something like:

iptables -A INPUT -p tcp \! --syn -j ACCEPT
iptables -A OUTPUT -p tcp \! --syn -j ACCEPT

This means, “If the TCP connection is already established, let it through,” (i.e. it doesn’t have the SYN flag set). This will mean that all TCP connections work exactly as before.

However, because we don’t have a state for UDP connections, we have to flip the rules around. For example, in the above code, we are saying, “Allow outbound connections to port 53,” so we now need to add a rule that also states, “Allow inbound connections from port 53.” This means you will need to add the following rule:

iptables -A INPUT -p udp --sport 53 -j ACCEPT # dns responses

One other thing that recently caught us by surprise is that you need to allow certain types of ICMP signalling traffic:

iptables -A INPUT -p icmp --icmp-type destination-unreachable -j ACCEPT

(Your stateless firewall will work fine without this 99% of the time. However, at Strongarm, we hit an issue on EC2 with our servers when using jumbo frames (packet size 9000) that were trying to communicate with the internet (packet size 1500) over the https protocol. EC2 tried to tell us using ICMP to make our packets smaller (the pmtu protocol), but because we were dropping it all automatically, we didn’t receive those packets, and so we couldn’t speak https with the internet. D’oh!)

Finally, we add in the magic that tells conntrack to not run, using the special “raw” table:

iptables -t raw -I PREROUTING -j NOTRACK
iptables -t raw -I OUTPUT -j NOTRACK

And you’re done. For an extra few lines of firewall code, you can achieve a 20% performance improvement in packet processing speed, lower memory usage, greater resistance to withstand DoS attacks, and much better scalability.

Now let’s put this all together to give you a final script that you can use to build an awesome firewall:

# Stateless firewall!
iptables -t raw -I PREROUTING -j NOTRACK
iptables -t raw -I OUTPUT -j NOTRACK

# Allow any established connections, dropping everything else
iptables -A INPUT -p tcp \! --syn -j ACCEPT
iptables -A OUTPUT -p tcp \! --syn -j ACCEPT
iptables -P INPUT DROP
iptables -P OUTPUT DROP

iptables -A INPUT -p icmp --icmp-type destination-unreachable -j ACCEPT # icmp routing messages

# Allow remote ssh and http access
iptables -A INPUT -p tcp --dport 22 -j ACCEPT # ssh
iptables -A INPUT -p tcp --dport 80 -j ACCEPT # http

# Allow DNS lookups to be initiated from this server
iptables -A OUTPUT -p udp --dport 53 -j ACCEPT # dns
iptables -A OUTPUT -p tcp --dport 53 -j ACCEPT # dns
iptables -A INPUT -p udp --sport 53 -j ACCEPT # dns responses

asus, Linux

Manually changing ASUS wireless router firmware versions

Saturday, June 25, 2016 Mark Leave a comment

I recently got a Asus RT-AC3200 wireless router for a project I’m experimenting with. Naturally, the first thing I did was to flash it with some custom firmware, in this case DD WRT which worked fine but there was a feature missing that I needed. I then wanted to change back to the stock firmware, however trying to use the standard upgrade method didn’t work for some reason. Asus offers a recovery tool via their website but only for windows. But, looking in the forums it turns out that it’s pretty simple to do this manually (and on linux).

First, turn the router off. Then turn it on while holding the reset button for a few seconds. It will then boot up into a recovery mode where you can ping it and send files to it but that’s about all. So, plug your computer in, manually configure your (wired) interface to 192.168.1.2 (I don’t know if this is actually required or not) and run:

$ tftp 192.168.1.1
tftp> mode octet
tftp> put RT-AC3200_380.60_beta2.trx
Sent 43192320 bytes in 125.8 seconds

Then reboot your router and the firmware has been forcibly updated! It’s probably worth to do a ‘factory reset’ after installing any new firmware as the nvram differs between different versions.

High Performance Linux, Linux

Maximizing rsync throughput in 2 easy commands

Friday, May 27, 2016 Mark 4 Comments

rsync is an excellent tool for linux to copy files between different systems. However, it doesn’t yet have the ability to run multiple copy processes in parallel which means that if you are limited by the speed you can read filesystem metadata (ie list the files), or you have a very fast network connection and lots of cores on both servers you can significantly speed up copying files by running processes in parallel. For example, one process can copy files at perhaps 50MB/sec, however with a 16-core server, 1gbps network connection and a fast SSD array you can copy data at 1GB/sec (gigabytes). Here’s how:

Firstly, you need to get ssh set up so you can connect between the machines without using a password. Even if you are copying between two remote systems and you use ssh-agent key forwarding (which I highly recommend), this can become a significant bottleneck so it’s best to do the following and generate a new key on the source system:

ssh-keygen -f rsync

Hit enter when it prompts for a passphrase so that the key is generated without needing a password to open it. This will create two files, rsync which is your private key and rsync.pub which you want to add to your authorized keys on the remote host using something like:

ssh-copy-id -i rsync.pub user@remote_host

You should then be able to ssh without needing a password by doing:

ssh -i rsync user@remote_host

Next, we need to go to the remote host and allow lots of ssh sessions to be opened at once; open up /etc/ssh/sshd_config on remote_host and append or change these lines:

MaxSessions 100
MaxStartups 100

Now, on your source host run the following command to ensure that rsync uses the ssh key you just created:

RSYNC_RSH="ssh -i rsync"

Now for the good stuff – first we need to mirror the directory structure but not the files:

rsync -za --include='*/' --exclude='*' /local/path/ remote_server:/remote_path/

And now we can do the copy in parallel (you might need to install the parallel command using something like apt-get install parallel):

cd /local/path/; find -L . -type f | parallel -j 30 rsync -za {} user@remote_host:/remote/path/{}

# To exclude some files append as many of these as you want to the find command: \! -name file_to_exclude

This will copy 30 files in parallel, using compression. Play with the exact number, but 1.5 times the number of cores in your boxes should be enough. You can monitor the disk bandwidth with iostat -mx or the network throughput with a tool like iptraf. One of those, or the CPU usage should now be saturated, and your copy should be going as fast as is physically possible. You can re-run this afterwards to synchronise even quicker than a normal rsync, however you won’t be able to use it to delete files.

Linux, PostgreSQL

Running lots of postgres commands in parallel

Thursday, May 26, 2016 Mark 1 Comment

Postgres is great, however one limitation is that you can only run one command at a time in the shell. Sometimes however when you are doing administrative functions over multiple tables, for example (re)creating indexes or vacuuming and you have a nice powerful box, you can run many of these commands in parallel for easy speedup. Here’s an easy way to run lots of commands in parallel.

Firstly, create a text file with one command per line. For example

vacuum full a;
vacuum full b;
vacuum full c;
vacuum full d;

Then, ensure that you have your .pgpass file set up correctly so that you can just run psql [database] [user] without being prompted for a password.

Finally, run the following command:

xargs -d "\n" -n 1 -P 20 psql database_name username -c < list_of_commands.txt

-P 20 specifies the number of jobs to run in parallel so change this to what your server can cope with.

Databases, High Performance Linux, Linux, PostgreSQL

Finding the amount of space wasted in postgres tables

Wednesday, May 25, 2016 Mark Leave a comment

Because of way that postgres handles transaction isolation (ie using MVCC), when you modify or delete a row in a table it marks it as deleted, and then frees the space at a later point in time using (auto)vacuum. However, unless you use the heavy-weight VACUUM FULL command (which exclusive locks the table and totally rewrites it, causing anything trying to access it to block until the command is finished) the space is never reclaimed by the operating system. Normally this is not a problem – if you have a heavily used table with 20mb of data in it it probably has 5-10mb of overhead with the dead rows, reclaimed free space etc which is acceptable. However there are a few situations where it is useful to know what exactly the overhead is:

Sometimes if your table changes very quickly, is large, and your disks or autovacuum parameters are unable to keep up, it can end up growing massive. For example we had a table that contains 3Gb of data but was taking up 45Gb due to the fact that autovacuum couldn’t keep up with the frequency of changes in the table
If you are using table partitioning to store historic data then to make the most use of space you want to see whether a VACUUM FULL would be advantageous to run or not. For example if you have a table that is recording data collected from each day, some days it may be mostly just inserts so doesn’t need vacuuming; other days it may have a number of changes made and so have quite a lot of free space that can be reclaimed. Additionally, VACUUM FULL optimizes the order of data in the table and the indexes making it more performant.

In the first case, looking at the output of a command like

SELECT
    psut.relname,
    to_char(psut.last_vacuum, 'YYYY-MM-DD HH24:MI') as last_vacuum,
    to_char(psut.last_autovacuum, 'YYYY-MM-DD HH24:MI') as last_autovacuum,
    pg_class.reltuples::bigint AS n_tup,
    psut.n_dead_tup::bigint AS dead_tup,
    CASE WHEN pg_class.reltuples > 0 THEN
        (psut.n_dead_tup / pg_class.reltuples * 100)::int
    ELSE 0
    END AS perc_dead,
    CAST(current_setting('autovacuum_vacuum_threshold') AS bigint) + (CAST(current_setting('autovacuum_vacuum_scale_factor') AS numeric) * pg_class.reltuples) AS av_threshold,
    CASE WHEN CAST(current_setting('autovacuum_vacuum_threshold') AS bigint) + (CAST(current_setting('autovacuum_vacuum_scale_factor') AS numeric) * pg_class.reltuples) < psut.n_dead_tup THEN
        '*'
    ELSE ''
    END AS expect_av
FROM pg_stat_user_tables psut
    JOIN pg_class on psut.relid = pg_class.oid
ORDER BY 5 desc, 4 desc;

(sorry I can’t remember where I found this) should show you that there are a very large number of dead tuples waiting to be reclaimed (ie turned in to free space) in the table.

However, if your disks were struggling at one point, but then you tweaked autovacuum so it reclaimed the dead tuples correctly (as in case 1 above), your table could now be 90% free space but there is no easy way to find this out within postgres.

Fortunately, there is an excellent extension called pgstattuple which allows you to find out the amount of free space within a table file that has been reclaimed but not released to the operating system. The following query lists all tables which are over 100Mb in size, and have more than 10Mb of free space and have more than 20% free space (you can tweak these numbers – I just did it for our platform where our typical table size is 1Gb+):

select
    table_schema,
    table_name,
    free_percent,
    pg_size_pretty( free_space ) AS space_free,
    pg_size_pretty( pg_relation_size( quoted_name ) ) AS total_size
from (
    select
        table_schema,
        table_name,
        quoted_name,
        space_stats.approx_free_percent AS free_percent,
        space_stats.approx_free_space AS free_space
    from
        ( select *,
            quote_ident( table_schema ) || '.' || quote_ident( table_name ) AS quoted_name
            from information_schema.tables
            where
                table_type = 'BASE TABLE' and table_schema not in ('information_schema', 'pg_catalog')
                and pg_relation_size( quote_ident( table_schema ) || '.' || quote_ident( table_name ) ) > 100000000
        ) t, pgstattuple_approx( quoted_name ) AS space_stats
) t
where
    free_percent > 20
    AND free_space > 10000000
ORDER BY free_space DESC;

This only uses an approximate count, however even so it can be a bit slow (it just took 10 minutes here) on a system with many tables and heavy IO. You can use this to find the tables that would most benefit from a VACUUM FULL command being run.

Databases, High Performance Linux, Linux, Mongo DB, NoSQL, PostgreSQL

How we tweaked Postgres upsert performance to be 2-3* faster than MongoDB

Friday, January 8, 2016 Mark 8 Comments

As we all know, relational databases are fine if you’re dealing with small amounts of data but for web-scale high performance high inserts speed and masses of queries per second, NoSQL is what you need. At least, that’s the conventional wisdom/hype surrounding NoSQL databases such as MongoDB. However as we’ve recently discovered this is very wrong indeed.

As the conventional wisdom said that, we went with MongoDB for our first attempt at a system that needed to support a very high number of upserts – we were aiming for 50-100k upserted rows per second of unstructured data to be exact. Initial attempts with MongoDB on some very nice hardware (2Tb SSD SAN) produced about 10k upserts per second – good but not great. Then, this number increased when we started using batched upsert operations that was added in MongoDB 2.6, and increased even further with the advert of MongoDB 3.0 and the WiredTiger storage backend to 30-50k upserts/second. Very nice!

However, there are a number of issues with MongoDB in particular and limitations on NoSQL databases in general that relational (SQL) databases make easy – want to do aggregation, joins, distinct row selection from an index, complex data manipulations before transmitting data over the nework? Easy with SQL, virtually impossible with NoSQL. So, we started to think that if we could find a relational database with even vaguely good insert performance it would make life and development tasks so much easier that perhaps it would be worth the effort to change. As postgres generally seems to be able to handle anything you can throw at it, including support for a binary JSON data type we started looking there.

In order to avoid changing our database code very much so that we could compare as closely with MongoDB as possible, I created a really simple schema for the tables – the column(s) required for a primary key, and a jsonb data type column to store the record. One great feature of the recently released Postgres 9.5 release is the single statement upsert ability that reduces latency and code complexity, so all testing was done on this. As Postgres is designed from the ground up to be reliable and not lose data (unlike MongoDB which has fsync disabled by default), it is very slow if you are just doing single inserts. When you start batching these in to transactions of say 100 or 1000 entries you can achieve a higher throughput. If you want to compare it with MongoDB though, you can choose to disable the fsync at the end of every transaction by using the command

SET LOCAL synchronous_commit TO OFF

This means that in the event of a power failure you may lose some transactions that returned successfully, however you will never risk getting database corruption (unlike with MongoDB). By setting this one command you get performance in line with MongoDB – 50k upserts per second. However this was pushing 500Mb/sec of disk bandwidth which was saturating even the SSD array.

But Postgres has a few more tricks up its’ sleeve. One of the downsides of Postgres being incredibly protective of data is that it has a lot of writes – first it writes all commands to the ‘Write-Ahead Log’ (WAL) and then it updates the underlying data and indexes, and the WAL is where much of this additional disk IO was coming from. Starting in Postgres 9.1 however there is a feature called ‘unlogged tables‘ which avoids writing to the WAL, at the cost of loosing the table data after an unclean shutdown. Enabling this is as simple as changing your ‘CREATE TABLE’ statement to ‘CREATE UNLOGGED TABLE’. When I switched to using unlogged tables, disk IO went to virtually nil for most of the run (just at the end when it flushed the tables to disk did it dump several hundred MB of data – the database and indexes) and we saw a throughput of 85k upserts per second – roughly twice that of Mongo. The downside of this is that at an unclean shutdown, unlogged tables are automatically truncated as they may be in an inconsistent state. My proposal would be to allow a method to mark an unlogged table as read-only in order to ‘freeze’ its state hence negating the need to truncate after a crash; however apparently this would be a bit tricky to implement.

Because we want to do lots of searches with our data we heavily rely on indexes. On Postgres these appear to entail quite a bit more overhead than mongo. When I added 8 additional indexes to the unlogged table performance suddenly dropped from 85k upserts/sec to 35k upserts/sec. Fortunately there’s an easy solution to this in our case– create a table with just a single index for the primary key. Then create a ‘Materialized view‘ which is a clone of the table, and add the indexes to that. When your updates have finished, simply run “REFRESH MATERIALIZED VIEW”. This takes perhaps a minute to run on our data of 2m rows (as it is a single query, it is presently bounded to a single CPU), but it means that we get full upsert performance and the indexes then get updated inside a single transaction which should hopefully lead to less fragmentation and means better overall performance and less overall CPU usage at the cost of the materialized table lagging behind updated data by about 1 minute – an acceptable tradeoff in our case. By default “REFRESH MATERIALIZED VIEW” exclusively locks the materialized view while it is being update but you can fix this at the cost of some additional CPU cycles by changing that to “REFRESH MATERIALIZED VIEW CONCURRENTLY”. Your commands then look like:

-- Run these once:
CREATE MATERIALIZED VIEW real_table AS select * from unlogged_pk_only_table;
CREATE INDEX ON real_table (…);

-- Run this when real_table updates complete
REFRESH MATERIALIZED VIEW CONCURRENTLY real_table;

So we’ve now got performance twice the speed of Mongo without tweaking any postgres performance parameters – can we do better? It turns out that some basic settings help improve performance pretty significantly:

shared_buffers = 3GB # Give postgres lots more memory to work with
effective_io_concurrency = 8 # We are on a nice RAID SAN – we can do multiple reads/writes at once

The following setting is key as well – we don’t want to commit at the end of a transaction because frequently we have massive of transactions on the same table. Instead, wait a short amount of time and then commit with any other transactions on the same table in a single batch:

commit_delay = 100000 # 100ms

With these settings, an unlogged table with a single primary key and a jsonb column can run at 100k upserts/second, roughly three times the insert performance of MongoDB.

As I mentioned above, one of the issues with upserting lots of data into a single table was that we were seeing massive IO spikes on that. It turns out that this can be significantly reduced by changing the checkpointing settings too. There’s a small sentence in the manual that is key here: “To ensure data page consistency, the first modification of a data page after each checkpoint results in logging the entire page content” – basically if you checkpoint too frequently (and the default is after only 80mb of commands) after each checkpoint, if you are still updating lots of entries in the table, basically you entire table is rewritten to the WAL. To work around this (at the cost of causing longer recovery time after an unclean shutdown) we want to reduce the frequency of checkpoints with regard to the amount of data written (it will still write a checkpoint by default every 5 minutes anyway):

max_wal_size = 10GB
min_wal_size = 1GB

With these settings, even a standard logged table now runs at 80-90k upserts/sec and only outputs around 50Mb/sec of data.

TL;DR, one of the main areas of hype about NoSQL has been to do with the performance of inserting data. However, not only does Postgres 9.5 provide all the functionality of a mature and flexible SQL database, with a few tweaks it also handles upserts at two to three times the speed of MongoDB with better data consistency and lower memory and CPU usage.

KDE, Linux

Scrolling back in kde console history with shift plus page-up via the number pad

Tuesday, December 29, 2015 Mark Leave a comment

I’ve just got a nice new laptop which is a bit wider than my previous one and so has a number pad on the right hand side. Because I don’t need another set of numbers I have ended up using the end/home/page up/page down keys quite a bit as to use them elsewhere you have to combine with the function key. However when trying to scroll back in the kde terminal (konsole) you need to do shift-page up or shift-page down but trying this combination of keys only displayed the numbers. Fortunately I found an option buried in the KDE keyboard setup that lets you keep this – go to Keyboard Settings -> Advanced and then ‘Miscellaneous compatibility options’. Tick the ‘Shift with numeric keypad keys works as in MS Windows’ and you’re done. While you’re at it you can also disable caps lock or change it to be something like escape by looking under the ‘Caps Lock key behaviour’ option in the menu.

Mark's blog