All posts by Mark

I'm a full-stack Linux consultant from the UK specializing in high performance systems, DNS and databases. I have also written and lead teams producing a number of web/mobile apps. I'm fluent in English and Turkish.

Databases, PostgreSQL

row_to_json() equivalent for jsonb types

Sunday, March 6, 2016 Mark Leave a comment

With the advent of the awesome jsonb data type in Postgres 9.4, pretty much all of the json-related functions have a json_* and jsonb_* varient. All those, except row_to_json. I spent a bit of time but couldn’t find an answer online, in the end after a bit of experimentation, I discovered that in Postgres 9.5 at least the to_jsonb function does exactly the same, like:

SELECT to_jsonb(table.*)
FROM table

Databases, High Performance Linux, PostgreSQL

How we improved database performance 1000% with just one command

Friday, February 19, 2016 Mark Leave a comment

In my current contract we have a nice database server with 4 datacentre quality SSD’s arranged in raid5, a battery-backed (Dell PERC / megaraid) controller and very poor disk performance. As I was doing a dump of the database (with parallel process, obviously), iostat was showing 100% disk utilization at 200-300 iops and 40-60mb/sec. Really not what it should be. After poking the raid card a lot with the MegaCli tool, I discovered that by changing the Linux IO scheduler to noop or deadline we suddenly got great performance – 50% disk utilization at 1300 iops and 200mb/sec read. So, an issue with the CFQ scheduler then as everywhere seems to suggest? In fact not – it turns out the root of the problem is that because of the raid controller the kernel did not detect that the underlying disks were SSD rather than HDD. Just by tweaking this setting CFQ provides the same performance as noop or deadline schedulers (in fact better for our postgres workload). So, here’s the command we used to get a 1000% improvement in database performance:

echo 0 > /sys/block/sdb/queue/rotational

Databases, PostgreSQL

Getting value/count grouping in a single row with postgres

Tuesday, February 9, 2016 Mark 1 Comment

In one of my contracts we’re doing some very complex data storage and querying using jsonb in postgres. One common display in the system we’re replacing is basically something like:

SELECT
  sum( IF( val between 91 and 92, 1, 0 ) ) val_91,
  sum( IF( val between 92 and 93, 1, 0 ) ) val_92,
  sum( IF( val between 93 and 94, 1, 0 ) ) val_93,
  ...
  GROUP BY somecol

Which is horrible, inefficient and doesn’t let you know if anything occurred outside the range. As we’re already doing a lot with JSON, I was wondering if rather than creating new columns, we could return these counts in a JSON format. It turns out that with postgres this is really easy – here’s how:

First, we want to create a function that given the existing aggregated value and a new key to inclement, returns a jsonb object

CREATE FUNCTION agg_count_values( state jsonb, key text ) RETURNS jsonb STRICT AS $$
    SELECT state || jsonb_build_object(
        key,
        -- extract the current value, or 0 if the key doesn't exist yet
        COALESCE(( jsonb_extract_path_text( state, key ) )::int, 0) + 1
    );
$$ language SQL;

Then, you define the aggregate:

CREATE AGGREGATE count_values( text ) (
    sfunc = agg_count_values,
    stype = jsonb,
    initcond = '{}'
);

If you noticed the STRICT in the function (which means that NULL’s are not allowed), the aggregate we created notices that and will never pass NULL values in to the function meaning we don’t need to special-case them.

If you then run the query first casting the floating-point value into an integer and then turning that into a text datatype for the json key:

SELECT count_values( val::int::text )
  GROUP BY somecol;

You get a JSON result back with the counts:

{"93": 1, "94": 2, "95": 1, "96": 5, "97": 2, "98": 4, "99": 6, "100": 6, "101": 7, "102": 5, "103": 5, ...}
...

Obviously your first port of call when faced with something like this is ‘can I group by the value’, but in this case we want the row to be based on something different so we needed the sub-aggregation of the values.

Happy days

Databases, High Performance Linux, Linux, Mongo DB, NoSQL, PostgreSQL

How we tweaked Postgres upsert performance to be 2-3* faster than MongoDB

Friday, January 8, 2016 Mark 8 Comments

As we all know, relational databases are fine if you’re dealing with small amounts of data but for web-scale high performance high inserts speed and masses of queries per second, NoSQL is what you need. At least, that’s the conventional wisdom/hype surrounding NoSQL databases such as MongoDB. However as we’ve recently discovered this is very wrong indeed.

As the conventional wisdom said that, we went with MongoDB for our first attempt at a system that needed to support a very high number of upserts – we were aiming for 50-100k upserted rows per second of unstructured data to be exact. Initial attempts with MongoDB on some very nice hardware (2Tb SSD SAN) produced about 10k upserts per second – good but not great. Then, this number increased when we started using batched upsert operations that was added in MongoDB 2.6, and increased even further with the advert of MongoDB 3.0 and the WiredTiger storage backend to 30-50k upserts/second. Very nice!

However, there are a number of issues with MongoDB in particular and limitations on NoSQL databases in general that relational (SQL) databases make easy – want to do aggregation, joins, distinct row selection from an index, complex data manipulations before transmitting data over the nework? Easy with SQL, virtually impossible with NoSQL. So, we started to think that if we could find a relational database with even vaguely good insert performance it would make life and development tasks so much easier that perhaps it would be worth the effort to change. As postgres generally seems to be able to handle anything you can throw at it, including support for a binary JSON data type we started looking there.

In order to avoid changing our database code very much so that we could compare as closely with MongoDB as possible, I created a really simple schema for the tables – the column(s) required for a primary key, and a jsonb data type column to store the record. One great feature of the recently released Postgres 9.5 release is the single statement upsert ability that reduces latency and code complexity, so all testing was done on this. As Postgres is designed from the ground up to be reliable and not lose data (unlike MongoDB which has fsync disabled by default), it is very slow if you are just doing single inserts. When you start batching these in to transactions of say 100 or 1000 entries you can achieve a higher throughput. If you want to compare it with MongoDB though, you can choose to disable the fsync at the end of every transaction by using the command

SET LOCAL synchronous_commit TO OFF

This means that in the event of a power failure you may lose some transactions that returned successfully, however you will never risk getting database corruption (unlike with MongoDB). By setting this one command you get performance in line with MongoDB – 50k upserts per second. However this was pushing 500Mb/sec of disk bandwidth which was saturating even the SSD array.

But Postgres has a few more tricks up its’ sleeve. One of the downsides of Postgres being incredibly protective of data is that it has a lot of writes – first it writes all commands to the ‘Write-Ahead Log’ (WAL) and then it updates the underlying data and indexes, and the WAL is where much of this additional disk IO was coming from. Starting in Postgres 9.1 however there is a feature called ‘unlogged tables‘ which avoids writing to the WAL, at the cost of loosing the table data after an unclean shutdown. Enabling this is as simple as changing your ‘CREATE TABLE’ statement to ‘CREATE UNLOGGED TABLE’. When I switched to using unlogged tables, disk IO went to virtually nil for most of the run (just at the end when it flushed the tables to disk did it dump several hundred MB of data – the database and indexes) and we saw a throughput of 85k upserts per second – roughly twice that of Mongo. The downside of this is that at an unclean shutdown, unlogged tables are automatically truncated as they may be in an inconsistent state. My proposal would be to allow a method to mark an unlogged table as read-only in order to ‘freeze’ its state hence negating the need to truncate after a crash; however apparently this would be a bit tricky to implement.

Because we want to do lots of searches with our data we heavily rely on indexes. On Postgres these appear to entail quite a bit more overhead than mongo. When I added 8 additional indexes to the unlogged table performance suddenly dropped from 85k upserts/sec to 35k upserts/sec. Fortunately there’s an easy solution to this in our case– create a table with just a single index for the primary key. Then create a ‘Materialized view‘ which is a clone of the table, and add the indexes to that. When your updates have finished, simply run “REFRESH MATERIALIZED VIEW”. This takes perhaps a minute to run on our data of 2m rows (as it is a single query, it is presently bounded to a single CPU), but it means that we get full upsert performance and the indexes then get updated inside a single transaction which should hopefully lead to less fragmentation and means better overall performance and less overall CPU usage at the cost of the materialized table lagging behind updated data by about 1 minute – an acceptable tradeoff in our case. By default “REFRESH MATERIALIZED VIEW” exclusively locks the materialized view while it is being update but you can fix this at the cost of some additional CPU cycles by changing that to “REFRESH MATERIALIZED VIEW CONCURRENTLY”. Your commands then look like:

-- Run these once:
CREATE MATERIALIZED VIEW real_table AS select * from unlogged_pk_only_table;
CREATE INDEX ON real_table (…);

-- Run this when real_table updates complete
REFRESH MATERIALIZED VIEW CONCURRENTLY real_table;

So we’ve now got performance twice the speed of Mongo without tweaking any postgres performance parameters – can we do better? It turns out that some basic settings help improve performance pretty significantly:

shared_buffers = 3GB # Give postgres lots more memory to work with
effective_io_concurrency = 8 # We are on a nice RAID SAN – we can do multiple reads/writes at once

The following setting is key as well – we don’t want to commit at the end of a transaction because frequently we have massive of transactions on the same table. Instead, wait a short amount of time and then commit with any other transactions on the same table in a single batch:

commit_delay = 100000 # 100ms

With these settings, an unlogged table with a single primary key and a jsonb column can run at 100k upserts/second, roughly three times the insert performance of MongoDB.

As I mentioned above, one of the issues with upserting lots of data into a single table was that we were seeing massive IO spikes on that. It turns out that this can be significantly reduced by changing the checkpointing settings too. There’s a small sentence in the manual that is key here: “To ensure data page consistency, the first modification of a data page after each checkpoint results in logging the entire page content” – basically if you checkpoint too frequently (and the default is after only 80mb of commands) after each checkpoint, if you are still updating lots of entries in the table, basically you entire table is rewritten to the WAL. To work around this (at the cost of causing longer recovery time after an unclean shutdown) we want to reduce the frequency of checkpoints with regard to the amount of data written (it will still write a checkpoint by default every 5 minutes anyway):

max_wal_size = 10GB
min_wal_size = 1GB

With these settings, even a standard logged table now runs at 80-90k upserts/sec and only outputs around 50Mb/sec of data.

TL;DR, one of the main areas of hype about NoSQL has been to do with the performance of inserting data. However, not only does Postgres 9.5 provide all the functionality of a mature and flexible SQL database, with a few tweaks it also handles upserts at two to three times the speed of MongoDB with better data consistency and lower memory and CPU usage.

KDE, Linux

Scrolling back in kde console history with shift plus page-up via the number pad

Tuesday, December 29, 2015 Mark Leave a comment

I’ve just got a nice new laptop which is a bit wider than my previous one and so has a number pad on the right hand side. Because I don’t need another set of numbers I have ended up using the end/home/page up/page down keys quite a bit as to use them elsewhere you have to combine with the function key. However when trying to scroll back in the kde terminal (konsole) you need to do shift-page up or shift-page down but trying this combination of keys only displayed the numbers. Fortunately I found an option buried in the KDE keyboard setup that lets you keep this – go to Keyboard Settings -> Advanced and then ‘Miscellaneous compatibility options’. Tick the ‘Shift with numeric keypad keys works as in MS Windows’ and you’re done. While you’re at it you can also disable caps lock or change it to be something like escape by looking under the ‘Caps Lock key behaviour’ option in the menu.

perl

Debugging perl memory leaks

Thursday, December 17, 2015 Mark Leave a comment

In perl you hardly ever have to think about memory management. It has a great garbage collector, and aside from the issue of it not usually passing free memory back to the OS (but rather using it for new objects) it doesn’t have much in the way of memory issues. One exception however is if you are creating objects with circular references and you forgot to weaken one of them meaning that there is a dependency cycle and so the memory cannot be garbage collected. Fortunately weakening is very easy with Moo or Moose

has back_ref => is => 'rw', isa => Object, weak_ref => 1;

If you start seeing memory leaks, it’s really simple to debug; just use the excellent Devel::Leak::Object at the top of your perl code:

use Devel::Leak::Object qw{ GLOBAL_bless };

Then, when your script exits it will print a list of all non-garbage collected objects still in existence. Look for a few with a high value and that’s where your memory leak is. Easy!

Apache, Lets Encrypt, Linux, SSL

Hiding the list of sites on your server

Thursday, December 10, 2015 Mark Leave a comment

Following on from switching my server to use HTTPS/SSL with the excellent Lets Encrypt free SSL certificate authority, after I enabled SSL on the first domain and you connect via HTTPS to any of the other domains I noticed that the browser comes up with an error like “You tried to get to site xxx.com but the certificate was issued for yyy.com”. I’m not sure about the specifics of the HTTPS protocol and certificates, but I don’t really want people easily being able to get a list of all the virtual hosts that are on my server. If you use the default Lets Encrypt client to just get one certificate for all domains on your server then when the certificate is passed to the client they will be able to see all the domains anyway, however if you issue a certificate for each virtual host as per the script in my other post, at least you can restrict what people see.

To make it even more secure and disallow even one valid certificate from being shown by default, you can create a new default vhost which will display a dummy certificate. To do this, you first need to create a random self-signed certificate:

openssl req -x509 -nodes -days 2000 -newkey rsa:2048 -keyout /etc/apache2/default.key -out /etc/apache2/default.crt

Just hit enter to all of those questions. Then, create a file in /etc/apache2/sites-enabled called 00default-ssl.conf and place the following commands in it:

# Default self-signed cert to mask what certificates are on the server
<VirtualHost *:443>
    SSLEngine on
    SSLCertificateFile /etc/apache2/default.crt
    SSLCertificateKeyFile /etc/apache2/default.key
</VirtualHost>

Job done!

Linux, perl

A source of random perl crashes

Wednesday, December 9, 2015 Mark Leave a comment

Recently we rolled out a new version of code to production and started observing a lot of segfaults being recorded to do with freeing memory:

Dec  5 14:11:06 XXX kernel: [268032.637417] perl[62948]: segfault at 7ffc3e26dff8 ip 00007f296a21f6ef sp 00007ffc3e26e000 error 6 in libc-2.19.so[7f296a1a3000+19f000]

Researching a bit further, I found that this was caused by some code which was meant to automatically save an object to the database when it was destroyed, if it had not already been saved:

# Catch objects being destroyed that have not been correctly flushed
sub DESTROY {
  shift->flush_to_database
}

Researching further, I discovered that the issue is to do with the way perl handles global destruction. During a normal object destroy phase this code will work just fine; however when the process is exiting and objects are being destroyed you don’t know what order they will be destroyed in. In this case I assume that the database object that we tried sending the object to had already been DESTROY/garbage collected which is what was causing the crash. Fortunately there’s quite a simple work-around:

# Catch objects being destroyed that have not been correctly flushed.
sub DESTROY {
  if( ${^GLOBAL_PHASE} eq 'DESTRUCT' ) {
    warn "Unflushed objects cannot be flushed when process is exiting";
  } else {
    shift->flush_to_database
  }
}

If you want to support perl < 5.14 you should use the Devel::GlobalDestruction module. Moo provides this more easily via the DEMOLISH subroutine:

# Catch objects being destroyed that have not been correctly flushed.
sub DEMOLISH {
  my ($self, $in_destruction) = @_;
  if( $in_destruction ) {
    warn "Unflushed objects cannot be flushed when process is exiting";
  } else {
    $self->flush_to_database
  }
}

We had another similar issue when using Log4perl – if you try to access it from within the DESTROY scope the logger object may already have been destroyed and the process baloons in memory until it is OOM’d. Again just a check if you are in global destruction from within your logging function will handle this just fine.

Databases, High Performance Linux, Mongo DB, MySQL, NoSQL, PostgreSQL

Mongo DB Pain

Tuesday, December 8, 2015 Mark 1 Comment

In my latest contract we’re gathering a lot of data very quickly (100k inserts/sec would not be unreasonable) and trying to save it into a database. Before I started the contract the decision had been made to use Mongo Database for this very reason, so when I joined I was glad to finally get a chance to play with some NoSQL software – up until now I’ve just used MySQL and PostgreSQL. There’s some software and programming languages where the more you use them the more they pleasantly surprise you. AngularJS is one of those pieces of software where you start with basic templating and form-element binding but quickly discover there’s a mass of great plugins available. PostgreSQL is another case in point – I had been using MySQL quite happily for years but since starting to use PostgreSQL for some geolocation projects (with the excellent PostGIS extension), I’ve started using it by default for any new projects or designs. At the start it seems like just another database server with some geospatial additions, but then you think “Perhaps I can get the database to validate my input data before inserting”. No problem for Postgres you can specify a regexp constraint for example CHECK (mac ~ '^[0-9a-f]{12}$'::text) to validate a mac address (but you probably want to use the built-in type macaddr). Good luck with doing this in MySQL – by default it doesn’t even care about inserting the wrong data type into a column just silently messes it up… Or perhaps you want to import data from another data source into Postgres – again Foreign Data Wrappers (FDW) to the rescue connecting in to pretty much any other database or data source that you could want.

With Mongo however the more I’ve learnt of it the less impressed I’ve become. Sure, you can do some nice searches without needing schemas and it’s pretty high performance however once you go past simple store/retrievs there are some pretty basic issues that have not been addressed. Here are two I’ve come across in the past week:

Firstly, we have a big table (500m records) with data in a composite primary key (id, timestamp). We have the index on that, and we want to get a list of all the ids in the table. No problem you think – it’s a simple btree; just walk it and you’ll pull out the ids. And yes, it should be like this however there’s no easy way to do this in Mongo. The first thing you see in the documentation is that there is a .distinct() function or a .group() function. Great to get the distinct ids just run db.collection.distinct('id'). However rather than doing something sensible and returning the distinct fields with a cursor it tries to return them all in a single BSON object which is capped at 16mb size. So, if you have less then perhaps 1 million distinct ids in your table it would work, once you go over some magic threshold then your entire process needs redesigning.

Then you notice that in the latest versions of MongoDB, the .aggregate function can return a cursor. So, something like:

db.collection.aggregate( [ { $group: { _id: "$id" } } ], { cursor: {} } )

should do it. Technically speaking yes, however an open bug report from 4 years ago explains that actually the $group function can’t use indexes. If you add an explain: true to the aggregate you see that it does a full collection scan. But, perhaps if you are clever and force the sort like:

db.collection.aggregate( [ { $sort: { id: 1 } }, { $group: { _id: "$id" } } ], { cursor: {} } )

it will use the index, and indeed the explain output shows that it does. However even then it has to search through the whole table before it returns the first item for the cursor! So, if your distinct result is possibly more than 16mb of data in a large table there doesn’t seem to be any way to get the data in a timely manner from the database.

Another issue that I ran into and reported upstream is to do with very poor index selection on the aggregate. See the ticket for more details, but if you run an aggregate query on a collection with the foo column as a unique key:

db.col.aggregate([ { "$match" : { "foo" : "abcd" } }, { "$sort" : { "_id" : 1 } } ])

for some reason Mongo doesn’t follow the obvious and use the first item in the aggregate pipeline as the index. Instead, it chooses to use the _id index meaning that the $match does a full table scan to find the one result that could have been instantly looked up if it had used the correct index. The .find().sort() pipeline doesn’t have this issue.

There was also a fun issue I discovered whereby when you get a hot row in the database (for example a lot of processes trying to use $inc to update a counter) performance gets absolutely killed by the spinlocks on the data structure. Rather than 50k upsert/sec on the cluster when we had this hot row it went down to about 3k upserts/sec.

This is not to say that Mongo is a bad product, certainly looking at posts from a few years ago it seems that it has come on leaps and bounds since then. However it’s not the be all and end all that many NoSQL champions would claim. And I’m not convinced that the other NoSQL type databases would be all that much better – certainly for basic store/fetch operations but as you get closer to needing the functionality of a traditional RDBMS they are just too immature. Other issues for us recently have been lack of JOIN statements, the lack of schema that means a simple typo on a query can cause it to hang doing a full table scan for a key that doesn’t exist, etc etc. These are not problems with Mongo per-se but rather an issue with the incredible flexibility that NoSQL products offer.

Would I use Mongo again in other projects? Yes probably, but only if we had a requirement for very high performance or schema-less flexibility that something like PostgreSQL simply couldn’t provide.

Apache, Lets Encrypt, Linux, SSL

Easy guide to free SSL with Lets Encrypt

Monday, December 7, 2015 Mark 1 Comment

I was excited to hear the other day that the beta of the lets encrypt project had gone live. For those of you that don’t know this is a great project that provides free SSL certificates to any website with the aim of finally moving much of the internet to SSL. However it’s not quite as easy to use as I had hoped, so I documented the process and the issues I ran in to below with the aim of helping some other people out that are wanting to do the same. My setup is apache on linux ubuntu but it should be true for most other flavours of linux.

Firstly, I downloaded and installed letsencrypt on my server as per the instructions:

git clone https://github.com/letsencrypt/letsencrypt
cd letsencrypt

I had thought that simply running their advised command of:

./letsencrypt-auto --apache

would convert my apache setup to use https and provide a certificate for each domain on my server. It doesn’t. Whilst it’s a great tool it is still in beta and even after that I’m not sure if the auto configuration functionality would be included. My first issue was that it presented a long list of all the domians in my apache config; some of which were live and some wern’t. You have to select which you want to get a certificate for. However if you select any that are not accessible or not supported (i18n domains ie those beginning xn-- or wildcard domains) then the process dies with an error and you have to restart selecting which domains you want. This is clearly an issue they could fix pretty simply in the code and hopefully that will happen before long. Also, you have to renew the SSL certs every three months and I believe that you have to go through this same process again each time rather than just hitting a renew button.

Another issue that I ran against which is probably more of a design issue is the idea that it issues you one certificate per server. This is fine if you are just hosting one set of domains; however if you have multiple virtual host entries or host multiple sites that are logically separate you probably don’t want to go down this route.

So, because of these issues I gave up on the automatic apache client and wrote a little script that gets a certificate for each virtual server configuration file and excludes i18n domain names, wildcard, localhost and optionally any others. You should be able to just run this script once every few months and it will automatically regenerate all your SSL certificates without having to change any of your config:

for conf in /etc/apache2/sites-enabled/*; do
    cmd=$(perl -nE 'next unless /Server(Name|Alias)/i; next if /xn--|\*|localhost|127|my-other-nonregistered-domain.com.../i; $d=(split " ", $_)[1]; print "-d $d "' $conf)
    if [ ! -z "$cmd" ]; then
        echo $conf
        /root/letsencrypt/letsencrypt-auto certonly --apache $cmd
    fi
done

Just save that to a file like /etc/apache2/regen_ssl.sh and run it with bash, or paste it directly into your shell and you will then get a load of directories under /etc/letsencrypt/live/ with your certificates in. Each directory should be called after the first ServerName configuration option in the particular virtual host configuration file.

To get the certificates auto-renewing you want to edit the crontab to execute this every few months:

# Refetch all certificates every 2 months
4 4 1 */2 * bash /etc/apache2/regen_ssl.sh

BUT, the process doesn’t end there – you have to set up apache for each domain you want to enable ssl on. First, enable SSL globally:

a2enmod ssl
service apache2 restart

Also make sure to update your firewall to allow HTTPS (port 443) traffic. Then for each virtual host you should change the header line from:

<VirtualHost *:80>

<VirtualHost *:80 *:443>

Then you need to enable SSL and specify the certificates:

    SSLEngine on
    SSLCertificateFile    /etc/letsencrypt/live/mark.zealey.org/cert.pem
    SSLCertificateKeyFile /etc/letsencrypt/live/mark.zealey.org/privkey.pem
    SSLCertificateChainFile /etc/letsencrypt/live/mark.zealey.org/fullchain.pem

A service apache2 graceful should then mean you can access your site via https.

If you want to force traffic to use HTTPS there are a number of different routes; the correct way is to use HSTS; however if for some reason you discover that your site doesn’t work correctly using HTTP (eg you have iframes or javascript that is not on SSL-enabled servers) you want to be able to quickly revert to not force clients to use SSL. For the initial deployment then I used the amazing apache2 mod_rewrite to handle this:

    RewriteEngine On
    RewriteCond %{HTTPS} !=on
    RewriteRule ^/?(.*) https://%{SERVER_NAME}/$1 [R,L]

If you are running a blog and you just want to force areas where secure information is transferred (eg user login page) to use SSL so people can’t snoop on your password, you could use something like this (for drupal)

    RewriteEngine On
    RewriteCond %{HTTPS} !=on
    RewriteCond %{REQUEST_URI} ^/(user|admin)/
    RewriteRule ^/?(.*) https://%{SERVER_NAME}/$1 [R,L]

If you’re running a site where you have some mobile or legacy clients accessing your API, but you want new web clients forced to use SSL, you could have some config like this:

    RewriteEngine On
    RewriteCond %{HTTPS} !=on
    RewriteCond %{REQUEST_URI} !^/api/.*
    RewriteRule ^/?(.*) https://%{SERVER_NAME}/$1 [R,L]

Mark's blog