Category Archives: Databases

Finding the amount of space wasted in postgres tables

Because of way that postgres handles transaction isolation (ie using MVCC), when you modify or delete a row in a table it marks it as deleted, and then frees the space at a later point in time using (auto)vacuum. However, unless you use the heavy-weight VACUUM FULL command (which exclusive locks the table and totally rewrites it, causing anything trying to access it to block until the command is finished) the space is never reclaimed by the operating system. Normally this is not a problem – if you have a heavily used table with 20mb of data in it it probably has 5-10mb of overhead with the dead rows, reclaimed free space etc which is acceptable. However there are a few situations where it is useful to know what exactly the overhead is:

  1. Sometimes if your table changes very quickly, is large, and your disks or autovacuum parameters are unable to keep up, it can end up growing massive. For example we had a table that contains 3Gb of data but was taking up 45Gb due to the fact that autovacuum couldn’t keep up with the frequency of changes in the table
  2. If you are using table partitioning to store historic data then to make the most use of space you want to see whether a VACUUM FULL would be advantageous to run or not. For example if you have a table that is recording data collected from each day, some days it may be mostly just inserts so doesn’t need vacuuming; other days it may have a number of changes made and so have quite a lot of free space that can be reclaimed. Additionally, VACUUM FULL optimizes the order of data in the table and the indexes making it more performant.

In the first case, looking at the output of a command like

(sorry I can’t remember where I found this) should show you that there are a very large number of dead tuples waiting to be reclaimed (ie turned in to free space) in the table.

However, if your disks were struggling at one point, but then you tweaked autovacuum so it reclaimed the dead tuples correctly (as in case 1 above), your table could now be 90% free space but there is no easy way to find this out within postgres.

Fortunately, there is an excellent extension called pgstattuple which allows you to find out the amount of free space within a table file that has been reclaimed but not released to the operating system. The following query lists all tables which are over 100Mb in size, and have more than 10Mb of free space and have more than 20% free space (you can tweak these numbers – I just did it for our platform where our typical table size is 1Gb+):

This only uses an approximate count, however even so it can be a bit slow (it just took 10 minutes here) on a system with many tables and heavy IO. You can use this to find the tables that would most benefit from a VACUUM FULL command being run.

row_to_json() equivalent for jsonb types

With the advent of the awesome jsonb data type in Postgres 9.4, pretty much all of the json-related functions have a json_* and jsonb_* varient. All those, except row_to_json. I spent a bit of time but couldn’t find an answer online, in the end after a bit of experimentation, I discovered that in Postgres 9.5 at least the to_jsonb function does exactly the same, like:

How we improved database performance 1000% with just one command

In my current contract we have a nice database server with 4 datacentre quality SSD’s arranged in raid5, a battery-backed (Dell PERC / megaraid) controller and very poor disk performance. As I was doing a dump of the database (with parallel process, obviously), iostat was showing 100% disk utilization at 200-300 iops and 40-60mb/sec. Really not what it should be. After poking the raid card a lot with the MegaCli tool, I discovered that by changing the Linux IO scheduler to noop or deadline we suddenly got great performance – 50% disk utilization at 1300 iops and 200mb/sec read. So, an issue with the CFQ scheduler then as everywhere seems to suggest? In fact not – it turns out the root of the problem is that because of the raid controller the kernel did not detect that the underlying disks were SSD rather than HDD. Just by tweaking this setting CFQ provides the same performance as noop or deadline schedulers (in fact better for our postgres workload). So, here’s the command we used to get a 1000% improvement in database performance:

Getting value/count grouping in a single row with postgres

In one of my contracts we’re doing some very complex data storage and querying using jsonb in postgres. One common display in the system we’re replacing is basically something like:

Which is horrible, inefficient and doesn’t let you know if anything occurred outside the range. As we’re already doing a lot with JSON, I was wondering if rather than creating new columns, we could return these counts in a JSON format. It turns out that with postgres this is really easy – here’s how:

First, we want to create a function that given the existing aggregated value and a new key to inclement, returns a jsonb object

Then, you define the aggregate:

If you noticed the STRICT in the function (which means that NULL’s are not allowed), the aggregate we created notices that and will never pass NULL values in to the function meaning we don’t need to special-case them.

If you then run the query first casting the floating-point value into an integer and then turning that into a text datatype for the json key:

You get a JSON result back with the counts:

Obviously your first port of call when faced with something like this is ‘can I group by the value’, but in this case we want the row to be based on something different so we needed the sub-aggregation of the values.

Happy days

How we tweaked Postgres upsert performance to be 2-3* faster than MongoDB

As we all know, relational databases are fine if you’re dealing with small amounts of data but for web-scale high performance high inserts speed and masses of queries per second, NoSQL is what you need. At least, that’s the conventional wisdom/hype surrounding NoSQL databases such as MongoDB. However as we’ve recently discovered this is very wrong indeed.

As the conventional wisdom said that, we went with MongoDB for our first attempt at a system that needed to support a very high number of upserts – we were aiming for 50-100k upserted rows per second of unstructured data to be exact. Initial attempts with MongoDB on some very nice hardware (2Tb SSD SAN) produced about 10k upserts per second – good but not great. Then, this number increased when we started using batched upsert operations that was added in MongoDB 2.6, and increased even further with the advert of MongoDB 3.0 and the WiredTiger storage backend to 30-50k upserts/second. Very nice!

However, there are a number of issues with MongoDB in particular and limitations on NoSQL databases in general that relational (SQL) databases make easy – want to do aggregation, joins, distinct row selection from an index, complex data manipulations before transmitting data over the nework? Easy with SQL, virtually impossible with NoSQL. So, we started to think that if we could find a relational database with even vaguely good insert performance it would make life and development tasks so much easier that perhaps it would be worth the effort to change. As postgres generally seems to be able to handle anything you can throw at it, including support for a binary JSON data type we started looking there.

In order to avoid changing our database code very much so that we could compare as closely with MongoDB as possible, I created a really simple schema for the tables – the column(s) required for a primary key, and a jsonb data type column to store the record. One great feature of the recently released Postgres 9.5 release is the single statement upsert ability that reduces latency and code complexity, so all testing was done on this. As Postgres is designed from the ground up to be reliable and not lose data (unlike MongoDB which has fsync disabled by default), it is very slow if you are just doing single inserts. When you start batching these in to transactions of say 100 or 1000 entries you can achieve a higher throughput. If you want to compare it with MongoDB though, you can choose to disable the fsync at the end of every transaction by using the command

This means that in the event of a power failure you may lose some transactions that returned successfully, however you will never risk getting database corruption (unlike with MongoDB). By setting this one command you get performance in line with MongoDB – 50k upserts per second. However this was pushing 500Mb/sec of disk bandwidth which was saturating even the SSD array.

But Postgres has a few more tricks up its’ sleeve. One of the downsides of Postgres being incredibly protective of data is that it has a lot of writes – first it writes all commands to the ‘Write-Ahead Log’ (WAL) and then it updates the underlying data and indexes, and the WAL is where much of this additional disk IO was coming from. Starting in Postgres 9.1 however there is a feature called ‘unlogged tables‘ which avoids writing to the WAL, at the cost of loosing the table data after an unclean shutdown. Enabling this is as simple as changing your ‘CREATE TABLE’ statement to ‘CREATE UNLOGGED TABLE’. When I switched to using unlogged tables, disk IO went to virtually nil for most of the run (just at the end when it flushed the tables to disk did it dump several hundred MB of data – the database and indexes) and we saw a throughput of 85k upserts per second – roughly twice that of Mongo. The downside of this is that at an unclean shutdown, unlogged tables are automatically truncated as they may be in an inconsistent state. My proposal would be to allow a method to mark an unlogged table as read-only in order to ‘freeze’ its state hence negating the need to truncate after a crash; however apparently this would be a bit tricky to implement.

Because we want to do lots of searches with our data we heavily rely on indexes. On Postgres these appear to entail quite a bit more overhead than mongo. When I added 8 additional indexes to the unlogged table performance suddenly dropped from 85k upserts/sec to 35k upserts/sec. Fortunately there’s an easy solution to this in our case– create a table with just a single index for the primary key. Then create a ‘Materialized view‘ which is a clone of the table, and add the indexes to that. When your updates have finished, simply run “REFRESH MATERIALIZED VIEW”. This takes perhaps a minute to run on our data of 2m rows (as it is a single query, it is presently bounded to a single CPU), but it means that we get full upsert performance and the indexes then get updated inside a single transaction which should hopefully lead to less fragmentation and means better overall performance and less overall CPU usage at the cost of the materialized table lagging behind updated data by about 1 minute – an acceptable tradeoff in our case. By default “REFRESH MATERIALIZED VIEW” exclusively locks the materialized view while it is being update but you can fix this at the cost of some additional CPU cycles by changing that to “REFRESH MATERIALIZED VIEW CONCURRENTLY”. Your commands then look like:

So we’ve now got performance twice the speed of Mongo without tweaking any postgres performance parameters – can we do better? It turns out that some basic settings help improve performance pretty significantly:

The following setting is key as well – we don’t want to commit at the end of a transaction because frequently we have massive of transactions on the same table. Instead, wait a short amount of time and then commit with any other transactions on the same table in a single batch:

With these settings, an unlogged table with a single primary key and a jsonb column can run at 100k upserts/second, roughly three times the insert performance of MongoDB.

As I mentioned above, one of the issues with upserting lots of data into a single table was that we were seeing massive IO spikes on that. It turns out that this can be significantly reduced by changing the checkpointing settings too. There’s a small sentence in the manual that is key here: “To ensure data page consistency, the first modification of a data page after each checkpoint results in logging the entire page content” – basically if you checkpoint too frequently (and the default is after only 80mb of commands) after each checkpoint, if you are still updating lots of entries in the table, basically you entire table is rewritten to the WAL. To work around this (at the cost of causing longer recovery time after an unclean shutdown) we want to reduce the frequency of checkpoints with regard to the amount of data written (it will still write a checkpoint by default every 5 minutes anyway):

With these settings, even a standard logged table now runs at 80-90k upserts/sec and only outputs around 50Mb/sec of data.

TL;DR, one of the main areas of hype about NoSQL has been to do with the performance of inserting data. However, not only does Postgres 9.5 provide all the functionality of a mature and flexible SQL database, with a few tweaks it also handles upserts at two to three times the speed of MongoDB with better data consistency and lower memory and CPU usage.