Scrolling back in kde console history with shift plus page-up via the number pad

I’ve just got a nice new laptop which is a bit wider than my previous one and so has a number pad on the right hand side. Because I don’t need another set of numbers I have ended up using the end/home/page up/page down keys quite a bit as to use them elsewhere you have to combine with the function key. However when trying to scroll back in the kde terminal (konsole) you need to do shift-page up or shift-page down but trying this combination of keys only displayed the numbers. Fortunately I found an option buried in the KDE keyboard setup that lets you keep this – go to Keyboard Settings -> Advanced and then ‘Miscellaneous compatibility options’. Tick the ‘Shift with numeric keypad keys works as in MS Windows’ and you’re done. While you’re at it you can also disable caps lock or change it to be something like escape by looking under the ‘Caps Lock key behaviour’ option in the menu.

Debugging perl memory leaks

In perl you hardly ever have to think about memory management. It has a great garbage collector, and aside from the issue of it not usually passing free memory back to the OS (but rather using it for new objects) it doesn’t have much in the way of memory issues. One exception however is if you are creating objects with circular references and you forgot to weaken one of them meaning that there is a dependency cycle and so the memory cannot be garbage collected. Fortunately weakening is very easy with Moo or Moose

If you start seeing memory leaks, it’s really simple to debug; just use the excellent Devel::Leak::Object at the top of your perl code:

Then, when your script exits it will print a list of all non-garbage collected objects still in existence. Look for a few with a high value and that’s where your memory leak is. Easy!

Hiding the list of sites on your server

Following on from switching my server to use HTTPS/SSL with the excellent Lets Encrypt free SSL certificate authority, after I enabled SSL on the first domain and you connect via HTTPS to any of the other domains I noticed that the browser comes up with an error like “You tried to get to site xxx.com but the certificate was issued for yyy.com”. I’m not sure about the specifics of the HTTPS protocol and certificates, but I don’t really want people easily being able to get a list of all the virtual hosts that are on my server. If you use the default Lets Encrypt client to just get one certificate for all domains on your server then when the certificate is passed to the client they will be able to see all the domains anyway, however if you issue a certificate for each virtual host as per the script in my other post, at least you can restrict what people see.

To make it even more secure and disallow even one valid certificate from being shown by default, you can create a new default vhost which will display a dummy certificate. To do this, you first need to create a random self-signed certificate:

Just hit enter to all of those questions. Then, create a file in /etc/apache2/sites-enabled called 00default-ssl.conf and place the following commands in it:

Job done!

A source of random perl crashes

Recently we rolled out a new version of code to production and started observing a lot of segfaults being recorded to do with freeing memory:

Researching a bit further, I found that this was caused by some code which was meant to automatically save an object to the database when it was destroyed, if it had not already been saved:

Researching further, I discovered that the issue is to do with the way perl handles global destruction. During a normal object destroy phase this code will work just fine; however when the process is exiting and objects are being destroyed you don’t know what order they will be destroyed in. In this case I assume that the database object that we tried sending the object to had already been DESTROY/garbage collected which is what was causing the crash. Fortunately there’s quite a simple work-around:

If you want to support perl < 5.14 you should use the Devel::GlobalDestruction module. Moo provides this more easily via the DEMOLISH subroutine:

We had another similar issue when using Log4perl – if you try to access it from within the DESTROY scope the logger object may already have been destroyed and the process baloons in memory until it is OOM’d. Again just a check if you are in global destruction from within your logging function will handle this just fine.

Mongo DB Pain

In my latest contract we’re gathering a lot of data very quickly (100k inserts/sec would not be unreasonable) and trying to save it into a database. Before I started the contract the decision had been made to use Mongo Database for this very reason, so when I joined I was glad to finally get a chance to play with some NoSQL software – up until now I’ve just used MySQL and PostgreSQL. There’s some software and programming languages where the more you use them the more they pleasantly surprise you. AngularJS is one of those pieces of software where you start with basic templating and form-element binding but quickly discover there’s a mass of great plugins available. PostgreSQL is another case in point – I had been using MySQL quite happily for years but since starting to use PostgreSQL for some geolocation projects (with the excellent PostGIS extension), I’ve started using it by default for any new projects or designs. At the start it seems like just another database server with some geospatial additions, but then you think “Perhaps I can get the database to validate my input data before inserting”. No problem for Postgres you can specify a regexp constraint for example CHECK (mac ~ '^[0-9a-f]{12}$'::text) to validate a mac address (but you probably want to use the built-in type macaddr). Good luck with doing this in MySQL – by default it doesn’t even care about inserting the wrong data type into a column just silently messes it up… Or perhaps you want to import data from another data source into Postgres – again Foreign Data Wrappers (FDW) to the rescue connecting in to pretty much any other database or data source that you could want.

With Mongo however the more I’ve learnt of it the less impressed I’ve become. Sure, you can do some nice searches without needing schemas and it’s pretty high performance however once you go past simple store/retrievs there are some pretty basic issues that have not been addressed. Here are two I’ve come across in the past week:

Firstly, we have a big table (500m records) with data in a composite primary key (id, timestamp). We have the index on that, and we want to get a list of all the ids in the table. No problem you think – it’s a simple btree; just walk it and you’ll pull out the ids. And yes, it should be like this however there’s no easy way to do this in Mongo. The first thing you see in the documentation is that there is a .distinct() function or a .group() function. Great to get the distinct ids just run db.collection.distinct('id'). However rather than doing something sensible and returning the distinct fields with a cursor it tries to return them all in a single BSON object which is capped at 16mb size. So, if you have less then perhaps 1 million distinct ids in your table it would work, once you go over some magic threshold then your entire process needs redesigning.

Then you notice that in the latest versions of MongoDB, the .aggregate function can return a cursor. So, something like:

should do it. Technically speaking yes, however an open bug report from 4 years ago explains that actually the $group function can’t use indexes. If you add an explain: true to the aggregate you see that it does a full collection scan. But, perhaps if you are clever and force the sort like:

it will use the index, and indeed the explain output shows that it does. However even then it has to search through the whole table before it returns the first item for the cursor! So, if your distinct result is possibly more than 16mb of data in a large table there doesn’t seem to be any way to get the data in a timely manner from the database.

Another issue that I ran into and reported upstream is to do with very poor index selection on the aggregate. See the ticket for more details, but if you run an aggregate query on a collection with the foo column as a unique key:

for some reason Mongo doesn’t follow the obvious and use the first item in the aggregate pipeline as the index. Instead, it chooses to use the _id index meaning that the $match does a full table scan to find the one result that could have been instantly looked up if it had used the correct index. The .find().sort() pipeline doesn’t have this issue.

There was also a fun issue I discovered whereby when you get a hot row in the database (for example a lot of processes trying to use $inc to update a counter) performance gets absolutely killed by the spinlocks on the data structure. Rather than 50k upsert/sec on the cluster when we had this hot row it went down to about 3k upserts/sec.

This is not to say that Mongo is a bad product, certainly looking at posts from a few years ago it seems that it has come on leaps and bounds since then. However it’s not the be all and end all that many NoSQL champions would claim. And I’m not convinced that the other NoSQL type databases would be all that much better – certainly for basic store/fetch operations but as you get closer to needing the functionality of a traditional RDBMS they are just too immature. Other issues for us recently have been lack of JOIN statements, the lack of schema that means a simple typo on a query can cause it to hang doing a full table scan for a key that doesn’t exist, etc etc. These are not problems with Mongo per-se but rather an issue with the incredible flexibility that NoSQL products offer.

Would I use Mongo again in other projects? Yes probably, but only if we had a requirement for very high performance or schema-less flexibility that something like PostgreSQL simply couldn’t provide.