All posts by Mark

I'm a full-stack Linux consultant from the UK specializing in high performance systems, DNS and databases. I have also written and lead teams producing a number of web/mobile apps. I'm fluent in English and Turkish.

Automounting swap on local SSD’s on Amazon EC2

Many instances on EC2 (AWS) now have local SSD’s attached. The excellent ubuntu 14.04 image boots brilliantly on these and automatically formats and mounts any of the local SSD storage. However when the instance shuts down, reboots or gets migrated these SSD’s go away so you still need to use the persistent EBS storage for most operations.

If you want to enable swap on the box, add the following to /etc/rc.local – it will create a 2gb swap file each boot on the local SSD and mount it:

dd if=/dev/zero of=/mnt/swapfile bs=1M count=2048
chmod 600 /mnt/swapfile
mkswap /mnt/swapfile
swapon /mnt/swapfile

I’ve not yet figured out what the process is to format/mount these local disks on bootup it may well be easier to add this to them.

Better Database Search Functionality in 4 Simple Steps

As Google and other search engines are so good at predictive search these days users (ie me) get very frustrated at poor search results on websites or input boxes. Something I try to use across my apps is a decent auto-complete search interface however so many websites are very poor at this either matching only the first part of the string or matching any part of the substring. Additionally sometimes they don’t handle differences in case properly, certainly they don’t usually work with different accenting marks or order particularly well (eg search for beyoglu doesn’t usually return results like Beyoğlu). So, here follows a simple suggestion and code design pattern about how to implement this properly in PostgreSQL (Also works in MySQL although the regex matching code is slightly different). You can then have great instant typeahead functionality for example using the excellent AngularJS Bootstrap Typeahead input. I’ve implemented this in Perl/DBIC but it is a pattern that can be easily applied to any language such as Ruby/Rails or NodeJS.

Whilst there are a number of different search options out there that can plug into existing databases such as ElasticSearch, Sphinx or MySQL/Postgres fulltext search these are often fiddly to set up and are more intended for natural fulltext than for simple phrases or keywords which is what I generally aim for. The below method is pretty quick and easy to set up and allows you full control over the rebuilds, stemming, common word removal etc which is especially important for multi-lingual sites. You can also easily switch between database servers without having to totally redo your search functionality using this method.

Step 1: Add Column to Database Tables

Firstly, for any table you wish to search create a searchdata column probably varchar, with the maximum length of the data you’ll want to be searching (eg article title, author etc combined). For example:

alter table article add searchdata varchar(255) not null default '';

Step 2: Create Search Query Normalization Code

Then in your code create two routines to normalize any search text. Here is a (perl) example from my code:

package OradanOraya::Search;
use strict;
use warnings;
use utf8;
    
use Text::Unidecode 'unidecode';

sub to_search_no_strip {
    my ($self, $txt) = @_;
    $txt = lc unidecode($txt);
    $txt =~ s/[^a-zA-Z0-9 ]/ /g;  # kill newlines or space runs
    $txt =~ s/\s+/ /g;  # kill newlines or space runs
    $txt =~ s/^\s+|\s+$//;
    return $txt;
}
    
sub to_search {
    my ($self, $txt) = @_;
    $txt = $self->to_search_no_strip($txt);

    # common words to strip out
    $txt =~ s/\b(?: bolum | hastane | doktor | doctor | doc | dr )\S*//xg;

    return $txt;
}       
            
1

The first function is purely for normalizing the search terms (firstly stripping accents using the excellent Text::Unidecode module, then killing any non-alphanumeric chars, ensuring only one space between words and no spaces beginning or end of the text), the latter function does the same but also removes any common words you don’t want indexed.

Step 3: Set Columns to Auto Update in Your ORM

In your ORM base-class (you are using an Object-Relational Mapper rather than plain SQL right?) create some functions to handle the auto-population of these fields when the rows get updated by your code. For Perl’s DBIx::Class users here’s the code you inject into your DBIC Result base class. The first function, _get_searchdata is the key one that takes a specified list of columns, normalizes them and returns the searchdata field. The other functions are for the manual refresh of the search data in the row, automatically updating search data on update and create respectively:

sub _get_searchdata {
    my ($self) = @_;

    return My::Search->to_search( join ' ', map { $self->$_ || '' } $self->searchdata_columns )
}

sub refresh_searchdata {
    my ($self) = @_;
    $self->update({
        searchdata => $self->_get_searchdata
    });
}

sub set_column {
    my $self = shift;

    my $ret = $self->next::method( @_ );

    if( $self->can('searchdata') ) {
        # Note that we call the super-class set_column update method rather than ourselves otherwise we'd have an infinite loop
        $self->next::method( 'searchdata', $self->_get_searchdata );
    }

    return $ret;
}

sub insert {
    my $self = shift;

    if( $self->can('searchdata') ) {
        $self->searchdata( $self->_get_searchdata );
    }

    return $self->next::method( @_ );
}

In any of your tables where you have added a searchdata column create a method that just returns what columns you want to add to searchdata:
sub searchdata_columns { qw< title name > }

Step 4: Search Queries and Ordering

Whenever a row is added or updated now you’ll have the normalized search text added (see below for a script to auto-populate if you have existing data). To do nice searches you can now execute the following SQL (for MySQL replace ~ with REGEXP operator):

select * from article where searchdata ~ 'foo'

This will match the text anywhere. If you want to only match words beginning with this you can use PostgreSQL’s zero-width start-of-word \m operator (in normal regexp language this would be roughly equivalent to \b although that matches beginning and end of words):
select * from article where searchdata ~ '\mfoo'

If you want to order results whereby those with beginning-of-string matches go first, then the rest are alphabetical you can do something like (note the !~ as false orders before true):
SELECT *
FROM article
WHERE searchdata ~ '\mfoo'
ORDER BY searchdata !~ '^foo', searchdata

Well that’s a job well done! You can look at using some sort of index in the database to speed this up but to be honest for tables with less than 10k rows that’s probably not worth while. You’ll need to look at the trie type indexes that Postgres has, I don’t believe MySQL is able to index these sorts of searches.

The DBIC code for this last one:

my $search_str = quotemeta($fn->to_search( $p->{search} ));
$dbic->resultset('Article')->search({
  searchdata => { '~' => '\m' . $search_str }
}, {
  order_by => [
         \[ 'searchdata !~ ?', [ str => '^' . $search_str ] ],
         'searchdata'
  ]
});

Extra Step: Create a Reindex Script

You’ll also want to write some code to find any tables with searchdata and update them for initial population. Here’s the perl/dbic solution for this again but should be simple enough with any ORM (note the transaction commit per 100 updates which significantly improves performance on all database servers):

$|++;
my $count = 1;
$dbic->txn_begin;
for my $table ($dbic->sources) {
    my $rs = $dbic->resultset($table);

    next unless $rs->result_source->has_column('searchdata');

    while( my $d = $rs->next ) {
        if( $count++ % 100 == 0 ) {
            $dbic->txn_commit;
            $dbic->txn_begin;
            print ".";
        }
        
        $d->refresh_searchdata;
    }           
}           
$dbic->txn_commit;

How to use mitmproxy to capture https connections

Based on the excellent in-depth guide found here I’ve written a few quick startup notes to myself below:

sudo ufw disable
sudo iptables -t nat -A PREROUTING -p tcp --dport 443 -j REDIRECT --to-port 8000 # remember not to use -i...
mitmproxy -T --host

Philip’s instructions have -i with the nat prerouting rule and because I’m on wireless this was a source of frustration until I noticed. Forwarding is enabled by default on my box as I run some vm’s from time to time, and the box will automatically forward the packets and just pull out the ones on port 443 which are the ones I’m interested in.

Slow http requests with chrome webview

I was developing an app using the excellent Chrome WebView remote debugging (enabled using the following code in the app)

        if(Build.VERSION.SDK_INT >= Build.VERSION_CODES.KITKAT)
            WebView.setWebContentsDebuggingEnabled(true);

and noticed that when testing against the live server suddenly what should have been a 100ms request was taking 2-5 seconds to return. I looked at the server and thought the load was a bit high so started up more server processes but that didn’t do anything to solve the problem. Finally, when I closed down the inspector window everything went back to normal. I guess the inspector window (network monitor panel at least) was blocking the process waiting for all the data to be transferred over usb. Annoying but an easy fix to close the inspector window.

Android launch intents from commandline

I’ve been trying to integrate android intents with BuradanOraya specifically to test different scheme parsing bits of code. Open up a shell connection (adb shell) to your android device and then trigger the intent focused on a specific application using a command like:

am start -a android.intent.action.VIEW -d geo:29,41 com.app.path

If you don’t specify a package ie
am start -a android.intent.action.VIEW -d geo:29,41

on my android 5 at least it will come up with a chooser of all the different apps that could open the intent.

Massive battery drain on Android 5 with gmail

I’ve seen this reported quite vaguely in some different forums but having experienced the problem a number of times myself I’ve now come up with the easiest solution. Basically once or twice a month my Nexus 4 running stock Android 5.0.1 eats through its battery in about 3 or 4 hours in spite of hardly being used. On the battery usage page it shows the culprit is the gmail application (I don’t actually use gmail but with Android 5 they have merged the email application into gmail).

Previously I would go to the applications page, click on gmail and clear all data, cache, stop the app and then have to reconfigure my accounts. I’ve noticed that simply force-stopping the app several times, then clearing the cache (which is usually very inflated – presently 100mb of data but 200mb of cache, and I’ve seen it as up to 800mb of cache usage before) seems to do the trick and bring battery usage down to normal.

Some observations of the GoogleBot

Recently rather than using cookies to gather data on people using some of my sites I’ve started using the newish html5 localStorage and generating a unique code on the first visit. This has a few advantages, mainly that it’s a bit more persistent than cookies, I already use localStorage to store customization data for the user client-side, it works seamlessly with PhoneGap/Cordova mobile apps and also I don’t have to worry about anything on the server side (ie setting, sending and tracking cookies). I use the roughly the following code (assuming localStorage is available for the browser, which in 99% of cases it is):

    if( localStorage['uuid'] == null) {
        first_use = 1;
        // 64-bit random number persistent between sessions
        localStorage['uuid'] = EDITION + Math.floor( Math.random() * 0x10000 ).toString(16)
                                + Math.floor( Math.random() * 0x10000 ).toString(16)
                                + Math.floor( Math.random() * 0x10000 ).toString(16)
                                + Math.floor( Math.random() * 0x10000 ).toString(16);
    }

I noticed that over several of my sites the GoogleBot was generating exactly the same uuid (over multiple access IPs) but it seemed that other localStorage preferences etc were not being saved. From this it seems like the GoogleBot doesn’t support saving stuff in localStorage (not a surprise given there are probably 10k computers running the GoogleBot scraper and it’s easier for them not to share site state). However it also appears that they are using the random number generator with a fixed seed so that any random numbers generated by the site are the same over all their scraper servers.

Conclusions? Don’t expect bots (or even some clients eg incognito mode) to actually save localStorage between sessions even if they support it as an interface (the modernizr test for localStorage is as follows:

try {
    var mod = 'modernizr';
    localStorage.setItem(mod, mod);
    localStorage.removeItem(mod);
    has_localstorage = 1;
} catch(e) {
}

which basically tests that the interface works, not that it is persistent between sessions). Also if you want truly random output when run in a bot, it looks like you’ll have to write your own pseudo-random number generator function with some changing seed perhaps based on Date.now() output. It doesn’t look like Javascript’s Math object supports a seed for the .random() function which, whilst I can understand this design means that you basically have to code your own random generator stack if you want to get truly random output for bots.

Getting WordPress posting to Twitter with hashtags

In trying to publicise some of my articles in this blog a bit wider afield I recently opened a twitter account. However, in order to reach a wider audience on twitter you need to use hashtags. Unfortunately WordPress’s excellent Jetpack extension, whilst allowing you to post to twitter and other social networks, doesn’t automatically include hashtags in your posts. There have been a few attempts to add this functionality as an extension in this thread however they are all not very well coded and don’t work properly. Here is what I am now using on this blog:

// Create our custom Publicize message
function jeherve_cust_pub_message() {
	$post = get_post();

	if ( !$post )
		return;

	$categories = get_the_category( $post->ID );

	if ( !$categories )
		return;

	$msg = $post->post_title;
	// No need to add a URL as that is a separate part of the message.
	foreach($categories as $category)
		$msg .= " #" . str_replace(' ', '', $category->cat_name);

	update_post_meta( $post->ID, '_wpas_mess', $msg );
}

// Save that message
function jeherve_cust_pub_message_save() {
	add_action( 'save_post', 'jeherve_cust_pub_message', 21 );
}
add_action( 'publish_post', 'jeherve_cust_pub_message_save' );

Facebook Graph API Page post changes

So about a month back it looks like facebook changed their graph API to prevent posting links to pages using the method we had always used which was simply a post to //feed with my access token with message and link parameters. Posting just a message was working fine still but when I tried to add a link in I was just getting access denied.

After spending an hour or two bashing my head against the wall I discovered that you had to first access a list of all your pages with your user access token, then from that you would figure out the page’s special access token, and only then could you post.

So the resulting (somewhat messy) perl code is like:

my $FB_GRAPH_BASE = 'https://graph.facebook.com/v2.2';
my $m = WWW::Mechanize->new;
my $res = $m->get( "$FB_GRAPH_BASE/me/accounts?access_token=$token" );
my $d = decode_json( $res->decoded_content )->{data};
my $page_token = (grep { $_->{id} eq $PAGE_ID } @$d)[0]->{access_token};

$res = $m->post( "$FB_GRAPH_BASE/$PAGE_ID/feed", {
    access_token => $page_token,
    message => $msg,
    link => $url,
});