Some observations of the GoogleBot

Recently rather than using cookies to gather data on people using some of my sites I’ve started using the newish html5 localStorage and generating a unique code on the first visit. This has a few advantages, mainly that it’s a bit more persistent than cookies, I already use localStorage to store customization data for the user client-side, it works seamlessly with PhoneGap/Cordova mobile apps and also I don’t have to worry about anything on the server side (ie setting, sending and tracking cookies). I use the roughly the following code (assuming localStorage is available for the browser, which in 99% of cases it is):

    if( localStorage['uuid'] == null) {
        first_use = 1;
        // 64-bit random number persistent between sessions
        localStorage['uuid'] = EDITION + Math.floor( Math.random() * 0x10000 ).toString(16)
                                + Math.floor( Math.random() * 0x10000 ).toString(16)
                                + Math.floor( Math.random() * 0x10000 ).toString(16)
                                + Math.floor( Math.random() * 0x10000 ).toString(16);
    }

I noticed that over several of my sites the GoogleBot was generating exactly the same uuid (over multiple access IPs) but it seemed that other localStorage preferences etc were not being saved. From this it seems like the GoogleBot doesn’t support saving stuff in localStorage (not a surprise given there are probably 10k computers running the GoogleBot scraper and it’s easier for them not to share site state). However it also appears that they are using the random number generator with a fixed seed so that any random numbers generated by the site are the same over all their scraper servers.

Conclusions? Don’t expect bots (or even some clients eg incognito mode) to actually save localStorage between sessions even if they support it as an interface (the modernizr test for localStorage is as follows:

try {
    var mod = 'modernizr';
    localStorage.setItem(mod, mod);
    localStorage.removeItem(mod);
    has_localstorage = 1;
} catch(e) {
}

which basically tests that the interface works, not that it is persistent between sessions). Also if you want truly random output when run in a bot, it looks like you’ll have to write your own pseudo-random number generator function with some changing seed perhaps based on Date.now() output. It doesn’t look like Javascript’s Math object supports a seed for the .random() function which, whilst I can understand this design means that you basically have to code your own random generator stack if you want to get truly random output for bots.

Leave a Reply

Your email address will not be published. Required fields are marked *