Diagnosing faulty memory in Linux…

For the past year I’ve had very occasional chrome crashes (segfaults in rendering process) and an occasional bit of btrfs corruption. As it was always easily repairable with btrfs check --repair I never thought much about it, although I suspected it may be an issue with the memory. I ran memtest86 overnight one time but it didn’t show up any issues. There were never any read or SMART issues logged on the disk either, and it happened to another disk within the machine as well.

Recently though I was seeing btrfs corruption on a weekly basis, especially after upgrading to ubuntu 18.04 (from ubuntu 16.04). I thought it may be a kernel issue so I got one of the latest kernels. It seemed to happen especially when I was doing something quite file-system intense, for example browsing some cache-heavy pages while running a vm with a long build process going on.

Then, earlier in the week the hard drive got corrupted again, much more seriously and after spending some time fixing, running `btrfs check –repair` a few times it suddenly started deleting a load of inodes. Force rebooting the machine I discovered that the disk was un-mountable, although later I was able to recover quite a lot of key data from btrfs restore as documented in this post.

memtest86 was still not showing any issues, and so my first thought was that assuming the hard disk was not at fault it may be something to do only when the memory had a lot of contention (memtest86 was only able to run on a single core on my box). I booted a minimal version of linux and ran a multi-process test over a large amount (not all) of the memory:

apt -y install memtester
seq $(nproc) | xargs -P1000 -n 1 bash -c 'memtester $0 10; E=$?; [[ $E != 0 ]] && { echo "FAILURE: EXIT status: $E"; exit 255; }' "$((($(grep MemAvailable /proc/meminfo | awk '{print $2}') / 1024 - 100) / $(nproc)))"

and check for FAILURE in the log messages, it likely also shows in dmesg, and may only show there if you have ECC RAM.

This will run a process per CPU aiming to consume pretty much all of your available memory. 10 is the number of test cycles to run through. In my case 8 cores and 16gb memory = 1400mb per memtester process. It took about 45 min to run once over the 16gb, or about 25 min to run over 8gb (each of the individual sodimms in my laptop).

Within about 10 minutes it started showing issues on one of the chips. I’ve done a bit of research since this and seen that if a memory chip is going to fail then it would usually do it within the first 6 months of being used. However this is a kingston chip that has been in my laptop since I bought it 2 or 3 years back. I added another 8gb samsung chip a year ago and it seemed to be after that that the issues started, however that chip works out as fine. Perhaps adding another chip in broke something, or perhaps it just wore out or overheated somehow…