All posts by Mark

I'm a full-stack Linux consultant from the UK specializing in high performance systems, DNS and databases. I have also written and lead teams producing a number of web/mobile apps. I'm fluent in English and Turkish.

Adding a primary key to a partitioned table in postgres with zero downtime or locks

In postgres (14), we have a table pubsub_node_option which looks like:

  Partitioned table "public.pubsub_node_option"
 Column │  Type  │ Collation │ Nullable │ Default 
────────┼────────┼───────────┼──────────┼─────────
 nodeid │ bigint │           │          │ 
 name   │ text   │           │ not null │ 
 val    │ text   │           │ not null │ 
Partition key: HASH (nodeid)
Indexes:
    "partitioner" btree (nodeid)
Number of partitions: 64 (Use \d+ to list them.)

The 64 partitions are named like pubsub_node_option_p0 to _p63. It has a few hundred Gb of data and we want to add a primary key to it with no downtime or locks (perhaps 1s is acceptable). The standard ADD PRIMARY KEY etc commands will lock the table for the duration of the process which means we can’t use them.

In Postgres a primary key is just a unique index over a set of non-null columns. The documentation says that whilst the SET NOT NULL command requires a full table scan (with exclusive lock), ADD PRIMARY KEY has another option which takes an already-existing index, and if there is also NOT NULL or a similar CHECK constraint on all of the columns it can do it without needing any locks.

So the basic process looks like:

  1. Add a CHECK constraint on nodeid
  2. Create a unique index over (nodeid, name)
  3. use ALTER TABLE ... ADD PRIMARY KEY USING INDEX ... to set the PK.

Lets start with (1). If we just add a CHECK constraint directly, postgres will take out an exclusive lock on the table which will freeze all updates for the duration. However there is a NOT VALID setting which allows us to later run a validator in the background. So we can do:

ALTER TABLE pubsub_node_option
  ADD CONSTRAINT pubsub_node_option_nodeid_not_null
  CHECK (nodeid is not null)
  NOT VALID;

Then we can try to run the validator which shouldn’t take any locks:

ALTER TABLE pubsub_node_option
  VALIDATE CONSTRAINT pubsub_node_option_nodeid_not_null;

Unfortunately, at this point the database locks up. Presumably because this is a partitioned table.

So, what I figured out (I’m not sure it’s actually documented anywhere) is that you can run the validator on all partitions, and then it will run instantly without locking on the parent:

ALTER TABLE only pubsub_node_option_p0
  VALIDATE CONSTRAINT pubsub_node_option_nodeid_not_null;
...
ALTER TABLE only pubsub_node_option_p63
  VALIDATE CONSTRAINT pubsub_node_option_nodeid_not_null;
ALTER TABLE pubsub_node_option
  VALIDATE CONSTRAINT pubsub_node_option_nodeid_not_null;

So (1) is solved – we’ve shown postgres that we don’t have any NULLs in that column.

Time for (2) the UNIQUE KEY. An index can be created CONCURRENTLY however this doesn’t work on partitioned tables. So we have to create it on each partition and then hook it in to the main table.

Firstly, we create the index on the main table:

create unique index
  pubsub_node_option_pkey
  on only pubsub_node_option
  (nodeid, name);

The ON ONLY means it doesn’t happen on the child tables.

Then we create a new index in each of the children, but CONCURRENTLY so no locking:

create unique index concurrently
  pubsub_node_option_p0_pkey
  on pubsub_node_option_p0 (nodeid, name);
...
create unique index concurrently
  pubsub_node_option_p63_pkey
  on pubsub_node_option_p63 (nodeid, name);

Now this is done we can attach each of them in to the main primary key:

alter index pubsub_node_option_pkey
  attach partition pubsub_node_option_p0_pkey;
...
alter index pubsub_node_option_pkey
  attach partition pubsub_node_option_p63_pkey;

Perfect.

Then, we should be able to do the final step of the process to convert the index + check constraint to a primary key (which is really only a bit of semantics for Postgres):

> alter table pubsub_node_option add primary key using index pubsub_node_option_pkey;
ERROR:  ALTER TABLE / ADD CONSTRAINT USING INDEX is not supported on partitioned tables

D’oh. But at least we can add to all of the individual tables:

alter table pubsub_node_option_p0
  add primary key
  using index pubsub_node_option_p0_pkey;
...
alter table pubsub_node_option_p63
  add primary key
  using index pubsub_node_option_p63_pkey;
...

I still can’t see a way to add it to the main table though, even ALTER TABLE ONLY pubsub_node_option comes up with the same error.

I’m not really sure where to go from here, however the reason for wanting a PRIMARY KEY is because we want to use pglogical to replicate data, and this requires “a PRIMARY KEY or other valid replica identity such as using an index, which must be unique, not partial, not deferrable, and include only columns marked NOT NULL”. So I think we should be OK with how far we have gotten for now.

Please leave a comment if you know how to complete the process!

Easily running unison on different Ubuntu versions

Unison is a great file-synchronization tool, but it’s highly dependent on both the unison version and the OCAML version being the same between client and server to enable syncing. I’ve wasted much time over the years trying to backport versions to ubuntu as I upgrade a laptop but not the server it backs up to or vice-versa.

This seems like a great problem to solve with containers, and fortunately it seems like it’s quite easy.

Create a new unison container on the server – Dockerfile should look like:

FROM ubuntu:22.04
RUN apt update && apt -y install unison

Then, build it:

docker build --network host -t unison .

And create /usr/bin/unison looking like:

#!/bin/bash
EXTRA=""
for g in $(id -G); do
    EXTRA="$EXTRA --group-add $g"
done
exec docker run --rm \
    -v /home/:/home/ \
    -e USER=$USER -e HOME=$HOME \
    -u "$(id -u):$(id -g)" \
    $EXTRA \
    -v /etc/passwd:/etc/passwd:ro \
    -v /etc/group:/etc/group:ro \
    --network host \
    -i \
    unison \
    unison $@

It’s necessary to run on host network so it picks up the server’s hostname as unison is highly dependent on this, and the environment variables that are passed in.

Then it seems to work just fine. Happy days!

Easily dumping all AWS SSM details

We have a lot of credentials and config mixed up in AWS SSM/Parameter Store over many different regions and profiles. We want to export all of these to a spreadsheet to allow some more junior team members to convert the non-secret ones into standard parameters which can be stored in our kubernetes git config (on a per-environment basis). The following script does this:

#!/bin/bash
for profile in x y z; do
    awsv2 --profile $profile ssm get-parameters-by-path --path / --with-decryption > paramstore-$profile.json
done
cat *.json  | jq -r '.Parameters[] | select(.Name+.Value |test("pass|private|secret|token|://.*:|certificate|pwd|cookie|key"; "i")|not) | [.Name, .Value] | @csv' > out.csv

Using Postgres NOTIFY with placeholders

When writing SQL you always want to use placeholders rather than trying to escape text yourself and risk an SQL injection attack.

Postgres provides great functionality for this such as:

SELECT * FROM table WHERE username = $1

Today I was trying to send arbitrary text to a channel via the very powerful NOTIFY command. However every time I tried to use placeholders I was getting errors (from python’s asyncpg driver, which is pulling it directly from postgres error code 42601).

Eventually, looking through the docs I found this quote:

To send a notification you can also use the function pg_notify(text, text). The function takes the channel name as the first argument and the payload as the second. The function is much easier to use than the NOTIFY command if you need to work with non-constant channel names and payloads.

So, after wasting an hour trying all sorts of different quoting strategies, I was able to change NOTIFY $1, $2 into SELECT pg_notify($1, $2) and resolve the issue.

I’m using SQLAlchemy by the way so it looks something like:

from sqlalchemy import text 
await session.execute(
  text("SELECT pg_notify(:channel, :data)")
    .bindparams(channel="channel", data="my text")
)

Perfect!

Hacking ElasticSearch python client to work with AWS OpenSearch

Because of various disagreements between AWS and ElasticSearch, AWS released a fork called OpenSearch but and Elastic updated their clients to throw errors if you try to use them with this product.

This is obviously really annoying if you are using 3rd party software which uses the Elastic libraries, but trying to run them against AWS managed services. The following hack fixes this for Python elasticsearch v7.17 at least, by disabling the unnecessary version check:

# Hack elasticsearch to work with AWS
from elasticsearch import Transport
Transport._do_verify_elasticsearch = lambda self, headers, timeout: None

Screen corruption on KDE with Ubuntu 20.04

At some point in the past couple of weeks I guess my laptop updated libraries or something, because when I had to reboot my laptop yesterday I started seeing massive screen corruption in some applications. It manifested itself in horizontal white/black lines remaining especially when selecting text. This was especially visible in konsole. I couldn’t see any package in particular which had updated recently and I tried looking at a few different Xorg or kernel options but to no avail.

Eventually I looked at the KDE compositor settings. I noticed that the “Rendering Backend” was set to XRender, which as far as I understand it is very old. Updating it to ‘OpenGL 3.1 instantly fixed the issue.

I’m leaving this mostly as a note to myself for if it happens again in the future, but at the same time perhaps it is a wider regression in the ubuntu 20.04 KDE packages so it would help someone else.

Best wordle starter words

I recently, like pretty much everyone else got into Wordle. One of the most important things in getting the correct answer is to find the best first word or two to start with which will help guide you to the correct answer. The ideal first word(s) should use one each of the most common letters so for example in the first 2 guesses you can test the top 10 characters.

My first (relatively uneducated) guesses based on what I vaguely remembered about letter frequency in English were ‘spear’ and ‘mount’ – 4 vowels and some of the most common consonants. However it’s pretty much a random guess so I was wondering if we could figure out a better approach.

It’s pretty straight forward to look at the source code of Wordle, which contains two word lists. The first one contains 2315 5-letter words which can be the answer, the second contains a further 10,000 of all possible 5 letter words in English.

So, I wrote a small script to analyse the frequency of letters in the list of possible answers, and then based on that filter the possible words to find the best starting (and subsequent) guesses which would work.

I’ve put the simple python script I used at the bottom of the article, but the output is:

Matching 5 new letters (39%) are: [‘arose’]
Matching 5 new letters (66%) are: [‘unlit’, ‘until’]
Matching 4 new letters (81%) are: [‘duchy’]
Matching 3 new letters (89%) are: [‘pygmy’]

What this means is that if you start with the word ‘arose’, and then ‘until’ (or ‘unlit’), even though it’s only 10 unique letters (38% of the alphabet) because they are the most frequent ones they will cover 2/3 (66%) of the possible words.

In terms of letter frequency overall we get the following ordered detail:

[(‘e’, 1233), (‘a’, 979), (‘r’, 899), (‘o’, 754), (‘t’, 729), (‘l’, 719), (‘i’, 671), (‘s’, 669), (‘n’, 575), (‘c’, 477), (‘u’, 467), (‘y’, 425), (‘d’, 393), (‘h’, 389),
(‘p’, 367), (‘m’, 316), (‘g’, 311), (‘b’, 281), (‘f’, 230), (‘k’, 210), (‘w’, 195), (‘v’, 153), (‘z’, 40), (‘x’, 37), (‘q’, 29), (‘j’, 27)]

The script I wrote is not perfect but it’s at least a start at finding some optimum words

import sys
with open(sys.argv[1]) as fh:
    words = [l.strip() for l in fh]

chars = {}
for char in ''.join(words):
    chars[char] = chars.get(char, 0) + 1
frequency = sorted(chars.keys(), key=lambda c: -chars[c])
print(sorted(chars.items(), key=lambda c: -c[1]))

total_freq = 0
while len(frequency) > 5:
    matching = words
    letters = []
    for char in frequency:
        new_matching = [w for w in matching if char in w]
        if new_matching:
            matching = new_matching
            letters.append(char)
        if len(letters) == 5:
            break

    total_freq += sum([chars[c] for c in letters])
    print("Matching %d new letters (%d%%) are: %r" % (len(letters), total_freq / sum(chars.values()) * 100, matching))
    frequency = [c for c in frequency if c not in letters]

Simple mitigation for the new DNS cache poisoning attack

As reported in many places, a new attack has been presented which can allow an attacker to poison caching and forwarding DNS server entries. The PDF is an interesting read and contains many different ideas which chained together can lead to this attack. I believe the following firewall rule should defend against the attack on caching servers with very little side effect by preventing sending of ICMP messages saying that the given UDP port was unreachable:

iptables -I OUTPUT -p icmp --icmp-type port-unreachable -m u32 --u32 '34 & 0xFF = 17' -j DROP

Running systemd inside a Centos 8 Docker container

Support for systemd in Docker has improved a lot since this 2016 article, but it’s still not obvious quite how to make it work. Why would you want this? Mostly for testing full-server deploys (for example we test ansible deployments against various docker containers to ensure there are no bugs). Here’s a systemd-based centos 8 Dockerfile that also includes an ssh server:
FROM centos:8

# Set up base packages that are expected
RUN dnf -y install openssh-server crontabs NetworkManager firewalld selinux-policy

RUN systemctl mask dev-mqueue.mount dev-hugepages.mount \
     systemd-remount-fs.service sys-kernel-config.mount \
     sys-kernel-debug.mount sys-fs-fuse-connections.mount \
     graphical.target systemd-logind.service \
     NetworkManager.service systemd-hostnamed.service

STOPSIGNAL SIGRTMIN+3

# SSHd setup
EXPOSE 22
COPY docker.pub /root/.ssh/authorized_keys
RUN chmod 600 /etc/ssh/ssh_host* /root/.ssh/authorized_keys

CMD ["/sbin/init"]
You can then launch this like:
docker run -v /sys/fs/cgroup:/sys/fs/cgroup:ro --tmpfs /run container-name
For the latest centos 7 you can use the following Dockerfile:
FROM centos:7

RUN yum -y install openssh-server NetworkManager firewalld && \
    systemctl disable NetworkManager && systemctl enable sshd

EXPOSE 22
COPY docker.pub /root/.ssh/authorized_keys
RUN chmod 600 /etc/ssh/ssh_host* /root/.ssh/authorized_keys

STOPSIGNAL SIGRTMIN+3

CMD ["/sbin/init"]

Complex network setup on Centos/Redhat 8 via Network Manager

Here are some notes, mostly for my future self about how to set up bond + vlan + bridge networks in Centos 8 in order to create a highly resilient virtual machine host server.

Firstly, create the bond interface and set the options:

nmcli con add type bond ifname bond0 bond.options "mode=802.3ad,miimon=100" ipv4.method disabled ipv6.method ignore connection.zone trusted

Note the 802.3ad option – this says we are going to be using the LACP protocol which requires router support, however you can look at the different mode options in the kernel documentation.

Then, we add the required interfaces into the bond – in this case enp131s0f0 and enp131s0f1

nmcli con add type ethernet ifname enp131s0f0 master bond0 slave-type bond connection.zone trusted
nmcli con add type ethernet ifname enp131s0f1 master bond0 slave-type bond connection.zone trusted

If you want to create a standard vlan interface over the top of the bond for vlan tag 123 we can do this just with 1 more command:

nmcli con add type vlan ifname bond0.123 dev bond0 id 123 ipv4.method manual ipv4.addresses a.b.c.d/24 ipv4.gateway a.b.c.d ipv4.dns 8.8.8.8 connection.zone public

Alternatively if you want to set up a bridge on this interface:

nmcli con add ifname br.123 type bridge ipv4.method manual ipv4.addresses a.b.c.d/24 ipv4.gateway a.b.c.d ipv4.dns 8.8.8.8 \
        bridge.stp no bridge.forward-delay 0 connection.zone public
nmcli con add type vlan ifname bond0.123 dev bond0 id 123 master br.123 slave-type bridge connection.zone trusted

If you don’t want the default route or DNS, simply remove the ipv4.gateway and ipv4.dns options above.

If you want to create a bridge which doesn’t have any IP address on the host (just bridges through to the host vm’s) then replace all the ipv4 settings with: ipv4.method disabled ipv6.method ignore.