Category Archives: Linux

Multi-line commands with comments in bash

As part of the last post I initially used a bash script to generate the commands to output the individual videos. As per usual, when I finally got fed up of the limitations and syntax issues in bash I switched to a proper programming language, perl. However this time I learnt a neat trick to doing multi-line commands in bash with comments embedded using the array feature of bash. A multi-line command typically looks like:

        melt \
            color:black \
                out=$audiolen \
            ...

However what if you want to add comments into the command? You can’t.

To solve this create an array:

    cmd=(
        # Take black background track for same number of seconds as the MP3, then add 10 seconds of another image
        melt
            color:black
                out=$audiolen
        ...
    )

and then use the following magic to execute it:

"${cmd[@]}"

Using this you can also conditionally add in extra statements if you’re using a pipeline-type program such as imagemagick (convert) or melt:

    cmd+=(
        # Output to the file
        -consumer avformat
            target="$target"
            mlt_profile="hdv_720_25p"
            f=mpeg acodec=mp2 ab=96k vcodec=mpeg2video vb=1000k
    )

Automatically creating videos from pictures, music and subtitles

So for one of my projects we have a number of albums and individual songs which we want to upload to youtube as many people use this to listen to music these days. We also want to create a separate collection of videos that have the song words (Think hard-burning subtitles into a video). Obviously you can do this in video editing software but it would be nice to be able to tweak all the videos afterwards without having to do much work.

Initially I tried using avconv/mencoder to generate videos based on the pictures using the following code – generate the picture/music as a video, apply subtitles and then finally apply the audio again but without reencoding it.

    avconv -loop 1 -y \
            -i bgimg.jpg \
            -i "$mp3" \
            -shortest \
            -c:v libx264 -tune stillimage -pix_fmt yuv420p \
            -c:a mp3 \
            "$t"

    # Apply subtitles
    mencoder -utf8 -ovc lavc -oac copy -o "$out" "$t" -sub "$sub"

    # Add in end track and overlay with mp3
    mencoder -audiofile "$mp3" -idx -ovc lavc -oac copy -o "final.avi" "$out" "$append"

Whilst this kind of works it’s got a number of downsides the big ones being 1) it isn’t flexible to eg add another picture/slide at the end, and 2) it reencodes the video/audio a number of times.

Then I remembered that the great kdenlive video editing software is actually just a frontend to the brilliant mlt framework. This is basically a library plus commandline programs to do all sorts of video mixing with live or rendered output.

Using the melt commandline program you can test and generate tracks without having to worry about the XML format that it typically uses for the more advanced options. The final commands:

melt color:black out=5614 \
  t.jpg out=250 \
  -track \
    cdimage.jpg out=5614 \
  -transition composite geometry=0,0:100%x70% halign=1 \
  -consumer xml:basic.mlt

melt basic.mlt
  -filter watermark:subtitles.mpl \
    composite.valign=b composite.halign=c producer.align=centre \
  -audio-track audio.mp3

If you want to do the video output you can add the following onto the last command:

-consumer avformat \
  target=out.mpg \
  mlt_profile=hdv_720_25p f=mpeg acodec=mp2 ab=96k vcodec=mpeg2video vb=1000k

Lets go through this a line at a time:

melt color:black out=5614

Generate black background for 5614 frames

  t.jpg out=250

Followed by t.jpg for 250 frames

  -track
    cdimage.jpg out=5614

Generate a new track which is the cd image for the same length as the black track

  -transition composite geometry=0,0:100%x70% halign=1

Mix the two tracks so that the second one (ie the cd image) is 70% of the screen height and centered horizontally to the top.

  -consumer xml:basic.mlt

Output to an xml file (in order to apply subtitles to the whole thing we need to do this intermediary stage)

melt basic.mlt

Start with the mixed video sequence defined in the xml file (which is just instructions, not a staged render)

  -filter watermark:subtitles.mpl
    composite.valign=b composite.halign=c producer.align=centre

Apply the watermark filter with a subtitle mpl file, align to the bottom centered (it will auto scale extra wide lines to be the width of the video). A MPL file looks like this:

1=blah
10=
15=foo
20=

Where the first bit is the frame and the second bit is any text to be displayed. New lines demarcated with a tilde (~) character. Here is a simple perl script to convert a srt format subtitle file into this mpl format:

#!/usr/bin/perl
use strict;
use warnings;
use Path::Tiny 'path';

my ($fps, $in) = @ARGV or die;
$in = (path $in)->slurp;
$in =~ s/\r//g;
my @parts = split /\n\n/, $in;
for my $part (@parts) {
    #print "$part\n\n";
    $part =~ s/^ \D* \d+ \n
        ([\d:,]+) \s --> \s ([\d:,]+) \n
        //x;
    my ($start, $end) = ($1, $2);
    for( $start, $end ) {
        my ($h,$m,$s,$part_s) = split /[:.,]/;
        $_ = int( ( ( $h * 60 + $m ) * 60 + $s + $part_s / 1000 ) * $fps );
    }
    $part =~ s/\n/~/g;
    print "$start=$part\n",
        "$end=\n";

}

Back to the melt commandline:

  -audio-track audio.mp3

Overlay the audio track

For the non-test output commandline parts:

-consumer avformat target=out.mpg

Output using libav

  mlt_profile=hdv_720_25p f=mpeg acodec=mp2 ab=96k vcodec=mpeg2video vb=1000k

Set the profile to be 25fps 720p hd video using mpeg, set audio rate 96kbps and video rate 1000kbps

rsync with remote filenames with spaces in from bash

Something that always annoys me with rsync is that due to executing a remote shell, any characters in the remote path name require double-escaping (once for this local shell, once for the remote one). For example

rsync -av 'my holiday photos/' server:'my holiday photos/'

creates a remote folder called ‘my’ and puts the directory into that. The solution is to do something like:

rsync -av 'my holiday photos/' server:'my\ holiday\ photos/'

But how to do this when you’re running from the shell eg iterating directories? One way would be to use a command like $(sed …) to handle the escaping, however you can do it purely in shell using two different types of quote. For example today I had to do:

for i in */; do
    rsync -av "$i/img/" server:"backup/'$i'/"
done

Automounting swap on local SSD’s on Amazon EC2

Many instances on EC2 (AWS) now have local SSD’s attached. The excellent ubuntu 14.04 image boots brilliantly on these and automatically formats and mounts any of the local SSD storage. However when the instance shuts down, reboots or gets migrated these SSD’s go away so you still need to use the persistent EBS storage for most operations.

If you want to enable swap on the box, add the following to /etc/rc.local – it will create a 2gb swap file each boot on the local SSD and mount it:

dd if=/dev/zero of=/mnt/swapfile bs=1M count=2048
chmod 600 /mnt/swapfile
mkswap /mnt/swapfile
swapon /mnt/swapfile

I’ve not yet figured out what the process is to format/mount these local disks on bootup it may well be easier to add this to them.

Better Database Search Functionality in 4 Simple Steps

As Google and other search engines are so good at predictive search these days users (ie me) get very frustrated at poor search results on websites or input boxes. Something I try to use across my apps is a decent auto-complete search interface however so many websites are very poor at this either matching only the first part of the string or matching any part of the substring. Additionally sometimes they don’t handle differences in case properly, certainly they don’t usually work with different accenting marks or order particularly well (eg search for beyoglu doesn’t usually return results like Beyoğlu). So, here follows a simple suggestion and code design pattern about how to implement this properly in PostgreSQL (Also works in MySQL although the regex matching code is slightly different). You can then have great instant typeahead functionality for example using the excellent AngularJS Bootstrap Typeahead input. I’ve implemented this in Perl/DBIC but it is a pattern that can be easily applied to any language such as Ruby/Rails or NodeJS.

Whilst there are a number of different search options out there that can plug into existing databases such as ElasticSearch, Sphinx or MySQL/Postgres fulltext search these are often fiddly to set up and are more intended for natural fulltext than for simple phrases or keywords which is what I generally aim for. The below method is pretty quick and easy to set up and allows you full control over the rebuilds, stemming, common word removal etc which is especially important for multi-lingual sites. You can also easily switch between database servers without having to totally redo your search functionality using this method.

Step 1: Add Column to Database Tables

Firstly, for any table you wish to search create a searchdata column probably varchar, with the maximum length of the data you’ll want to be searching (eg article title, author etc combined). For example:

alter table article add searchdata varchar(255) not null default '';

Step 2: Create Search Query Normalization Code

Then in your code create two routines to normalize any search text. Here is a (perl) example from my code:

package OradanOraya::Search;
use strict;
use warnings;
use utf8;
    
use Text::Unidecode 'unidecode';

sub to_search_no_strip {
    my ($self, $txt) = @_;
    $txt = lc unidecode($txt);
    $txt =~ s/[^a-zA-Z0-9 ]/ /g;  # kill newlines or space runs
    $txt =~ s/\s+/ /g;  # kill newlines or space runs
    $txt =~ s/^\s+|\s+$//;
    return $txt;
}
    
sub to_search {
    my ($self, $txt) = @_;
    $txt = $self->to_search_no_strip($txt);

    # common words to strip out
    $txt =~ s/\b(?: bolum | hastane | doktor | doctor | doc | dr )\S*//xg;

    return $txt;
}       
            
1

The first function is purely for normalizing the search terms (firstly stripping accents using the excellent Text::Unidecode module, then killing any non-alphanumeric chars, ensuring only one space between words and no spaces beginning or end of the text), the latter function does the same but also removes any common words you don’t want indexed.

Step 3: Set Columns to Auto Update in Your ORM

In your ORM base-class (you are using an Object-Relational Mapper rather than plain SQL right?) create some functions to handle the auto-population of these fields when the rows get updated by your code. For Perl’s DBIx::Class users here’s the code you inject into your DBIC Result base class. The first function, _get_searchdata is the key one that takes a specified list of columns, normalizes them and returns the searchdata field. The other functions are for the manual refresh of the search data in the row, automatically updating search data on update and create respectively:

sub _get_searchdata {
    my ($self) = @_;

    return My::Search->to_search( join ' ', map { $self->$_ || '' } $self->searchdata_columns )
}

sub refresh_searchdata {
    my ($self) = @_;
    $self->update({
        searchdata => $self->_get_searchdata
    });
}

sub set_column {
    my $self = shift;

    my $ret = $self->next::method( @_ );

    if( $self->can('searchdata') ) {
        # Note that we call the super-class set_column update method rather than ourselves otherwise we'd have an infinite loop
        $self->next::method( 'searchdata', $self->_get_searchdata );
    }

    return $ret;
}

sub insert {
    my $self = shift;

    if( $self->can('searchdata') ) {
        $self->searchdata( $self->_get_searchdata );
    }

    return $self->next::method( @_ );
}

In any of your tables where you have added a searchdata column create a method that just returns what columns you want to add to searchdata:
sub searchdata_columns { qw< title name > }

Step 4: Search Queries and Ordering

Whenever a row is added or updated now you’ll have the normalized search text added (see below for a script to auto-populate if you have existing data). To do nice searches you can now execute the following SQL (for MySQL replace ~ with REGEXP operator):

select * from article where searchdata ~ 'foo'

This will match the text anywhere. If you want to only match words beginning with this you can use PostgreSQL’s zero-width start-of-word \m operator (in normal regexp language this would be roughly equivalent to \b although that matches beginning and end of words):
select * from article where searchdata ~ '\mfoo'

If you want to order results whereby those with beginning-of-string matches go first, then the rest are alphabetical you can do something like (note the !~ as false orders before true):
SELECT *
FROM article
WHERE searchdata ~ '\mfoo'
ORDER BY searchdata !~ '^foo', searchdata

Well that’s a job well done! You can look at using some sort of index in the database to speed this up but to be honest for tables with less than 10k rows that’s probably not worth while. You’ll need to look at the trie type indexes that Postgres has, I don’t believe MySQL is able to index these sorts of searches.

The DBIC code for this last one:

my $search_str = quotemeta($fn->to_search( $p->{search} ));
$dbic->resultset('Article')->search({
  searchdata => { '~' => '\m' . $search_str }
}, {
  order_by => [
         \[ 'searchdata !~ ?', [ str => '^' . $search_str ] ],
         'searchdata'
  ]
});

Extra Step: Create a Reindex Script

You’ll also want to write some code to find any tables with searchdata and update them for initial population. Here’s the perl/dbic solution for this again but should be simple enough with any ORM (note the transaction commit per 100 updates which significantly improves performance on all database servers):

$|++;
my $count = 1;
$dbic->txn_begin;
for my $table ($dbic->sources) {
    my $rs = $dbic->resultset($table);

    next unless $rs->result_source->has_column('searchdata');

    while( my $d = $rs->next ) {
        if( $count++ % 100 == 0 ) {
            $dbic->txn_commit;
            $dbic->txn_begin;
            print ".";
        }
        
        $d->refresh_searchdata;
    }           
}           
$dbic->txn_commit;

How to use mitmproxy to capture https connections

Based on the excellent in-depth guide found here I’ve written a few quick startup notes to myself below:

sudo ufw disable
sudo iptables -t nat -A PREROUTING -p tcp --dport 443 -j REDIRECT --to-port 8000 # remember not to use -i...
mitmproxy -T --host

Philip’s instructions have -i with the nat prerouting rule and because I’m on wireless this was a source of frustration until I noticed. Forwarding is enabled by default on my box as I run some vm’s from time to time, and the box will automatically forward the packets and just pull out the ones on port 443 which are the ones I’m interested in.

crypt() function potential insecurities with invalid salts

So, yesterday I discovered quite a serious vulnerability in an application using the crypt() function (it was in perl, but perl just calls through to the system library so this application design flaw may be found in anything using the crypt() function for authentication).

Firstly why use crypt() at all? It’s well known that the DES-style crypt is very weak, for example the salt is only 2 characters and it only takes the first 8 characters of the password and ignores everything after that. However the modern glibc implementation of crypt() include a number of very secure hashing functions prefixed with $, particularly $6$ which is hashed SHA-512 and I’d advise everyone to use.

Anyway, back to the issue at hand. Your standard crypt() password check would look like this (taken from perl’s Catalyst::Authentication::Credential::Password module):

return $storedpassword eq crypt( $password, $storedpassword );

ie you use as the salt the encrypted version of the password to encrypt the user specified password, and then check that against the crypted password itself. If they match it means the password was the same.

However in this application for certain pre-created accounts or accounts that had only been logged in to using an oauth mechanism (eg facebook login) the user’s password field was an empty string. This seems reasonable enough – crypt() of a blank password should always return something non-blank (unless you’re using an older version of mysql that has a crypt() inconsistency that I reported 2 years ago). eg

$ perl -le 'print crypt("", "aa")'
aaQSqAReePlq6

Unless that is, that you specify an invalid (or blank) salt:
$ perl -le 'print crypt("any password", "a")'

$ perl -le 'print crypt("any password", "")'


D’oh. This basically means that if you are using a blank password column to specify no password login allowed in reality someone can log in with ANY password! So in the case of the app in question if you knew the registered email address of someone who had a precreated (but locked out) account, or the address that someone who logged in using oauth you could do a login with any password. I’m guessing that because this is not particularly widely known amongst developers there are probably a number of apps where this is possible today but no-one has tested it.

Some workarounds/mitigations:

  • Set your database password field to default to something that is non-blank (eg ‘a’) – even if crypt() classes the salt as invalid it will return blank which will not match this field.
  • Use something that overrides crypt() to auto-generate a salt if none is specified (eg Crypt::Password::Util’s crypt function). This won’t cover you on the case where the specified salt is invalid and so crypt() just returns a blank.
  • Assert that the user’s password field is not blank before allowing login (but you need to make sure you do this everywhere in your application).
  • For language designers: Either make sure that your crypt() function doesn’t ever return blank (throws an error on invalid output for example), OR that it automatically generates a valid salt regardless of whether or not a salt is specified (including the case where a salt is specified but is invalid).

Extracting old weird format audio files

So, I had a friend who has a load of recordings from about 10 years ago which were done on a weird dictophone. The files had the extension .FC4 which according to the internet is a legacy Amiga audio format with no more support. Great.

First thing was to run file on it:

$ file t.FC4 
t.FC4: data

Great. Let’s see if we can do a better job looking at a hex dump (with xxd):

0000000: 4649 4c45 0103 0101 0333 0fff ffff ffff  FILE.....3......
0000010: ffff ffff ffff ffff ffff ffff ffff ffff  ................
0000020: aa10 1f40 01ff ffff ffff ffff ffff ffff  ...@............
0000030: ffff ffff ffff ffff ffff ffff ffff ffff  ................
0000040: ffff ffff ffff ffff ffff ffff ffff ffff  ................
0000050: 4d49 2d53 4334 ffff ffff ffff ffff ffff  MI-SC4..........
0000060: 4456 522d 3030 37ff ffff ffff ffff ffff  DVR-007.........
0000070: 4130 322d 3033 3031 3031 3033 3531 3135  A02-030101035115
0000080: ffff ffff ffff ffff ffff ffff ffff ffff  ................
0000090: ffff ffff ffff ffff ffff ffff ffff ffff  ................
00000a0: ffff ffff ffff ffff ffff ffff ffff ffff  ................
00000b0: ffff ffff ffff ffff ffff ffff ffff ffff  ................
00000c0: ffff ffff ffff ffff ffff ffff ffff ffff  ................
00000d0: ffff ffff ffff ffff ffff ffff ffff ffff  ................
00000e0: ffff ffff ffff ffff ffff ffff ffff ffff  ................
00000f0: 4643 34ff ffff ffff ffff ffff ffff ffff  FC4.............
0000100: 5249 4646 501f 0000 5741 5645 666d 7420  RIFFP...WAVEfmt 
0000110: 1400 0000 5003 0100 401f 0000 2910 0000  ....P...@...)...
0000120: 1e00 0000 0200 3a00 6461 7461 0000 0000  ......:.data....
0000130: 00fe ffff feff fffe ffff feff ffef 55ff  ..............U.
0000140: feff feff feff feff feff effe feff efef  ................
0000150: efef efef efef efef efef efef 55ef efef  ............U...
0000160: effe feff efef feff effe ffff ffff ffff  ................
0000170: ffff ffff ffff ffff ffff 55ff ffff ffff  ..........U.....
0000180: ffff ffff ffff ffff ffff ffff ffff ffff  ................
0000190: ffff ffff ffff ffff 55ff ffff ffff ffff  ........U.......
00001a0: ffff ffff ffff ffff ffff ffff ffff ffff  ................
00001b0: ffff ffff ffff 55ff ffff ffff ffff ffff  ......U.........
00001c0: ffff ffff ffff ffff ffff ffff ffff ffff  ................
00001d0: ffff ffff 55ff ffff ffff ffff ffff ffff  ....U...........
00001e0: ffff ffff ffff ffff ffff ffff ffff ffff  ................
00001f0: ffff 55ff ffff ffff ffff ffff ffff ffff  ..U.............
0000200: ffff ffff ffff ffff ffff ffff ffff ffff  ................
0000210: 55ff ffff ffff ffff ffff ffff ffff ffff  U...............
...
00006e0: ffff ffff fd88 3fe1 e1e1 e1ef d21e 1e1e  ......?.........
00006f0: 1fff ffff c821 e1e1 bb32 f2d1 55f1 e1e1  .....!...2..U...
0000700: dbac f61f ffff ac15 fe2e 1e1d 8f2f 1e1e  ............./..
0000710: dc2d 4fe1 d9d4 f3ef ed31 55b2 e219 b4fb  .-O......1U.....

So, looks like at offset 0x100 (256) we have something that is a RIFF/WAV file, then the stuff that shows as U is probably a chunk-size block or somesuch. Given the blocks of data afterwards it could probably be 16-bit single channel at a guess. Perhaps something can read that if we cut the initial header off and re-save:

$ xxd -s -256 -r t out.wav
$ file out.wav 
out.wav: RIFF (little-endian) data, WAVE audio, mono 8000 Hz

Ah-ha looks like file has a clue now. Let’s try to play it:

$ mplayer out.wav
...
Requested audio codec family [sc4] (afm=acm) not available.
Enable it at compilation.
Cannot find codec for audio format 0x350.
...

D’oh. Opening as a raw file in audacity shows pretty much white-noise (whereas you’d have expected it to be something vaguely like speach but with blips in every so often if it was any sort of valid PCM or wave type encoding).

After searching around for a long time I discovered this post which talked about a very similar looking header and especially WAV encoding 0x350. This linked to an mplayer plugin with an acm and inf file however the ubuntu version of mplayer doesn’t support w32codecs. I tried installing this in several different ways in a windows 7 vm but couldn’t get it to work.

I then tried compiling mplayer from source only to be greeted with:

cc -MMD -MP -Wundef -Wall -Wno-switch -Wno-parentheses -Wpointer-arith -Wredundant-decls -Werror=format-security -Wstrict-prototypes -Wmissing-prototypes -Wdisabled-optimization -Wno-pointer-sign -Wdeclaration-after-statement -std=gnu99 -Werror-implicit-function-declaration -D_POSIX_C_SOURCE=200112 -D_XOPEN_SOURCE=600 -D_ISOC99_SOURCE -I. -Iffmpeg -O4 -march=native -mtune=native -pipe -ffast-math -fomit-frame-pointer -fno-tree-vectorize -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64 -D_LARGEFILE64_SOURCE  -fpie -DPIC -D_REENTRANT  -I/usr/include/freetype2 -DZLIB_CONST -fno-omit-frame-pointer -mno-omit-leaf-frame-pointer -c -o loader/wrapper.o loader/wrapper.S
loader/wrapper.S: Assembler messages:
loader/wrapper.S:31: Error: `pusha' is not supported in 64-bit mode
loader/wrapper.S:34: Error: operand type mismatch for `push'
loader/wrapper.S:38: Error: operand type mismatch for `push'
loader/wrapper.S:40: Error: operand type mismatch for `push'
loader/wrapper.S:45: Error: operand type mismatch for `push'
loader/wrapper.S:46: Error: operand type mismatch for `push'

D’oh. Rather than mess around with trying a 32-bit compile or hacking the assembly I remembered I had a 10-year old laptop lying around with a very old 32-bit install of gentoo. Power it up, install the codec files and it plays them!

I then try to extract some proper PCM WAV file from the FC4 file using mencoder. But mencoder doesn’t support audio-only. I also try using the -dumpstream option in mplayer but that just dumps the encoded audio. So finally I come across the -ao pcm option which puts out a nice plain wav file that I can encode into mp3 or any other format.

Cutting a video into segments on the command-line

So, today I needed to cut a video into several segments and figured that as in the future I may need to reprocess the best thing to do would be to write a small script on the command-line to do this. Fortunately it turned out to be pretty easy… First create a file called cut_points with the points (in seconds):

5
10
20
100

(That last line of 100 is some value greater than the length of the video). Then using the following bash one-liner:

i=1;
prev=0;
for new in `cat cut_points`; do
  avconv -y -i out.mp4 -ss $prev -t $(echo "$new-$prev" | bc) -async 1 -strict experimental $i.mp4; i=$(($i+1));
  prev=$new;
done

Unfortunately this does reencode the video (I guess if you just try doing copy for the a/v streams it will only do it to the nearest B-frame so you’ll only have your cut points accurate to the nearest few seconds which wouldn’t work in my case)