jueves, 19 de enero de 2017

gnu parallel as a queuing system

This post is about a gnu parallel, a tool I recently discovered, and I'm starting to use a bit more every day.

At its core, it's a command to multiple commands in parallel, but it has many many different options to customize how the paralelization is done, notifications, and other configs. Take a look at the official tutorial or the man page, which contain a wealth of info and examples.

Let's get SICP videos

The use case I have today is to use it as a simple queuing system. I just want processes to start when I have a new job for them.  The task at hand is to download all sicp lectures, at one download at a time (don't want to hog the network).

First of all, we notice the pattern of the links:

  1. http://www.archive.org/download/MIT_Structure_of_Computer_Programs_1986/lec1a.mp4
  2. http://www.archive.org/download/MIT_Structure_of_Computer_Programs_1986/lec1b.mp4
  3. http://www.archive.org/download/MIT_Structure_of_Computer_Programs_1986/lec2a.mp4 
  4. http://www.archive.org/download/MIT_Structure_of_Computer_Programs_1986/lec2b.mp4 

We notice the pattern, right? Let's craft some generator for the filenames.


 perl -e 'for(1..15){for $i (('a','b')){print "wget http://www.archive.org/download/MIT_Structure_of_Computer_Programs_1986/lec$_$i.mp4\n"}}' >sicp.list


After we generated the file, we're going to run the command in the following way:

cat sicp.lisp | parallel -j1 --tmux

this makes parallel run one after the other, and just putting the outputs of each job in a tmux tab.

B, b, but.... what's the point of all this?

Ok, we didn't use parallel for anything useful, we could have run the list as a shellscript, and be happy.   The idea is that we can use this simple mechanism to treat the file as a job queue, that waits for new incomming jobs and then processes them.  Beware, it has very little logging, and you can't do very sophisticated error recovery (see --resume-failed), so it's NOT a replacement for resque/sidekiq/etc...  In fact, I'd love to see something like a suckless queuing system based on parallel and bash.

So, here's the command to use this tool as a queuing system.

touch joblist; tail -f joblist | parallel --tmux 

Then, add lines to the joblist for them to be executed. It's a very easy plumbing task:

echo "sleep 10" >joblist

tail -f works as our event loop, waiting for new tasks to come, and it will pass the tasks to parallel, that will apply the job contention depending on the number of jobs you configure it to run in parallel.

I've just scratched the surface of what parallel is able to do. Do some searching around, and take a look at the man and tutorial to get a grasp of what this amazing gnu tool is able to do.  Taco Bell programming at its best!

References

Here I'm pasting some useful refs (appart from the ones already mentioned). 

4 comentarios:

José Galisteo dijo...

Muy guapo, hace tiempo buscaba algo así para bajar unos ficheros en paralelo por consola

Raimon Grau dijo...

Me debes un euro

Ole Tange dijo...

No generator needed:

parallel wget http://www.archive.org/download/MIT_Structure_of_Computer_Programs_1986/lec{1}{2}.mp4 ::: {1..15} ::: a b

Raimon Grau dijo...

Ole,
very nice. It's true that one can replace most loops with a single parallel command :)
Thanks for passing by!