GNU parallel is underrated

I’m always surprised to learn that a friend who has used Linux for a long time, in both server and desktop contexts, might not have heard of GNU parallel.

If you use GNU parallel together with pv (pipe viewer), UNIX shell pipelines, and Python fileinput module, you get a pretty powerful parallel job running framework with testable independent pieces. It’s one of my favorite tricks when writing tools for my own benefit.

Let’s break it down.

Python has Perl-style well-behaved stdin/stdout/file command line tool support built into stdlib, it’s just underpromoted. It’s called fileinput. You can read about it here.

UNIX/Linux shells, like bash and zsh, have built in support for parallel job management. This is described in detail here. It’s just rarely used.

To inspect shell pipelines, you might want to take a look at tools like pv, the (p)ipeline (v)iewer. Or, up, the (u)ltimate (p)lumber.

Using the & syntax correctly for spawning jobs and spawning “shell functions” as jobs is perhaps not for the faint of heart. That’s where GNU parallel fits in.

As described on Wikipedia, GNU parallel “allows the user to execute commands in parallel,” simply by taking single-core jobs and automatically scaling them to run “as parallel as there are CPU cores” on the local machine. It accomplishes this through a number of easy-to-understand tricks: automatically spawning multiple copies of the job; partitioning the lines over standard input (stdin) or the filenames passed as arguments; and then finally, multiplexing standard output (stdout) and standard error (stderr).

There are even fancier ways to use GNU parallel which involve chunking single large files into chunks as a partitioning scheme (to do e.g. “parallel grep” of a single file by chunking the file into multi-megabyte blocks). Or, using GNU parallel for local map/reduce style jobs via a 3-stage UNIX pipeline, such that the first stage is the data collection step, GNU parallel runs a given program as a “mapper” over the inputs, and a final command in the pipeline runs a “reducer” to gather results.

The creator of GNU parallel wrote a USENIX article in 2011, which you can find here (PDF), that explains how it can be used in a variety of sysadmin contexts. Including: multi-core speedups of file read/write operations; managing remote clusters via ssh; transferring files between remote machines; introducing parallelism into shell scripts; and using plain text files as parallel job queue managers.

I bet you didn’t realize you could do all of that without a “framework” of any kind! Just Linux, a rock-solid (but ancient) Perl 5 program, shell pipelines, and SSH. Very nice!

That’s timeless design. Happy hacking!


Note: I put this post up on my blog with an old date of November 10, 2021 because I hate how Twitter/X no longer renders threads for anonymous visitors properly. The source material that I elaborated into a blog post is available here on Twitter/X.

Leave a Reply

Your email address will not be published. Required fields are marked *