Multi-threaded xargs with examples

If you are new to xargs, or do not know what xargs is yet, please read our xargs for beginners with examples first. If you are already somewhat used to xargs, and can write basic xargs command line statements without looking at the manual, then this article will help you to become more advanced with xargs on the command line, especially by making it multi-threaded.

In this tutorial you will learn:

  • How to use xargs -P (multi-threaded mode) from the command line in Bash
  • Advanced usage examples using multi-threaded xargs from the command line in Bash
  • A deeper understanding of how to apply xargs multi-threaded to your existing Bash code

Multi-threaded xargs with examples

Multi-threaded xargs with examples

Software requirements and conventions used

Software Requirements and Linux Command Line Conventions
Category Requirements, Conventions or Software Version Used
System Linux Distribution-independent
Software Bash command line, Linux based system
Other The xargs utility is included in the Bash shell by default
Conventions # – requires linux-commands to be executed with root privileges either directly as a root user or by use of sudo command
$ – requires linux-commands to be executed as a regular non-privileged user

Example 1: Calling another Bash shell with xargs compiled input



After one uses to learn xargs, he or she will soon find that – whereas xargs allows one to do many powerful things by itself – the power of xargs seems to be limited by it’s inability to execute multiple commands in sequence.

For example, let’s say we have a directory which has subdirectories named 00 to 10 (11 in total). And, for each of these subdirectories, we want to traverse into it, and check if a file named file.txt exists, and if so cat (and merge using >>) the contents of this file to a file total_file.txt in the directory where the 00 to 10 directories are. Let’s try and do this with xargs in various steps:

$ mkdir 00 01 02 03 04 05 06 07 08 09 10
$ ls
00  01  02  03  04  05  06  07  08  09  10
$ echo 'a' > 03/file.txt
$ echo 'b' > 07/file.txt
$ echo 'c' > 10/file.txt

Here we first create 11 directories, 00 to 10 and next create 3 sample file.txt files in the subdirectories 03, 07 and 10.

$ find . -maxdepth 2 -type f -name file.txt
./10/file.txt
./07/file.txt
./03/file.txt

We then write a find command to locate all file.txt files starting at the current directory (.) and that up to a maximum of 1 level of subdirectories:

$ find . -maxdepth 2 -type f -name file.txt | xargs -I{} cat {} > ./total_file.txt
$ cat total_file.txt
c
b
a

The -maxdepth 2 indicates the current directory (1) and all subdirectories of this directory (hence the maxdepth of 2).

Finally we use xargs (with the recommended and preferred {} replacement string as passed to the xargs -I replace string option) to cat the contents of any such file located by the find command into a file in the current directory named total_file.txt.

Something nice to note here is that, even though one would think about xargs as subsequently executing multiple cat commands all redirecting to the same file, one can use > (output to new file, creating the file if it does not exist yet, and overwriting any file with the same name already there) instead of >> (append to a file, and create the file if not existing yet)!



The exercise so far sort of fulfilled our requirements, but it did not match the requirement exactly – namely, it does not traverse into the subdirectories. It also did not use the >> redirection as specified, though using that in this case would still have worked.

The challenge with running multiple commands (like the specific cd command required to change directory/traverse into the subdirectory) from within xargs is that 1) they are very hard to code, and 2) it may not be possible to code this at all.

There is however a different and easy to understand way to code this, and once you know how to do this, you will likely be using this in plenty. Let’s dive in.

$ rm total_file.txt

We first cleaned up our previous output.

$ ls -d --color=never [0-9][0-9] | xargs -I{} echo 'cd {}; if [ -r ./file.txt ]; then cat file.txt >> ../total_file.txt; fi'
cd 00; if [ -r ./file.txt ]; then cat file.txt >> ../total_file.txt; fi
cd 01; if [ -r ./file.txt ]; then cat file.txt >> ../total_file.txt; fi
cd 02; if [ -r ./file.txt ]; then cat file.txt >> ../total_file.txt; fi
cd 03; if [ -r ./file.txt ]; then cat file.txt >> ../total_file.txt; fi
cd 04; if [ -r ./file.txt ]; then cat file.txt >> ../total_file.txt; fi
cd 05; if [ -r ./file.txt ]; then cat file.txt >> ../total_file.txt; fi
cd 06; if [ -r ./file.txt ]; then cat file.txt >> ../total_file.txt; fi
cd 07; if [ -r ./file.txt ]; then cat file.txt >> ../total_file.txt; fi
cd 08; if [ -r ./file.txt ]; then cat file.txt >> ../total_file.txt; fi
cd 09; if [ -r ./file.txt ]; then cat file.txt >> ../total_file.txt; fi
cd 10; if [ -r ./file.txt ]; then cat file.txt >> ../total_file.txt; fi

Next, we formulated a command, this time using ls which will list all directories which correspond to the [0-9][0-9] regular expression (Read our Advanced Bash regex with examples article for more information on regular expressions).

We also used xargs, but this time (in comparison with previous examples) with an echo command which will output exactly what we would like to do, even if it requires more then one or many commands. Think about this like a mini-script.

We also use cd {} to change into directories as listed by the ls -d (directories only) command (which as a side note is protected by the --color=never clause preventing any color codes in the ls output from skewing our results), and check if the file file.txt is there in the subdirectory by using an if [ -r ... command. If it exists, we cat the file.txt into ../total_file.txt. Note the .. as the cd {} in the command has placed us into the subdirectory!

We run this to see how it works (after all, only the echo is executed; nothing will actually happen). The code generated looks great. Let’s take it one step further now and actually execute the same:

$ ls -d --color=never [0-9][0-9] | xargs -I{} echo 'cd {}; if [ -r ./file.txt ]; then cat file.txt >> ../total_file.txt; fi' | xargs -I{} bash -c "{}"
$ cat total_file.txt
a
b
c


We now executed the total script by using a specific (and always the same, i.e. you will find yourself writing | xargs -I{} bash -c "{}" with some regularity) command, which executes whatever was generated by the echo preceding it: xargs -I{} bash -c "{}". Basically this is telling the Bash interpreter to execute whatever was passed to it – and this for any code generated. Very powerful!

Example 2: Multi-threaded xargs

Here we will have a look at two different xargs commands, one executed without parallel (multi-threaded) execution, the other with. Consider the difference between the following two examples:

$ time for i in $(seq 1 5); do echo $[$RANDOM % 5 + 1]; done | xargs -I{} echo "sleep {}; echo 'Done! {}'" | xargs -I{} bash -c "{}"
Done! 5
Done! 5
Done! 2
Done! 4
Done! 1

real    0m17.016s
user    0m0.017s
sys 0m0.003s
$ time for i in $(seq 1 5); do echo $[$RANDOM % 5 + 1]; done | xargs -I{} echo "sleep {}; echo 'Done! {}'" | xargs -P5 -I{} bash -c "{}"
Done! 1
Done! 3
Done! 3
Done! 3
Done! 5

real    0m5.019s
user    0m0.036s
sys 0m0.015s

The difference between the actual two command lines is small; we only added -P5 in the second command line. The runtime however (as measured by the time command prefix) is significant. Let’s find out why (and why the output differs!).



In the first example, we create a for loop which will run 5 times (due to the subshell $(seq 1 5) generating numbers from 1 to 5) and in it we echo a random number between 1 and 5. Next, much in line with out last example, we sent this output into the sleep command, and also output the duration slept as part of the Done! echo. Finally we sent this to be run by a subshell Bash command, again in a similar fashion to our last example.

The output of the first command works like this; execute a sleep, output result, execute the next sleep, and so on.

The second command however completely changes this. Here we added -P5 which basically starts 5 parallel threads all at once!

The way that this command works is: start up to x threads (as defined by the -P option) and process them simultaneously. When a thread is complete, grab new input immediately, do not wait for other threads to finish first. The latter part of that description is not applicable here (it only would be if there were less threads specified by -P then the number of ‘lines’ of input given, or in other words less parallel threads would be available then number of rows of input).

The result is that the threads which finish first – those with a short random sleep time – come back first, and output their ‘Done!’ statement. The total runtime also comes down from about 17 seconds to just about 5 seconds exactly in real clock time. Cool!

Conclusion

Using xargs is one of the most advanced, and also one of the most powerful, ways to code in Bash. But it doesn’t stop at just using xargs! In this article we thus explored multi-threaded parallel execution via the -P option to xargs. We also looked at calling subshells using $() and finally we introduced a method to pass multi-command statements directly to xargs by using a bash -c subshell call.

Powerful? We think so! Leave us your thoughts.



Comments and Discussions
Linux Forum