Learning Linux Commands: join

Introduction

Join command is yet another example of text processing utility under GNU/Linux. Join command combines two files based on the matching content lines found in each file. Using join command is quite straight forward and if used currently and in the right situation it can save lots of time and effort. This article requires very basic command line experience.

Frequently used options

  • -1 FIELD
    Join on specified field found in file 1
  • -2 FIELD
    Join on specified field found in file 2
  • -t CHAR
    Use CHAR as an input and output separator

Basics

Basic usage of join command is usage without any options. All what is required is to specify 2 files as an arguments. Let’s say we have two files A.txt and B.txt with a following content:

$ cat A.txt
1 A
2 B
3 C
$ cat B.txt
1 John
2 Linda
3 Rares

Here we can see that first field is a perfect candidate to perform a join operation upon. By default join command will perform join operation on a first FIELD where field separator is single space character or TAB. Therefore, by executing a following linux command our two files are joined based on FIELD 1:

$ join A.txt B.txt
1 A John
2 B Linda
3 C Rares

Choosing FIELD

In our previous example you could see a default function of a join command. In this section we will see how we can specify different field upon which join command should perform a joining operation. Lets consider a following A.txt and B.txt files:

$ cat A.txt
John A 1
Linda B 2
Rares C 3
$ cat B.txt
1 A
2 B
3 C

As we can see, in this case we the first default join FIELD is not longer matching. Therefore, executing a join command on these two files will not produce any output. However, we can also see that second field in first file is matching with second field in the second file. As a result we can modify default behavior of join command and join both files based on the second FIELD in each file:

$ join -1 2 -2 2 A.txt B.txt
A John 1 1
B Linda 2 2
C Rares 3 3

where -1 is a first file and 2 is a second FIELD and -2 is a second file and 2 is a second column. In this case just because both matching fields are second we can use -j as a shortcut.

$ join -j 2 A.txt B.txt
A John 1 1
B Linda 2 2
C Rares 3 3

As you have already noticed a third FIELD and first FIELD in both files can also be used for a join operation.

$ join -1 2 -2 2 A.txt B.txt
A John 1 1
B Linda 2 2
C Rares 3 3
$ join -1 3 -2 1 A.txt B.txt
1 John A A
2 Linda B B
3 Rares C C

It is important to note from the last to examples that join command will always print matching FIELD as first.

Choosing FIELD delimiter

Well not all the files are nicely formatted for a join command operation. As mentioned earlier the default FIELD delimiter is space. But what if we have “,” ( comma ) or even TAB as a delimiter? Let’s consider a following files:

$ cat A.txt
John,A,1
Linda,B,2
Rares,C,3
$ cat B.txt
1,A
2,B
3,C

In this case the only difference from previous examples is that the FIELD separator is “,”. To tell join command which FIELD separator to use we can take an advantage of -t option othewise the syntax is same:

$ join -t, -1 3 -2 1 A.txt B.txt
1,John,A,A
2,Linda,B,B
3,Rares,C,C

Multiple spaces

If your files have multiple spaces as FIELD delimiter the safest option is to convert them to a single space delimiter. For example joining these files may be quite hard:

$ cat A.txt
John A     1
Linda   B 2
Rares   C       3
$ cat B.txt
1   A
2 B
3                      C

Therefore, if the the situation allows you, you can convert it to a single space delimited file using sed command:

$ sed -i 's/\s\+/ /g' A.txt
$ sed -i 's/\s\+/ /g' B.txt
$ cat A.txt B.txt
John A 1
Linda B 2
Rares C 3
1 A
2 B
3 C

Tab a FIELD delimiter

Tab delimiter is also accepted by join command as a default. Therefore, the following example will perform valid join operation:

$ cat A.txt B.txt
John    A       1
Linda   B       2
Rares   C       3
1       A
2       B
3       C
$ join -1 3 -2 1 A.txt B.txt
1 John A A
2 Linda B B
3 Rares C C

Ignoring case

If your matching FIELD is in perfect order but FIELDS from both files have upper and lower case differences the join will not perform a valid join operation and ignore unmatched lines unless -i ( ignore case ) is used. Note the following lower-case b in the following example:

$ cat A.txt B.txt
John    A       1
Linda   b       2
Rares   C       3
1       A
2       B
3       C
$ join -1 2 -2 2 A.txt B.txt
A John 1 1
C Rares 3 3
$ join -i -1 2 -2 2 A.txt B.txt
A John 1 1
b Linda 2 2
C Rares 3 3

Overriding default join format

In the following example we will see how we can override a default join command format behavior with -o option. Consider a join output of of files A.txt and B.txt:

$ cat A.txt B.txt
John    A       1
Linda   B       2
Rares   C       3
1       A
2       B
3       C
$ join -1 2 -2 2 A.txt B.txt
A John 1 1
B Linda 2 2
C Rares 3 3

Joining field is printed as first. What if we do not want to print joining FIELD at all or what if we want to print joined field in different order. In this case the -o FORMAT option becomes very handy. In this example we will print only matching fields:

  • first field in first file: 1.1
  • first field in second file: 2.1
$ join -o 1.1 2.1 -1 2 -2 2 A.txt B.txt
John 1
Linda 2
Rares 3

Non-pairable lines

By default only pairable lines are printed by join command. -a option instructs join command to also include unmatched lines. In next example we will print all unmatched lines in file 1.

$ cat A.txt B.txt
John    A       1
Linda   B       2
Rares   C       3
Lubos   D       4
1       A
2       B
3       C
5       E
$ join -a 1 -o 1.1 2.1 -1 2 -2 2 A.txt B.txt
John 1
Linda 2
Rares 3
Lubos

or we cat print all non-pairable lines from both files by:

$ join -a 1 -a 2 -o 1.1 2.1 -1 2 -2 2 A.txt B.txt
John 1
Linda 2
Rares 3
Lubos
5


Comments and Discussions
Linux Forum