Parallel data analysis

6 minute read

Published:

Got some sequencing data? Many powerful tools to analyse them are based on the command line and this is part of a series of short but essential posts that help you getting started. I assume that you are working on a UNIX-based operating system (‘Mac’ or ‘Linux’ computer).

What if your data analysis on a remote server takes several hours, days, or even weeks, to finish? No worries, you don't need to be connected to the remote server while the data are being analysed. Here, I introduce you to the tools that allow you to start an analysis, disconnect from the server, and then look at the progress or the results at a later time point.

nohup

The nohup tool allows you to run a process in the background; which means that, while the analysis is running, you can do other tasks in parallel or log off from the remote server.

Imagine the nohup tool as a bracket which encloses the command that you want to run in the background

nohup … &

Always, nohup precedes and & follows the command that you want to run in the background (here shown as ...). Let's say you want to run the command ls -lhcrt (which lists all files and subdirectories in your current directory) in the background.

nohup ls -lhcrt &

When you hit ENTER, the terminal prints out some information:

[1] 21118 nohup: ignoring input and appending output to ‘nohup.out’

The number 21118 (which will differ in your case) in the first line is the process-ID of your background-process. The second line informs you that all ‘results’, that would be normally printed in the terminal window, are now redirected to the file nohup.out.

Process-ID

Let's first have a closer look at the process-ID. What's the use of this number?

Process status

If you have started a process that takes several hours - or longer - to finish, then you can use the process-ID to see if the process is still running. For this, you can use the ps command with the -p option, which reports the status of a process with a certain process ID. To see the status of the process I have started above, I would use:

ps -p 21118

The output is

PID TTY          TIME CMD

Since this is only the header line of the process specifications, the process must have finished. Here:

  • PID indicates the process-ID
  • TTY indicates the controlling terminal
  • TIME shows the time that the process is running already
  • CMD shows the command name

If the process would still run, you would get a line similar to:

PID  TTY          TIME CMD
21118 ?        00:00:04 ls

Cancel the process

The process-ID allows you to cancel the process before it finishes. To cancel the process comes in handy when you figure out that you started it with wrong parameters or input files and you want to re-start it with different settings. The kill command allows you to cancel a specific project.

kill 21118

This would cancel the process that we started before in the background. If you can't remember the process-ID but want to cancel all ls processes, then you could use the pkill command in the following way:

pkill ls

Compared to the kill command, the pkill command allows you to specify the command-name instead of the process-id of the running process that you want to cancel.

Redirecting output

By default, the nohup command redirects all information from the terminal window to the nohup.out file. If the file exists already, it will not be overwritten. All new information will be appended to the end of the file. With the > operator, you can redirect the output to a different file. For example, to redirect the output of the ls command to the file Directory-Listing.txt, I use the command

nohup ls -lhcrt > Directory-Listing.txt &

So, the redirecting-operator (>) is followed by the name of the target file and precedes the closing & operator of the nohup command. If you want to save the output to a file in a different directory, just specify the entire file-path that precedes your target file, like:

nohup ls -lhcrt > /home/alj/Documents/DirectoryListing.txt &

screen

The screen tool provides another way (than nohup) to continue running a process on a remote server when you log off, or to run different processes in parallel.

You can imaging the screen tool as a command-line way to open different terminals as ‘sessions’ in parallel and running different processes in each of them. To start or open a new session, you can use:

screen -S Testscreen

The option -S allows you to set a name (here Testscreen) to the session. Once you hit ENTER, you will be faced with a new (clean) terminal window. This is your Testscreen-session. You can execute any commands in it and while a process is running, you can detach from the session by pressing first ‘CTRL+A’ on you keyboard, then hitting the letter ‘d’ (for ‘detach’). OK, I get the information

[detached from 970.Testscreen]

This means that I detached from the Testscreen-session that has the process-ID 970. The processes in this session, however, still continue to run. You can log off from the remote server and get back to the

To get an overview of all sessions that are running in parallel, use:

screen -ls

In my case, I get:

There are screens on: 970.Testscreen (14. feb. 2015 kl. 19.53 +0100) (Detached) 31995.pts-9.alj-Inspiron-5537 (14. feb. 2015 kl. 19.47 +0100) (Detached) 2 Sockets in /var/run/screen/S-alj.

You see that I have two sessions running, both are detached. To re-attach to our Testscreen-session, just enter:

screen -r Testscreen

The option -r (for re-attach) is followed by the name of the session that you would like to re-attach to.

When checking now the running sessions, I get:

screen -ls

There are screens on: 970.Testscreen (14. feb. 2015 kl. 19.53 +0100) (Attached) 31995.pts-9.alj-Inspiron-5537 (14. feb. 2015 kl. 19.47 +0100) (Detached) 2 Sockets in /var/run/screen/S-alj.

To stop a session, you have two options. Either, attach to the session and enter quit in the terminal window, or use the kill command with the process-ID of the session. To stop the Testscreen-session, for example, I would use

kill 970

When using the screen tool, be aware that, compared to the nohup tool, all results are printed to the session's terminal - not to a file.