Parallel data analysis
Published:
Got some sequencing data? Many powerful tools to analyse them are based on the command line and this is part of a series of short but essential posts that help you getting started. I assume that you are working on a UNIX-based operating system (‘Mac’ or ‘Linux’ computer).
What if your data analysis on a remote server takes several hours, days, or even weeks, to finish? No worries, you don't need to be connected to the remote server while the data are being analysed. Here, I introduce you to the tools that allow you to start an analysis, disconnect from the server, and then look at the progress or the results at a later time point.
nohup
The nohup
tool allows you to run a process in the background; which
means that, while the analysis is running, you can do other tasks in
parallel or log off from the remote server.
Imagine the nohup
tool as a bracket which encloses the command that
you want to run in the background
nohup … &
Always, nohup
precedes and &
follows the command that you want to
run in the background (here shown as ...
). Let's say you want to run
the command ls -lhcrt
(which lists all files and subdirectories in
your current directory) in the background.
nohup ls -lhcrt &
When you hit ENTER, the terminal prints out some information:
[1] 21118 nohup: ignoring input and appending output to ‘nohup.out’
The number 21118
(which will differ in your case) in the first line is
the process-ID of your background-process. The second line informs you
that all ‘results’, that would be normally printed in the terminal
window, are now redirected to the file nohup.out
.
Process-ID
Let's first have a closer look at the process-ID. What's the use of this number?
Process status
If you have started a process that takes several hours - or longer - to
finish, then you can use the process-ID to see if the process is still
running. For this, you can use the ps
command with the -p
option,
which reports the status of a process with a certain process ID. To see
the status of the process I have started above, I would use:
ps -p 21118
The output is
PID TTY TIME CMD
Since this is only the header line of the process specifications, the process must have finished. Here:
PID
indicates the process-IDTTY
indicates the controlling terminalTIME
shows the time that the process is running alreadyCMD
shows the command name
If the process would still run, you would get a line similar to:
PID TTY TIME CMD
21118 ? 00:00:04 ls
Cancel the process
The process-ID allows you to cancel the process before it finishes. To
cancel the process comes in handy when you figure out that you started
it with wrong parameters or input files and you want to re-start it with
different settings. The kill
command allows you to cancel a specific
project.
kill 21118
This would cancel the process that we started before in the background.
If you can't remember the process-ID but want to cancel all ls
processes, then you could use the pkill
command in the following way:
pkill ls
Compared to the kill
command, the pkill
command allows you to
specify the command-name instead of the process-id of the running
process that you want to cancel.
Redirecting output
By default, the nohup
command redirects all information from the
terminal window to the nohup.out
file. If the file exists already, it
will not be overwritten. All new information will be appended to the end
of the file. With the >
operator, you can redirect the output to a
different file. For example, to redirect the output of the ls
command
to the file Directory-Listing.txt
, I use the command
nohup ls -lhcrt > Directory-Listing.txt &
So, the redirecting-operator (>
) is followed by the name of the target
file and precedes the closing &
operator of the nohup
command. If
you want to save the output to a file in a different directory, just
specify the entire file-path that precedes your target file, like:
nohup ls -lhcrt > /home/alj/Documents/DirectoryListing.txt &
screen
The screen
tool provides another way (than nohup
) to continue
running a process on a remote server when you log off, or to run
different processes in parallel.
You can imaging the screen
tool as a command-line way to open
different terminals as ‘sessions’ in parallel and running different
processes in each of them. To start or open a new session, you can use:
screen -S Testscreen
The option -S
allows you to set a name (here Testscreen
) to the
session. Once you hit ENTER, you will be faced with a new (clean)
terminal window. This is your Testscreen
-session. You can execute any
commands in it and while a process is running, you can detach from the
session by pressing first ‘CTRL+A’ on you keyboard, then hitting the
letter ‘d’ (for ‘detach’). OK, I get the information
[detached from 970.Testscreen]
This means that I detached from the Testscreen
-session that has the
process-ID 970. The processes in this session, however, still continue
to run. You can log off from the remote server and get back to the
To get an overview of all sessions that are running in parallel, use:
screen -ls
In my case, I get:
There are screens on: 970.Testscreen (14. feb. 2015 kl. 19.53 +0100) (Detached) 31995.pts-9.alj-Inspiron-5537 (14. feb. 2015 kl. 19.47 +0100) (Detached) 2 Sockets in /var/run/screen/S-alj.
You see that I have two sessions running, both are detached. To
re-attach to our Testscreen
-session, just enter:
screen -r Testscreen
The option -r
(for re-attach) is followed by the name of the session
that you would like to re-attach to.
When checking now the running sessions, I get:
screen -ls
There are screens on: 970.Testscreen (14. feb. 2015 kl. 19.53 +0100) (Attached) 31995.pts-9.alj-Inspiron-5537 (14. feb. 2015 kl. 19.47 +0100) (Detached) 2 Sockets in /var/run/screen/S-alj.
To stop a session, you have two options. Either, attach to the session
and enter quit
in the terminal window, or use the kill
command with
the process-ID of the session. To stop the Testscreen
-session, for
example, I would use
kill 970
When using the screen
tool, be aware that, compared to the nohup
tool, all results are printed to the session's terminal - not to a
file.