CCM

Table Of Contents

Previous topic

Picking the Right Tools

Next topic

First Principles Methods

SBU cluster: SARMAD

Electronic structure calculations are computationally highly demanding, and a good example of calculations that are routinely performed on supercomuters. When one needs too much processing and/or intensive memory to handle huge amount of data, parallel computing (or high performance computing, HPC) is required.

A computer cluster is a set of nodes (each node contains several cores) linked together to do parallel computations.

You can perform your (heavy) calculations on the SBU cluster: SARMAD.

Note

The operating system of SARMAD is CentOS, which is a Linux distribution. You need to know Linux to work on SARMAD. A basic tutorial is available on Picking the Right Tools.

The following aims to help you start working on SARAMD using your linux-based local machine. If you are a Windows user, you may use a terminal emulator like PuTTY. Useful instructions (PuTTY and WinSCP) are available on the SARMAD documentation webpage.

Connecting to SARMAD

To connect to a remote computer one may use the secure shell (SSH) service. Open a Linux terminal and type in the command:

ssh username@hostname

Use 192.168.220.100 as hostname of SARMAD. If you do not have your own username, use ali_sadeghi. Type at the shell command line:

ssh ali_sadeghi@192.168.220.100

You will need the password (which is given to you in the class).

Once you succeeded to login you would see some welcome message:

Profile built 10:44 20-May-2014
Kickstarted 16:44 20-May-2014
###############################################################
#                Welcome to SBU Cluster                       #
#  Disconnect IMMEDIATELY if you are not an authorized user!  #
###############################################################

From now on, you are working on the remote machine.

To log out and return to your local machine, use the exit command:

cluster:~ >exit

Let us start with reading the information provided by the system administrator in file GettingStarted on the home directory:

cluster:~ >ls GettingStarted
GettingStarted
cluster:~ >less GettingStarted

    ----------------------
    Getting started:
    ----------------------
    We have provided samples for the scientific applications which are supported on our cluster.
    In you home directory, you'll find a directory per application containing the sample.
    Follow the README of the application and submit your job.

   .
   .
   .

    Remainder of this document is organized as follows:
    * Using Job Scheduler (TORQUE)
    * Writing your own TORQUE script
    * Modifying your text files
    * Cluster's load and queue time
    * Contact

In particular, note that you are asked to put your computational job in the queue but not run it on the master node. It is master node’s task to distribute the jobs over the computing nodes. See Running a job on cluster.

To come back to the shell, just press the q key.

Note

You can connect to SARMAD only if your IP belongs to the university network.

From outside the university, first turn on the SBU vpn service on your machine and then connect to SARMAD.

File Transfer

To transfer files between local and remote machines, the scp command is used.

  • To copy lfile from your local machine to the home directory of SARMAD.

    scp lfile ali_sadeghi@192.168.220.100:.
    
  • To copy rfile from the home directory of SARMAD to your local machine:

    scp ali_sadeghi@192.168.220.100:~/rfile .
    
  • To copy the whole directory rdir from SARMAD home to your local machine: (Note the switch -r)

scp -r ali_sadeghi@192.168.220.100:~/rdir .

SARMAD specifications

SARMAD has, in addition to the master node, ten computing nodes. Each computing node consists of 64 computing cores (AMD 2.4 GHz) and has 128 GB of memory in total. Two additional 24-core nodes are equipped by Tesla K20X GPU cards for GPU computing.

Such specifications may be obtained by the command pbsnodes. See for instance these ones:

cluster:~ >pbsnodes |  grep properties  | nl
     1      properties = cpu
     2      properties = cpu
     3      properties = cpu
     4      properties = cpu
     5      properties = cpu
     6      properties = cpu
     7      properties = cpu
     8      properties = cpu
     9      properties = cpu
    10      properties = cpu
    11      properties = gpuen
    12      properties = gpuen
cluster:~ >

which shows that there are 12 nodes, two of which are GPU-enabled.

If we grep for np, we get:

cluster:~ >pbsnodes |  grep np  | nl
     1       np = 64
     2       np = 64
     3       np = 64
     4       np = 64
     5       np = 64
     6       np = 64
     7       np = 64
     8       np = 64
     9       np = 64
    10       np = 64
    11       np = 12
    12       np = 24
    13       np = 24
cluster:~ >

showing that

  • Each of the first ten nodes has 64 cores.
  • Node number 11 has 12 cores. Note that it is the master node and should not be used for calculations!
  • The last two are those equipped with GPU and each has 24 cores.

Need more how-to-use information? see this and this.


Making your own workspace

After you logged in to SARMAD, go to directory students and make a directory for yourself. This is a simple task:

cluster:~ >ls
GettingStarted  admin_samples  students
cluster:~ >cd students
cluster:~/students >mkdir yourname
cluster:~/students >cd yourname
cluster:~/students/yourname >

(replace yourname by your name!)

Warning

Please work only in your directory. Never remove other files or directories.

Running a job on cluster

When you are connected to a cluster (e.g. the SBU cluster), you interact with the master node. The master node replies the users’ commands but it is not supposed to become busy with calculations. Instead, the master node distributes the computational tasks between the available computing nodes using some job scheduler (queuing system). The job scheduler that SARMAD uses is called TORQUE which is a version of PBS (Portable Batch System). To manage how to assign tasks to nodes efficiently, SARMAD needs to know about how many cores your job needs and how long it would take. Such information is given by the users when submitting their jobs via job scripts. Here you learn how to write a job script and how to submit and monitor your jobs on SARMAD.

Job scripts

A job script consists of two main parts. It starts with a set of #PBS commands followed by the execution command(s). The #PBS commands (which start with #PBS) determine the properties of your job on the cluster. The execution command is usually an mpirun command to execute the compiled program using the MPI scheme. This is clarified in the following example.

An example script

A job script looks like a normal shell script and thus we name it as submit.sh. Here is an example of such script for doing PWSCF (see PWSCF) calculations on SARMAD. (download)

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
#PBS -l nodes=1:ppn=8 
#PBS -l walltime=01:00:00

cd $PBS_O_WORKDIR

bin=/share/Application/espresso-5.0.2/bin/pw.x # the executable i.e. pw.x
in=co.in                                       # input file for pw.x
out=co.out                                     # output file by pw.x

mpirun -n 8 $bin   < $in > $out

Description:

  • #PBS -l nodes=1:ppn=8 : reserve 1 node for this job. Use 8 processors on each node.
  • #PBS -l walltime=00:30:00 : expected maximum runtime is 30 min (will be killed if not finished within 30’)
  • cd $PBS_O_WORKDIR : changed the working path (which is home ~ by default, and therefore annoying) to the current directory where your input/output files are).
  • mpirun -n 8 $bin < $in > $out : Execute the binary file $bin using 8 MPI threads. If the executable file needs input and/or makes an output (as does the PWSCF code), simply use < and > followed by input and output file names.

The above script is a minimal one and more options can be added to it. Use it as a template and modify small parts for your jobs.

Note

At the moment, the maximum number of nodes that one can use at each time is two. If you have already reserved two nodes, your further jobs have to wait in queue.

Submitting a job script

Although the script looks like a shell script, the PBS commands that start with # sign are not interpreted (and understood) by the shell core. Instead, one should submit the script using the PBS command qsub. To do that, type in (just like a shell command) in the shell command line:

qsub submit.sh

There are also commands to stop jobs or monitor their state. The basic commands are listed and described bellow. (Type them in the shell command line.)

command description example description
qsub submit a job qsub submit.sh submits the script submit.sh
qstat show status of submitted jobs qstat shows jobs status and their assigned ID
qdel delete a submitted job qdel  1031 deletes the job whose ID is 1031

Note

Another useful command is showq which shows queue status of all jobs.

MPI example

Consider the following very simple MPI program: (download)

1
2
3
4
5
6
7
8
9
include 'mpif.h'

call MPI_INIT(ierror)
call MPI_COMM_SIZE(MPI_COMM_WORLD, ntask, ierror)
call MPI_COMM_RANK(MPI_COMM_WORLD, itask, ierror)
print*, 'My ID is: ', itask, '     Hello world'
if(itask == 0) print*, 'Number of tasks:', ntask
call MPI_FINALIZE(ierror)
end

This program prints out the ID of every MPI thread. The number of threads is printed only by the master (ID = 0). Assuming that its name is hello.f90, first compile it (on SARMAD) as

mpif90 hello.f90

* serial run: if you simply run it serially, the result would be:

   cluster:~ >./a.out
    My ID is:            0      Hello world
    Number of tasks:           1

* parallel run: Now run it in parallel. Make a job script submit.sh like as follows. (Note that our simple program needs no input file and therefore no < in.)

#PBS -l nodes=1:ppn=8
#PBS -l walltime=00:05:00

cd $PBS_O_WORKDIR

bin=./a.out  # the executable
out=out      # output file

mpirun -n 8 $bin > $out

Submit this job, and look at the output file out after its run finished. As said before, use qstat to monitor the job status. Since it is a simple task, the job will finish very soon (1’’ as seen below).

cluster:~ >qsub submit.sh
10813.cluster.sbu.ac.ir
cluster:~ >qstat 10813
Job id                    Name             User            Time Use S Queue
------------------------- ---------------- --------------- -------- - -----
10813.cluster             submit.sh        ali_sadeghi     00:00:01 C batch
cluster:~ >cat out
My ID is:            3      Hello world
My ID is:            4      Hello world
My ID is:            7      Hello world
My ID is:            0      Hello world
Number of tasks:           8
My ID is:            1      Hello world
My ID is:            2      Hello world
My ID is:            6      Hello world
My ID is:            5      Hello world

Tip

On SARMAD, one can not use more than one node per each job. In other words, nodes=1 is the only possibility and e.g.

#PBS -l nodes=2:ppn=80

would not be accepted.

Therefore, the current configuration of SARMAD suits best OpenMP rather than MPI.