2. General Information¶
2.1. How to log into Svante¶
To login into the Svante cluster, from a terminal window on your local computer, type:
ssh -Y «username»@svante-login.mit.edu
a password will be requested; use your athena password (your Svante «username» above will be the same your athena «username»).
After you log in, you should receive a terminal prompt with your local working directory set to /home/«username»
;
you will logged into the Svante cluster login (or “head”) node svante-login
. See Section 6 for discussion of proper uses of the head node, file server
nodes, and compute nodes.
Note
If you want to access any of the HDR nodes or our GPU node (i.e., see Using SLURM to Submit Jobs), you must use a different head node: svante.mit.edu
.
This node can accessed from within Svante by typing ssh svante
or from outside Svante ssh -Y «username»@svante.mit.edu
. At this time, use of HDR
compute nodes is restricted to approved users, please contact the Executive Director with details of your planned usage. Also note that the HDR and GPU nodes
use a different dialect of linux than the EDR/FDR nodes (Red Hat 8 vs. Fedora Core 24, respectively), which typically will require recompilation using
different modules.
2.2. /home
spaces¶
/home/«username»/
: we have about 100 terabytes (TB) of total ‘home space’ for general purpose usage, source code, plots and figures, model builds, etc.
Every Svante user is given disk space here; quotas on home space are 500 gigabytes (GB) per user.
/home
is backed up daily (offsite) and protected from disk failure via a RAID array. Svante /home
space is mounted
to all nodes in the cluster. Home space is not intended as a repository for large (or large numbers of) data files for analyses,
or to be used as disk space for large model runs.
2.2.1. Public space subdirectory¶
In each user’s /home
space we have created a subdirectory public_html
which can be used to share files with outside collaborators through a URL:
https://svante.mit.edu/~«username»
will pull up a web browser page with all files in this subdirectory. For outside users to be able to see (or download) files,
you do need to make sure any files in this subdirectory have open read permissions (you can change this via chmod -R 755 «filename»
). For small files,
it is easy enough to make a copy into this subdirectory, but given the /home
space quota, this is not practical for large files or data sets.
Instead, in public_html
make a symbolic link to a file or directory (e.g., ln -s «file to make public» /home/«username»/public_html/
). Note that if you are
sym linking a file or directory, for example from a file server, these files or directories require open read permissions as well.
2.3. File servers¶
Svante file servers are named fsxx (see Table 2.1), currently fs01
-fs12
, with total capacity presently roughly 5 PB.
To get onto a file server node, for example, typing
ssh fs02
(from svante-login
or any other node in the Svante cluster)
will give you a shell on file server fs02
.
Once you ssh
to a file server, local disks should be accessed as /d0
, /d1
, /d2
, /d3
, and /d4
;
note some partitions not present on all fileservers. From all other nodes, this space can be reached through ‘remote’ mounts
in which your access paths would be, for example, /net/fs02/d0
to access the /d0
partition
on fs02
. Storage on these file servers is for runs, experiments, downloaded data, etc.
that will be accessed and kept for longer-term periods, and these spaces are backed up (weekly offsite backup,
although in periods of heavy use, it may take longer for backups to complete). These machines can also be accessed
externally (i.e. from outside Svante): fs02
is svante2.mit.edu
, fs03
is svante3.mit.edu
, etc.
(see Table 2.1). User directories of
various sizes, organized by research group, will be created and allocated by the Executive Director (Jeff) on a project-based, need basis.
These disk spaces are reserved for research-related work; they are not intended as “personal” storage spaces,
and any such use will not be tolerated.
2.4. Compute nodes¶
Svante includes a large pool of compute nodes, which comprise the main computational engine of the cluster,
interconnected through a fast infiniband network.
Users cannot directly ssh
to log into compute nodes, but instead
must go through the SLURM scheduler, see Section 4 (the exception to this is if the user has a
running job on a compute node, ssh
to that specific node is permitted).
Users can also access local disk space ( /scratch
) on any given compute node, as limited by the size of its disk (see Table 2.1).
Every user has unlimited storage in these spaces, the caveat being there is no safeguard for you or another user filling up a local compute node disk.
These spaces are not backed up, and any files left longer than six months will be deleted without notice.
Local scratch spaces can also be accessed (say, from svante-login
or a file server node) through remote mounts /net/«cxxx»/scratch
,
where «cxxx» denotes the compute node name – see Table 2.1,
although not all remote scratch disks may be available to other compute nodes (i.e. on a compute node, use the local scratch disk, not
another compute node’s scratch space via a /net
mount).
2.5. Archive space¶
Server fs01
contains archive space /data
which holds downloaded external data, e.g. reanalysis data,
CMIP data, etc. Let us know if you require external data sets to be stored on Svante; if we think such data might
benefit general users, we would be happy to store it here. Note that Svante users do not have individual spaces on fs01
,
nor can users ssh
to fs01
to use this server for computational analysis purposes.
A second, separate data space /archive
(currently on server fs09
)
holds file server spaces of users that are no longer actively working at MIT; these spaces may be
compressed or uncompressed, depending on both the data type
and likelihood access will be required. Our general policy is to maintain users’ data in perpetuity.
2.6. Running background screen sessions¶
Most typically, users log into svante, submit or monitor compute node jobs, edit files, perform analyses or generate plots, and/or perform other interactive tasks etc., all while using the terminal window command line. At the end of the work day, typically users will log off and close the terminal window, particularly on a laptop where putting it to sleep would break the network connection. However, what if you are performing some time-consuming task, and closing your laptop would effectively kill it mid-progress? There are a few possible solutions:
The most obvious approach would be to write a SLURM script (see Section 4) to do this task using a compute node (rather than executing while logged into a file server or the login node).
If using a SLURM script is not easily workable, you can run in the background on the terminal command line by adding a
&
to the end of the command. Even if you close the terminal window or log off, the command will continue to run. This is particularly helpful when copying large directories from one file server to another, for example, where the completion time might be order several hours.Finally, you might encounter a case where you need to run something time-consuming in the terminal window and monitor its output (and/or get frustrated if you are using a shaky network connection, prone to disconnects). There is a nifty unix command to help you out by allowing you to “detach” a terminal window session that persists even if you log off. To create the screen session, from the command line type
screen -e ^Tt -S «myscreenname»
; this will put you in a fresh terminal session. To detach the session, typectrl-t d
(ctrl=the control key); it is now safely tucked away in the background, continuing to run. To list any running screen sessions, from another terminal window (logged into the node where the screen is running), typescreen -ls
. To reattach to a screen session, typescreen -x «myscreenname»
(again, while logged into the same node). To kill a screen session, from within the screen session, typeexit
on the command line.
2.7. Svante node specs¶
Node(s)
|
External
Name
|
#
Cores
|
RAM
(GB)
|
CPU
Arch.
|
CPU
|
Speed
(Ghz)
|
IB
|
Partition
|
Size
(TB)
|
Usage
|
---|---|---|---|---|---|---|---|---|---|---|
svante- login |
svante- login |
16 |
64 |
broadwell |
E5-2609 v4 |
1.70 |
EDR |
FDR/EDR login node |
||
svante |
svante |
16 |
64 |
cascade lake |
2.10 |
EDR |
HDR login node |
|||
c001 - c036 |
48 |
256 |
xeon scalable (3rd gen) |
Gold 6336Y |
2.40 |
HDR |
/scratch |
4 |
compute nodes |
|
c041 - c060 |
16 |
64 |
sandy bridge |
E5-2670 |
2.60 |
FDR |
/scratch |
2 |
compute nodes |
|
abba |
24 |
128 |
haswell |
E5-2680 v3 |
2.50 |
FDR |
/scratch |
4 |
OOD-only nodes |
|
c061 - c096 |
32 |
128 |
broadwell |
E5-2697A v4 |
2.60 |
EDR |
/scratch |
4 |
compute nodes |
|
c097 - c136 |
48 |
256 |
xeon scalable (4th gen) |
Gold 5418Y |
2.00 |
HDR200 |
/scratch |
4 |
compute nodes |
|
g001 |
28 |
128 |
broadwell |
E5-2680 v4 |
2.40 |
GPU |
/scratch |
gpu node |
||
fs01 |
20 |
512 |
xeon scalable |
Silver 4210 |
2.20 |
EDR |
/data |
570 |
external datasets |
|
fs02 |
svante2 |
16 |
96 |
sandy bridge |
E5-2660 |
2.20 |
FDR |
/d0 |
80 |
Marshall Group |
fs03 |
svante3 |
24 |
512 |
xeon scalable |
Silver 4214R |
2.40 |
EDR |
/d0 |
250 |
Selin Group |
/d1 |
120 |
Selin Group |
||||||||
fs04 |
svante4 |
16 |
128 |
broadwell |
E5-2620 v4 |
2.10 |
EDR |
/d0 |
100 |
Land Group |
/d1 |
100 |
Land Group |
||||||||
/d2 |
255 |
CS3 EPPA Group |
||||||||
fs05 |
svante5 |
4 |
24 |
nehalem |
W3530 |
2.80 |
FDR |
/d0 |
40 |
Land Group |
/d1 |
120 |
Land Group |
||||||||
fs06 |
svante6 |
32 |
512 |
xeon scalable |
Silver 4314 |
2.40 |
EDR |
/d0 |
648 |
Kang Group |
/d1 |
648 |
Heald Group |
||||||||
/d2 |
648 |
CGCS/Prinn Group |
||||||||
/d3 |
648 |
BC3 Group |
||||||||
/d4 |
648 |
Solomon Group |
||||||||
fs07 |
svante7 |
16 |
96 |
sandy bridge |
E5-2660 |
2.20 |
FDR |
/d0 |
80 |
Land Group |
/d1 |
70 |
Land Group |
||||||||
fs08 |
svante8 |
16 |
128 |
ivy bridge |
E5-2640 v2 |
2.00 |
FDR |
/d0 |
150 |
Marshall Group |
/d1 |
65 |
Marshall Group |
||||||||
/d2 |
70 |
Marshall Group |
||||||||
fs09 |
svante9 |
32 |
512 |
xeon scalable |
Silver 4314 |
2.40 |
EDR |
/d0 |
1050 |
Fiore Group |
/archive |
1050 |
old user archive |
||||||||
fs10 |
svante10 |
16 |
128 |
ivy bridge |
E5-2650 v2 |
2.60 |
FDR |
/d0 |
110 |
Marshall Group |
/d1 |
110 |
CS3 Climate Group |
||||||||
/d2 |
110 |
CS3 Climate Group |
||||||||
fs11 |
svante11 |
24 |
512 |
broadwell |
E5-2650 v4 |
2.20 |
EDR |
/d0 |
150 |
Selin Group |
/d1 |
150 |
Selin Group |
||||||||
fs12 |
svante12 |
32 |
1024 |
xeon scalable |
Silver 4314 |
2.40 |
EDR |
/d0 |
375 |
CS3 Climate Group |
/d1 |
375 |
LAE Group |
||||||||
/d2 |
375 |
Land Group |
||||||||
geoschem |
geoschem |
12 |
32 |
ivy bridge |
E5-2620 v2 |
2.10 |
FDR |
/data |
166 |
geoschem data |
Notes:
At present, all svante nodes’ CPUs use Intel architecture.
Svante’s operating system is Linux and by default terminal windows are Bash shell, although we maintain limited legacy support of C shell.
Nodes with external names can be accessed from outside the Svante cluster, e.g.
ssh -Y «username»@svante2.mit.edu
would log you intofs02
.From inside Svante, it is possible to
ssh
to compute nodes as well as file servers (as discussed above), but for compute nodes only if the user has a SLURM job running concurrently on that specific node.At present, the four abba nodes are only available for Svante Open OnDemand (OOD) usage (see Section 5.2), through (https://svante-ood.mit.edu).
Although the clock speeds are fairly similar across all nodes, in terms of general speed, nehalem < sandy bridge < ivy bridge < haswell < broadwell < xeon scalable. Your job might run 40% faster on
c070
thanc045
, for example. There have been improvements in cpu efficiency and RAM speed over the years, which translates to faster calculations, etc. that typically will improve your job’s efficiency.Infiniband (IB) is a super-fast network connection, running in parallel with the standard ethernet connection. Note however that although the FDR, EDR and HDR switches are interconnected, MPI jobs cannot span across FDR and EDR, EDR and HDR, FDR and HDR, or HDR and compute nodes.
As presently configured, jobs cannot span across HDR and HDR200 partitions. The interconnecting network speed is twice as fast for the HDR200 nodes, using a newer, improved cpu in the HDR200 nodes (albeit running at a slightly lower clock speed), but otherwise HDR200 will run similarly compiled code as HDR and give identical results.
To gain access to our GPU node, in the slurm (#SBATCH/srun) request you must specify partition
-p gpu
AND also how many gpus you require:--gres=gpu:1
would ask for a single gpu (the gpu node g001 contains in total four gpu chips). Note: The gpu node only supports up to Cuda 11.4.