2. General Information¶

2.1. How to log into Svante¶

To login into the Svante cluster, from a terminal window on your local computer, type:

ssh -Y «username»@svante-login.mit.edu

a password will be requested; use your athena password (your Svante «username» above will be the same your athena «username»). After you log in, you should receive a terminal prompt with your local working directory set to /home/«username»; you will logged into the Svante cluster login (or “head”) node svante-login. See Section 6 for discussion of proper uses of the head node, file server nodes, and compute nodes.

Note

If you want to access any of the HDR nodes or our GPU node (i.e., see Using SLURM to Submit Jobs), you must use a different head node: svante.mit.edu. This node can accessed from within Svante by typing ssh svante or from outside Svante ssh -Y «username»@svante.mit.edu. At this time, use of HDR compute nodes is restricted to approved users, please contact the Executive Director with details of your planned usage. Also note that the HDR and GPU nodes use a different dialect of linux than the EDR/FDR nodes (Red Hat 8 vs. Fedora Core 24, respectively), which typically will require recompilation using different modules.

2.2. `/home` spaces¶

/home/«username»/: we have about 100 terabytes (TB) of total ‘home space’ for general purpose usage, source code, plots and figures, model builds, etc. Every Svante user is given disk space here; quotas on home space are 500 gigabytes (GB) per user. /home is backed up daily (offsite) and protected from disk failure via a RAID array. Svante /home space is mounted to all nodes in the cluster. Home space is not intended as a repository for large (or large numbers of) data files for analyses, or to be used as disk space for large model runs.

2.2.1. Public space subdirectory¶

In each user’s /home space we have created a subdirectory public_html which can be used to share files with outside collaborators through a URL: https://svante.mit.edu/~«username» will pull up a web browser page with all files in this subdirectory. For outside users to be able to see (or download) files, you do need to make sure any files in this subdirectory have open read permissions (you can change this via chmod -R 755 «filename»). For small files, it is easy enough to make a copy into this subdirectory, but given the /home space quota, this is not practical for large files or data sets. Instead, in public_html make a symbolic link to a file or directory (e.g., ln -s «file to make public» /home/«username»/public_html/). Note that if you are sym linking a file or directory, for example from a file server, these files or directories require open read permissions as well.

2.3. File servers¶

Svante file servers are named fsxx (see Table 2.1), currently fs01-fs12, with total capacity presently roughly 5 PB. To get onto a file server node, for example, typing

ssh fs02

(from svante-login or any other node in the Svante cluster) will give you a shell on file server fs02. Once you ssh to a file server, local disks should be accessed as /d0, /d1, /d2, /d3, and /d4; note some partitions not present on all fileservers. From all other nodes, this space can be reached through ‘remote’ mounts in which your access paths would be, for example, /net/fs02/d0 to access the /d0 partition on fs02. Storage on these file servers is for runs, experiments, downloaded data, etc. that will be accessed and kept for longer-term periods, and these spaces are backed up (weekly offsite backup, although in periods of heavy use, it may take longer for backups to complete). These machines can also be accessed externally (i.e. from outside Svante): fs02 is svante2.mit.edu, fs03 is svante3.mit.edu, etc. (see Table 2.1). User directories of various sizes, organized by research group, will be created and allocated by the Executive Director (Jeff) on a project-based, need basis. These disk spaces are reserved for research-related work; they are not intended as “personal” storage spaces, and any such use will not be tolerated.

2.4. Compute nodes¶

Svante includes a large pool of compute nodes, which comprise the main computational engine of the cluster, interconnected through a fast infiniband network. Users cannot directly ssh to log into compute nodes, but instead must go through the SLURM scheduler, see Section 4 (the exception to this is if the user has a running job on a compute node, ssh to that specific node is permitted).

Users can also access local disk space ( /scratch) on any given compute node, as limited by the size of its disk (see Table 2.1). Every user has unlimited storage in these spaces, the caveat being there is no safeguard for you or another user filling up a local compute node disk. These spaces are not backed up, and any files left longer than six months will be deleted without notice. Local scratch spaces can also be accessed (say, from svante-login or a file server node) through remote mounts /net/«cxxx»/scratch, where «cxxx» denotes the compute node name – see Table 2.1, although not all remote scratch disks may be available to other compute nodes (i.e. on a compute node, use the local scratch disk, not another compute node’s scratch space via a /net mount).

2.5. Archive space¶

Server fs01 contains archive space /data which holds downloaded external data, e.g. reanalysis data, CMIP data, etc. Let us know if you require external data sets to be stored on Svante; if we think such data might benefit general users, we would be happy to store it here. Note that Svante users do not have individual spaces on fs01, nor can users ssh to fs01 to use this server for computational analysis purposes. A second, separate data space /archive (currently on server fs08) holds file server spaces of users that are no longer actively working at MIT; these spaces may be compressed or uncompressed, depending on both the data type and likelihood access will be required. Our general policy is to maintain users’ data in perpetuity.

2.6. Running background screen sessions¶

Most typically, users log into svante, submit or monitor compute node jobs, edit files, perform analyses or generate plots, and/or perform other interactive tasks etc., all while using the terminal window command line. At the end of the work day, typically users will log off and close the terminal window, particularly on a laptop where putting it to sleep would break the network connection. However, what if you are performing some time-consuming task, and closing your laptop would effectively kill it mid-progress? There are a few possible solutions:

The most obvious approach would be to write a SLURM script (see Section 4) to do this task using a compute node (rather than executing while logged into a file server or the login node).
If using a SLURM script is not easily workable, you can run in the background on the terminal command line by adding a & to the end of the command. Even if you close the terminal window or log off, the command will continue to run. This is particularly helpful when copying large directories from one file server to another, for example, where the completion time might be order several hours.
Finally, you might encounter a case where you need to run something time-consuming in the terminal window and monitor its output (and/or get frustrated if you are using a shaky network connection, prone to disconnects). There is a nifty unix command to help you out by allowing you to “detach” a terminal window session that persists even if you log off. To create the screen session, from the command line type screen -e ^Tt -S «myscreenname»; this will put you in a fresh terminal session. To detach the session, type ctrl-t d (ctrl=the control key); it is now safely tucked away in the background, continuing to run. To list any running screen sessions, from another terminal window (logged into the node where the screen is running), type screen -ls. To reattach to a screen session, type screen -x «myscreenname» (again, while logged into the same node). To kill a screen session, from within the screen session, type exit on the command line.

2.7. Svante node specs¶

Table 2.1 Svante Node Specifications¶
Node(s)	External Name	# Cores	RAM (GB)	CPU Arch.	CPU	Speed (Ghz)	IB	Partition	Size (TB)	Usage
svante- login	svante- login	16	64	broadwell	E5-2609 v4	1.70	EDR			FDR/EDR login node
svante	svante	16	64	cascade lake		2.10	EDR			HDR login node

c001 - c036		48	256	xeon scalable (3rd gen)	Gold 6336Y	2.40	HDR	/scratch	4	compute nodes
c041 - c060		16	64	sandy bridge	E5-2670	2.60	FDR	/scratch	2	compute nodes
abba		24	128	haswell	E5-2680 v3	2.50	FDR	/scratch	4	OOD-only nodes
c061 - c096		32	128	broadwell	E5-2697A v4	2.60	EDR	/scratch	4	compute nodes
c097 - c136		48	256	xeon scalable (4th gen)	Gold 5418Y	2.00	HDR200	/scratch	4	compute nodes
g001		28	128	broadwell	E5-2680 v4	2.40	GPU	/scratch		gpu node

fs01		20	512	xeon scalable	Silver 4210	2.20	EDR	/data	570	external datasets

fs02	svante2	16	96	sandy bridge	E5-2660	2.20	FDR	/d0	80	Marshall Group

fs03	svante3	24	512	xeon scalable	Silver 4214R	2.40	EDR	/d0	250	Selin Group
								/d1	120	Selin Group

fs04	svante4	16	128	broadwell	E5-2620 v4	2.10	EDR	/d0	100	Land Group
								/d1	100	Land Group
								/d2	255	CS3 EPPA Group

fs05	svante5	4	24	nehalem	W3530	2.80	FDR	/d0	40	Land Group
								/d1	120	Land Group

fs06	svante6	32	512	xeon scalable	Silver 4314	2.40	EDR	/d0	648	Kang Group
								/d1	648	Heald Group
								/d2	648	CGCS/Prinn Group
								/d3	648	BC3 Group
								/d4	648	Solomon Group

fs07	svante7	16	96	sandy bridge	E5-2660	2.20	FDR	/d0	80	Land Group
								/d1	70	Land Group

fs08	svante8	32	1024	xeon scalable	Silver 4314	2.40	EDR	/d0	880	Marshall Group
								/d1	880	Marshall Group
								/d2	880	Marshall Group
								/archive	880	old user archive

fs09	svante9	32	512	xeon scalable	Silver 4314	2.40	EDR	/d0	1050	Fiore Group

fs10	svante10	16	128	ivy bridge	E5-2650 v2	2.60	FDR	/d0	110	Marshall Group
								/d1	110	CS3 Climate Group
								/d2	110	CS3 Climate Group

fs11	svante11	24	512	broadwell	E5-2650 v4	2.20	EDR	/d0	150	Selin Group
								/d1	150	Selin Group

fs12	svante12	32	1024	xeon scalable	Silver 4314	2.40	EDR	/d0	880	CS3 Climate Group
								/d1	880	LAE Group
								/d2	880	Land Group

geoschem	geoschem	12	32	ivy bridge	E5-2620 v2	2.10	FDR	/data	166	geoschem data

Notes:

At present, all svante nodes’ CPUs use Intel architecture.
Svante’s operating system is Linux and by default terminal windows are Bash shell, although we maintain limited legacy support of C shell.
Nodes with external names can be accessed from outside the Svante cluster, e.g. ssh -Y «username»@svante2.mit.edu would log you into fs02.
From inside Svante, it is possible to ssh to compute nodes as well as file servers (as discussed above), but for compute nodes only if the user has a SLURM job running concurrently on that specific node.
At present, the four abba nodes are only available for Svante Open OnDemand (OOD) usage (see Section 5.2), through (https://svante-ood.mit.edu).
Although the clock speeds are fairly similar across all nodes, in terms of general speed, nehalem < sandy bridge < ivy bridge < haswell < broadwell < xeon scalable. Your job might run 40% faster on c070 than c045, for example. There have been improvements in cpu efficiency and RAM speed over the years, which translates to faster calculations, etc. that typically will improve your job’s efficiency.
Infiniband (IB) is a super-fast network connection, running in parallel with the standard ethernet connection. Note however that although the FDR, EDR and HDR switches are interconnected, MPI jobs cannot span across FDR and EDR, EDR and HDR, FDR and HDR, or HDR and compute nodes.
As presently configured, jobs cannot span across HDR and HDR200 partitions. The interconnecting network speed is twice as fast for the HDR200 nodes, using a newer, improved cpu in the HDR200 nodes (albeit running at a slightly lower clock speed), but otherwise HDR200 will run similarly compiled code as HDR and give identical results.
To gain access to our GPU node, in the slurm (#SBATCH/srun) request you must specify partition -p gpu AND also how many gpus you require: --gres=gpu:1 would ask for a single gpu (the gpu node g001 contains in total four gpu chips). Note: The gpu node only supports up to Cuda 11.4.