6. Best Practices for Svante Use

This section aims to address some of the most common questions about Svante usage.

6.1. Use of login node (aka “head node”) vs. file servers vs. compute nodes

6.1.1. Login node usage

svante-login’s main purpose is precisely that: it allows one to login and “look around”, or submit or montior jobs in SLURM, and acts as a gateway to interactive sessions on file servers or compute nodes. Proper uses of the login node include listing directories and looking at files, copying small files, editing, submitting jobs, and running simple scripts (requiring trivial processing).

Do not run RAM-intensive jobs on the login node, nor do massive copying or deleting of files, nor tasks which use many cores or access many files. DO NOT RUN PYTHON, MATLAB ETC. ON THE LOGIN NODE. Such use will incur the wrath of the Systems Administrator and Executive Director, and other cluster users who might be frustrated by how slowly the login node is responding to simple typed commands. Most work on svante should occur while logged into a file server or on a compute node; see below. There is no local disk space on the login node (/home is presently housed on fs01).

Improper use of the login node will not be tolerated.

6.1.2. File server usage

The main purpose of file servers is to store model output, large data sets, and/or any large output related to research projects. As such, the following computational activities are appropriate while logged into file servers:

  1. Copying and deleting files and directories - Ideally, massive copying or deleting of files should be done on the local machine where disks are attached. If copying a large directory from file server A to B, ssh first to A where you can access the disks locally as /d0, for example (as discussed in Section 2.3), thus only requiring a remote /net/ pathway for the target specification. If you are downloading external data onto a file server, please scp (or your file transfer protocol of choice) external data directly to these machines (svante2-svante11), rather than through the login node. Or, for external data, contact us about storing the data on fs01.

  2. Data analysis using Python, MATLAB, GrADS, fortran analysis codes, etc. - Any interactive data analysis (i.e. opening, reading, processing, and/or writing new data files) should typically occur while logged into one of the file servers, where the data are located, particularly if the analysis involves heavy file input/output. Accessing data and files on disks locally is MUCH faster and superior than using a remote /net/ pathway. Moreover, the bandwidth for remote /net/ pathway data transfer is limited, and remote file access during periods of heavy cluster usage can result your data analysis routines running very slowly. Note if you must load a massive data set into memory for data analysis, RAM capacity makes fs03 or fs11 your best choice (see Table 2.1); otherwise, for routine computational analysis, use a file server where your supervisor or PI has data stored.

  3. Running small single core or multi-core executibles - fle servers typically have significant computational horsepower, with 8-24 cores, thus making them a simple way to interactively run a single-core executable without have to bother with SLURM. This can even be done with small MPI or shared-memory jobs using a few cores. The only caveat here is that you are effectively sharing the file server with your fellow group researchers, and they may suffer if you monopolize the server’s cores or RAM; if in doubt, you might want to give them a head’s up.

6.1.3. Compute node usage

The main purpose of the compute nodes is to run large model simulations, ensembles of model simulations, ensembles of data analysis scripts, or compile code. Although possible to run small (or short) jobs on file servers interactively, typically we do not want to see people running production jobs on the file server nodes. The SLURM scheduler distributes submitted jobs among our pool of compute nodes; see Section 4 for detailed instructions. It is certainly acceptable however to use a compute node interactively instead of a file server for data analysis (especially if the task involves significant number crunching, as the compute nodes are generally fastest at this); see Section 4.5 on how to get on a compute node interactively for this purpose. Interactive sessions on compute nodes are also recommended for compiling code.

6.2. Where should I compile code for running on Svante?

Users should compile on the node type where they intend to run. Most compilers are smart enough to query the local machine type and will tailor the compilation toward that specific architecture (note, one can usually also force machine-specific compilations using compiler options). What we’ve found is that executables compiled on the oldest machines will often run on anything newer, but compilations on the newest machine only execute there, giving ‘illegal instruction’ on older machines. There are potential speed boosts by compiling on the node type where you plan to run, although how much, if any, is unclear and likely depends on the code. Our advice: if you have a big MPI job that you’d only run on the EDR nodes, then compile it on an EDR node. Otherwise, compile jobs with intent to run on EDR or FDR on a FDR node. For something to run on HDR, most typically it must be compiled on HDR.

Do not compile on the svante-login node. We’ve even purposely broken some compilers on svante-login; the login node requires extra security and runs a different OS than compute nodes, and thus is problematic for compilation. As a technical note, we very rarely upgrade the OS on compute nodes (say, once or twice a decade); this makes it much easier to maintain working codes over long time periods. In contrast, the login node requires frequent OS upgrades, with each upgrade having the possibility of breaking working code.

What about compiling for file servers? If you intend to run code on a file server, first suggestion would be to compile on a FDR node and see if it runs on the file server. If it doesn’t, then try to compile and run on the file server you where you intend to run the code. You may need assistance from jp-admin@techsquare.com to get this working properly (computing environment and operating system varies across file server nodes).

Many of the models we run on Svante are fundamentally chaotic, and as such, obtaining reproducible output is usually only possible if one uses the exact same executible on identical hardware. Almost any variation will result in different output; this may just be in the insignificant digits of a double precision variable, or more significant. Because our compute nodes are not directly accessible through the internet (and thus don’t require frequent security updates), to the extent possible we “freeze” the operating system and computing environment so that machine-precision reproduction is possible. However, at some point in time upgrades are unavoidable. If obtaining machine-precision reproducible results over long time periods is critical to your work, you should discuss your requirements with us, as we cannot guarantee the duration we can maintain a frozen compute node environment.

6.3. To which compute node partition should I submit my job?

EDR is faster than FDR, and HDR is faster than EDR – not just in terms of the nodes’ general number crunching capabilities, but also in the infiniband network speed, so if you are running a large MPI job with significant node-to-node intercommunication, your best choice is HDR, second choice EDR. This however, depends on the application; some models spend most of their time doing numerical calculations and thus the IB speed doesn’t matter so much, putting EDR and FDR nodes only at a slight disadvantage. If you are running a shared memory code such as GEOS-Chem, you are limited by the number of cores on a given node; HDR will allow 48-core shared memory jobs, EDR nodes will allow 32-core shared memory jobs, whereas FDR permits 24 (abba nodes) or 16-core jobs (c041-c060), which may impact your decision on where to submit. If you are running an ensemble of single-core jobs, we strongly suggest (and in fact insist) that these jobs are run on EDR or FDR nodes, as large MPI jobs get priority on HDR nodes.

6.4. Where should I write model output?

If you know in advance that you need to keep run output for an extended period, it may make sense to write run output directly to your file server space although beware of possible bottlenecks with too many people attempting to access disks on a single file server at one time. We’ve also noticed that a single user copying huge amounts of data to and from a file server can effectively grind all other access on this server to a near standstill, so be forewarned. Using local /scratch space on the compute nodes for run output is strongly encouraged, especially if run output need not be saved permanently. If you are writing large amounts of data, e.g. 1 TB or more, give jp-admin@techsquare a head’s up and we’ll clean out old files on /scratch to ensure sufficient space.

Do not write model output to /home, or write other data there for temporary storage. Besides the obvious consideration that /home is rather limited in size (500GB per user), the offsite backup of /home is run daily, and if folks are constantly deleting and repopulating their home spaces, the backup is unlikely to complete its daily cycle, and thus fail to properly backup many users.

6.5. Are there limits on my usage (or computational “footprint”) on Svante?

At present, users are limited to use of 400 running cores on EDR nodes, and the maximum EDR walltime is 4 days. There are no restrictions on core count or walltimes on FDR nodes (nor does FDR usage count toward the 400 EDR core limit). Limitations on HDR core usage remains TBD at this time, contact Jeff if you want to use them.

There is no limit to storage on file server spaces other than the disk partition size itself. That being said, individual users and research groups are responsible for monitoring their file server space usage. Ideally, disks should not go above 95-97% capacity; in fact, disk speed performance starts to decline somewhere > 90% filled. Filling up a file server disk 100% is painful, i.e. costly in terms of tech support hours. To check free disk space: df -h /d0 to get infomation on the /d0 partition when logged into a file server, for example.

In general, an individual’s accountability for using the cluster “responsibly” is proportional to the magnitude of their computational task. In other words, for small jobs on few cores using few files, the repercussions of running in a clumsy manner are usually minimal, whereas a poorly conceived setup for huge jobs with copious output can effectively make the cluster a miserable user environment for all cluster users. If you are unsure about any significant undertaking, please consult with the Executive Director beforehand; if necessary, they will consult with our tech support for the best way to set up your work.