Colorado State University Logo

NREL Computing Conversations

Rubel Updates
2024-10-15

Rubel becoming a lot more Slurmy!

Rubel has been a shared compute cluster in NREL since around 2005. From a brief initial experiment with 3 Sun Ultra 5 systems to an initial system with 9 nodes and 18 processing threads, it has grown to over 1800 threads now.

In that time, we've used a lot of mechanisms for using the resource (see below for ways we have parallelized code!) but over time, the tool that emerged as the one we want to use is Slurm. Slurm is a cluster workload manager that has a huge number of features, but for us the strong points are:

Structure

Well Behaved Rubel Jobs (Ideal)

If your job is not well behaved you will either have problems or cause problems. Let's discuss things to see what solution is best.

Slurm will now enforce job resource limits

Jobs will now be terminated if they exceed their allocation of:

This is an important change to help keep rubel stable and to allow jobs to co-exist, but it requires everyone to know their jobs better and request appropriate resouces. We know this is hard espeically with complex code or code that uses libraries from other people. To solve this we must do some actual testing of the code.

It may be tempting to simply request a lot of resources, but that will both slow your job down as well as harm others.

Connecting to Rubel

From within the WCNR/NREL network, ssh to rubel.nrel.colostate.edu. This will land you on one of the login nodes.

If off campus, you can ssh to trailridge.nrel.colostate.edu, and then from there ssh to rubel.nrel.colostate.edu.

Partitions: Uses and Restrictions

Basic batch scheduling with sbatch

Log into rubel, and from one of the login nodes, you would run 'sbatch yourscript'. In the simplest form, an sbatch script would have several parameters set at the top as lines that start with "$SBATCH". For example, here is a script that just runs the sleep command: #!/usr/bin/env bash #SBATCH --array=1-3600 # Run these array ids #SBATCH --job-name=sleeper # A single job name for the array #SBATCH --ntasks-per-node=1 # Each job uses one core... #SBATCH --nodes=1 # ...on one node #SBATCH --output=/dev/null # Standard output #SBATCH --error=/dev/null # Standard error #SBATCH --mem-per-cpu=800M #SBATCH --mail-user=Your.Email@colostate.edu #SBATCH --mail-type=ALL sleep $(( $RANDOM % 20 + 10 )) # (This command does not matter much, just an example of sleeping between 10 and 30 seconds) This is just like having a bash script with just the sleep line, but calling it as: sbatch --array=1-3600 --job-name=sleeper --ntasks-per-node=1 --nodes=1 --output=/dev/null --error=/dev/null --time=3:00 --mem-per-cpu=800M --mail-user=Ty.Boyack@colostate.edu --mail-type=ALL script.sh Clearly keeping all the options in the script is easier to remember and repeat the job.

A breakdown of these common options:

If you don't want to use all of the cores at once, you can set a limit to the number of simultaneous runs of your array like this: #SBATCH --array=1-1000%500 # Run these array ids -- limit number of tasks to 500 at a time This limit can be adjusted after with scontrol to update ArrayTaskThrottle, where 0 is no limit scontrol update job=jobid ArrayTaskThrottle=x This mimics what we did with sharea/shareb/sharec, but is more flexible.

Recommendation: Use all cores when possible. But if sharing is needed you can adjust this to help share the system.

Workflow Scheduling

You can start a job, get it's jobid, then load the next job into the queue with a specification like: #SBATCH --dependency=afterany:jobidnumber The "afterany" option says to wait until the jobid specified by jobidnumber is done, then run this job. The "after" part makes sense to everyone, and the "any" part refers to the status of the job you are waiting on. "afterany" will run this job regardless of whether the job we are waiting on exits cleanly or with an error. "afterok" will only run if the job we are waiting on exits cleanly. There are more complex dependencies that can be built. See the sbatch documentation for all the info.

Interactive Scheduling

We used to be able to jump from node to node with 'rsh' commands. That no longer works. If you need similar functionality, you can use 'srun' instead of 'sbatch' to run a job through slurm. While 'sbatch' puts jobs into the queue to be run, 'srun' tries to run a job right now, but srun jobs are still subject to the scheduling rules.

One common use of this would be to get a shell on a test node to test your script. This can be done with a command like this: srun --partition=test --pty $SHELL This will give you a shell on one of the test nodes, with the default of 1 cpu on one node, and 800MB of ram, and time limited to the maximum time in the test partition.

Memory Use -- how to test and measure

This is a big change -- we get much better control over memory use now. Thus it is best to know how much RAM our jobs will use. In most of our work, people run many iterations of a similar piece of code. Often each run will have the same memory footprint, sometimes it will be different based on the data (for example, even with the same code it takes more ram to process Texas than Connecticut). It makes sense to pick a large job and use that as a test. Maybe you know that you are running an array from 1-5000, but array id #150 is a large-ish one that would be good to test with. Set up a script so that you just run that one step. That might be as simple as using your normal driver script and changing --array=1-5000 to --array=150.

Once you have set up your single-run job, you can run this in the test partion if possible, but if it is large you might need to test it in the main rubel partition. I will assume it will fit in the test partition for these examples.
When you use sbatch to submit a job, any options given on the command line override the sbatch options set in the script. So you could likely take your main sbatch script and run: sbatch --partition=test --array=150 your_sbatch_script By default, this will get 800MB of RAM. Maybe this runs to completion, in which case you are fine with the default amount of memory and don't need to do anything else. This is the case for many of our jobs like DayCent. But for larger jobs, it may fail with an out of memory error.

If you get an out of memory error, you will need to increase the memory by adding --mem-per-cpu=2G or similar to your command. Since the test nodes only have around 7000MB available, if you go larger than that, you will need to test in the main rubel partition.

When you run the sbatch command, you will get a job id. After the job has completed, you can get information about the job by running: seff jobid Seff will tell you how much memory was used and how much cpu time was used. Naturally, if you requested 50GB for testing and your job only needed 12GB, lower your request to a little more than 12GB, or else your job will be limited and you will consume resources unnecessarily.

Another option, also for after the job is run, is to run the slurm accounting report tool: sacct -l which wil list information about all the jobs you have run since midnight. The output is kind of messy (and can be cleaned up with format options) but it shows the memory use in MaxVMSize

A third option can be helpful if you want to test outside of the sbatch framework, like if you just want to run an R command and see how much memory that uses. This method does not need slurm so you can use it on Calypso or other linux systems. To use this technique on Rubel, you still need to work within slurm, so you would get an interactive shell on the test node using the method shown above (though you might alter that srun command with a larger --mem-per-cpu option or else you get the default 800MB. Then once you are on a compute node, you run your job. Let's say you want to run: R myscript.R Instead, you would run: /bin/time -v R myscript.R

This would show you information about memory use that you could use to build your sbatch scripts.

Helpful Commands for things that have changed

rsh and ssh to nodes no longer works

See the section above on interactive scheduling

ruptime is gone

Some things that be similar: sinfo -Nlp rubel The State shows idle/alloc/comp/down/etc. It also has the CPU and Memory of each node squeue | awk '$5=="R"{print $8 " " $4 " " $3}' | sort | uniq -c Shows a count of how many jobs are running on each node. Sort of like the load average of ruptime used to show.

Checking how your jobs went

To see how your jobs went, sacct can give clearer output like: sacct --format=JobID%16,Jobname%10,partition,state,elapsed,ReqMem,MaxRss,MaxVMSize,nodelist --starttime=today --endtime=now Or you can limit it to only jobs that failed (state=failed), ran out of memory (state=oom, which is Out Of Memory), or any other state: sacct --format=JobID%16,Jobname%10,partition,state,elapsed,ReqMem,MaxRss,MaxVMSize,nodelist --starttime=today --endtime=now --state=oom

There are a thousand ways to build useful slurm commands. Please keep sharing them!

May 1 Meeting
2018-05-01

Meeting Lead by Chris Dorich and Ernie Marx

Meeting Notes from Chris
April 3 Meeting
2018-04-03

Topics for Discussion

Going Parallel

Methods for parallelizing within a single computer/node

Methods for parallelizing across nodes

Access Security

Visibility

Upgrades

Second Conversation Topics
2017-11-21

The next Computing Conversation will be on December 5th, 2017 at 11am in NESB B224

Rubel will be the primary topic

Rubel Resources

Rubel Topology

Accessing Rubel - User point of view

Accessing Rubel - System point of view

First Conversation Notes
2017-11-07

What Was Discussed?

The topics in the last post were presented to give an overview of what resources are available. After that, discussion follwed to dicsuss how people are using computing resources and what are the pressing needs.

Current Computing Uses

Pressing Needs for Discussion

First Conversation Topics
2017-11-07

Why have these computing conversations?

NREL has had many computing "islands" form, where individuals or small groups have worked on computing tasks and advanced their own projects. Sometimes people in that situation feel isolated, or that they must solve every computational challenge on their own. These islands form out of isolations when people are unaware of others who may be doing similar work, and form over time as people get busy and focus on the work. We all face this -- sometimes it feels necessary to keep our heads down and get the job done rathern than put the time and effort (some of which is seen as lost productivity) into forging relationships with other people who are doing similar, but separate, work. These meetings are an effort to help facilitate the conversation, so that we can all learn from one another.

This is not an attempt to homogenize -- that is the wrong goal. Instead, the intent is to leverage good ideas and techniques when and where it is best for the science and developer, and to help foster a community where we can all bounce ideas around to find the best solutions.

We are here to build a community

(Introductions)

This group is being formed to share:

What do we mean by "Computing" in this context?

Trying to squeeze more direct sciency work out of a computer.

Also characterized by having the thought, "Gee, I wish I had a bigger computer."

Performance computing is a mentality as much as a tool. ★

History of CSU/NREL Computing

NREL Computing

CSU Computing

Storage:

State of CSU/NREL Computing

Computing

Storage:

Computing Options

(sidebar -- Server/workstation vs Desktop/Laptop)

Choosing a platform -- Funding

Choosing a platform -- OS

Choosing a platform -- Hardware

This is very specific to each task - we are here to discuss this anytime

Future Topics