Rubel has been a shared compute cluster in NREL since around 2005. From a brief initial experiment with 3 Sun Ultra 5 systems to an initial system with 9 nodes and 18 processing threads, it has grown to over 1800 threads now.
In that time, we've used a lot of mechanisms for using the resource (see below for ways we have parallelized code!) but over time, the tool that emerged as the one we want to use is Slurm. Slurm is a cluster workload manager that has a huge number of features, but for us the strong points are:
If your job is not well behaved you will either have problems or cause problems. Let's discuss things to see what solution is best.
Jobs will now be terminated if they exceed their allocation of:
This is an important change to help keep rubel stable and to allow jobs to co-exist, but it requires everyone to know their jobs better and request appropriate resouces. We know this is hard espeically with complex code or code that uses libraries from other people. To solve this we must do some actual testing of the code.
It may be tempting to simply request a lot of resources, but that will both slow your job down as well as harm others.
From within the WCNR/NREL network, ssh to rubel.nrel.colostate.edu. This will land you on one of the login nodes.
If off campus, you can ssh to trailridge.nrel.colostate.edu, and then from there ssh to rubel.nrel.colostate.edu.
Log into rubel, and from one of the login nodes, you would run 'sbatch yourscript'. In the simplest form, an sbatch script
would have several parameters set at the top as lines that start with "$SBATCH". For example, here is a script that just runs
the sleep command:
#!/usr/bin/env bash
#SBATCH --array=1-3600 # Run these array ids
#SBATCH --job-name=sleeper # A single job name for the array
#SBATCH --ntasks-per-node=1 # Each job uses one core...
#SBATCH --nodes=1 # ...on one node
#SBATCH --output=/dev/null # Standard output
#SBATCH --error=/dev/null # Standard error
#SBATCH --mem-per-cpu=800M
#SBATCH --mail-user=Your.Email@colostate.edu
#SBATCH --mail-type=ALL
sleep $(( $RANDOM % 20 + 10 ))
# (This command does not matter much, just an example of sleeping between 10 and 30 seconds)
This is just like having a bash script with just the sleep line, but calling it as:
sbatch --array=1-3600 --job-name=sleeper --ntasks-per-node=1 --nodes=1 --output=/dev/null --error=/dev/null --time=3:00 --mem-per-cpu=800M --mail-user=Ty.Boyack@colostate.edu --mail-type=ALL script.sh
Clearly keeping all the options in the script is easier to remember and repeat the job.
A breakdown of these common options:
If you don't want to use all of the cores at once, you can set a limit to the number of simultaneous runs of your array like this:
#SBATCH --array=1-1000%500 # Run these array ids -- limit number of tasks to 500 at a time
This limit can be adjusted after with scontrol to update ArrayTaskThrottle, where 0 is no limit
scontrol update job=jobid ArrayTaskThrottle=x
This mimics what we did with sharea/shareb/sharec, but is more flexible.
Recommendation: Use all cores when possible. But if sharing is needed you can adjust this to help share the system.
You can start a job, get it's jobid, then load the next job into the queue with a specification like:
#SBATCH --dependency=afterany:jobidnumber
The "afterany" option says to wait until the jobid specified by jobidnumber is done, then run this job. The "after"
part makes sense to everyone, and the "any" part refers to the status of the job you are waiting on. "afterany" will
run this job regardless of whether the job we are waiting on exits cleanly or with an error. "afterok" will only run
if the job we are waiting on exits cleanly. There are more complex dependencies that can be built. See the sbatch documentation for all the info.
We used to be able to jump from node to node with 'rsh' commands. That no longer works. If you need similar functionality, you can use 'srun' instead of 'sbatch' to run a job through slurm. While 'sbatch' puts jobs into the queue to be run, 'srun' tries to run a job right now, but srun jobs are still subject to the scheduling rules.
One common use of this would be to get a shell on a test node to test your script. This can be done with a command like this:
srun --partition=test --pty $SHELL
This will give you a shell on one of the test nodes, with the default of 1 cpu on one node, and 800MB of ram, and
time limited to the maximum time in the test partition.
This is a big change -- we get much better control over memory use now. Thus it is best to know how much RAM our jobs will use. In most of our work, people run many iterations of a similar piece of code. Often each run will have the same memory footprint, sometimes it will be different based on the data (for example, even with the same code it takes more ram to process Texas than Connecticut). It makes sense to pick a large job and use that as a test. Maybe you know that you are running an array from 1-5000, but array id #150 is a large-ish one that would be good to test with. Set up a script so that you just run that one step. That might be as simple as using your normal driver script and changing --array=1-5000 to --array=150.
Once you have set up your single-run job, you can run this in the test partion if possible, but if it is large you might
need to test it in the main rubel partition. I will assume it will fit in the test partition for these examples.
When you use sbatch to submit a job, any options given on the command line override the sbatch options set in the script.
So you could likely take your main sbatch script and run:
sbatch --partition=test --array=150 your_sbatch_script
By default, this will get 800MB of RAM. Maybe this runs to completion, in which case you are fine with the default amount of
memory and don't need to do anything else. This is the case for many of our jobs like DayCent. But for larger jobs, it may
fail with an out of memory error.
If you get an out of memory error, you will need to increase the memory by adding --mem-per-cpu=2G or similar to your command. Since the test nodes only have around 7000MB available, if you go larger than that, you will need to test in the main rubel partition.
When you run the sbatch command, you will get a job id. After the job has completed, you can get information about the job by running:
seff jobid
Seff will tell you how much memory was used and how much cpu time was used. Naturally, if you requested 50GB for testing
and your job only needed 12GB, lower your request to a little more than 12GB, or else your job will be limited and you
will consume resources unnecessarily.
Another option, also for after the job is run, is to run the slurm accounting report tool:
sacct -l
which wil list information about all the jobs you have
run since midnight. The output is kind of messy (and can be cleaned up with format options) but it shows the memory use in MaxVMSize
A third option can be helpful if you want to test outside of the sbatch framework, like if you just want to run an
R command and see how much memory that uses. This method does not need slurm so you can use it on Calypso or other
linux systems. To use this technique on Rubel, you still need to work within slurm, so you would get an interactive
shell on the test node using the method shown above (though you might alter that srun command with a larger --mem-per-cpu
option or else you get the default 800MB. Then once you are on a compute node, you run your job. Let's say you want to run:
R myscript.R
Instead, you would run:
/bin/time -v R myscript.R
This would show you information about memory use that you could use to build your sbatch scripts.
See the section above on interactive scheduling
Some things that be similar:
sinfo -Nlp rubel
The State shows idle/alloc/comp/down/etc. It also has the CPU and Memory of each node
squeue | awk '$5=="R"{print $8 " " $4 " " $3}' | sort | uniq -c
Shows a count of how many jobs are running on each node. Sort of like the load average of ruptime used to show.
To see how your jobs went, sacct can give clearer output like:
sacct --format=JobID%16,Jobname%10,partition,state,elapsed,ReqMem,MaxRss,MaxVMSize,nodelist --starttime=today --endtime=now
Or you can limit it to only jobs that failed (state=failed), ran out of memory (state=oom, which is Out Of Memory), or any other state:
sacct --format=JobID%16,Jobname%10,partition,state,elapsed,ReqMem,MaxRss,MaxVMSize,nodelist --starttime=today --endtime=now --state=oom
There are a thousand ways to build useful slurm commands. Please keep sharing them!
The topics in the last post were presented to give an overview of what resources are available. After that, discussion follwed to dicsuss how people are using computing resources and what are the pressing needs.
NREL has had many computing "islands" form, where individuals or small groups have worked on computing tasks and advanced their own projects. Sometimes people in that situation feel isolated, or that they must solve every computational challenge on their own. These islands form out of isolations when people are unaware of others who may be doing similar work, and form over time as people get busy and focus on the work. We all face this -- sometimes it feels necessary to keep our heads down and get the job done rathern than put the time and effort (some of which is seen as lost productivity) into forging relationships with other people who are doing similar, but separate, work. These meetings are an effort to help facilitate the conversation, so that we can all learn from one another.
This is not an attempt to homogenize -- that is the wrong goal. Instead, the intent is to leverage good ideas and techniques when and where it is best for the science and developer, and to help foster a community where we can all bounce ideas around to find the best solutions.
Trying to squeeze more direct sciency work out of a computer.
Also characterized by having the thought, "Gee, I wish I had a bigger computer."