Rubel Updates

2024-10-15

Rubel becoming a lot more Slurmy!

Rubel has been a shared compute cluster in NREL since around 2005. From a brief initial experiment with 3 Sun Ultra 5 systems to an initial system with 9 nodes and 18 processing threads, it has grown to over 1800 threads now.

In that time, we've used a lot of mechanisms for using the resource (see below for ways we have parallelized code!) but over time, the tool that emerged as the one we want to use is Slurm. Slurm is a cluster workload manager that has a huge number of features, but for us the strong points are:

Scheduling large numbers of jobs
Workflow pipelies
Resource contraints
Prioritization of jobs
Measurement of resource use

Structure

Physical components of Rubel -- around 60 pieces of hardware
- Login nodes -- When you ssh to rubel, you will land on a login node. These should NOT be used for computing
- Compute nodes -- these will do the work, but I don't want people thinking about them individually!
- Test nodes -- nodes that are set asside for testing and development
- Master node -- core infrastructure, handles storage and many other services
- Network -- currently being upgraded
Compute Partitions (these can overlap, and see the next section)
- Default partition is "rubel"
- Test partition is "test"
- Partitions "lopri", "hipri", and emergency are the same as "rubel," but with lower or higher priority
- PI-named partitions give priority on newly purchased nodes
- longrun partition has longer time limits for general use
- The patitions have priorities, but the "FairTree/FairShare" algorithm will also adjust priorities
- Rubel will try to prioritize jobs onto the fastest cores within the partition
Rubel uses a lot of energy
- Around 9000 watts at full load
- Roughly 43 amps at 208 volts
- This is expected, but as an ecological group we should keep this in mind

Well Behaved Rubel Jobs (Ideal)

Rubel is a shard resource -- PLEASE PLAY NICE!
Must do ALL disk I/O on /data/rubelscratch or /scratch -- never on the PI network drives or home directory
Should know how many CPUs it needs, and request just that amount (usually 1)
Should know how much memory is needs, and request just that amount. Ideally, it should be under 800MB, or 3500MB per CPU
Should be fairly short lived in run time. 10 minutes or less would be ideal. Longer runs should consider if they can be broken into steps or even checkpointed and restarted
If the databases are used, will only open one connection per CPU allocated

If your job is not well behaved you will either have problems or cause problems. Let's discuss things to see what solution is best.

Slurm will now enforce job resource limits

Jobs will now be terminated if they exceed their allocation of:

Number of CPUs
Amount of Memory (RAM)
CPU time

This is an important change to help keep rubel stable and to allow jobs to co-exist, but it requires everyone to know their jobs better and request appropriate resouces. We know this is hard espeically with complex code or code that uses libraries from other people. To solve this we must do some actual testing of the code.

It may be tempting to simply request a lot of resources, but that will both slow your job down as well as harm others.

Connecting to Rubel

From within the WCNR/NREL network, ssh to rubel.nrel.colostate.edu. This will land you on one of the login nodes.

If off campus, you can ssh to trailridge.nrel.colostate.edu, and then from there ssh to rubel.nrel.colostate.edu.

Partitions: Uses and Restrictions

rubel
- Default partition
- Most jobs should be run in this partition if possible
- Runtime limited to 1 hour
- Calendar scheduling is not necessary
- Communication is encouraged for job priorities
- Instead of sharea/shareb/sharec like before, use the ArrayTaskThrottle (%) to limit number of simultaneous jobs
test
- Used to test/measure jobs before putting them on the main nodes
- Runtime limited to 4 hours
- Calendar scheduling not necessary
- This is a good place to measure memory use for modest jobs
lopri
- Low Priority for jobs that want to run but also want to defer to other jobs -- could use "nice" instead but this is more communicatitive
- Runtime limited to 1 hour
- Calendar scheduling not necessary
PI partitions (ogle, uvb, etc.)
- Limited to nodes recently purchased by PI
- Has higher priority than hipri
- Should ONLY be used by that PI's team and with the PI's permission
- Runtime is unlimited
- Calendar scheduling is recommended, but up to the PI
hipri
- High Priority for jobs with deadlines or urgent needs
- Runtime limited to 2 hour
- Calendar scheduling is required
- Ideally this is used very little
emergency
- Highest Priority for jobs with deadline problems or urgent needs
- Runtime is unlimited
- Calendar scheduling is helpful
- Approval of IT and PIs is necessary

Basic batch scheduling with sbatch

Log into rubel, and from one of the login nodes, you would run 'sbatch yourscript'. In the simplest form, an sbatch script would have several parameters set at the top as lines that start with "$SBATCH". For example, here is a script that just runs the sleep command: #!/usr/bin/env bash #SBATCH --array=1-3600 # Run these array ids #SBATCH --job-name=sleeper # A single job name for the array #SBATCH --ntasks-per-node=1 # Each job uses one core... #SBATCH --nodes=1 # ...on one node #SBATCH --output=/dev/null # Standard output #SBATCH --error=/dev/null # Standard error #SBATCH --mem-per-cpu=800M #SBATCH --mail-user=Your.Email@colostate.edu #SBATCH --mail-type=ALL sleep $(( $RANDOM % 20 + 10 )) # (This command does not matter much, just an example of sleeping between 10 and 30 seconds) This is just like having a bash script with just the sleep line, but calling it as: sbatch --array=1-3600 --job-name=sleeper --ntasks-per-node=1 --nodes=1 --output=/dev/null --error=/dev/null --time=3:00 --mem-per-cpu=800M --mail-user=Ty.Boyack@colostate.edu --mail-type=ALL script.sh Clearly keeping all the options in the script is easier to remember and repeat the job.

A breakdown of these common options:

#!/usr/bin/env bash
- This is the recommened way to start a bash or any script for portability.
--array=1-3600 # Run these array ids
- Specify how many iterations of this script you want run. The job number will be available in the variable $SLURM_ARRAY_TASK_ID
--job-name=sleeper # A single job name for the array
- Give the job a name so that others can see it in the job listing. Short is good as often the field is not that wide.
--ntasks-per-node=1 # Each job uses one core...
- Tell slurm that the individual job in this script (the sleep command) wants 1 CPU. Only change this if you use a multithreading tool within your code.
--nodes=1 # ...on one node
- Tell slurm that the individual job int his script (the sleep command) will only use one node. Only change this if you use something like MPI.
--output=/dev/null # Standard output
- Quiet STANDARD OUTPUT from each job -- but enable this for debugging!
--error=/dev/null # Standard error
- Quiet STANDARD ERROR from each job -- but enable this for debugging!
--mem-per-cpu=800M
- Specifiy how much memory this job (the sleep command) needs. By default you get under around 800MB per job. You only need to specify this if you want more, but it will limit how many jobs you can load on the cluster.
--mail-user=Your.Address@colostate.edu
- The system will send you mail when your job starts or stops.
--mail-type=ALL
- This asks the system to send all email types.

If you don't want to use all of the cores at once, you can set a limit to the number of simultaneous runs of your array like this: #SBATCH --array=1-1000%500 # Run these array ids -- limit number of tasks to 500 at a time This limit can be adjusted after with scontrol to update ArrayTaskThrottle, where 0 is no limit scontrol update job=jobid ArrayTaskThrottle=x This mimics what we did with sharea/shareb/sharec, but is more flexible.

Recommendation: Use all cores when possible. But if sharing is needed you can adjust this to help share the system.

Workflow Scheduling

You can start a job, get it's jobid, then load the next job into the queue with a specification like: #SBATCH --dependency=afterany:jobidnumber The "afterany" option says to wait until the jobid specified by jobidnumber is done, then run this job. The "after" part makes sense to everyone, and the "any" part refers to the status of the job you are waiting on. "afterany" will run this job regardless of whether the job we are waiting on exits cleanly or with an error. "afterok" will only run if the job we are waiting on exits cleanly. There are more complex dependencies that can be built. See the sbatch documentation for all the info.

Interactive Scheduling

We used to be able to jump from node to node with 'rsh' commands. That no longer works. If you need similar functionality, you can use 'srun' instead of 'sbatch' to run a job through slurm. While 'sbatch' puts jobs into the queue to be run, 'srun' tries to run a job right now, but srun jobs are still subject to the scheduling rules.

One common use of this would be to get a shell on a test node to test your script. This can be done with a command like this: srun --partition=test --pty $SHELL This will give you a shell on one of the test nodes, with the default of 1 cpu on one node, and 800MB of ram, and time limited to the maximum time in the test partition.

Memory Use -- how to test and measure

This is a big change -- we get much better control over memory use now. Thus it is best to know how much RAM our jobs will use. In most of our work, people run many iterations of a similar piece of code. Often each run will have the same memory footprint, sometimes it will be different based on the data (for example, even with the same code it takes more ram to process Texas than Connecticut). It makes sense to pick a large job and use that as a test. Maybe you know that you are running an array from 1-5000, but array id #150 is a large-ish one that would be good to test with. Set up a script so that you just run that one step. That might be as simple as using your normal driver script and changing --array=1-5000 to --array=150.

Once you have set up your single-run job, you can run this in the test partion if possible, but if it is large you might need to test it in the main rubel partition. I will assume it will fit in the test partition for these examples.
When you use sbatch to submit a job, any options given on the command line override the sbatch options set in the script. So you could likely take your main sbatch script and run: sbatch --partition=test --array=150 your_sbatch_script By default, this will get 800MB of RAM. Maybe this runs to completion, in which case you are fine with the default amount of memory and don't need to do anything else. This is the case for many of our jobs like DayCent. But for larger jobs, it may fail with an out of memory error.

If you get an out of memory error, you will need to increase the memory by adding --mem-per-cpu=2G or similar to your command. Since the test nodes only have around 7000MB available, if you go larger than that, you will need to test in the main rubel partition.

When you run the sbatch command, you will get a job id. After the job has completed, you can get information about the job by running: seff jobid Seff will tell you how much memory was used and how much cpu time was used. Naturally, if you requested 50GB for testing and your job only needed 12GB, lower your request to a little more than 12GB, or else your job will be limited and you will consume resources unnecessarily.

Another option, also for after the job is run, is to run the slurm accounting report tool: sacct -l which wil list information about all the jobs you have run since midnight. The output is kind of messy (and can be cleaned up with format options) but it shows the memory use in MaxVMSize

A third option can be helpful if you want to test outside of the sbatch framework, like if you just want to run an R command and see how much memory that uses. This method does not need slurm so you can use it on Calypso or other linux systems. To use this technique on Rubel, you still need to work within slurm, so you would get an interactive shell on the test node using the method shown above (though you might alter that srun command with a larger --mem-per-cpu option or else you get the default 800MB. Then once you are on a compute node, you run your job. Let's say you want to run: R myscript.R Instead, you would run: /bin/time -v R myscript.R

This would show you information about memory use that you could use to build your sbatch scripts.

Helpful Commands for things that have changed

rsh and ssh to nodes no longer works

See the section above on interactive scheduling

ruptime is gone

Some things that be similar: sinfo -Nlp rubel The State shows idle/alloc/comp/down/etc. It also has the CPU and Memory of each node squeue | awk '$5=="R"{print $8 " " $4 " " $3}' | sort | uniq -c Shows a count of how many jobs are running on each node. Sort of like the load average of ruptime used to show.

Checking how your jobs went

To see how your jobs went, sacct can give clearer output like: sacct --format=JobID%16,Jobname%10,partition,state,elapsed,ReqMem,MaxRss,MaxVMSize,nodelist --starttime=today --endtime=now Or you can limit it to only jobs that failed (state=failed), ran out of memory (state=oom, which is Out Of Memory), or any other state: sacct --format=JobID%16,Jobname%10,partition,state,elapsed,ReqMem,MaxRss,MaxVMSize,nodelist --starttime=today --endtime=now --state=oom

There are a thousand ways to build useful slurm commands. Please keep sharing them!

May 1 Meeting

2018-05-01

Meeting Lead by Chris Dorich and Ernie Marx

Meeting Notes from Chris

April 3 Meeting

2018-04-03

Topics for Discussion

Going Parallel

Methods for parallelizing within a single computer/node

Shell Backgrounds
fork/exec
OpenMP
Language Tools

Methods for parallelizing across nodes

Rsh
MPI
Language Tools

Rmpi

Access Security

Discuss Remote Needs/Desires

Visibility

Discuss Visibility Needs/Desires

Upgrades

Python

Second Conversation Topics

2017-11-21

The next Computing Conversation will be on December 5th, 2017 at 11am in NESB B224

Rubel will be the primary topic

Rubel Resources

39 Nodes
/data/rubelscratch storage

Short term only (duration of runs)
As a shared resource, must respect others use

Database for Ogle's projects (tusker)

Rubel Topology

Accessing Rubel - User point of view

SSH Into System
RSH Between Nodes
Off Campus access will be changing

Accessing Rubel - System point of view

Master Node Routing
Load Balancing
Problems with uncontrolled network access

First Conversation Notes

2017-11-07

What Was Discussed?

The topics in the last post were presented to give an overview of what resources are available. After that, discussion follwed to dicsuss how people are using computing resources and what are the pressing needs.

Current Computing Uses

Monte Carlo simulations
Model parameterization
Optimizations
Data analysis
Spatial analysis
Remote sensing
Others

Pressing Needs for Discussion

Rubel best practices
Database (Tusker/MariaDB) integration
Rubel resource management

Use of 0365 Group/Calendar for scheduling
Examine cost model options for Rubel

First Conversation Topics

2017-11-07

Why have these computing conversations?

NREL has had many computing "islands" form, where individuals or small groups have worked on computing tasks and advanced their own projects. Sometimes people in that situation feel isolated, or that they must solve every computational challenge on their own. These islands form out of isolations when people are unaware of others who may be doing similar work, and form over time as people get busy and focus on the work. We all face this -- sometimes it feels necessary to keep our heads down and get the job done rathern than put the time and effort (some of which is seen as lost productivity) into forging relationships with other people who are doing similar, but separate, work. These meetings are an effort to help facilitate the conversation, so that we can all learn from one another.

This is not an attempt to homogenize -- that is the wrong goal. Instead, the intent is to leverage good ideas and techniques when and where it is best for the science and developer, and to help foster a community where we can all bounce ideas around to find the best solutions.

We are here to build a community

(Introductions)

This group is being formed to share:

Ideas
Methods
Techniques
Algorithms
Workflows
Challenges/Problems
Fun and Humor
Anything else related to computing

What do we mean by "Computing" in this context?

Trying to squeeze more direct sciency work out of a computer.

Also characterized by having the thought, "Gee, I wish I had a bigger computer."

Performance computing is a mentality as much as a tool. ★

HPC - High Performance Computing
MPC - Middlin' Performance Computing
Ideal scaling vs reality - desktop to "big" computer
Single vs multithreaded
Science based rather than technology or software based

History of CSU/NREL Computing

NREL Computing

1996 - Saccarum 150MHz SuperSparc processor, 32-512 MB Ram
2000 - 10-15 Sun and HP Workstations
HP Exemplar - Elvis 8x 179MHz CPUs
2003 - Calypso - 3 node Sun Ultra 1 Cluster
2005 - Rubel - 9 node, 18 cpu Linux cluster
2006 - Present - Rubel as a rolling cluster

CSU Computing

Cyber Systems, Green/Gold
Lamar
Cray

Storage:

2-18GB per system, consolidation project to build first storage of around 70GB.
2008 - iSCSI based san supporting 40TB.
2013 - SAS fabric san supporting 140TB.

State of CSU/NREL Computing

Computing

Project Server or Workstation ★
Rubel ★

440 Cores (Processors)
1.46TB Memory

Cray (Not accepting new work)
Shared/Offsite

Summit HPC ★
Yellowstone
Jaguar (ORNL)
Amazon (Cloud)
Azure (Cloud)
Google (Cloud)

Storage:

2018 - NREL/WCNR Gluster based network storage, scalable to >1PB.
University Research Storage Iniative

Computing Options

(sidebar -- Server/workstation vs Desktop/Laptop)

Virtual Server/Workstation ★

Modest power
Monthly bill
Scaling flexibility

Physical Server/Workstation ★

Can be impressive power
Capital expense, plus monthly hosting
Resources dedicatd to your project
Does not benefit from (or get hurt by) sharing

Rubel ★

Can be scaled fairly large
Capital expense for nodes
Requires a "play nice" mentality and planning
Has advantages (and disadvantages) sharing
Can do lots of processing at once
May not be available when you want it

Summit HPC ★

Can be scaled very large
Capital expense for nodes
Usage limits are enforced
Has advantages (and disadvantages) sharing
Can do lots of processing at once
May not be available when you want it

Cloud providers

Choosing a platform -- Funding

Choosing a platform -- OS

On desktops, choose OS based on Software
For MPC/HPC, choose Software based on OS/Platform

Choosing a platform -- Hardware

Go as small as possible
What is your use pattern? Intermittent, continuous, peak load, deadline-driven, etc.
What software/skills are necessary?

This is very specific to each task - we are here to discuss this anytime

Future Topics

Rubel specifics -- parts, topology, access, best practices
Linux Basics
Linux Better-Than-Basics
Mapping tasks to resources
Favorite languages/tools
Data management
Database use
What about GPUs?
What about Machine Learning/Artificial Intellegence?

NREL Computing Conversations

Rubel becoming a lot more Slurmy!

Structure

Well Behaved Rubel Jobs (Ideal)

Slurm will now enforce job resource limits

Connecting to Rubel

Partitions: Uses and Restrictions

Basic batch scheduling with sbatch

Workflow Scheduling

Interactive Scheduling

Memory Use -- how to test and measure

Helpful Commands for things that have changed

rsh and ssh to nodes no longer works

ruptime is gone

Checking how your jobs went

Meeting Lead by Chris Dorich and Ernie Marx

Topics for Discussion

Going Parallel

Methods for parallelizing within a single computer/node

Methods for parallelizing across nodes

Access Security

Visibility

Upgrades

The next Computing Conversation will be on December 5th, 2017 at 11am in NESB B224

Rubel will be the primary topic

Rubel Resources

Rubel Topology

Accessing Rubel - User point of view

Accessing Rubel - System point of view

What Was Discussed?

Current Computing Uses

Pressing Needs for Discussion

Why have these computing conversations?

We are here to build a community

(Introductions)

This group is being formed to share:

What do we mean by "Computing" in this context?

Performance computing is a mentality as much as a tool. ★

History of CSU/NREL Computing

NREL Computing

CSU Computing

Storage:

State of CSU/NREL Computing

Computing

Storage:

Computing Options

(sidebar -- Server/workstation vs Desktop/Laptop)

Choosing a platform -- Funding

Choosing a platform -- OS

Choosing a platform -- Hardware

This is very specific to each task - we are here to discuss this anytime

Future Topics