cookbook 'slurm', '~> 1.5.1'
slurm
(34) Versions
1.5.1
-
Follow0
Installs/Configures slurm workload manager
cookbook 'slurm', '~> 1.5.1', :supermarket
knife supermarket install slurm
knife supermarket download slurm
slurm
Wrapper cookbook that can prepare a full slurm cluster, controller, compute and accounting nodes
Requirements
Requires the following cookbooks:
Platforms
The following platforms are supported:
- Ubuntu 18.04
- Debian 9
Other Debian family distributions are assumed to work, as long as the slurm version from the package tree
is at least 17.02 due to hostname behaviour of slurmdbd.
Chef
- Chef 14.0+
TODO
- Support for RHEL family
- Make cgroup.conf file dynamic
- Add recipe to setup a dynamic resource allocation cluster
- Install slurm from static stable sources, i.e 17.11-latest, 18.08-latest
- Refactor and remove code that can be used as a resource instead of a recipe
- Remove static types of nodes and partitions and support static generation, maybe by passing the Hash directly
- Complete spec files
Usage
Check the .kitchen.yml file for the run_list, this can be applied with:
$ kitchen converge [debian|ubuntu|all]
The use case for this run_list is to setup a monolith which contains all of the slurm components.
Recipes
slurm::_disable_ipv6
- Disable ipv6 on a Linux system.
slurm::_systemd_daemon_reload
- Makes available forcing a
daemon-reload
on systemd, in order to refresh service unit files.
slurm::accounting
- Installs and configures slurmdbd, slurms' accounting service.
slurm::cluster
- TODO sets up a dynamic resource allocation cluster.
slurm::compute
- Installs and configures slurmd, slurms' compute service.
slurm::database
- Installs and configures a MariaDB service.
slurm::default
- Sets up slurm user and group
- Installs packages common to all slurms' services.
slurm::munge
- Sets up munge user and group
- Installs and configures munge authentication service.
slurm::plugin_shifter
- Sets up shifter plugin for slurm.
slurm::server
- Installs and configures slurmctld, slurms' controller service.
This is where the common configuration file shared between slurmctld
and slurmd
services is generated.
Take a close look at attributes below.
Attributes
The attributes are presented here in order of importance for assembling a whole infrastructure.
Common
# ========================= Data bag configuration =========================
default['slurm']['secret']['secrets_data_bag'] # The name of the encrypted data bag that stores openstack secrets
default['slurm']['secret']['service_passwords_data_bag'] # The name of the encrypted data bag that stores service user passwords, with
# each key in the data bag corresponding to a named Slurm service, like
# "slurmdbd", "slurmctl", "slurmd" (this may not be needed for slurm).
default['slurm']['secret']['db_passwords_data_bag'] # The name of the encrypted data bag that stores database passwords, with
# each key in the data bag corresponding to a named Slurm database, like
# "slurmdbd", "slurmctl", "slurmd"
default['slurm']['secret']['user_passwords_data_bag'] # The name of the encrypted data bag that stores general user passwords, with
# each key in the data bag corresponding to a user (this may not be needed for slurm).
# ========================= Slurm specific configuration =========================
default['slurm']['common']['conf_dir'] # slurm configuration directory, usually '/etc/slurm-llnl'
default['slurm']['custom_template_banner'] # String that is prepended to each slurm configuration file
default['slurm']['user'] # username to configure slurm as, usually 'slurm'
default['slurm']['group'] # group to configure slurm as, usually 'slurm'
default['slurm']['uid'] # Slurm user ID, common to all nodes, our default is 999, just before user land id's
default['slurm']['gid'] # Slurm group ID, common to all nodes, our default is 999, just before user land id's
default['proxy']['http'] # proxy address for use with apt, mariadb, and system environment
Munge
default['slurm']['munge']['key'] # munge key location
default['slurm']['munge']['env_file'] # munge environment file, to be used by systemd
default['slurm']['munge']['auth_socket'] # munge communication socket location
default['slurm']['munge']['user'] # username to configure munge as, usually 'munge'
default['slurm']['munge']['group'] # group name to configure munge as, usually 'munge'
default['slurm']['munge']['uid'] # MUNGE user ID, common to all nodes, our default is 998, just before Slurm's
default['slurm']['munge']['gid'] # MUNGE user ID, common to all nodes, our default is 998, just before Slurm's
Monolith
default['slurm']['control_machine'] # fqdn of the machine where slurmctld is running
default['slurm']['nfs_apps_server'] # fqdn of the machine where the apps directory is made available through nfs
default['slurm']['nfs_homes_server'] # fqdn of the machine where the home directory is made available through nfs
default['slurm']['apps_dir'] # path to the apps directory
default['slurm']['homes_dir'] # path to the home directory
default['slurm']['monolith_testing'] # tells the cookbook if the setup should be that of a monolith or not, usually for testing, either true or false
Database
default['mysql']['bind_address'] # CIDR to where the mariadb server should listen to connections, defaults to '0.0.0.0'
default['mysql']['port'] # port to where the mariadb server should listen to connections, defaults to '3306'
default['mysql']['version'] # MariaDB version lock, defaults to '10.1'
default['mysql']['character-set-server'] # database character set, defaults to 'utf8'
default['mysql']['collation-server'] # database collation, defaults to 'utf8_general_ci'
default['mysql']['user']['slurm'] # user which slurm accounting service uses to connect to the database
Accounting
default['slurm']['accounting']['conf_file'] # path to the slurmdbd configuration file, defaults to '/etc/slurm-llnl/slurmdbd.conf'
default['slurm']['accounting']['env_file'] # path to the slurmdbd environment file location, defaults to '/etc/default/slurmdbd'
default['slurm']['accounting']['bin_file'] # path to the slurmdbd binary, defaults to '/usr/sbin/slurmdbd'
default['slurm']['accounting']['pid_file'] # path to the slurmdbd pid file, defaults to '/var/run/slurm-llnl/slurmdbd.pid'
default['slurm']['accounting']['systemd_file'] # path to the slurmdbd systemd service unit file, defaults to '/lib/systemd/system/slurmdbd.service'
default['slurm']['accounting']['debug'] # debug level, valid values from 0-7, defaults to '3'
default['slurm']['accounting']['conf'] # Hash representing the slurmdbd configuration options
The default for ['slurm']['accounting']['conf']
is:
{
AuthType: 'auth/munge',
AuthInfo: node['slurm']['munge']['auth_socket'],
DbdHost: node['hostname'],
DebugLevel: node['slurm']['accounting']['debug'],
LogFile: '/var/log/slurm-llnl/slurmdbd.log', # default is syslog
MessageTimeout: '10',
PidFile: node['slurm']['accounting']['pid_file'],
SlurmUser: node['mysql']['user']['slurm'],
StorageHost: node['hostname'],
StorageLoc: 'slurm_acct_db',
StoragePort: node['mysql']['port'],
StorageType: 'accounting_storage/mysql',
StorageUser: node['mysql']['user']['slurm'],
}
take into account that when overriding ['slurm']['accounting']['conf']
you will override all of its options.
Server
default['slurm']['cluster']['name'] # Name for the cluster, defaults to 'slurm-test'
default['slurm']['server']['conf_file'] # path to the slurmctld and slurmd configuration file, defaults to '/etc/slurm-llnl/slurm.conf'
default['slurm']['server']['env_file'] # path to the slurmctld environment file, defaults to '/etc/default/slurmctld'
default['slurm']['server']['bin_file'] # path to the slurmctld binary file, defaults to '/usr/sbin/slurmctld'
default['slurm']['server']['pid_file'] # path to the slurmctld pid file, defaults to '/var/run/slurm-llnl/slurmctld.pid'
default['slurm']['server']['systemd_file'] # path to the slurmctld systemd service unit file, defaults to '/lib/systemd/system/slurmctld.service'
default['slurm']['server']['service_req'] # name of the storage service(s) that the slurm service should depend on to start
# this should be either empty or the name of the storage service client(s) that slurm might depend on (ceph, beegfs, lustre)
default['slurm']['server']['cgroup_dir'] # path to the cgroup plugin directory, defaults to '/etc/slurm-llnl/cgroup'
default['slurm']['server']['cgroup_conf_file'] # path to the cgroup configuration file, defaults to '/etc/slurm-llnl/cgroup.conf'
default['slurm']['server']['plugstack_dir'] # path to the slurm plugin directory, defaults to '/etc/slurm-llnl/plugstack.conf.d'
default['slurm']['server']['plugstack_conf_file'] # path to the slurm plugin configuration file, defaults to '/etc/slurm-llnl/plugstack.conf'
default['slurm']['shifter'] # Boolean, if true shifter will be installed
default['shifter']['imagegw'] # Boolean, if true the shifter image gateway will be installed and configured (assumes default['slurm']['shifter'] == true
default['shifter']['imagegw_fqdn'] # String, Image Gateway FQDN, accessible hostname or ip address, defaults node['slurm']['control_machine']
default['shifter']['siteenv_append'] # String, Environment Variable Append control, defaults to 'PATH=/opt/udiImage/bin'
Compute nodes
In the computes.rb attribute file you can see an example for the various slurm cluster settings.
For now we assume three types of partitions (and nodes):
- small
- medium
- large
representing the capacity (memory) for each group. The nodes in each group are assumed to be homogeneous.
Each group properties can be passed via the following attributes
default['slurm']['conf']['nodes'][type]['count']
default['slurm']['conf']['nodes'][type]['properties']['cpus'] # amount of CPUs available in the node group, Integer
default['slurm']['conf']['nodes'][type]['properties']['mem'] # amount of RAM available in the node group, Megabytes
default['slurm']['conf']['nodes'][type]['properties']['sockets'] # number of sockets in node group, on private cloud systems it is usually the number of cpus
default['slurm']['conf']['nodes'][type]['properties']['cores_per_socket'] # number of cores per socket, on private cloud systems it is usually one
default['slurm']['conf']['nodes'][type]['properties']['threads_per_core'] # number of threas per core, on private cloud systems it is usually one
default['slurm']['conf']['nodes'][type]['properties']['weight'] # preference for being allocated work to, the lower the weight the highest the preference
At this time, this cookbook is designed to work either as a monolith (PoC) or to be deployed in a private cloud environment.
Data Bags
From the previous section we can see which data bags are required to exist. Each of the items must have a key with the same name as the data bag, where the secret value should be stored.
Within those databags we have to create the following items:
DataBag | Item | Keys |
---|---|---|
slurm_db_passwords | mysqlroot | --- |
slurm_db_passwords | node['mysql']['user']['slurm'] | --- |
slurm_secrets | munge | --- |
Any of the slurm_db_passwords
items should be text passwords, generated with your favorite tool.
The munge key should be a base64 key, based on binary data generated from running either of the following:
-
$ create-munge-key -r
on a system with munge installed (note that it will try to overwrite any existing key in /etc/munge/munge.key) $ dd if=/dev/random bs=1 count=1024 > munge.key
$ dd if=/dev/urandom bs=1 count=1024 > munge.key
For more information on generating a munge key see the munge documentation.
Authors
- Manuel Torrinha manuel.torrinha@tecnico.ulisboa.pt
Dependent cookbooks
mariadb ~> 2.0 |
shifter ~> 1.0 |
Contingent cookbooks
There are no cookbooks that are contingent upon this one.
Collaborator Number Metric
1.5.1 passed this metric
1.5.1 passed this metric
Contributing File Metric
1.5.1 failed this metric
Failure: To pass this metric, your cookbook metadata must include a source url, the source url must be in the form of https://github.com/user/repo, and your repo must contain a CONTRIBUTING.md file
1.5.1 failed this metric
Foodcritic Metric
1.5.1 passed this metric
1.5.1 passed this metric
No Binaries Metric
1.5.1 passed this metric
1.5.1 passed this metric
Testing File Metric
1.5.1 failed this metric
Failure: To pass this metric, your cookbook metadata must include a source url, the source url must be in the form of https://github.com/user/repo, and your repo must contain a TESTING.md file
1.5.1 failed this metric
Version Tag Metric
1.5.1 failed this metric
Failure: To pass this metric, your cookbook metadata must include a source url, the source url must be in the form of https://github.com/user/repo, and your repo must include a tag that matches this cookbook version number
1.5.1 failed this metric