![]() |
|
Author: | Capit Nicolas |
---|---|
Address: | Laboratoire Informatique et Distribution (ID)-IMAG ENSIMAG - antenne de Montbonnot ZIRST 51, avenue Jean Kuntzmann 38330 MONTBONNOT SAINT MARTIN |
Contact: | |
Authors: | ID laboratory |
organization: | ID laboratory |
status: | This is a "work in progress" |
license: | GNU GENERAL PUBLIC LICENSE |
Dedication: | For users, administrators and developpers. |
abstract: | OAR is a resource manager or (batch scheduler) for large clusters. In functionnalities, it's near of PBS, LSF, CCS and Condor. It's suitable for productive plateforms and research experiments. |
BE CAREFULL : THIS DOCUMENTATION IS FOR OAR 2.0
Oar is an opensource batch scheduler which provides a simple and flexible exploitation of a cluster.
It manages resources of clusters as a traditional batch scheduler (as PBS / Torque / LSF / SGE).
It is flexible enough to be suitable for production clusters and research experiments. It currently manages over than 5000 nodes and has executed more than 5 million jobs.
There a three kinds of nodes, each requiring a specific software configuration.
These are :
- the server node, which will hold all of OAR "smartness" ;
- the login nodes, on which you will be allowed to login, then reserve some computational nodes ;
- the computational nodes (a.k.a. the nodes), on which the jobs will run.
On every node, the "sentinelle" binary can be installed (it is not necessary). This tool is used to launch commands on several computers, at a same time. It ships with the current OAR package (but it is also available separately in Taktuk in an Open Source License).
On every nodes (server, login, computational), the following packages must be installed :
- sudo
- Perl
- Perl-base
- openssh (server and client)
On the OAR server and on the login nodes, the following packages must be installed:
- Perl-Mysql
- Perl-DBI
- MySQL
- MySQL-shared
- libmysql
From now on, we will suppose all the Perl/Mysql packages are correctly installed and configured and mysql database is started.
The following steps have to be done, prior to installing OAR:
add a user named "oar" in the group "oar" on every node
let the user "oar" connect through ssh from any node to any node WITHOUT password. To achieve this, here is some standard procedure for OpenSSH:
create a set of ssh keys for the user "oar" with ssh-keygen (for instance 'id_dsa.pub' and 'id_dsa')
copy these keys on each node of the cluster in the ".ssh" folder of the user "oar"
append the contents of 'id_dsa.pub' to the file "~/.ssh/authorized_keys"
in "~/.ssh/config" add the lines:
Host * ForwardX11 no StrictHostKeyChecking no PasswordAuthentication no AddressFamily inettest the ssh connection between (every) two nodes : there should not be any prompt.
grant the user "oar" the permission to execute commands with root privileges. To achieve that, OAR makes use of sudo. As a consequence, /etc/sudoers must be configured. Please use visudo to add the following lines:
Defaults>oar env_reset,env_keep = "OARLIB OARUSER OARDIR PWD PERL5LIB DISPLAY OARCONFFILE" Cmnd_Alias OARCMD = /usr/lib/oar/oarnodes, /usr/lib/oar/oarstat,\ /usr/lib/oar/oarsub, /usr/lib/oar/oardel, /usr/lib/oar/oarhold,\ /usr/lib/oar/oarnotify, /usr/lib/oar/oarresume %oar ALL=(oar) NOPASSWD: OARCMD oar ALL=(ALL) NOPASSWD:ALL Defaults:www-data env_keep += "SCRIPT_NAME SERVER_NAME SERVER_ADMIN HTTP_CONNECTION REQUEST_METHOD CONTENT_LENGTH SCRIPT_FILENAME SERVER_SOFTWARE HTTP_TE QUERY_STRING REMOTE_PORT HTTP_USER_AGENT SERVER_PORT SERVER_SIGNATURE REMOTE_ADDR CONTENT_TYPE SERVER_PROTOCOL PATH REQUEST_URI GATEWAY_INTERFACE SERVER_ADDR DOCUMENT_ROOT HTTP_HOST" www-data ALL=(oar) NOPASSWD: /usr/lib/oar/oar-cgi
There are a three different flavors of installation :
- server: install the daemon which must be running on the server
- user: install all the tools needed to submit and manage jobs for the users (oarsub, oarstat, oarnodes, ...)
- node: install the tools for a computing node
The installation is straightforward:
become root
go to OAR source repository
You can set Makefile variables in the command line to suit your configuration (change "OARHOMEDIR" to the home of your user oar and "PREFIX" where you want to copy all OAR files).
- run make <module> [module] ...
- where module := { server-install | user-install | node-install | doc-install | debian-package }
OPTIONS := { OARHOMEDIR | OARCONFDIR | OARUSER | PREFIX | MANDIR | OARDIR | BINDIR | SBINDIR | DOCDIR }
Edit /etc/oar.conf file to match your cluster configuration.
Make sure that the PATH environnement variable contains $PREFIX/$BINDIR of your installation (default is /usr/local/bin).
Initialization of OAR database (MySQL) is achieved using oar_mysql_db_init script provided with the server module installation and located in $PREFIX/sbin (/usr/local/sbin in default Makefile).
If you want to use a postgres SQL server then there is currently no automatic installation script. You have to add a new user which can connect on a new oar database(use the commands createdb and createuser). After that, you have to authorize network connections on the postgresql server in the postgresql.conf (uncomment tcpip_socket = true). Moreover a line like
host oar oar X.X.X.X/Y md5
to enable oar user to connect to the database.
Then you can import the database scheme stored in oar_postgres.sql (use the SQL command "\i").
For more informations about postgresql, go to http://www.postgresql.org/.
Note: The same machine may host several or even all modules.
Note about X11: The easiest and scalable way to use X11 application on cluster nodes is to open X11 ports and set the right DISPLAY environment variable by hand. Otherwise users can use X11 forwarding via ssh to access cluster frontal. After that you must configure ssh server on this frontal with
X11Forwarding yes X11UseLocalhost no
With this configuration, users can launch X11 applications after a 'oarsub -I' on the given node.
"oarsh" and "oarsh_shell" are two scripts that can restrict user processes to stay in the same cpuset on all nodes.
This feature is very usefull to restrict processor consumption on multiprocessors computers and to kill all processes of a same OAR job on several nodes.
CPUSET is a module integrated in the Linux kernel since 2.6.x. In the kernel documentation, you can read:
Cpusets provide a mechanism for assigning a set of CPUs and Memory Nodes to a set of tasks. Cpusets constrain the CPU and Memory placement of tasks to only the resources within a tasks current cpuset. They form a nested hierarchy visible in a virtual file system. These are the essential hooks, beyond what is already present, required to manage dynamic job placement on large systems. Each task has a pointer to a cpuset. Multiple tasks may reference the same cpuset. Requests by a task, using the sched_setaffinity(2) system call to include CPUs in its CPU affinity mask, and using the mbind(2) and set_mempolicy(2) system calls to include Memory Nodes in its memory policy, are both filtered through that tasks cpuset, filtering out any CPUs or Memory Nodes not in that cpuset. The scheduler will not schedule a task on a CPU that is not allowed in its cpus_allowed vector, and the kernel page allocator will not allocate a page on a node that is not allowed in the requesting tasks mems_allowed vector. If a cpuset is cpu or mem exclusive, no other cpuset, other than a direct ancestor or descendent, may share any of the same CPUs or Memory Nodes. A cpuset that is cpu exclusive has a sched domain associated with it. The sched domain consists of all cpus in the current cpuset that are not part of any exclusive child cpusets. This ensures that the scheduler load balacing code only balances against the cpus that are in the sched domain as defined above and not all of the cpus in the system. This removes any overhead due to load balancing code trying to pull tasks outside of the cpu exclusive cpuset only to be prevented by the tasks' cpus_allowed mask. A cpuset that is mem_exclusive restricts kernel allocations for page, buffer and other data commonly shared by the kernel across multiple users. All cpusets, whether mem_exclusive or not, restrict allocations of memory for user space. This enables configuring a system so that several independent jobs can share common kernel data, such as file system pages, while isolating each jobs user allocation in its own cpuset. To do this, construct a large mem_exclusive cpuset to hold all the jobs, and construct child, non-mem_exclusive cpusets for each individual job. Only a small amount of typical kernel memory, such as requests from interrupt handlers, is allowed to be taken outside even a mem_exclusive cpuset. User level code may create and destroy cpusets by name in the cpuset virtual file system, manage the attributes and permissions of these cpusets and which CPUs and Memory Nodes are assigned to each cpuset, specify and query to which cpuset a task is assigned, and list the task pids assigned to a cpuset.
"oarsh" is a wrapper around the "ssh" command (tested with openSSH). Its goal is to propagate two environment variables:
- OAR_CPUSET : The name of the OAR job cpuset
- SUDO_USER : The name of the user who has launched oarsh command
So "oarsh" must be run by oar and a simple user must run it via the "sudowrapper" script to become oar. In this way each cluster user who can execute "oarsh" via "sudowrapper" can connect himself on each cluster nodes (if oarsh is installed everywhere).
"oarsh_shell" must be the shell of the oar user on each nodes where you want oarsh to worked. This script takes "OAR_CPUSET" and "SUDO_USER" environment variables and adds its PID in OAR_CPUSET cpuset. Then it searchs user shell and home and executes the right command (like ssh).
On each nodes you must add in the SSH server configuration file:
AcceptEnv OAR_CPUSET SUDO_USERIn Debian the file is "/etc/ssh/sshd_config"
You can use scp with oarsh. The syntax is:
scp -S /path/to/oarsh ...You can restrict the use of oarsh with the sudo configuration:
%oarsh ALL=(oar) NOPASSWD: /path/to/oarshHere only users from oarsh group can execute oarsh
There are two different tools. One, named Monika, displays the current cluster state with all active and waiting jobs. The other, named drawgantt, diplays node occupation in a lapse of time. These tools are CGI scripts and generate HTML pages.
drawgantt:
- Make sure you installed "ruby", "libdbd-mysql-ruby" or "libdbd-pg-ruby" and "libgd-ruby1.8" packages.
- Copy "drawgantt.cgi" and "drawgantt.conf" in the CGI folder of your web server (ex: /usr/lib/cgi-bin/ for Debian).
- Copy all icons and javascript files in a folder that web server can find them (ex: /var/www/oar/Icons and /var/www/oar/Icons).
- Make sure that these files can be read by the web server user.
- Edit "drawgantt.conf" and change tags to fit your configuration.
Monika:
- The package "perl-AppConfig" is required.
- Read INSTALL file in the monika repository.
OAR is also released under Debian packages (or Ubuntu). You can find them at http://oar.imag.fr/download.html.
If you want to add it as a new source in your /etc/apt/sources.list then add the line:
deb http://oar.imag.fr/download ./
The installation will ask you if you want to initialize the nodes. It will copy the oar SSH key on each specified nodes. You can skeep this step by you will have to do this manually.
After installing packages, you have to edit the configuration file on the server, submition nodes and computing nodes to fite your needs.
First, you must start OAR daemon on the server (its name is "Almighty").
- if you have installed OAR from sources, become oar user and launch command "Almighty" (it stands in $PREFIX/sbin).
- if you have installed OAR from Debian packages, use the script "/etc/init.d/oar-server" to start the daemon.
Then you have to insert new resources in the database via the command oarnodesetting. If you want to have an idea how does it work then launch $PREFIX/oar/detect_new_resources.sh. It will print right commands execute with an appropriate value for the memory and the cpuset properties.
If you want to initialize your whole cluster in 1 command you can use this one (tune it to fite your cluster). You must be oar to run this command because oarnodesetting will called and the sentinelle.pl will log onto all nodes stored in "node_list.txt" file without password:
export PREFIX=/var/lib $PREFIX/oar/sentinelle.pl -f node_list.txt \ -p "$PREFIX/oar/detect_new_resources.sh" | sh
Then you can launch the oarnodes command and see all new resources inserted.
For further information, please check http://oar.imag.fr/.
All user commands are installed on cluster login nodes. So you must connect to one of these computers first.
This command prints jobs in execution mode on the terminal.
Options
-f : prints each job in full details -j job_id : prints the specified job_id informations (even if it is finished) --sql "sql where" : Restricts display with the SQL where clause on the table jobs -g "d1,d2" : prints history of jobs and state of resources between two dates. -D : formats outputs in Perl Dumper -X : formats outputs in XML -Y : formats outputs in YAML
Examples
# oarstat # oarstat -j 42 -f # oarstat --sql "project = 'p1'"
This command prints informations about cluster resources (state, which jobs on which resources, resource properties, ...).
Options
-a : shows all resources with their properties -r : show only properties of a resource -s : shows only resource states -l : shows only resource list --sql "sql where" : Display resources which matches this sql where clause -D : formats outputs in Perl Dumper -X : formats outputs in XML -Y : formats outputs in YAML
Examples
# oarnodes # oarnodes -s # oarnodes --sql "state = 'Suspected'"
The user can submit a job with this command. So, what is a job in our context?
A job is defined by needed resources and a script/program to run. So, the user must specify how many resources and what kind of them are needed by his application. Thus, OAR system will give him or not what he wants and will control the execution. When a job is launched, OAR executes user program only on the first reservation node. So this program can access some environment variables to know its environment:
$OAR_NODEFILE contains the name of a file which lists all reserved nodes for this job $OAR_JOB_ID contains the OAR job identificator $OAR_RESOURCE_PROPERTIES_FILE contains the name of a file which lists all resources and their properties $OAR_JOB_NAME name of the job given by the "-n" option $OAR_RESOURCE_PROPERTIES_FILE contains the detailed resources used by the job $OAR_PROJECT_NAME job project name
Options:
-q "queuename" : specify the queue for this job -I : turn on INTERACTIVE mode (OAR gives you a shell instead of executing a script) -l "resource description" : defines resource list requested for this job; the different parameters are resource properties registered in OAR database; see examples below. (walltime : Request maximun time. Format is [hour:mn:sec|hour:mn|hour]; after this elapsed time, the job will be killed) -p "properties" : adds constraints for the job (format is a WHERE clause from the SQL syntax) -S, --Scanscript : in batch mode, asks oarsub to scan the given script for OAR directives (#OAR -l ...) -r "2007-05-11 23:32:03" : asks for a reservation job to begin at the date in argument -C job_id : connects to a reservation in Running state -k "duration" : asks OAR to send the checkpoint signal to the first processus of the job "number_of_seconds" before the walltime --signal "signal name" : specify the signal to use when checkpointing -t "type name" : specify a specific type (deploy, besteffort, cosystem, checkpoint) -d "directory path" : specify the directory where to launch the command (default is current directory) -n "job name" : specify an arbitrary name for the job -a job_id : anterior job that must be terminated to start this new one --project : Specify the name of the project corresponding to the job. --notify "method" : specify a notification method(mail or command); ex: --notify "mail:name@domain.com" --notify "exec:/path/to/script args" --stdout "file name" : specify the name of the standard output file --stderr "file name" : specify the name of the error output file --resubmit job_id : resubmit the given job to a new one --force_cpuset_name "cpuset name" : Instead of using job_id for the cpuset name you can specify one (WARNING: if several jobs have the same cpuset name then processes of a job could be killed when another finished on the same computer) --hold : Set the job state into Hold instead of Waiting; so it is not scheduled (you must run "oarresume" to turn it into the Waiting state). -D : Print result in DUMPER format. -Y : Print result in XML format. -X : Print result in YAML format.
Wanted resources have to be described in a hierarchical manner (this is the "-l" syntax option).
Moreover it is possible to give a property that they must match.
So the long and complete syntax is of the form:
"{ sql1 }/prop1=1/prop2=3+{sql2}/prop3=2/prop4=1/prop5=1+...,walltime=1:00:00"
So we want to reserve 3 resources with the same value of the type prop2 and with the same property prop1 and these resources must fit sql1. To that possible resources we want to add 2 others which fit sql2 and the hierarchy /prop3=2/prop4=1/prop5=1.
Examples
# oarsub -l /node=4 test.sh
(the "test.sh" script will be run on 4 entire nodes in the default queue with the default walltime)
# oarsub -q default -l walltime=50:30:00,/node=10/cpu=3,walltime=2:15:00 \ -p "switch = 'sw1'" /home/users/toto/prog
(the "/home/users/toto/prog" script will be run on 10 nodes with 3 cpus (so a total of 30 cpus) in the default queue with a walltime of 2:15:00. Moreover "-p" option restricts resources only on the switch 'sw1')
# oarsub -r "2004-04-27 11:00:00" -l /node=12/cpu=2
(a reservation will begin at "2004-04-27 11:00:00" on 12 nodes with 2 cpus on each one)
# oarsub -C 42
(connects to the job 42 on the first node and set all OAR environment variables)
# oarsub -I
(gives a shell on a resource)
This command is used to delete or checkpoint job(s). They are designed by their identifier.
Option
--sql : delete/checkpoint jobs which respond to the SQL where clause on the table jobs (ex: "project = 'p1'") -c job_id : send checkpoint signal to the job (signal was definedwith "--signal" option in oarsub)
Examples
# oardel 14 42
(delete jobs 14 and 42)
# oardel -c 42
(send checkpoint signal to the job 42)
This command is used to remove a job from the scheduling queue if it is in the "Waiting" state.
Moreover if its state is "Running" oarhold can suspend the execution and enable other jobs to use its resources. In that way, a SIGINT signal is sent to every processes.
Options
--sql : hold jobs which respond to the SQL where clause on the table jobs (ex: "project = 'p1'") -r : Manage not only Waiting jobs but also Running one (can suspend the job)
This command resumes jobs in the states Hold or Suspended
Option
--sql : resume jobs which respond to the SQL where clause on the table jobs (ex: "project = 'p1'")
This is a web cgi normally installed on the cluster frontal. This tool executes oarnodes and oarstat then format data in a html page.
Thus you can have a global view of cluster state and where your jobs are running.
This is also a web cgi. It creates a Gantt chart which shows job repartition on nodes in the time. It is very useful to see cluster occupation in the past and to know when a job will be launched in the future.
This command manages OAR resource properties stored in the database.
Options are:
-l : list properties -a NAME : add a property -c : sql new field of type VARCHAR(255) (default is integer) -d NAME : delete a property -r "OLD_NAME,NEW_NAME" : rename property OLD_NAME into NEW_NAME
Examples:
# oarproperty -a cpu_freq # oarproperty -a type # oarproperty -r "cpu_freq,freq"
This command permits to change the state or a property of a node or of several resources resources.
By default the node name used by oarnodesetting is the result of the command hostname.
Options are:
-a : add a new resource -s : state to assign to the node: * "Alive" : a job can be run on the node. * "Absent" : administrator wants to remove the node from the pool for a moment. * "Dead" : the node will not be used and will be deleted. -h : specify the node name (override hostname). -r : specify the resource number --sql : get resource identifiers which respond to the SQL where clause on the table jobs (ex: "type = 'default'") -p : change the value of a property specified resources. -n : specify this option if you do not want to wait the end of jobs running on this node when you change its state into "Absent" or "Dead".
This command permits to remove a resource from the database.
The node must be in the state "Dead" (use oarnodesetting to do this) and then you can use this command to delete it.
This command permits to update the accounting table for jobs ended since the last launch.
This command sends commands to the Almighty module and manages scheduling queues.
Option are:
Almighty_tag send this tag to the Almighty (default is TERM) -e active an existing queue -d inactive an existing queue -E active all queues -D inactive all queues --add_queue add a new queue; syntax is name,priority,scheduler (ex: "name,3,"oar_sched_gantt_with_timesharing" --remove_queue remove an existing queue -l list all queues and there status -h show this help screen -v print OAR version number
Note : all dates and duration are stored in an integer manner (number of seconds since the EPOCH).
Fields | Types | Descriptions |
---|---|---|
window_start | INT UNSIGNED | start date of the accounting interval |
window_stop | INT UNSIGNED | stop date of the accounting interval |
accounting_user | VARCHAR(20) | user name |
accounting_project | VARCHAR(255) | name of the related project |
queue_name | VARCHAR(100) | queue name |
consumption_type | ENUM("ASKED", "USED") | "ASKED" corresponds to the walltimes specified by the user. "USED" corresponds to the effective time used by the user. |
consumption | INT UNSIGNED | number of seconds used |
Primary key: | window_start, window_stop, accounting_user, queue_name, accounting_project, consumption_type |
---|---|
Index fields: | window_start, window_stop, accounting_user, queue_name, accounting_project, consumption_type |
This table is a summary of the consumption for each user on each queue. This increases the speed of queries about user consumptions and statistic generation.
Data are inserted through the command oaraccounting (when a job is treated the field accounted in table jobs is passed into "YES"). So it is possible to regenerate this table completely in this way :
Delete all data of the table:
DELETE FROM accounting;Set the field accounted in the table jobs to "NO" for each row:
UPDATE jobs SET accounted = "NO";Run the oaraccounting command.
You can change the amount of time for each window : edit the oar configuration file and change the value of the tag ACCOUNTING_WINDOW.
Fields | Types | Descriptions |
---|---|---|
id | INT UNSIGNED | id number |
rule | VARCHAR(255) | rule written in Perl applied when a job is going to be registered |
Primary key: | id |
---|---|
Index fields: | None |
You can use these rules to change some values of some properties when a job is submitted. So each admission rule is executed in the order of the id field and it can set several variables. If one of them exits then the others will not be evaluated and oarsub returns an error.
Some examples are better than a long description :
Specify the default value for queue parameter
INSERT INTO admission_rules (rule) VALUES(' if (not defined($queue_name)) { $queue_name="default"; } ');Avoid users except oar to go in the admin queue
INSERT INTO admission_rules (rule) VALUES (' if (($queue_name eq "admin") && ($user ne "oar")) { die("[ADMISSION RULE] Only oar user can submit jobs in the admin queue\\n"); } ');Restrict the maximum of the walltime for interactive jobs
INSERT INTO admission_rules (rule) VALUES (' my $max_walltime = iolib::sql_to_duration("12:00:00"); if ($jobType eq "INTERACTIVE"){ foreach my $mold (@{$ref_resource_list}){ if ( (defined($mold->[1])) and ($max_walltime < $mold->[1]) ){ print("[ADMISSION RULE] Walltime to big for an INTERACTIVE job so it is set to $max_walltime.\\n"); $mold->[1] = $max_walltime; } } } ');Specify the default walltime
INSERT INTO admission_rules (rule) VALUES (' my $default_wall = iolib::sql_to_duration("2:00:00"); foreach my $mold (@{$ref_resource_list}){ if (!defined($mold->[1])){ print("[ADMISSION RULE] Set default walltime to $default_wall.\\n"); $mold->[1] = $default_wall; } } ');How to perform actions if the user name is in a file
INSERT INTO admission_rules (rule) VALUES (' open(FILE, "/tmp/users.txt"); while (($queue_name ne "admin") and ($_ = <FILE>)){ if ($_ =~ m/^\\s*$user\\s*$/m){ print("[ADMISSION RULE] Change assigned queue into admin\\n"); $queue_name = "admin"; } } close(FILE); ');
Fields | Types | Descriptions |
---|---|---|
event_id | INT UNSIGNED | event identifier |
type | VARCHAR(50) | event type |
job_id | INT UNSIGNED | job related of the event |
date | INT UNSIGNED | event date |
description | VARCHAR(255) | textual description of the event |
to_check | ENUM('YES', 'NO') | specify if the module NodeChangeState must check this event to Suspect or not some nodes |
Primary key: | event_id |
---|---|
Index fields: | type, to_check |
The different event types are:
- "PING_CHECKER_NODE_SUSPECTED" : the system detected via the module "finaud" that a node is not responding.
- "PROLOGUE_ERROR" : an error occurred during the execution of the job prologue (exit code != 0).
- "EPILOGUE_ERROR" : an error occurred during the execution of the job epilogue (exit code != 0).
- "CANNOT_CREATE_TMP_DIRECTORY" : OAR cannot create the directory where all information files will be stored.
- "CAN_NOT_WRITE_NODE_FILE" : the system was not able to write file which had to contain the node list on the first node (/tmp/OAR_job_id).
- "CAN_NOT_WRITE_PID_FILE" : the system was not able to write the file which had to contain the pid of oarexec process on the first node (/tmp/pid_of_oarexec_for_job_id).
- "USER_SHELL" : the system was not able to get informations about the user shell on the first node.
- "EXIT_VALUE_OAREXEC" : the oarexec process terminated with an unknown exit code.
- "SEND_KILL_JOB" : signal that OAR has transmitted a kill signal to the oarexec of the specified job.
- "LEON_KILL_BIPBIP_TIMEOUT" : Leon module has detected that something wrong occurred during the kill of a job and so kill the local bipbip process.
- "EXTERMINATE_JOB" : Leon module has detected that something wrong occurred during the kill of a job and so clean the database and terminate the job artificially.
- "WORKING_DIRECTORY" : the directory from which the job was submitted does not exist on the node assigned by the system.
- "OUTPUT_FILES" : OAR cannot write the output files (stdout and stderr) in the working directory.
- "CANNOT_NOTIFY_OARSUB" : OAR cannot notify the oarsub process for an interactive job (maybe the user has killed this process).
- "WALLTIME" : the job has reached its walltime.
- "SCHEDULER_REDUCE_NB_NODES_FOR_RESERVATION" : this means that there is not enough nodes for the reservation and so the scheduler do the best and gives less nodes than the user wanted (this occurres when nodes become Suspected or Absent).
- "BESTEFFORT_KILL" : the job is of the type besteffort and was killed because a normal job wanted the nodes.
- "FRAG_JOB_REQUEST" : someone wants to delete a job.
- "CHECKPOINT" : the checkpoint signal was sent to the job.
- "CHECKPOINT_ERROR" : OAR cannot send the signal to the job.
- "CHECKPOINT_SUCCESS" : system has sent the signal correctly.
- "SERVER_EPILOGUE_TIMEOUT" : epilogue server script has time outed.
- "SERVER_EPILOGUE_EXIT_CODE_ERROR" : epilogue server script did not return 0.
- "SERVER_EPILOGUE_ERROR" : cannot find epilogue server script file.
- "SERVER_PROLOGUE_TIMEOUT" : prologue server script has time outed.
- "SERVER_PROLOGUE_EXIT_CODE_ERROR" : prologue server script did not return 0.
- "SERVER_PROLOGUE_ERROR" : cannot find prologue server script file.
- "CPUSET_CLEAN_ERROR" : OAR cannot clean correctly cpuset files for a job on the remote node.
- "MAIL_NOTIFICATION_ERROR" : a mail cannot be sent.
- "USER_MAIL_NOTIFICATION" : user mail notification cannot be performed.
- "USER_EXEC_NOTIFICATION_ERROR" : user script execution notification cannot be performed.
- "BIPBIP_BAD_JOBID" : error when retrieving informations about a running job.
- "BIPBIP_CHALLENGE" : OAR is configured to detach jobs when they are launched on compute nodes and the job return a bad challenge number.
- "RESUBMIT_JOB_AUTOMATICALLY" : the job was automatically resubmitted.
- "WALLTIME" : the job reached its walltime.
- "REDUCE_RESERVATION_WALLTIME" : the reservation job was shrunk.
- "SSH_TRANSFER_TIMEOUT" : node OAR part script was too long to transfer.
- "BAD_HASHTABLE_DUMP" : OAR transfered a bad hashtable.
- "LAUNCHING_OAREXEC_TIMEOUT" : oarexec was too long to initialize itself.
- "RESERVATION_NO_NODE" : All nodes were detected as bad for the reservation job.
Fields | Types | Descriptions |
---|---|---|
event_id | INT UNSIGNED | event identifier |
hostname | VARCHAR(255) | name of the node where the event has occured |
Primary key: | event_id |
---|---|
Index fields: | hostname |
This table stores hostnames related to events like "PING_CHECKER_NODE_SUSPECTED".
Fields | Types | Descriptions |
---|---|---|
idFile | INT UNSIGNED | |
md5sum | VARCHAR(255) | |
location | VARCHAR(255) | |
method | VARCHAR(255) | |
compression | VARCHAR(255) | |
size | INT UNSIGNED |
Primary key: | idFile |
---|---|
Index fields: | md5sum |
Fields | Types | Descriptions |
---|---|---|
frag_id_job | INT UNSIGNED | job id |
frag_date | INT UNSIGNED | kill job decision date |
frag_state | ENUM('LEON', 'TIMER_ARMED' , 'LEON_EXTERMINATE', 'FRAGGED') DEFAULT 'LEON' | state to tell Leon what to do |
Primary key: | frag_id_job |
---|---|
Index fields: | frag_state |
What do these states mean:
- "LEON" : the Leon module must try to kill the job and change the state into "TIMER_ARMED".
- "TIMER_ARMED" : the Sarko module must wait a response from the job during a timeout (default is 60s)
- "LEON_EXTERMINATE" : the Sarko module has decided that the job time outed and asked Leon to clean up the database.
- "FRAGGED" : job is fragged.
Fields | Types | Descriptions |
---|---|---|
moldable_job_id | INT UNSIGNED | moldable job id |
resource_id | INT UNSIGNED | resource assigned to the job |
Primary key: | moldable_job_id, resource_id |
---|---|
Index fields: | None |
This table specifies which resources are attributed to which jobs.
Fields | Types | Descriptions |
---|---|---|
moldable_job_id | INT UNSIGNED | moldable job id |
resource_id | INT UNSIGNED | resource assigned to the job |
Primary key: | moldable_job_id, resource_id |
---|---|
Index fields: | None |
This table is the same as gantt_jobs_resources and is used by visualisation tools. It is updated atomically (a lock is used).
Fields | Types | Descriptions |
---|---|---|
moldable_job_id | INT UNSIGNED | job id |
start_time | INT UNSIGNED | date when the job is scheduled to start |
Primary key: | moldable_job_id |
---|---|
Index fields: | None |
With this table and gantt_jobs_resources you can know exactly what are the decisions taken by the schedulers for each waiting jobs.
note: | The special job id "0" is used to store the scheduling reference date. |
---|
Fields | Types | Descriptions |
---|---|---|
moldable_job_id | INT UNSIGNED | job id |
start_time | INT UNSIGNED | date when the job is scheduled to start |
Primary key: | job_id |
---|---|
Index fields: | None |
This table is the same as gantt_jobs_predictions and is used by visualisation tools. It is made up to date in an atomic action (with a lock).
Fields | Types | Descriptions |
---|---|---|
job_id | INT UNSIGNED | job identifier |
job_name | VARCHAR(100) | name given by the user |
cpuset_name | VARCHAR(255) | name of the cpuset directory used for this job on each nodes |
job_type | ENUM('INTERACTIVE', 'PASSIVE') DEFAULT 'PASSIVE' | specify if the user wants to launch a program or get an interactive shell |
info_type | VARCHAR(255) | some informations about oarsub command |
state | ENUM('Waiting','Hold', 'toLaunch', 'toError', 'toAckReservation', 'Launching', 'Running' , 'Finishing', 'Terminated', 'Error') | job state |
reservation | ENUM('None', 'toSchedule', 'Scheduled') DEFAULT 'None' | specify if the job is a reservation and the state of this one |
message | VARCHAR(255) | readable information message for the user |
job_user | VARCHAR(20) | user name |
command | TEXT | program to run |
queue_name | VARCHAR(100) | queue name |
properties | TEXT | properties that assigned nodes must match |
launching_directory | VARCHAR(255) | path of the directory where to launch the user process |
submission_time | INT UNSIGNED | date when the job was submitted |
start_time | INT UNSIGNED | date when the job was launched |
stop_time | INT UNSIGNED | date when the job was stopped |
file_id | INT UNSIGNED | |
accounted | ENUM("YES", "NO") DEFAULT "NO" | specify if the job was considered by the accounting mechanism or not |
notify | VARCHAR(255) | gives the way to notify the user about the job (mail or script ) |
assigned_moldable_job | INT UNSIGNED | moldable job chosen by the scheduler |
checkpoint | INT UNSIGNED | number of seconds before the walltime to send the checkpoint signal to the job |
checkpoint_signal | INT UNSIGNED | signal to use when checkpointing the job |
stdout_file | TEXT | file name where to redirect program STDOUT |
stderr_file | TEXT | file name where to redirect program STDERR |
resubmit_job_id | INT UNSIGNED | if a job is resubmitted then the new one store the previous |
project | VARCHAR(255) | arbitrary name given by the user or an admission rule |
suspended | ENUM("YES","NO") | specify if the job was suspended (oarhold) |
job_env | TEXT | environment variables to set for the job |
exit_code | INT DEFAULT 0 | exit code for passive jobs |
job_group | VARCHAR(255) | not used |
Primary key: | job_id |
---|---|
Index fields: | state, reservation, queue_name, accounted, suspended |
Explications about the "state" field:
- "Waiting" : the job is waiting OAR scheduler decision.
- "Hold" : user or administrator wants to hold the job (oarhold command). So it will not be scheduled by the system.
- "toLaunch" : the OAR scheduler has attributed some nodes to the job. So it will be launched.
- "toError" : something wrong occurred and the job is going into the error state.
- "toAckReservation" : the OAR scheduler must say "YES" or "NO" to the waiting oarsub command because it requested a reservation.
- "Launching" : OAR has launched the job and will execute the user command on the first node.
- "Running" : the user command is executing on the first node.
- "Finishing" : the user command has terminated and OAR is doing work internally
- "Terminated" : the job has terminated normally.
- "Error" : a problem has occurred.
Explications about the "reservation" field:
- "None" : the job is not a reservation.
- "toSchedule" : the job is a reservation and must be approved by the scheduler.
- "Scheduled" : the job is a reservation and is scheduled by OAR.
Fields | Types | Descriptions |
---|---|---|
job_id | INT UNSIGNED | job identifier |
job_id_required | INT UNSIGNED | job needed to be completed before launching job_id |
Primary key: | job_id, job_id_required |
---|---|
Index fields: | job_id, job_id_required |
This table is feeded by oarsub command with the "-a" option.
Fields | Types | Descriptions |
---|---|---|
moldable_id | INT UNSIGNED | job identifier |
moldable_job_id | INT UNSIGNED | corresponding job identifier |
moldable_walltime | INT UNSIGNED | instance duration |
Primary key: | moldable_id |
---|---|
Index fields: | moldable_job_id |
A job can be described with several instances. Thus OAR scheduler can choose one of them. For example it can calculate which instance will finish first. So this table stores all instances for all jobs.
Fields | Types | Descriptions |
---|---|---|
res_group_id | INT UNSIGNED | group identifier |
res_group_moldable_id | INT UNSIGNED | corresponding moldable job identifier |
res_group_property | TEXT | SQL constraint properties |
Primary key: | res_group_id |
---|---|
Index fields: | res_group_moldable_id |
As you can specify job global properties with oarsub and the "-p" option, you can do the same thing for each resource groups that you define with the "-l" option.
Fields | Types | Descriptions |
---|---|---|
res_job_group_id | INT UNSIGNED | corresponding group identifier |
res_job_resource_type | VARCHAR(255) | resource type (name of a field in resources) |
res_job_value | INT | wanted resource number |
res_job_order | INT UNSIGNED | order of the request |
Primary key: | res_job_group_id, res_job_resource_type, res_job_order |
---|---|
Index fields: | res_job_group_id |
This table store the hierarchical resource description given with oarsub and the "-l" option.
Fields | Types | Descriptions |
---|---|---|
job_id | INT UNSIGNED | corresponding job identifier |
job_state | ENUM('Waiting', 'Hold', 'toLaunch', 'toError', 'toAckReservation', 'Launching', 'Finishing', 'Terminated', 'Error') | job state during the interval |
date_start | INT UNSIGNED | start date of the interval |
date_stop | INT UNSIGNED | end date of the interval |
Primary key: | None |
---|---|
Index fields: | job_id, job_state |
This table keeps informations about state changes of jobs.
Fields | Types | Descriptions |
---|---|---|
job_id | INT UNSIGNED | corresponding job identifier |
type | VARCHAR(255) | job type like "deploy", "timesharing", ... |
Primary key: | None |
---|---|
Index fields: | job_id, type |
This table stores job types given with the oarsub command and "-t" options.
Fields | Types | Descriptions |
---|---|---|
resource_id | INT UNSIGNED | resource identifier |
type | VARCHAR(100) DEFAULT "default" | resource type (used for licence resources for example) |
network_address | VARCHAR(100) | node name (used to connect via SSH) |
state | ENUM('Alive', 'Dead' , 'Suspected', 'Absent') | resource state |
next_state | ENUM('UnChanged', 'Alive', 'Dead', 'Absent', 'Suspected') DEFAULT 'UnChanged' | state for the resource to switch |
finaud_decision | ENUM('YES', 'NO') DEFAULT 'NO' | tell if the actual state results in a "finaud" module decision |
next_finaud_decision | ENUM('YES', 'NO') DEFAULT 'NO' | tell if the next node state results in a "finaud" module decision |
state_num | INT | corresponding state number (useful with the SQL "ORDER" query) |
suspended_jobs | ENUM('YES','NO') | specify if there is at least one suspended job on the resource |
switch | VARCHAR(50) | name of the switch |
cpu | INT UNSIGNED | global cluster cpu number |
cpuset | INT UNSIGNED | field used with the CPUSET_RESOURCE_PROPERTY_DB_FIELD |
besteffort | ENUM('YES','NO') | accept or not besteffort jobs |
deploy | ENUM('YES','NO') | specify if the resource is deployable |
expiry_date | INT UNSIGNED | field used for the desktop computing feature |
desktop_computing | ENUM('YES','NO') | tell if it is a desktop computing resource (with an agent) |
last_job_date | INT UNSIGNED | store the date when the resource was used for the last time |
cm_availability | INT UNSIGNED | used with compute mode features to know if an Absent resource can be switch on |
Primary key: | resource_id |
---|---|
Index fields: | state, next_state, type, suspended_jobs |
State explications:
- "Alive" : the resource is ready to accept a job.
- "Absent" : the oar administrator has decided to pull out the resource. This computer can come back.
- "Suspected" : OAR system has detected a problem on this resource and so has suspected it (you can look in the event_logs table to know what has happened). This computer can come back (automatically if this is a "finaud" module decision).
- "Dead" : The oar administrator considers that the resource will not come back and will be removed from the pool.
This table permits to specify different properties for each resources. These can be used with the oarsub command ("-p" and "-l" options).
You can add your own properties with oarproperty command.
These properties can be updated with the oarnodesetting command ("-p" option).
Several properties are added by default:
- switch : you have to register the name of the switch where the node is plugged.
- cpu : this is a unique name given to each cpus. This enables OAR scheduler to distinguish all cpus.
- cpuset : this is the name of the cpu on the node. The Linux kernel sets this to an integer beginning at 0. This field is linked to the configuration tag CPUSET_RESOURCE_PROPERTY_DB_FIELD.
Fields | Types | Descriptions |
---|---|---|
resource_log_id | INT UNSIGNED | unique id |
resource_id | INT UNSIGNED | resource identifier |
attribute | VARCHAR(255) | name of corresponding field in resources |
value | VARCHAR(255) | value of the field |
date_start | INT UNSIGNED | interval start date |
date_stop | INT UNSIGNED | interval stop date |
finaud_decision | ENUM('YES','NO') | store if this is a system change or a human one |
Primary key: | None |
---|---|
Index fields: | resource_id, attribute |
This table permits to keep a trace of every property changes (consequence of the oarnodesetting command with the "-p" option).
Fields | Types | Descriptions |
---|---|---|
moldable_job_id | INT UNSIGNED | job id |
resource_id | INT UNSIGNED | resource assigned to the job |
Primary key: | moldable_job_id, resource_id |
---|---|
Index fields: | moldable_job_id |
This table keeps informations for jobs on which resources they were scheduled.
Fields | Types | Descriptions |
---|---|---|
queue_name | VARCHAR(100) | queue name |
priority | INT UNSIGNED | the scheduling priority |
scheduler_policy | VARCHAR(100) | path of the associated scheduler |
state | ENUM('Active', 'notActive') DEFAULT 'Active' | permits to stop the scheduling for a queue |
Primary key: | queue_name |
---|---|
Index fields: | None |
This table contains the schedulers executed by the oar_meta_scheduler module. Executables are launched one after one in the specified priority.
Fields | Types | Descriptions |
---|---|---|
job_id | INT UNSIGNED | job identifier |
challenge | VARCHAR(255) | challenge string |
Primary key: | job_id |
---|---|
Index fields: | None |
This table is used to share a secret between OAR server and oarexec process on computing nodes (avoid a job id being stolen/forged by malicious user).
For security reasons, this table must not be readable for a database account given to users who want to access OAR internal informations(like statistics).
Each configuration tag found in /etc/oar.conf is now described:
Database type : you can use a MySQL or a PostgreSQL database (tags are "mysql" or "Pg"):
DB_TYPE = mysqlDatabase hostname:
DB_HOSTNAME=localhostDatabase base name:
DB_BASE_NAME=oarDataBase user name:
DB_BASE_LOGIN=oarDataBase user password:
DB_BASE_PASSWD=oarOAR server hostname:
SERVER_HOSTNAME=localhost
OAR server port:
SERVER_PORT=6666When the user does not specify a -l option then oar use this:
OARSUB_DEFAULT_RESOURCES = /resource_id=1
Specify where we are connected in the deploy queue(the node to connect to when the job is in the deploy queue):
DEPLOY_HOSTNAME = 127.0.0.1Specify where we are connected with a job of the cosystem type:
COSYSTEM_HOSTNAME = 127.0.0.1
Set DETACH_JOB_FROM_SERVER to 1 if you do not want to keep a ssh connection between the node and the server. Otherwise set this tag to 0:
DETACH_JOB_FROM_SERVER=1By default OAR uses the ping command to detect if nodes are down or not. To enhance this diagnostic you can specify one of these other methods ( give the complete command path):
OAR sentinelle:
SENTINELLE_COMMAND=/usr/bin/sentinelle -cconnect=ssh,timeout=3000If you use sentinelle.pl then you must use this tag:
SENTINELLE_SCRIPT_COMMAND=/var/lib/oar/sentinelle.pl -t 5 -w 20OAR fping:
FPING_COMMAND=/usr/bin/fping -qOAR nmap : it will test to connect on the ssh port (22):
NMAP_COMMAND=/usr/bin/nmap -p 22 -n -T5OAR generic : a specific script may be used instead of ping to check aliveness of nodes. The script must return bad nodes on STDERR (1 line for a bad node and it must have exactly the same name that OAR has given in argument of the command):
GENERIC_COMMAND=/path/to/command arg1 arg2OAR log level: 3(debug+warnings+errors), 2(warnings+errors), 1(errors):
LOG_LEVEL=2OAR log file:
LOG_FILE=/var/log/oar.logIf you want to debug oarexec on nodes then affect 1 (only effective if DETACH_JOB_FROM_SERVER = 1):
OAREXEC_DEBUG_MODE=0OAR Allowed networks, Networks or hosts allowed to submit jobs to OAR and compute nodes may be specified here(0.0.0.0/0 means all IPs are allowed and 127.0.0.1/32 means only IP 127.0.0.1 is allowed):
ALLOWED_NETWORKS= 127.0.0.1/32 0.0.0.0/0
Set the granularity of the OAR accounting feature (in seconds). Default is 1 day (86400s):
ACCOUNTING_WINDOW= 86400
OAR informations may be notified by email to the administror. Set accordingly to your configuration the next lines to activate this feature:
MAIL_SMTP_SERVER = smtp.serveur.com MAIL_RECIPIENT = user@domain.com MAIL_SENDER = oar@domain.comSet the timeout for the prologue and epilogue execution on computing nodes:
PROLOGUE_EPILOGUE_TIMEOUT = 60Files to execute before and after each job on the first computing node (default is ~oar/oar_prologue ans ~oar/oar_epilogue):
PROLOGUE_EXEC_FILE = /path/to/prog EPILOGUE_EXEC_FILE = /path/to/progSet the timeout for the prologue and epilogue execution on the OAR server:
SERVER_PROLOGUE_EPILOGUE_TIMEOUT = 60
Files to execute before and after each job on the OAR server:
SERVER_PROLOGUE_EXEC_FILE = /path/to/prog SERVER_EPILOGUE_EXEC_FILE = /path/to/progSet the frequency for checking Alive and Suspected resources:
FINAUD_FREQUENCY = 300
Set time after which resources become Dead (default is 0 and it means never):
DEAD_SWITCH_TIME = 600
Maximum of seconds used by a scheduler:
SCHEDULER_TIMEOUT = 10Time to wait when a reservation has not got all resources that it has reserved (some resources could have become Suspected or Absent since the job submission) before to launch the job in the remaining resources:
RESERVATION_WAITING_RESOURCES_TIMEOUT = 300
Time to add between each jobs (time for administration tasks or time to let computers to reboot):
SCHEDULER_JOB_SECURITY_TIME = 1
Minimum time in seconds that can be considered like a hole where a job could be scheduled in:
SCHEDULER_GANTT_HOLE_MINIMUM_TIME = 300
You can add an order preference on resource assigned by the system(SQL ORDER syntax):
SCHEDULER_RESOURCE_ORDER = switch ASC, network_address DESC, resource_id ASCThis says to the scheduler to treate resources of these types, where there is a suspended job, like free ones. So some other jobs can be scheduled on these resources. (list resource types separate with spaces; Default value is nothing so no other job can be scheduled on suspended job resources):
SCHEDULER_AVAILABLE_SUSPENDED_RESOURCE_TYPE = default licence vlanName of the perl script that manages suspend/resume. You have to install your script in $OARDIR and give only the name of the file without the entire path. (default is suspend_resume_manager.pl):
SUSPEND_RESUME_FILE = suspend_resume_manager.pl
Files to execute just after a job was suspended and just before a job was resumed:
JUST_AFTER_SUSPEND_EXEC_FILE = /path/to/prog JUST_BEFORE_RESUME_EXEC_FILE = /path/to/progTimeout for the two previous scripts:
SUSPEND_RESUME_SCRIPT_TIMEOUT = 60
Indicate the name of the database field that contains the cpu number of the node. If this option is set then users must use OARSH instead of ssh to walk on each nodes that they have reserved via oarsub.
CPUSET_RESOURCE_PROPERTY_DB_FIELD = cpuset
Name of the perl script that manages cpuset. You have to install your script in $OARDIR and give only the name of the file without the entire path. (default is cpuset_manager.pl which handles the linux kernel cpuset)
CPUSET_FILE = cpuset_manager.pl
If you have installed taktuk and want to use it to manage cpusets then give the full command path (with your options except "-m" and "-o" and "-c"). You don t also have to give any taktuk command.
TAKTUK_CMD = /usr/bin/taktuk -sIf you want to manage nodes to be started and stoped. OAR gives you this API:
When OAR scheduler wants some nodes to wake up then it launches this command with the node list in arguments(the scheduler looks at the cm_availability field in resources table to know if the node will be started for enough time):
SCHEDULER_NODE_MANAGER_WAKE_UP_CMD = /path/to/the/command with your args
When OAR considers that some nodes can be shut down, it launches this command with the node list in arguments:
SCHEDULER_NODE_MANAGER_SLEEP_CMD = /path/to/the/command args
Parameters for the scheduler to decide when a node is idle(number of seconds since the last job was terminated on the nodes):
SCHEDULER_NODE_MANAGER_IDLE_TIME = 600
Parameters for the scheduler to decide if a node will have enough time to sleep(number of seconds before the next job):
SCHEDULER_NODE_MANAGER_SLEEP_TIME = 600
Command to use to connect to other nodes (default is "ssh" in the PATH)
OPENSSH_CMD = /usr/bin/sshThese are configuration tags for OAR in the desktop-computing mode:
DESKTOP_COMPUTING_ALLOW_CREATE_NODE=0 DESKTOP_COMPUTING_EXPIRY=10 STAGEOUT_DIR=/var/lib/oar/stageouts/ STAGEIN_DIR=/var/lib/oar/stageins STAGEIN_CACHE_EXPIRY=144
OAR can be decomposed into several modules which perform different tasks.
This module is the OAR server. It decides what actions must be performed. It is divided into 2 processes:
- One listens to a TCP/IP socket. It waits informations or commands from OAR user program or from the other modules.
- Another one deals with commands thanks to an automaton and launch right modules one after one.
This module is executed periodically by the Almighty (default is every 30 seconds).
The jobs of Sarko are :
- Look at running job walltimes and ask to frag them if they had expired.
- Detect if fragged jobs are really fragged otherwise asks to exterminate them.
- In "Desktop Computing" mode, it detects if a node date has expired and asks to change its state into "Suspected".
- Can change "Suspected" resources into "Dead" after DEAD_SWITCH_TIME seconds.
This is the module dedicated to print and log every debugging, warning and error messages.
This module is in charge to delete the jobs. Other OAR modules or commands can ask to kill a job and this is Leon which performs that.
There are 2 frag types :
- normal : Leon tries to connect to the first node allocated for the job and terminates the job. oarexec to end itself.
- exterminate : after a timeout if the normal method did not succeed then Leon notifies this case and clean up the database for these jobs. So OAR doesn't know what occured on the node and Suspects it.
This module is in charge of changing resource states and checking if there are jobs on these.
It also checks all pending events in the table event_logs.
This module checks for each reservation jobs if it is valid and launches them at the right time.
Scheduler launches all gantt scheduler in the order of the priority specified in the database and update all visualization tables (gantt_jobs_predictions_visu and gantt_jobs_resources_visu).
This is the default OAR scheduler. It implements all functionalities like timesharing, moldable jobs, besteffort jobs, ...
By default, this scheduler is used by all default queues.
We have implemented the FIFO with backfilling algorithm. Some parameters can be changed in the configuration file (see SCHEDULER_TIMEOUT, SCHEDULER_JOB_SECURITY_TIME, SCHEDULER_GANTT_HOLE_MINIMUM_TIME, SCHEDULER_RESOURCE_ORDER).
This module launches OAR effective jobs. These processes are run asynchronously with all modules.
For each job, the Runner uses OPENSSH_CMD to connect to the first node of the reservation and propagate a Perl script which handles the execution of the user command.
For PASSIVE jobs, the mechanism is similar to the INTERACTIVE one, except for the shell launched from the frontal node.
The job is finished when the user command ends. Then oarexec return its exit value (what errors occured) on the Almighty via the SERVER_PORT if DETACH_JOB_FROM_SERVER was set to 1 otherwise it returns directly.
If the "--force_cpuset_name" option of the oarsub command is not defined then OAR will use job identifier. The CPUSET name is effectively created on each nodes and is composed as "user_cpusetname".
So if a user specifies "--force_cpuset_name" option, he will not be able to disturb other users.
OAR system steps:
- Before each job, the Runner initialize the CPUSET (see CPUSET definition) with OPENSSH_CMD and an efficient launching tool : Taktuk. If it is not installed and configured (TAKTUK_CMD) then OAR uses an internal launching tool less optimized.
- After each job, OAR deletes all processes stored in the associated CPUSET. Thus all nodes are clean after a OAR job.
If you don't want to use this feature, you can, but nothing will warranty that every user processes will be killed after the end of a job.
If you want you can implement your own cpuset management. This is done by editing 3 files (see also CPUSET installation):
- cpuset_manager.pl : this script creates the cpuset on each nodes and also delete it at the end of the job. For more informations, you have to look at this script (there are several comments).
- oarsh : (OARSH) this script is used to replace the standard "ssh" command. It gets the cpuset name where it is running and transfer this information via "ssh" and the "SendEnv" option. In this file, you have to change the "get_current_cpuset" function.
- oarsh_shell : (OARSH_SHELL) this script is the shell of the oar user on each nodes. It gets environment variables and look at if there is a cpuset name. So if there is one it assigns the current process and its father to this cpusetname. So all further user processes will remind in the cpuset. In this file you just have to change the "add_process_to_cpuset" function.
Jobs can be suspended with the command oarhold (send a "SIGSTOP" on every processes on every nodes) to allow other jobs to be executed.
"Suspended" jobs can be resumed with the command oarresume (send a "SIGSTOP" on every suspended processes on every nodes). They will pass into "Running" when assigned resources will be free.
IMPORTANT: This feature is available only if CPUSET is configured.
You can specify 2 scripts if you have to perform any actions just after (JUST_AFTER_SUSPEND_EXEC_FILE) suspend and just before resume (JUST_BEFORE_RESUME_EXEC_FILE).
Moreover you can perform other actions (than send signals to processes) if you want: just edit the "suspend_resume_manager.pl" file.
Leon tries to connect to OAR Perl script running on the first job node (find it thanks to the file /tmp/oar/pid_of_oarexec_for_jobId_id) and sends a "SIGTERM" signal. Then the script catch it and normally end the job (kill processes that it has launched).
If this method didn't succeed then Leon will flush the OAR database for the job and nodes will be "Suspected" by NodeChangeState.
If your job is check pointed and is of the type idempotent (oarsub "-t" option) and its exit code is equal to 0 then another job is automatically created and scheduled with same behaviours.
The checkpoint is just a signal sent to the program specified with the oarsub command.
If the user uses "-k" option then Sarko will ask the OAR Perl script running on the first node to send the signal to the process (SIGUSR2 or the one specified with "--signal").
You can also use oardel command to send the signal.
General steps used to schedule a job:
- All previous scheduled jobs are stored in a Gantt data structure.
- All resources that match property constraints of the job("-p" option and indication in the "{...}" from the "-l" option of the oarsub) are stored in a tree data structure according to the hierarchy given with the "-l" option.
- Then this tree is given to the Gantt library to find the first hole where the job can be launched.
- The scheduler stores its decision into the database in the gantt_jobs_predictions and gantt_jobs_resources tables.
See User section from the FAQ for more examples and features.
This section explains how the "--notify" oarsub option is handled by OAR:
The user wants to receive an email:
The syntax is "mail:name@domain.com". Mail section in the Configuration file must be present otherwise the mail cannot be sent.
The user wants to launch a script:
The syntax is "exec:/path/to/script args". OAR server will connect (using OPENSSH_CMD) on the node where the oarsub command was invoked and then launches the script with in argument : job_id, job_name, tag, comments.
(tag is a value in : "START", "END", "ERROR")
In the Configuration file you can set the ACCOUNTING_WINDOW parameter. Thus the command oaraccounting will split the time with this amount and feed the table accounting.
So this is very easily and faster to get usage statistics of the cluster. We can see that like a "data warehousing" information extraction method.
We are working with the Icatis company on clusters composed by Intranet computers. These nodes can be switch in computing mode only at specific times. So we have implemented a functionality that can request to power on some hardware if they can be in the cluster.
We are using the field cm_availability from the table resources to know when a node will be inaccessible in the cluster mode (easily settable with oarnodesetting command). So when the OAR scheduler wants some potential available computers to launch the jobs then it executes the command SCHEDULER_NODE_MANAGER_WAKE_UP_CMD.
Moreover if a node didn't execute a job for SCHEDULER_NODE_MANAGER_IDLE_TIME seconds and no job is scheduled on it before SCHEDULER_NODE_MANAGER_SLEEP_TIME seconds then OAR will launch the command SCHEDULER_NODE_MANAGER_SLEEP_CMD.
It is possible to share the slot time of a job with other ones. To perform this feature you have to specify the type timesharing when you use oarsub.
You have 4 different ways to share your slot:
- timesharing=*,* : This is the default behavior if nothing but timesharing is specified. It indicates that the job can be shared with all users and every job names.
- timesharing=user,* : This indicates that the job can be shared only with the same user and every job names.
- timesharing=*,job_name* : This indicates that the job can be shared with all users but only one with the same name.
- timesharing=user,job_name : This indicates that the job can be shared only with the same user and one with the same job name.
See User section from the FAQ for more examples and features.
Besteffort jobs are scheduled in the besteffort queue. Their particularity is that they are deleted if another not besteffort job want resources where they are running.
For example you can use this feature to maximize the use of your cluster with multiparametric jobs. This what it is done by the CIGRI project.
When you submit a job you have to use "-t besteffort" option of oarsub to specify that this is a besteffort job.
Note : a besteffort job cannot be a reservation.
This feature enables to reserve some resources without launching any program on corresponding nodes. Thus nothing is done by OAR when a job is starting (no prologue, no epilogue on the server nor on the nodes).
This is useful with an other launching system that will declare its time slot in OAR. So yo can have two different batch scheduler.
When you submit a job you have to use "-t cosystem" option of oarsub to specify that this is a besteffort job.
These jobs are stopped by the oardel command or when they reach their walltime.
This feature is useful when you want to enable the users to reinstall their reserved nodes. So the OAR jobs will not log on the first computer of the reservation but on the DEPLOY_HOSTNAME.
So prologue and epilogue scripts are executed on DEPLOY_HOSTNAME and if the user wants to launch a script it is also executed on DEPLOY_HOSTNAME.
OAR does nothing on computing nodes because they normally will be rebooted to install a new system image.
This feature is strongly used in the Grid5000 project with Kadeploy tools.
When you submit a job you have to use "-t deploy" option of oarsub to specify that this is a deploy job.
If you cannot contact the computers via SSH you can install the "desktop computing" OAR mode. This kind of installation is based on two programs:
- oar-cgi : this is a web CGI used by the nodes to communicate with the OAR server.
- oar-agent.pl : This program asks periodically the server web CGI to know what it has to do.
This method replaces the SSH command. Computers which want to register them into OAR just has to be able to contact OAR HTTP server.
In this situation we don't have a NFS file system to share the same repertories over all nodes so we have to use a stagein/stageout solution. In this case you can use the oarsub option "stagein" to migrate your data.
You just have to use several "-l" oarsub option(one for each moldable description). By default the OAR scheduler will launch the moldable job which will end first.
So you can see some free resources but the scheduler can decide to start your job later because they will have more free resources and the job walltime will be smaller.
Example:
oarsub -I -l '{switch = "sw1" or switch = "sw5"}/switch=1+/node=1'
This example asks OAR to reserve all resources from the switch sw1 or the switch sw2 and a node on another switch.
You can see the "+" syntax as a sub-reservation directive.
Yes. You have to use the OAR scheduler "timesharing" feature. To use it, the reservation and your further jobs must be of the type timesharing (only for you).
Example:
Make your reservation:
oarsub -r "2006-09-12 8:00:00" -l /switch=1 -t 'timesharing=user,*'This command asks all resources from one switch at the given date for the default walltime. It alsa specifies that this job can be shared with himself and without a constraint on the job name.
Once your reservation has begun then you can launch:
oarsub -I -l /node=2,walltime=0:50:00 -p 'switch = "nom_du_switch_schedule"'\ -t 'timesharing=user,*'So this job will be scheduled on nodes assigned from the previous reservation.
The "timesharing" oarsub command possibilities are enumerated in Timesharing.
You have to specify that your job is idempotent. So, after a successful checkpoint, if the job is resubmitted then all will go right and there will have no problem (like file creation, deletion, ...).
Example:
oarsub -k 600 --signal 2 -t idempotent /path/to/prog
So this job will send a signal SIGINT (see man kill to know signal numbers) 10 minutes before the walltime ends. Then if everything goes well it will be resubmitted.
You can use the besteffort job type. Thus your job will be launched only if there is a hole and will be deleted if another job wants its resources.
Example:
oarsub -t besteffort /path/to/prog
OAR does not manage resources with an empty "network_address". So you can define resources that are not linked with a real node.
So the steps to configure OAR with the possibility to reserve licences (or whatever you want that are other notions):
Add a new field in the table resources to specify the licence name.
oarproperty -a licence -cAdd your licence name resources with oarnodesetting.
oarnodesetting -a -h "" -p type=mathlab -p licence=l1 oarnodesetting -a -h "" -p type=mathlab -p licence=l2 oarnodesetting -a -h "" -p type=fluent -p licence=l1 ...Now you have to write an admission rule to force oarsub "-l" option on resources of the type "default" (node resources) if there is no other specifications.
ADD ADMISSION RULE HERE
After this configuration, users can perform submitions like
oarsub -I -l "/switch=2/nodes=10+{type = 'mathlab'}/licence=20"
So users ask OAR to give them some other resource types but nothing block their program to take more licences than they asked. You can resolve this problem with the SERVER_SCRIPT_EXEC_FILE configuration. In these files you have to bind OAR allocated resources to the licence servers to restrict user consumptions to what they asked. This is very dependant of the licence management.
you must manipulate only jobs in your queue and with their state = "Waiting" and "Reservation" = "None"
you can get all informations stored in the database (in read mode)
you have to load previous decisions of the other schedulers (load informations from tables gantt_jobs_predictions and gantt_jobs_resources) or your decisions can conflict with previous.
you must store your decisions in the tables gantt_jobs_predictions and gantt_jobs_resources
you can set state of jobs to "toError" and OAR will delete them. After you must exit from your program with "exit code" = 1 otherwise 0.
You can look at the default OAR scheduler "oar_sched_gantt_with_timesharing". It uses a gantt and a resource tree libraries that are essential to take some decisions.
No.
If you change the server time when OAR is executing jobs then their stop date will be wrong. So users haveto be warned about this feature and database logs are not exact for these jobs.
We are using the RST format from the Docutils project. This syntax is easily readable and can be converted into HTML, LaTex or XML.
You can find basic informations on http://docutils.sourceforge.net/docs/user/rst/quickref.html
Now, with the ability to declare any type of resources like licences, VLAN, IP range, computing resources must have the type default and a network_address not null.
Possibility to declare associated resources like licences, IP ranges, ... and to reserve them like others.
Now you can connect to your jobs (not only for reservations).
Add "cosystem" job type (execute and do nothing for these jobs).
New scheduler : "oar_sched_gantt_with_timesharing". You can specify jobs with the type "timesharing" that indicates that this scheduler can launch more than 1 job on a resource at a time. It is possible to restrict this feature with words "user and name". For example, '-t timesharing=user,name' indicates that only a job from the same user with the same name can be launched in the same time than it.
Add PostGresSQL support. So there is a choice to make between MySQL and PostgresSQL.
New approach for the scheduling : administrators have to insert into the databases descriptions about resources and not nodes. Resources have a network address (physical node) and properties. For example, if you have dual-processor, then you can create 2 different resources with the same natwork address but with 2 different processor names.
The scheduler can now handle resource properties in a hierarchical manner. Thus, for example, you can do "oarsub -l /switch=1/cpu=5" which submit a job on 5 processors on the same switch.
Add a signal handler in oarexec and propagate this signal to the user process.
Support '#OAR -p ...' options in user script.
- Add in oar.conf:
- DB_BASE_PASSWD_RO : for security issues, it is possible to execute request with parts specified by users with a read only account (like "-p" option).
- OARSUB_DEFAULT_RESOURCES : when nothing is specified with the oarsub command then OAR takes this default resource description.
- OAREXEC_DEBUG_MODE : turn on or off debug mode in oarexec (create /tmp/oar/oar.log on nodes).
- FINAUD_FREQUENCY : indicates the frequency when OAR launchs Finaud (search dead nodes).
- SCHEDULER_TIMEOUT : indicates to the scheduler the amount of time after what it must end itself.
- SCHEDULER_JOB_SECURITY_TIME : time between each job.
- DEAD_SWITCH_TIME : after this time Absent and Suspected resources are turned on the Dead state.
- PROLOGUE_EPILOGUE_TIMEOUT : the possibility to specify a different timeout for prologue and epilogue (PROLOGUE_EPILOGUE_TIMEOUT).
- PROLOGUE_EXEC_FILE : you can specify the path of the prologue script executed on nodes.
- EPILOGUE_EXEC_FILE : you can specify the path of the epilogue script executed on nodes.
- GENERIC_COMMAND : a specific script may be used instead of ping to check aliveness of nodes. The script must return bad nodes on STDERR (1 line for a bad node and it must have exactly the same name that OAR has given in argument of the command).
- JOBDEL_SOFTWALLTIME : time after a normal frag that the system waits to retry to frag the job.
- JOBDEL_WALLTIME : time after a normal frag that the system waits before to delete the job arbitrary and suspects nodes.
- LOG_FILE : specify the path of OAR log file (default : /var/log/oar.log).
Add wait() in pingchecker to avoid zombies.
Better code modularization.
Remove node install part to launch jobs. So it is easier to upgrade from one version to an other (oarnodesetting must already be installed on each nodes if we want to use it).
Users can specify a method to be notified (mail or script).
Add cpuset support
Add prologue and epilogue script to be executed on the OAR server before and after launching a job.
Add dependancy support between jobs ("-a" option in oarsub).
In oarsub you can specify the launching directory ("-d" option).
In oarsub you can specify a job name ("-n" option).
In oarsub you can specify stdout and stderr file names.
User can resubmit a job (option "--resubmit" in oarsub).
It is possible to specify a read only database account and it will be used to evaluate SQL properties given by the user with the oarsub command (more scecure).
Add possibility to order assigned resources with their properties by the scheduler. So you can privilege some resources than others (SCHEDULER_RESOURCE_ORDER tag in oar.conf file)
a command can be specified to switch off idle nodes (SCHEDULER_NODE_MANAGER_SLEEP_CMD, SCHEDULER_NODE_MANAGER_IDLE_TIME, SCHEDULER_NODE_MANAGER_SLEEP_TIME in oar.conf)
a command can be specified to switch on nodes in the Absent state according to the resource property cm_availability in the table resources (SCHEDULER_NODE_MANAGER_WAKE_UP_CMD in oar.conf).
if a job goes in Error state and this is not its fault then OAR will resubmit this one.
initialise the "ganttJobsPrediction" table with a right reference date (1970-01-01 00:00:01)
oarsub has the "-k, --checkpoint" option. It specifies the number of seconds before job walltime to send a SIGUSR2 on it.
You can see the list of events for jobs with -f option in oarstat
oardel can now send SIGUSR2 to the oarexec of a job (-c option)
oardel can now delete several jobs
Add a signal handler in oarexec and propagate this signal to the user process
Support '#OAR -p ...' options in user script
Add in oar.conf the possibility to specify a different timeout for prologue and epilogue (PROLOGUE_EPILOGUE_TIMEOUT)
when a connection to the DB fails, OAR retrie 5 times
handle the EXTERMINATE_JOB event and suspect all nodes if possible
add job state log table
add node properties log table
add possibility to use sentinelle.rb script in the ping_checker module
add a GRID5000 specific scheduler which implements specific policy (oar_sched_gant_g5k)
add -s option to oarnodes command --> show only state of nodes
add -l option to oarnodes command --> show only the node list
root can now delete any jobs in the same way the oar user could
change a few oardel messages
limit the commands (with arguments) specified by users:
regular expression : [\w\s\/\.\-]*change oaremovenode command to oarremovenode
enhance job error management
enhance suspicious nodes detection
fix bugs about accounting
debug reservation jobs (if the job has not the right number of requested nodes then it will wait for a delay. Once expired, the job will be launched on the nodes available).
add a cache for visualization of the gantt scheme (add two gantt table for the visu)
add an event log mechanism. It permits to know all decisions and events occuring in OAR with regards to jobs.
- detection of errors (they can be traced via event_log table):
- job working directory does not exist on the node
- output files cannot be created
add deploy scheduler awareness (schedule on non deploy nodes firstly)
possibility to change property value via oarnodesetting command (-p option)
add the command "oaremovenode" that allows to delete a node of the database definitely
bug fix : now oarsub can use the user's script even if the oar user cannot
now, you can use special value "all" in oarsub ("nodes" resource). It gives all free nodes corresponding to the weight and properties specified.
debug Gantt visualization
add the possibility to test nodes via nmap
add accounting features (accounting table)
change nodeState_log table scheme to increase rapidity