oar's logo
Resource Management System for High Performance Computing

OAR Documentation

OAR logo
Author: Capit Nicolas
Address: Laboratoire Informatique et Distribution (ID)-IMAG ENSIMAG - antenne de Montbonnot ZIRST 51, avenue Jean Kuntzmann 38330 MONTBONNOT SAINT MARTIN
Contact:
Authors: ID laboratory
organization: ID laboratory
status: This is a "work in progress"
license: GNU GENERAL PUBLIC LICENSE
Dedication: For users, administrators and developpers.
abstract: OAR is a resource manager or (batch scheduler) for large clusters. In functionnalities, it's near of PBS, LSF, CCS and Condor. It's suitable for productive plateforms and research experiments.

BE CAREFULL : THIS DOCUMENTATION IS FOR OAR 2.0

Table of Contents


1   OAR capabilities

Oar is an opensource batch scheduler which provides a simple and flexible exploitation of a cluster.

It manages resources of clusters as a traditional batch scheduler (as PBS / Torque / LSF / SGE).

Its design is based on high level tools:
  • relational database engine MySQL,
  • scripting language Perl,
  • confinement system mechanism cpuset,
  • scalable exploiting tool Taktuk.

It is flexible enough to be suitable for production clusters and research experiments. It currently manages over than 5000 nodes and has executed more than 5 million jobs.

OAR advantages:
  • No specific daemon on nodes.
  • Upgrades are made on the servers, nothing to do on computing nodes.
  • CPUSET (2.6 linux kernel) integration which restricts the jobs on assigned resources (also useful to clean completely a job, even parallel jobs).
  • All administration tasks are performed with the taktuk command (a large scale remote execution deployment): http://taktuk.gforge.inria.fr/.
  • Hierarchical resource requests (handle heterogeneous clusters).
  • Gantt scheduling (so you can visualize the internal scheduler decisions).
  • Full or partial time-sharing.
  • Checkpoint/resubmit.
  • Licences servers management support.
  • Best effort jobs : if another job wants the same resources then it is deleted automatically (useful to execute programs like SETI@home).
  • Environment deployment support (Kadeploy): http://kadeploy.imag.fr/.
Other more common features:
  • Batch and Interactive jobs.
  • Admission rules.
  • Walltime.
  • Multi-schedulers support.
  • Multi-queues with priority.
  • Backfilling.
  • First-Fit Scheduler.
  • Reservation.
  • Support of moldable tasks.
  • Check compute nodes.
  • Epilogue/Prologue scripts.
  • Support of dynamic nodes.
  • Logging/Accounting.
  • Suspend/resume jobs.

2   Installing the OAR batch system

What do you need?
  • a cluster
  • to be an admin of this cluster
  • to get the install package of OAR (normally you have already done that)

2.1   Requirements

There a three kinds of nodes, each requiring a specific software configuration.

These are :

  • the server node, which will hold all of OAR "smartness" ;
  • the login nodes, on which you will be allowed to login, then reserve some computational nodes ;
  • the computational nodes (a.k.a. the nodes), on which the jobs will run.

On every node, the "sentinelle" binary can be installed (it is not necessary). This tool is used to launch commands on several computers, at a same time. It ships with the current OAR package (but it is also available separately in Taktuk in an Open Source License).

On every nodes (server, login, computational), the following packages must be installed :

  • sudo
  • Perl
  • Perl-base
  • openssh (server and client)

On the OAR server and on the login nodes, the following packages must be installed:

  • Perl-Mysql
  • Perl-DBI
  • MySQL
  • MySQL-shared
  • libmysql

From now on, we will suppose all the Perl/Mysql packages are correctly installed and configured and mysql database is started.

2.2   Configuration of the cluster

The following steps have to be done, prior to installing OAR:

  • add a user named "oar" in the group "oar" on every node

  • let the user "oar" connect through ssh from any node to any node WITHOUT password. To achieve this, here is some standard procedure for OpenSSH:

    • create a set of ssh keys for the user "oar" with ssh-keygen (for instance 'id_dsa.pub' and 'id_dsa')

    • copy these keys on each node of the cluster in the ".ssh" folder of the user "oar"

    • append the contents of 'id_dsa.pub' to the file "~/.ssh/authorized_keys"

    • in "~/.ssh/config" add the lines:

      Host *
          ForwardX11 no
          StrictHostKeyChecking no
          PasswordAuthentication no
          AddressFamily inet
      
    • test the ssh connection between (every) two nodes : there should not be any prompt.

  • grant the user "oar" the permission to execute commands with root privileges. To achieve that, OAR makes use of sudo. As a consequence, /etc/sudoers must be configured. Please use visudo to add the following lines:

    Defaults>oar    env_reset,env_keep = "OARLIB OARUSER OARDIR PWD
    PERL5LIB DISPLAY OARCONFFILE"
    
    Cmnd_Alias OARCMD = /usr/lib/oar/oarnodes, /usr/lib/oar/oarstat,\
    /usr/lib/oar/oarsub, /usr/lib/oar/oardel, /usr/lib/oar/oarhold,\
    /usr/lib/oar/oarnotify, /usr/lib/oar/oarresume
    %oar ALL=(oar) NOPASSWD: OARCMD
    
    oar ALL=(ALL) NOPASSWD:ALL
    
    Defaults:www-data   env_keep += "SCRIPT_NAME SERVER_NAME
    SERVER_ADMIN HTTP_CONNECTION REQUEST_METHOD CONTENT_LENGTH
    SCRIPT_FILENAME SERVER_SOFTWARE HTTP_TE QUERY_STRING REMOTE_PORT
    HTTP_USER_AGENT SERVER_PORT SERVER_SIGNATURE REMOTE_ADDR
    CONTENT_TYPE SERVER_PROTOCOL PATH REQUEST_URI GATEWAY_INTERFACE
    SERVER_ADDR DOCUMENT_ROOT HTTP_HOST"
    
    www-data ALL=(oar) NOPASSWD: /usr/lib/oar/oar-cgi
    

There are a three different flavors of installation :

  • server: install the daemon which must be running on the server
  • user: install all the tools needed to submit and manage jobs for the users (oarsub, oarstat, oarnodes, ...)
  • node: install the tools for a computing node

The installation is straightforward:

  • become root

  • go to OAR source repository

  • You can set Makefile variables in the command line to suit your configuration (change "OARHOMEDIR" to the home of your user oar and "PREFIX" where you want to copy all OAR files).

  • run make <module> [module] ...
    where module := { server-install | user-install | node-install | doc-install | debian-package }

    OPTIONS := { OARHOMEDIR | OARCONFDIR | OARUSER | PREFIX | MANDIR | OARDIR | BINDIR | SBINDIR | DOCDIR }

  • Edit /etc/oar.conf file to match your cluster configuration.

  • Make sure that the PATH environnement variable contains $PREFIX/$BINDIR of your installation (default is /usr/local/bin).

Initialization of OAR database (MySQL) is achieved using oar_mysql_db_init script provided with the server module installation and located in $PREFIX/sbin (/usr/local/sbin in default Makefile).

If you want to use a postgres SQL server then there is currently no automatic installation script. You have to add a new user which can connect on a new oar database(use the commands createdb and createuser). After that, you have to authorize network connections on the postgresql server in the postgresql.conf (uncomment tcpip_socket = true). Moreover a line like

host    oar         oar            X.X.X.X/Y    md5

to enable oar user to connect to the database.

Then you can import the database scheme stored in oar_postgres.sql (use the SQL command "\i").

For more informations about postgresql, go to http://www.postgresql.org/.

Note: The same machine may host several or even all modules.

Note about X11: The easiest and scalable way to use X11 application on cluster nodes is to open X11 ports and set the right DISPLAY environment variable by hand. Otherwise users can use X11 forwarding via ssh to access cluster frontal. After that you must configure ssh server on this frontal with

X11Forwarding yes
X11UseLocalhost no

With this configuration, users can launch X11 applications after a 'oarsub -I' on the given node.

2.2.1   CPUSET installation

2.2.1.1   What are "oarsh" and "oarsh_shell" scripts?

"oarsh" and "oarsh_shell" are two scripts that can restrict user processes to stay in the same cpuset on all nodes.

This feature is very usefull to restrict processor consumption on multiprocessors computers and to kill all processes of a same OAR job on several nodes.

2.2.1.2   CPUSET definition

CPUSET is a module integrated in the Linux kernel since 2.6.x. In the kernel documentation, you can read:

Cpusets provide a mechanism for assigning a set of CPUs and Memory
Nodes to a set of tasks.

Cpusets constrain the CPU and Memory placement of tasks to only
the resources within a tasks current cpuset.  They form a nested
hierarchy visible in a virtual file system.  These are the essential
hooks, beyond what is already present, required to manage dynamic
job placement on large systems.

Each task has a pointer to a cpuset.  Multiple tasks may reference
the same cpuset.  Requests by a task, using the sched_setaffinity(2)
system call to include CPUs in its CPU affinity mask, and using the
mbind(2) and set_mempolicy(2) system calls to include Memory Nodes
in its memory policy, are both filtered through that tasks cpuset,
filtering out any CPUs or Memory Nodes not in that cpuset.  The
scheduler will not schedule a task on a CPU that is not allowed in
its cpus_allowed vector, and the kernel page allocator will not
allocate a page on a node that is not allowed in the requesting tasks
mems_allowed vector.

If a cpuset is cpu or mem exclusive, no other cpuset, other than a direct
ancestor or descendent, may share any of the same CPUs or Memory Nodes.
A cpuset that is cpu exclusive has a sched domain associated with it.
The sched domain consists of all cpus in the current cpuset that are not
part of any exclusive child cpusets.
This ensures that the scheduler load balacing code only balances
against the cpus that are in the sched domain as defined above and not
all of the cpus in the system. This removes any overhead due to
load balancing code trying to pull tasks outside of the cpu exclusive
cpuset only to be prevented by the tasks' cpus_allowed mask.

A cpuset that is mem_exclusive restricts kernel allocations for
page, buffer and other data commonly shared by the kernel across
multiple users.  All cpusets, whether mem_exclusive or not, restrict
allocations of memory for user space.  This enables configuring a
system so that several independent jobs can share common kernel
data, such as file system pages, while isolating each jobs user
allocation in its own cpuset.  To do this, construct a large
mem_exclusive cpuset to hold all the jobs, and construct child,
non-mem_exclusive cpusets for each individual job.  Only a small
amount of typical kernel memory, such as requests from interrupt
handlers, is allowed to be taken outside even a mem_exclusive cpuset.

User level code may create and destroy cpusets by name in the cpuset
virtual file system, manage the attributes and permissions of these
cpusets and which CPUs and Memory Nodes are assigned to each cpuset,
specify and query to which cpuset a task is assigned, and list the
task pids assigned to a cpuset.

2.2.1.3   OARSH

"oarsh" is a wrapper around the "ssh" command (tested with openSSH). Its goal is to propagate two environment variables:

  • OAR_CPUSET : The name of the OAR job cpuset
  • SUDO_USER : The name of the user who has launched oarsh command

So "oarsh" must be run by oar and a simple user must run it via the "sudowrapper" script to become oar. In this way each cluster user who can execute "oarsh" via "sudowrapper" can connect himself on each cluster nodes (if oarsh is installed everywhere).

2.2.1.4   OARSH_SHELL

"oarsh_shell" must be the shell of the oar user on each nodes where you want oarsh to worked. This script takes "OAR_CPUSET" and "SUDO_USER" environment variables and adds its PID in OAR_CPUSET cpuset. Then it searchs user shell and home and executes the right command (like ssh).

2.2.1.5   NOTES

  • On each nodes you must add in the SSH server configuration file:

    AcceptEnv OAR_CPUSET SUDO_USER
    

    In Debian the file is "/etc/ssh/sshd_config"

  • You can use scp with oarsh. The syntax is:

    scp -S /path/to/oarsh ...
    
  • You can restrict the use of oarsh with the sudo configuration:

    %oarsh ALL=(oar) NOPASSWD: /path/to/oarsh
    

    Here only users from oarsh group can execute oarsh

2.3   Visualization tools installation

There are two different tools. One, named Monika, displays the current cluster state with all active and waiting jobs. The other, named drawgantt, diplays node occupation in a lapse of time. These tools are CGI scripts and generate HTML pages.

You can install these in this way:

drawgantt:

  • Make sure you installed "ruby", "libdbd-mysql-ruby" or "libdbd-pg-ruby" and "libgd-ruby1.8" packages.
  • Copy "drawgantt.cgi" and "drawgantt.conf" in the CGI folder of your web server (ex: /usr/lib/cgi-bin/ for Debian).
  • Copy all icons and javascript files in a folder that web server can find them (ex: /var/www/oar/Icons and /var/www/oar/Icons).
  • Make sure that these files can be read by the web server user.
  • Edit "drawgantt.conf" and change tags to fit your configuration.

Monika:

  • The package "perl-AppConfig" is required.
  • Read INSTALL file in the monika repository.

2.4   Debian packages

OAR is also released under Debian packages (or Ubuntu). You can find them at http://oar.imag.fr/download.html.

If you want to add it as a new source in your /etc/apt/sources.list then add the line:

deb http://oar.imag.fr/download ./

The installation will ask you if you want to initialize the nodes. It will copy the oar SSH key on each specified nodes. You can skeep this step by you will have to do this manually.

After installing packages, you have to edit the configuration file on the server, submition nodes and computing nodes to fite your needs.

2.5   Starting

First, you must start OAR daemon on the server (its name is "Almighty").

  • if you have installed OAR from sources, become oar user and launch command "Almighty" (it stands in $PREFIX/sbin).
  • if you have installed OAR from Debian packages, use the script "/etc/init.d/oar-server" to start the daemon.

Then you have to insert new resources in the database via the command oarnodesetting. If you want to have an idea how does it work then launch $PREFIX/oar/detect_new_resources.sh. It will print right commands execute with an appropriate value for the memory and the cpuset properties.

If you want to initialize your whole cluster in 1 command you can use this one (tune it to fite your cluster). You must be oar to run this command because oarnodesetting will called and the sentinelle.pl will log onto all nodes stored in "node_list.txt" file without password:

export PREFIX=/var/lib
$PREFIX/oar/sentinelle.pl -f node_list.txt \
-p "$PREFIX/oar/detect_new_resources.sh" | sh

Then you can launch the oarnodes command and see all new resources inserted.

2.6   Further informations

For further information, please check http://oar.imag.fr/.

3   User guide

3.1   Description of the different commands

All user commands are installed on cluster login nodes. So you must connect to one of these computers first.

3.1.1   oarstat

This command prints jobs in execution mode on the terminal.

Options

-f                    : prints each job in full details
-j job_id             : prints the specified job_id informations (even if it is finished)
--sql "sql where"     : Restricts display with the SQL where clause on the table jobs
-g "d1,d2"            : prints history of jobs and state of resources between two dates.
-D                    : formats outputs in Perl Dumper
-X                    : formats outputs in XML
-Y                    : formats outputs in YAML

Examples

# oarstat
# oarstat -j 42 -f
# oarstat --sql "project = 'p1'"

3.1.2   oarnodes

This command prints informations about cluster resources (state, which jobs on which resources, resource properties, ...).

Options

-a                : shows all resources with their properties
-r                : show only properties of a resource
-s                : shows only resource states
-l                : shows only resource list
--sql "sql where" : Display resources which matches this sql where clause
-D                : formats outputs in Perl Dumper
-X                : formats outputs in XML
-Y                : formats outputs in YAML

Examples

# oarnodes
# oarnodes -s
# oarnodes --sql "state = 'Suspected'"

3.1.3   oarsub

The user can submit a job with this command. So, what is a job in our context?

A job is defined by needed resources and a script/program to run. So, the user must specify how many resources and what kind of them are needed by his application. Thus, OAR system will give him or not what he wants and will control the execution. When a job is launched, OAR executes user program only on the first reservation node. So this program can access some environment variables to know its environment:

$OAR_NODEFILE                 contains the name of a file which lists
                              all reserved nodes for this job
$OAR_JOB_ID                   contains the OAR job identificator
$OAR_RESOURCE_PROPERTIES_FILE contains the name of a file which lists
                              all resources and their properties
$OAR_JOB_NAME                 name of the job given by the "-n" option
$OAR_RESOURCE_PROPERTIES_FILE contains the detailed resources used by the
                              job
$OAR_PROJECT_NAME             job project name

Options:

-q "queuename" : specify the queue for this job
-I : turn on INTERACTIVE mode (OAR gives you a shell instead of executing a
     script)
-l "resource description" : defines resource list requested for this job;
                            the different parameters are resource properties
                            registered in OAR database; see examples below.
                            (walltime : Request maximun time. Format is
                            [hour:mn:sec|hour:mn|hour]; after this elapsed
                            time, the job will be killed)
-p "properties" : adds constraints for the job
                  (format is a WHERE clause from the SQL syntax)
-S, --Scanscript : in batch mode, asks oarsub to scan the given script for
                   OAR directives (#OAR -l ...)
-r "2007-05-11 23:32:03" : asks for a reservation job to begin at the
                           date in argument
-C job_id : connects to a reservation in Running state
-k "duration" : asks OAR to send the checkpoint signal to the first processus
                of the job "number_of_seconds" before the walltime
--signal "signal name" : specify the signal to use when checkpointing
-t "type name" : specify a specific type (deploy, besteffort, cosystem,
                 checkpoint)
-d "directory path" : specify the directory where to launch the command
                      (default is current directory)
-n "job name" :  specify an arbitrary name for the job
-a job_id : anterior job that must be terminated to start this new one
--project : Specify the name of the project corresponding to the job.
--notify "method" : specify a notification method(mail or command); ex:
                    --notify "mail:name@domain.com"
                    --notify "exec:/path/to/script args"
--stdout "file name" : specify the name of the standard output file
--stderr "file name" : specify the name of the error output file
--resubmit job_id : resubmit the given job to a new one
--force_cpuset_name "cpuset name" : Instead of using job_id for the cpuset
                                    name you can specify one (WARNING: if
                                    several jobs have the same cpuset name
                                    then processes of a job could be killed
                                    when another finished on the same
                                    computer)
--hold : Set the job state into Hold instead of Waiting; so it is not scheduled (you must run "oarresume" to turn it into the Waiting state).
-D : Print result in DUMPER format.
-Y : Print result in XML format.
-X : Print result in YAML format.

Wanted resources have to be described in a hierarchical manner (this is the "-l" syntax option).

Moreover it is possible to give a property that they must match.

So the long and complete syntax is of the form:

"{ sql1 }/prop1=1/prop2=3+{sql2}/prop3=2/prop4=1/prop5=1+...,walltime=1:00:00"
where:
  • sql1 : WHERE sql clause on the table resources that filters resource names used in the hierarchical description
  • prop1 : first type of resources
  • prop2 : second type of resources
  • + : add another resource hierarchy to the previous one
  • sql2 : WHERE sql clause to apply on the second hierarchy request
  • ...

So we want to reserve 3 resources with the same value of the type prop2 and with the same property prop1 and these resources must fit sql1. To that possible resources we want to add 2 others which fit sql2 and the hierarchy /prop3=2/prop4=1/prop5=1.

Hierarchical resource example

Example of a resource hierarchy and 2 different oarsub commands

hierarchical_resources.svg

Examples

# oarsub -l /node=4 test.sh

(the "test.sh" script will be run on 4 entire nodes in the default queue with the default walltime)

# oarsub -q default -l walltime=50:30:00,/node=10/cpu=3,walltime=2:15:00 \
  -p "switch = 'sw1'" /home/users/toto/prog

(the "/home/users/toto/prog" script will be run on 10 nodes with 3 cpus (so a total of 30 cpus) in the default queue with a walltime of 2:15:00. Moreover "-p" option restricts resources only on the switch 'sw1')

# oarsub -r "2004-04-27 11:00:00" -l /node=12/cpu=2

(a reservation will begin at "2004-04-27 11:00:00" on 12 nodes with 2 cpus on each one)

#  oarsub -C 42

(connects to the job 42 on the first node and set all OAR environment variables)

# oarsub -I

(gives a shell on a resource)

3.1.4   oardel

This command is used to delete or checkpoint job(s). They are designed by their identifier.

Option

--sql     : delete/checkpoint jobs which respond to the SQL where clause
            on the table jobs (ex: "project = 'p1'")
-c job_id : send checkpoint signal to the job (signal was
            definedwith "--signal" option in oarsub)

Examples

# oardel 14 42

(delete jobs 14 and 42)

# oardel -c 42

(send checkpoint signal to the job 42)

3.1.5   oarhold

This command is used to remove a job from the scheduling queue if it is in the "Waiting" state.

Moreover if its state is "Running" oarhold can suspend the execution and enable other jobs to use its resources. In that way, a SIGINT signal is sent to every processes.

Options

--sql : hold jobs which respond to the SQL where clause on the table
        jobs (ex: "project = 'p1'")
-r    : Manage not only Waiting jobs but also Running one
        (can suspend the job)

3.1.6   oarresume

This command resumes jobs in the states Hold or Suspended

Option

--sql : resume jobs which respond to the SQL where clause on the table
        jobs (ex: "project = 'p1'")

3.2   Visualisation tools

3.2.1   Monika

This is a web cgi normally installed on the cluster frontal. This tool executes oarnodes and oarstat then format data in a html page.

Thus you can have a global view of cluster state and where your jobs are running.

3.2.2   DrawOARGantt

This is also a web cgi. It creates a Gantt chart which shows job repartition on nodes in the time. It is very useful to see cluster occupation in the past and to know when a job will be launched in the future.

4   Administrator guide

4.1   Administrator commands

4.1.1   oarproperty

This command manages OAR resource properties stored in the database.

Options are:

-l : list properties
-a NAME : add a property
  -c : sql new field of type VARCHAR(255) (default is integer)
-d NAME : delete a property
-r "OLD_NAME,NEW_NAME" : rename property OLD_NAME into NEW_NAME

Examples:

# oarproperty -a cpu_freq
# oarproperty -a type
# oarproperty -r "cpu_freq,freq"

4.1.2   oarnodesetting

This command permits to change the state or a property of a node or of several resources resources.

By default the node name used by oarnodesetting is the result of the command hostname.

Options are:

-a    : add a new resource
-s    : state to assign to the node:
        * "Alive" : a job can be run on the node.
        * "Absent" : administrator wants to remove the node from the pool
           for a moment.
        * "Dead" : the node will not be used and will be deleted.
-h    : specify the node name (override hostname).
-r    : specify the resource number
--sql : get resource identifiers which respond to the
        SQL where clause on the table jobs
        (ex: "type = 'default'")
-p    : change the value of a property specified resources.
-n    : specify this option if you do not want to wait the end of jobs running
        on this node when you change its state into "Absent" or "Dead".

4.1.3   oarremoveresource

This command permits to remove a resource from the database.

The node must be in the state "Dead" (use oarnodesetting to do this) and then you can use this command to delete it.

4.1.4   oaraccounting

This command permits to update the accounting table for jobs ended since the last launch.

4.1.5   oarnotify

This command sends commands to the Almighty module and manages scheduling queues.

Option are:

    Almighty_tag    send this tag to the Almighty (default is TERM)
-e                  active an existing queue
-d                  inactive an existing queue
-E                  active all queues
-D                  inactive all queues
--add_queue         add a new queue; syntax is name,priority,scheduler
                    (ex: "name,3,"oar_sched_gantt_with_timesharing"
--remove_queue      remove an existing queue
-l                  list all queues and there status
-h                  show this help screen
-v                  print OAR version number

4.2   Database scheme

Database scheme

Database scheme (red lines seem PRIMARY KEY, blue lines seem INDEX)

db_scheme.svg

Note : all dates and duration are stored in an integer manner (number of seconds since the EPOCH).

4.2.1   accounting

Fields Types Descriptions
window_start INT UNSIGNED start date of the accounting interval
window_stop INT UNSIGNED stop date of the accounting interval
accounting_user VARCHAR(20) user name
accounting_project VARCHAR(255) name of the related project
queue_name VARCHAR(100) queue name
consumption_type ENUM("ASKED", "USED") "ASKED" corresponds to the walltimes specified by the user. "USED" corresponds to the effective time used by the user.
consumption INT UNSIGNED number of seconds used
Primary key: window_start, window_stop, accounting_user, queue_name, accounting_project, consumption_type
Index fields: window_start, window_stop, accounting_user, queue_name, accounting_project, consumption_type

This table is a summary of the consumption for each user on each queue. This increases the speed of queries about user consumptions and statistic generation.

Data are inserted through the command oaraccounting (when a job is treated the field accounted in table jobs is passed into "YES"). So it is possible to regenerate this table completely in this way :

  • Delete all data of the table:

    DELETE FROM accounting;
    
  • Set the field accounted in the table jobs to "NO" for each row:

    UPDATE jobs SET accounted = "NO";
    
  • Run the oaraccounting command.

You can change the amount of time for each window : edit the oar configuration file and change the value of the tag ACCOUNTING_WINDOW.

4.2.2   admission_rules

Fields Types Descriptions
id INT UNSIGNED id number
rule VARCHAR(255) rule written in Perl applied when a job is going to be registered
Primary key: id
Index fields: None

You can use these rules to change some values of some properties when a job is submitted. So each admission rule is executed in the order of the id field and it can set several variables. If one of them exits then the others will not be evaluated and oarsub returns an error.

Some examples are better than a long description :

  • Specify the default value for queue parameter

    INSERT INTO admission_rules (rule) VALUES('
      if (not defined($queue_name)) {
          $queue_name="default";
      }
    ');
    
  • Avoid users except oar to go in the admin queue

    INSERT INTO admission_rules (rule) VALUES ('
      if (($queue_name eq "admin") && ($user ne "oar")) {
        die("[ADMISSION RULE] Only oar user can submit jobs in the admin queue\\n");
      }
    ');
    
  • Restrict the maximum of the walltime for interactive jobs

    INSERT INTO admission_rules (rule) VALUES ('
      my $max_walltime = iolib::sql_to_duration("12:00:00");
      if ($jobType eq "INTERACTIVE"){
        foreach my $mold (@{$ref_resource_list}){
          if (
            (defined($mold->[1])) and
            ($max_walltime < $mold->[1])
          ){
            print("[ADMISSION RULE] Walltime to big for an INTERACTIVE job so it is set to $max_walltime.\\n");
            $mold->[1] = $max_walltime;
          }
        }
      }
    ');
    
  • Specify the default walltime

    INSERT INTO admission_rules (rule) VALUES ('
      my $default_wall = iolib::sql_to_duration("2:00:00");
      foreach my $mold (@{$ref_resource_list}){
        if (!defined($mold->[1])){
          print("[ADMISSION RULE] Set default walltime to $default_wall.\\n");
          $mold->[1] = $default_wall;
        }
      }
    ');
    
  • How to perform actions if the user name is in a file

    INSERT INTO admission_rules (rule) VALUES ('
      open(FILE, "/tmp/users.txt");
      while (($queue_name ne "admin") and ($_ = <FILE>)){
        if ($_ =~ m/^\\s*$user\\s*$/m){
          print("[ADMISSION RULE] Change assigned queue into admin\\n");
          $queue_name = "admin";
        }
      }
      close(FILE);
    ');
    

4.2.3   event_logs

Fields Types Descriptions
event_id INT UNSIGNED event identifier
type VARCHAR(50) event type
job_id INT UNSIGNED job related of the event
date INT UNSIGNED event date
description VARCHAR(255) textual description of the event
to_check ENUM('YES', 'NO') specify if the module NodeChangeState must check this event to Suspect or not some nodes
Primary key: event_id
Index fields: type, to_check

The different event types are:

  • "PING_CHECKER_NODE_SUSPECTED" : the system detected via the module "finaud" that a node is not responding.
  • "PROLOGUE_ERROR" : an error occurred during the execution of the job prologue (exit code != 0).
  • "EPILOGUE_ERROR" : an error occurred during the execution of the job epilogue (exit code != 0).
  • "CANNOT_CREATE_TMP_DIRECTORY" : OAR cannot create the directory where all information files will be stored.
  • "CAN_NOT_WRITE_NODE_FILE" : the system was not able to write file which had to contain the node list on the first node (/tmp/OAR_job_id).
  • "CAN_NOT_WRITE_PID_FILE" : the system was not able to write the file which had to contain the pid of oarexec process on the first node (/tmp/pid_of_oarexec_for_job_id).
  • "USER_SHELL" : the system was not able to get informations about the user shell on the first node.
  • "EXIT_VALUE_OAREXEC" : the oarexec process terminated with an unknown exit code.
  • "SEND_KILL_JOB" : signal that OAR has transmitted a kill signal to the oarexec of the specified job.
  • "LEON_KILL_BIPBIP_TIMEOUT" : Leon module has detected that something wrong occurred during the kill of a job and so kill the local bipbip process.
  • "EXTERMINATE_JOB" : Leon module has detected that something wrong occurred during the kill of a job and so clean the database and terminate the job artificially.
  • "WORKING_DIRECTORY" : the directory from which the job was submitted does not exist on the node assigned by the system.
  • "OUTPUT_FILES" : OAR cannot write the output files (stdout and stderr) in the working directory.
  • "CANNOT_NOTIFY_OARSUB" : OAR cannot notify the oarsub process for an interactive job (maybe the user has killed this process).
  • "WALLTIME" : the job has reached its walltime.
  • "SCHEDULER_REDUCE_NB_NODES_FOR_RESERVATION" : this means that there is not enough nodes for the reservation and so the scheduler do the best and gives less nodes than the user wanted (this occurres when nodes become Suspected or Absent).
  • "BESTEFFORT_KILL" : the job is of the type besteffort and was killed because a normal job wanted the nodes.
  • "FRAG_JOB_REQUEST" : someone wants to delete a job.
  • "CHECKPOINT" : the checkpoint signal was sent to the job.
  • "CHECKPOINT_ERROR" : OAR cannot send the signal to the job.
  • "CHECKPOINT_SUCCESS" : system has sent the signal correctly.
  • "SERVER_EPILOGUE_TIMEOUT" : epilogue server script has time outed.
  • "SERVER_EPILOGUE_EXIT_CODE_ERROR" : epilogue server script did not return 0.
  • "SERVER_EPILOGUE_ERROR" : cannot find epilogue server script file.
  • "SERVER_PROLOGUE_TIMEOUT" : prologue server script has time outed.
  • "SERVER_PROLOGUE_EXIT_CODE_ERROR" : prologue server script did not return 0.
  • "SERVER_PROLOGUE_ERROR" : cannot find prologue server script file.
  • "CPUSET_CLEAN_ERROR" : OAR cannot clean correctly cpuset files for a job on the remote node.
  • "MAIL_NOTIFICATION_ERROR" : a mail cannot be sent.
  • "USER_MAIL_NOTIFICATION" : user mail notification cannot be performed.
  • "USER_EXEC_NOTIFICATION_ERROR" : user script execution notification cannot be performed.
  • "BIPBIP_BAD_JOBID" : error when retrieving informations about a running job.
  • "BIPBIP_CHALLENGE" : OAR is configured to detach jobs when they are launched on compute nodes and the job return a bad challenge number.
  • "RESUBMIT_JOB_AUTOMATICALLY" : the job was automatically resubmitted.
  • "WALLTIME" : the job reached its walltime.
  • "REDUCE_RESERVATION_WALLTIME" : the reservation job was shrunk.
  • "SSH_TRANSFER_TIMEOUT" : node OAR part script was too long to transfer.
  • "BAD_HASHTABLE_DUMP" : OAR transfered a bad hashtable.
  • "LAUNCHING_OAREXEC_TIMEOUT" : oarexec was too long to initialize itself.
  • "RESERVATION_NO_NODE" : All nodes were detected as bad for the reservation job.

4.2.4   event_log_hostnames

Fields Types Descriptions
event_id INT UNSIGNED event identifier
hostname VARCHAR(255) name of the node where the event has occured
Primary key: event_id
Index fields: hostname

This table stores hostnames related to events like "PING_CHECKER_NODE_SUSPECTED".

4.2.5   files

Fields Types Descriptions
idFile INT UNSIGNED  
md5sum VARCHAR(255)  
location VARCHAR(255)  
method VARCHAR(255)  
compression VARCHAR(255)  
size INT UNSIGNED  
Primary key: idFile
Index fields: md5sum

4.2.6   frag_jobs

Fields Types Descriptions
frag_id_job INT UNSIGNED job id
frag_date INT UNSIGNED kill job decision date
frag_state ENUM('LEON', 'TIMER_ARMED' , 'LEON_EXTERMINATE', 'FRAGGED') DEFAULT 'LEON' state to tell Leon what to do
Primary key: frag_id_job
Index fields: frag_state

What do these states mean:

  • "LEON" : the Leon module must try to kill the job and change the state into "TIMER_ARMED".
  • "TIMER_ARMED" : the Sarko module must wait a response from the job during a timeout (default is 60s)
  • "LEON_EXTERMINATE" : the Sarko module has decided that the job time outed and asked Leon to clean up the database.
  • "FRAGGED" : job is fragged.

4.2.7   gantt_jobs_resources

Fields Types Descriptions
moldable_job_id INT UNSIGNED moldable job id
resource_id INT UNSIGNED resource assigned to the job
Primary key: moldable_job_id, resource_id
Index fields: None

This table specifies which resources are attributed to which jobs.

4.2.8   gantt_jobs_resources_visu

Fields Types Descriptions
moldable_job_id INT UNSIGNED moldable job id
resource_id INT UNSIGNED resource assigned to the job
Primary key: moldable_job_id, resource_id
Index fields: None

This table is the same as gantt_jobs_resources and is used by visualisation tools. It is updated atomically (a lock is used).

4.2.9   gantt_jobs_predictions

Fields Types Descriptions
moldable_job_id INT UNSIGNED job id
start_time INT UNSIGNED date when the job is scheduled to start
Primary key: moldable_job_id
Index fields: None

With this table and gantt_jobs_resources you can know exactly what are the decisions taken by the schedulers for each waiting jobs.

note: The special job id "0" is used to store the scheduling reference date.

4.2.10   gantt_jobs_predictions_visu

Fields Types Descriptions
moldable_job_id INT UNSIGNED job id
start_time INT UNSIGNED date when the job is scheduled to start
Primary key: job_id
Index fields: None

This table is the same as gantt_jobs_predictions and is used by visualisation tools. It is made up to date in an atomic action (with a lock).

4.2.11   jobs

Fields Types Descriptions
job_id INT UNSIGNED job identifier
job_name VARCHAR(100) name given by the user
cpuset_name VARCHAR(255) name of the cpuset directory used for this job on each nodes
job_type ENUM('INTERACTIVE', 'PASSIVE') DEFAULT 'PASSIVE' specify if the user wants to launch a program or get an interactive shell
info_type VARCHAR(255) some informations about oarsub command
state ENUM('Waiting','Hold', 'toLaunch', 'toError', 'toAckReservation', 'Launching', 'Running' , 'Finishing', 'Terminated', 'Error') job state
reservation ENUM('None', 'toSchedule', 'Scheduled') DEFAULT 'None' specify if the job is a reservation and the state of this one
message VARCHAR(255) readable information message for the user
job_user VARCHAR(20) user name
command TEXT program to run
queue_name VARCHAR(100) queue name
properties TEXT properties that assigned nodes must match
launching_directory VARCHAR(255) path of the directory where to launch the user process
submission_time INT UNSIGNED date when the job was submitted
start_time INT UNSIGNED date when the job was launched
stop_time INT UNSIGNED date when the job was stopped
file_id INT UNSIGNED  
accounted ENUM("YES", "NO") DEFAULT "NO" specify if the job was considered by the accounting mechanism or not
notify VARCHAR(255) gives the way to notify the user about the job (mail or script )
assigned_moldable_job INT UNSIGNED moldable job chosen by the scheduler
checkpoint INT UNSIGNED number of seconds before the walltime to send the checkpoint signal to the job
checkpoint_signal INT UNSIGNED signal to use when checkpointing the job
stdout_file TEXT file name where to redirect program STDOUT
stderr_file TEXT file name where to redirect program STDERR
resubmit_job_id INT UNSIGNED if a job is resubmitted then the new one store the previous
project VARCHAR(255) arbitrary name given by the user or an admission rule
suspended ENUM("YES","NO") specify if the job was suspended (oarhold)
job_env TEXT environment variables to set for the job
exit_code INT DEFAULT 0 exit code for passive jobs
job_group VARCHAR(255) not used
Primary key: job_id
Index fields: state, reservation, queue_name, accounted, suspended

Explications about the "state" field:

  • "Waiting" : the job is waiting OAR scheduler decision.
  • "Hold" : user or administrator wants to hold the job (oarhold command). So it will not be scheduled by the system.
  • "toLaunch" : the OAR scheduler has attributed some nodes to the job. So it will be launched.
  • "toError" : something wrong occurred and the job is going into the error state.
  • "toAckReservation" : the OAR scheduler must say "YES" or "NO" to the waiting oarsub command because it requested a reservation.
  • "Launching" : OAR has launched the job and will execute the user command on the first node.
  • "Running" : the user command is executing on the first node.
  • "Finishing" : the user command has terminated and OAR is doing work internally
  • "Terminated" : the job has terminated normally.
  • "Error" : a problem has occurred.

Explications about the "reservation" field:

  • "None" : the job is not a reservation.
  • "toSchedule" : the job is a reservation and must be approved by the scheduler.
  • "Scheduled" : the job is a reservation and is scheduled by OAR.

4.2.12   job_dependencies

Fields Types Descriptions
job_id INT UNSIGNED job identifier
job_id_required INT UNSIGNED job needed to be completed before launching job_id
Primary key: job_id, job_id_required
Index fields: job_id, job_id_required

This table is feeded by oarsub command with the "-a" option.

4.2.13   moldable_job_descriptions

Fields Types Descriptions
moldable_id INT UNSIGNED job identifier
moldable_job_id INT UNSIGNED corresponding job identifier
moldable_walltime INT UNSIGNED instance duration
Primary key: moldable_id
Index fields: moldable_job_id

A job can be described with several instances. Thus OAR scheduler can choose one of them. For example it can calculate which instance will finish first. So this table stores all instances for all jobs.

4.2.14   job_resource_groups

Fields Types Descriptions
res_group_id INT UNSIGNED group identifier
res_group_moldable_id INT UNSIGNED corresponding moldable job identifier
res_group_property TEXT SQL constraint properties
Primary key: res_group_id
Index fields: res_group_moldable_id

As you can specify job global properties with oarsub and the "-p" option, you can do the same thing for each resource groups that you define with the "-l" option.

4.2.15   job_resource_descriptions

Fields Types Descriptions
res_job_group_id INT UNSIGNED corresponding group identifier
res_job_resource_type VARCHAR(255) resource type (name of a field in resources)
res_job_value INT wanted resource number
res_job_order INT UNSIGNED order of the request
Primary key: res_job_group_id, res_job_resource_type, res_job_order
Index fields: res_job_group_id

This table store the hierarchical resource description given with oarsub and the "-l" option.

4.2.16   job_state_logs

Fields Types Descriptions
job_id INT UNSIGNED corresponding job identifier
job_state ENUM('Waiting', 'Hold', 'toLaunch', 'toError', 'toAckReservation', 'Launching', 'Finishing', 'Terminated', 'Error') job state during the interval
date_start INT UNSIGNED start date of the interval
date_stop INT UNSIGNED end date of the interval
Primary key: None
Index fields: job_id, job_state

This table keeps informations about state changes of jobs.

4.2.17   job_types

Fields Types Descriptions
job_id INT UNSIGNED corresponding job identifier
type VARCHAR(255) job type like "deploy", "timesharing", ...
Primary key: None
Index fields: job_id, type

This table stores job types given with the oarsub command and "-t" options.

4.2.18   resources

Fields Types Descriptions
resource_id INT UNSIGNED resource identifier
type VARCHAR(100) DEFAULT "default" resource type (used for licence resources for example)
network_address VARCHAR(100) node name (used to connect via SSH)
state ENUM('Alive', 'Dead' , 'Suspected', 'Absent') resource state
next_state ENUM('UnChanged', 'Alive', 'Dead', 'Absent', 'Suspected') DEFAULT 'UnChanged' state for the resource to switch
finaud_decision ENUM('YES', 'NO') DEFAULT 'NO' tell if the actual state results in a "finaud" module decision
next_finaud_decision ENUM('YES', 'NO') DEFAULT 'NO' tell if the next node state results in a "finaud" module decision
state_num INT corresponding state number (useful with the SQL "ORDER" query)
suspended_jobs ENUM('YES','NO') specify if there is at least one suspended job on the resource
switch VARCHAR(50) name of the switch
cpu INT UNSIGNED global cluster cpu number
cpuset INT UNSIGNED field used with the CPUSET_RESOURCE_PROPERTY_DB_FIELD
besteffort ENUM('YES','NO') accept or not besteffort jobs
deploy ENUM('YES','NO') specify if the resource is deployable
expiry_date INT UNSIGNED field used for the desktop computing feature
desktop_computing ENUM('YES','NO') tell if it is a desktop computing resource (with an agent)
last_job_date INT UNSIGNED store the date when the resource was used for the last time
cm_availability INT UNSIGNED used with compute mode features to know if an Absent resource can be switch on
Primary key: resource_id
Index fields: state, next_state, type, suspended_jobs

State explications:

  • "Alive" : the resource is ready to accept a job.
  • "Absent" : the oar administrator has decided to pull out the resource. This computer can come back.
  • "Suspected" : OAR system has detected a problem on this resource and so has suspected it (you can look in the event_logs table to know what has happened). This computer can come back (automatically if this is a "finaud" module decision).
  • "Dead" : The oar administrator considers that the resource will not come back and will be removed from the pool.

This table permits to specify different properties for each resources. These can be used with the oarsub command ("-p" and "-l" options).

You can add your own properties with oarproperty command.

These properties can be updated with the oarnodesetting command ("-p" option).

Several properties are added by default:

  • switch : you have to register the name of the switch where the node is plugged.
  • cpu : this is a unique name given to each cpus. This enables OAR scheduler to distinguish all cpus.
  • cpuset : this is the name of the cpu on the node. The Linux kernel sets this to an integer beginning at 0. This field is linked to the configuration tag CPUSET_RESOURCE_PROPERTY_DB_FIELD.

4.2.19   resource_logs

Fields Types Descriptions
resource_log_id INT UNSIGNED unique id
resource_id INT UNSIGNED resource identifier
attribute VARCHAR(255) name of corresponding field in resources
value VARCHAR(255) value of the field
date_start INT UNSIGNED interval start date
date_stop INT UNSIGNED interval stop date
finaud_decision ENUM('YES','NO') store if this is a system change or a human one
Primary key: None
Index fields: resource_id, attribute

This table permits to keep a trace of every property changes (consequence of the oarnodesetting command with the "-p" option).

4.2.20   assigned_resources

Fields Types Descriptions
moldable_job_id INT UNSIGNED job id
resource_id INT UNSIGNED resource assigned to the job
Primary key: moldable_job_id, resource_id
Index fields: moldable_job_id

This table keeps informations for jobs on which resources they were scheduled.

4.2.21   queues

Fields Types Descriptions
queue_name VARCHAR(100) queue name
priority INT UNSIGNED the scheduling priority
scheduler_policy VARCHAR(100) path of the associated scheduler
state ENUM('Active', 'notActive') DEFAULT 'Active' permits to stop the scheduling for a queue
Primary key: queue_name
Index fields: None

This table contains the schedulers executed by the oar_meta_scheduler module. Executables are launched one after one in the specified priority.

4.2.22   challenges

Fields Types Descriptions
job_id INT UNSIGNED job identifier
challenge VARCHAR(255) challenge string
Primary key: job_id
Index fields: None

This table is used to share a secret between OAR server and oarexec process on computing nodes (avoid a job id being stolen/forged by malicious user).

For security reasons, this table must not be readable for a database account given to users who want to access OAR internal informations(like statistics).

5   Configuration file

Each configuration tag found in /etc/oar.conf is now described:

  • Database type : you can use a MySQL or a PostgreSQL database (tags are "mysql" or "Pg"):

    DB_TYPE = mysql
    
  • Database hostname:

    DB_HOSTNAME=localhost
    
  • Database base name:

    DB_BASE_NAME=oar
    
  • DataBase user name:

    DB_BASE_LOGIN=oar
    
  • DataBase user password:

    DB_BASE_PASSWD=oar
    
  • OAR server hostname:

    SERVER_HOSTNAME=localhost
    
  • OAR server port:

    SERVER_PORT=6666
    
  • When the user does not specify a -l option then oar use this:

    OARSUB_DEFAULT_RESOURCES = /resource_id=1
    
  • Specify where we are connected in the deploy queue(the node to connect to when the job is in the deploy queue):

    DEPLOY_HOSTNAME = 127.0.0.1
    
  • Specify where we are connected with a job of the cosystem type:

    COSYSTEM_HOSTNAME = 127.0.0.1
    
  • Set DETACH_JOB_FROM_SERVER to 1 if you do not want to keep a ssh connection between the node and the server. Otherwise set this tag to 0:

    DETACH_JOB_FROM_SERVER=1
    
  • By default OAR uses the ping command to detect if nodes are down or not. To enhance this diagnostic you can specify one of these other methods ( give the complete command path):

    • OAR sentinelle:

      SENTINELLE_COMMAND=/usr/bin/sentinelle -cconnect=ssh,timeout=3000
      

      If you use sentinelle.pl then you must use this tag:

      SENTINELLE_SCRIPT_COMMAND=/var/lib/oar/sentinelle.pl -t 5 -w 20
      
    • OAR fping:

      FPING_COMMAND=/usr/bin/fping -q
      
    • OAR nmap : it will test to connect on the ssh port (22):

      NMAP_COMMAND=/usr/bin/nmap -p 22 -n -T5
      
    • OAR generic : a specific script may be used instead of ping to check aliveness of nodes. The script must return bad nodes on STDERR (1 line for a bad node and it must have exactly the same name that OAR has given in argument of the command):

      GENERIC_COMMAND=/path/to/command arg1 arg2
      
  • OAR log level: 3(debug+warnings+errors), 2(warnings+errors), 1(errors):

    LOG_LEVEL=2
    
  • OAR log file:

    LOG_FILE=/var/log/oar.log
    
  • If you want to debug oarexec on nodes then affect 1 (only effective if DETACH_JOB_FROM_SERVER = 1):

    OAREXEC_DEBUG_MODE=0
    
  • OAR Allowed networks, Networks or hosts allowed to submit jobs to OAR and compute nodes may be specified here(0.0.0.0/0 means all IPs are allowed and 127.0.0.1/32 means only IP 127.0.0.1 is allowed):

    ALLOWED_NETWORKS= 127.0.0.1/32 0.0.0.0/0
    
  • Set the granularity of the OAR accounting feature (in seconds). Default is 1 day (86400s):

    ACCOUNTING_WINDOW= 86400
    
  • OAR informations may be notified by email to the administror. Set accordingly to your configuration the next lines to activate this feature:

    MAIL_SMTP_SERVER = smtp.serveur.com
    MAIL_RECIPIENT = user@domain.com
    MAIL_SENDER = oar@domain.com
    
  • Set the timeout for the prologue and epilogue execution on computing nodes:

    PROLOGUE_EPILOGUE_TIMEOUT = 60
    
  • Files to execute before and after each job on the first computing node (default is ~oar/oar_prologue ans ~oar/oar_epilogue):

    PROLOGUE_EXEC_FILE = /path/to/prog
    EPILOGUE_EXEC_FILE = /path/to/prog
    
  • Set the timeout for the prologue and epilogue execution on the OAR server:

    SERVER_PROLOGUE_EPILOGUE_TIMEOUT = 60
    
  • Files to execute before and after each job on the OAR server:

    SERVER_PROLOGUE_EXEC_FILE = /path/to/prog
    SERVER_EPILOGUE_EXEC_FILE = /path/to/prog
    
  • Set the frequency for checking Alive and Suspected resources:

    FINAUD_FREQUENCY = 300
    
  • Set time after which resources become Dead (default is 0 and it means never):

    DEAD_SWITCH_TIME = 600
    
  • Maximum of seconds used by a scheduler:

    SCHEDULER_TIMEOUT = 10
    
  • Time to wait when a reservation has not got all resources that it has reserved (some resources could have become Suspected or Absent since the job submission) before to launch the job in the remaining resources:

    RESERVATION_WAITING_RESOURCES_TIMEOUT = 300
    
  • Time to add between each jobs (time for administration tasks or time to let computers to reboot):

    SCHEDULER_JOB_SECURITY_TIME = 1
    
  • Minimum time in seconds that can be considered like a hole where a job could be scheduled in:

    SCHEDULER_GANTT_HOLE_MINIMUM_TIME = 300
    
  • You can add an order preference on resource assigned by the system(SQL ORDER syntax):

    SCHEDULER_RESOURCE_ORDER = switch ASC, network_address DESC, resource_id ASC
    
  • This says to the scheduler to treate resources of these types, where there is a suspended job, like free ones. So some other jobs can be scheduled on these resources. (list resource types separate with spaces; Default value is nothing so no other job can be scheduled on suspended job resources):

    SCHEDULER_AVAILABLE_SUSPENDED_RESOURCE_TYPE = default licence vlan
    
  • Name of the perl script that manages suspend/resume. You have to install your script in $OARDIR and give only the name of the file without the entire path. (default is suspend_resume_manager.pl):

    SUSPEND_RESUME_FILE = suspend_resume_manager.pl
    
  • Files to execute just after a job was suspended and just before a job was resumed:

    JUST_AFTER_SUSPEND_EXEC_FILE = /path/to/prog
    JUST_BEFORE_RESUME_EXEC_FILE = /path/to/prog
    
  • Timeout for the two previous scripts:

    SUSPEND_RESUME_SCRIPT_TIMEOUT = 60
    
  • Indicate the name of the database field that contains the cpu number of the node. If this option is set then users must use OARSH instead of ssh to walk on each nodes that they have reserved via oarsub.

    CPUSET_RESOURCE_PROPERTY_DB_FIELD = cpuset
    
  • Name of the perl script that manages cpuset. You have to install your script in $OARDIR and give only the name of the file without the entire path. (default is cpuset_manager.pl which handles the linux kernel cpuset)

    CPUSET_FILE = cpuset_manager.pl
    
  • If you have installed taktuk and want to use it to manage cpusets then give the full command path (with your options except "-m" and "-o" and "-c"). You don t also have to give any taktuk command.

    TAKTUK_CMD = /usr/bin/taktuk -s
    
  • If you want to manage nodes to be started and stoped. OAR gives you this API:

  • When OAR scheduler wants some nodes to wake up then it launches this command with the node list in arguments(the scheduler looks at the cm_availability field in resources table to know if the node will be started for enough time):

    SCHEDULER_NODE_MANAGER_WAKE_UP_CMD = /path/to/the/command with your args
    
  • When OAR considers that some nodes can be shut down, it launches this command with the node list in arguments:

    SCHEDULER_NODE_MANAGER_SLEEP_CMD = /path/to/the/command args
    
  • Parameters for the scheduler to decide when a node is idle(number of seconds since the last job was terminated on the nodes):

    SCHEDULER_NODE_MANAGER_IDLE_TIME = 600
    
  • Parameters for the scheduler to decide if a node will have enough time to sleep(number of seconds before the next job):

    SCHEDULER_NODE_MANAGER_SLEEP_TIME = 600
    
  • Command to use to connect to other nodes (default is "ssh" in the PATH)

    OPENSSH_CMD = /usr/bin/ssh
    
  • These are configuration tags for OAR in the desktop-computing mode:

    DESKTOP_COMPUTING_ALLOW_CREATE_NODE=0
    DESKTOP_COMPUTING_EXPIRY=10
    STAGEOUT_DIR=/var/lib/oar/stageouts/
    STAGEIN_DIR=/var/lib/oar/stageins
    STAGEIN_CACHE_EXPIRY=144
    

6   Module descriptions

OAR can be decomposed into several modules which perform different tasks.

6.1   Almighty

This module is the OAR server. It decides what actions must be performed. It is divided into 2 processes:

  • One listens to a TCP/IP socket. It waits informations or commands from OAR user program or from the other modules.
  • Another one deals with commands thanks to an automaton and launch right modules one after one.

6.2   Sarko

This module is executed periodically by the Almighty (default is every 30 seconds).

The jobs of Sarko are :

  • Look at running job walltimes and ask to frag them if they had expired.
  • Detect if fragged jobs are really fragged otherwise asks to exterminate them.
  • In "Desktop Computing" mode, it detects if a node date has expired and asks to change its state into "Suspected".
  • Can change "Suspected" resources into "Dead" after DEAD_SWITCH_TIME seconds.

6.3   Judas

This is the module dedicated to print and log every debugging, warning and error messages.

6.4   Leon

This module is in charge to delete the jobs. Other OAR modules or commands can ask to kill a job and this is Leon which performs that.

There are 2 frag types :

  • normal : Leon tries to connect to the first node allocated for the job and terminates the job. oarexec to end itself.
  • exterminate : after a timeout if the normal method did not succeed then Leon notifies this case and clean up the database for these jobs. So OAR doesn't know what occured on the node and Suspects it.

6.5   NodeChangeState

This module is in charge of changing resource states and checking if there are jobs on these.

It also checks all pending events in the table event_logs.

6.6   Scheduler

This module checks for each reservation jobs if it is valid and launches them at the right time.

Scheduler launches all gantt scheduler in the order of the priority specified in the database and update all visualization tables (gantt_jobs_predictions_visu and gantt_jobs_resources_visu).

6.6.1   oar_sched_gantt_with_timesharing

This is the default OAR scheduler. It implements all functionalities like timesharing, moldable jobs, besteffort jobs, ...

By default, this scheduler is used by all default queues.

We have implemented the FIFO with backfilling algorithm. Some parameters can be changed in the configuration file (see SCHEDULER_TIMEOUT, SCHEDULER_JOB_SECURITY_TIME, SCHEDULER_GANTT_HOLE_MINIMUM_TIME, SCHEDULER_RESOURCE_ORDER).

6.7   Runner

This module launches OAR effective jobs. These processes are run asynchronously with all modules.

For each job, the Runner uses OPENSSH_CMD to connect to the first node of the reservation and propagate a Perl script which handles the execution of the user command.

7   Mechanisms

7.2   Job launch

For PASSIVE jobs, the mechanism is similar to the INTERACTIVE one, except for the shell launched from the frontal node.

The job is finished when the user command ends. Then oarexec return its exit value (what errors occured) on the Almighty via the SERVER_PORT if DETACH_JOB_FROM_SERVER was set to 1 otherwise it returns directly.

7.3   CPUSET

If the "--force_cpuset_name" option of the oarsub command is not defined then OAR will use job identifier. The CPUSET name is effectively created on each nodes and is composed as "user_cpusetname".

So if a user specifies "--force_cpuset_name" option, he will not be able to disturb other users.

OAR system steps:

  1. Before each job, the Runner initialize the CPUSET (see CPUSET definition) with OPENSSH_CMD and an efficient launching tool : Taktuk. If it is not installed and configured (TAKTUK_CMD) then OAR uses an internal launching tool less optimized.
  2. After each job, OAR deletes all processes stored in the associated CPUSET. Thus all nodes are clean after a OAR job.

If you don't want to use this feature, you can, but nothing will warranty that every user processes will be killed after the end of a job.

If you want you can implement your own cpuset management. This is done by editing 3 files (see also CPUSET installation):

  • cpuset_manager.pl : this script creates the cpuset on each nodes and also delete it at the end of the job. For more informations, you have to look at this script (there are several comments).
  • oarsh : (OARSH) this script is used to replace the standard "ssh" command. It gets the cpuset name where it is running and transfer this information via "ssh" and the "SendEnv" option. In this file, you have to change the "get_current_cpuset" function.
  • oarsh_shell : (OARSH_SHELL) this script is the shell of the oar user on each nodes. It gets environment variables and look at if there is a cpuset name. So if there is one it assigns the current process and its father to this cpusetname. So all further user processes will remind in the cpuset. In this file you just have to change the "add_process_to_cpuset" function.

7.4   Suspend/resume

Jobs can be suspended with the command oarhold (send a "SIGSTOP" on every processes on every nodes) to allow other jobs to be executed.

"Suspended" jobs can be resumed with the command oarresume (send a "SIGSTOP" on every suspended processes on every nodes). They will pass into "Running" when assigned resources will be free.

IMPORTANT: This feature is available only if CPUSET is configured.

You can specify 2 scripts if you have to perform any actions just after (JUST_AFTER_SUSPEND_EXEC_FILE) suspend and just before resume (JUST_BEFORE_RESUME_EXEC_FILE).

Moreover you can perform other actions (than send signals to processes) if you want: just edit the "suspend_resume_manager.pl" file.

7.5   Job deletion

Leon tries to connect to OAR Perl script running on the first job node (find it thanks to the file /tmp/oar/pid_of_oarexec_for_jobId_id) and sends a "SIGTERM" signal. Then the script catch it and normally end the job (kill processes that it has launched).

If this method didn't succeed then Leon will flush the OAR database for the job and nodes will be "Suspected" by NodeChangeState.

If your job is check pointed and is of the type idempotent (oarsub "-t" option) and its exit code is equal to 0 then another job is automatically created and scheduled with same behaviours.

7.6   Checkpoint

The checkpoint is just a signal sent to the program specified with the oarsub command.

If the user uses "-k" option then Sarko will ask the OAR Perl script running on the first node to send the signal to the process (SIGUSR2 or the one specified with "--signal").

You can also use oardel command to send the signal.

7.7   Scheduling

General steps used to schedule a job:

  1. All previous scheduled jobs are stored in a Gantt data structure.
  2. All resources that match property constraints of the job("-p" option and indication in the "{...}" from the "-l" option of the oarsub) are stored in a tree data structure according to the hierarchy given with the "-l" option.
  3. Then this tree is given to the Gantt library to find the first hole where the job can be launched.
  4. The scheduler stores its decision into the database in the gantt_jobs_predictions and gantt_jobs_resources tables.

See User section from the FAQ for more examples and features.

7.8   User notification

This section explains how the "--notify" oarsub option is handled by OAR:

  • The user wants to receive an email:

    The syntax is "mail:name@domain.com". Mail section in the Configuration file must be present otherwise the mail cannot be sent.

  • The user wants to launch a script:

    The syntax is "exec:/path/to/script args". OAR server will connect (using OPENSSH_CMD) on the node where the oarsub command was invoked and then launches the script with in argument : job_id, job_name, tag, comments.

    (tag is a value in : "START", "END", "ERROR")

7.9   Accounting aggregator

In the Configuration file you can set the ACCOUNTING_WINDOW parameter. Thus the command oaraccounting will split the time with this amount and feed the table accounting.

So this is very easily and faster to get usage statistics of the cluster. We can see that like a "data warehousing" information extraction method.

7.10   Dynamic nodes coupling features

We are working with the Icatis company on clusters composed by Intranet computers. These nodes can be switch in computing mode only at specific times. So we have implemented a functionality that can request to power on some hardware if they can be in the cluster.

We are using the field cm_availability from the table resources to know when a node will be inaccessible in the cluster mode (easily settable with oarnodesetting command). So when the OAR scheduler wants some potential available computers to launch the jobs then it executes the command SCHEDULER_NODE_MANAGER_WAKE_UP_CMD.

Moreover if a node didn't execute a job for SCHEDULER_NODE_MANAGER_IDLE_TIME seconds and no job is scheduled on it before SCHEDULER_NODE_MANAGER_SLEEP_TIME seconds then OAR will launch the command SCHEDULER_NODE_MANAGER_SLEEP_CMD.

7.11   Timesharing

It is possible to share the slot time of a job with other ones. To perform this feature you have to specify the type timesharing when you use oarsub.

You have 4 different ways to share your slot:

  1. timesharing=*,* : This is the default behavior if nothing but timesharing is specified. It indicates that the job can be shared with all users and every job names.
  2. timesharing=user,* : This indicates that the job can be shared only with the same user and every job names.
  3. timesharing=*,job_name* : This indicates that the job can be shared with all users but only one with the same name.
  4. timesharing=user,job_name : This indicates that the job can be shared only with the same user and one with the same job name.

See User section from the FAQ for more examples and features.

7.12   Besteffort jobs

Besteffort jobs are scheduled in the besteffort queue. Their particularity is that they are deleted if another not besteffort job want resources where they are running.

For example you can use this feature to maximize the use of your cluster with multiparametric jobs. This what it is done by the CIGRI project.

When you submit a job you have to use "-t besteffort" option of oarsub to specify that this is a besteffort job.

Note : a besteffort job cannot be a reservation.

7.13   Cosystem jobs

This feature enables to reserve some resources without launching any program on corresponding nodes. Thus nothing is done by OAR when a job is starting (no prologue, no epilogue on the server nor on the nodes).

This is useful with an other launching system that will declare its time slot in OAR. So yo can have two different batch scheduler.

When you submit a job you have to use "-t cosystem" option of oarsub to specify that this is a besteffort job.

These jobs are stopped by the oardel command or when they reach their walltime.

7.14   Deploy jobs

This feature is useful when you want to enable the users to reinstall their reserved nodes. So the OAR jobs will not log on the first computer of the reservation but on the DEPLOY_HOSTNAME.

So prologue and epilogue scripts are executed on DEPLOY_HOSTNAME and if the user wants to launch a script it is also executed on DEPLOY_HOSTNAME.

OAR does nothing on computing nodes because they normally will be rebooted to install a new system image.

This feature is strongly used in the Grid5000 project with Kadeploy tools.

When you submit a job you have to use "-t deploy" option of oarsub to specify that this is a deploy job.

7.15   Desktop computing

If you cannot contact the computers via SSH you can install the "desktop computing" OAR mode. This kind of installation is based on two programs:

  • oar-cgi : this is a web CGI used by the nodes to communicate with the OAR server.
  • oar-agent.pl : This program asks periodically the server web CGI to know what it has to do.

This method replaces the SSH command. Computers which want to register them into OAR just has to be able to contact OAR HTTP server.

In this situation we don't have a NFS file system to share the same repertories over all nodes so we have to use a stagein/stageout solution. In this case you can use the oarsub option "stagein" to migrate your data.

8   FAQ

8.1   User

8.1.1   How can I submit a moldable job?

You just have to use several "-l" oarsub option(one for each moldable description). By default the OAR scheduler will launch the moldable job which will end first.

So you can see some free resources but the scheduler can decide to start your job later because they will have more free resources and the job walltime will be smaller.

8.1.2   How can I submit a job with a non uniform description?

Example:

oarsub -I -l '{switch = "sw1" or switch = "sw5"}/switch=1+/node=1'

This example asks OAR to reserve all resources from the switch sw1 or the switch sw2 and a node on another switch.

You can see the "+" syntax as a sub-reservation directive.

8.1.3   Can I perform a fix scheduled reservation and then launch several jobs in it?

Yes. You have to use the OAR scheduler "timesharing" feature. To use it, the reservation and your further jobs must be of the type timesharing (only for you).

Example:

  1. Make your reservation:

    oarsub -r "2006-09-12 8:00:00" -l /switch=1 -t 'timesharing=user,*'
    

    This command asks all resources from one switch at the given date for the default walltime. It alsa specifies that this job can be shared with himself and without a constraint on the job name.

  2. Once your reservation has begun then you can launch:

    oarsub -I -l /node=2,walltime=0:50:00 -p 'switch = "nom_du_switch_schedule"'\
    -t 'timesharing=user,*'
    

    So this job will be scheduled on nodes assigned from the previous reservation.

The "timesharing" oarsub command possibilities are enumerated in Timesharing.

8.1.4   How can a checkpointable job be resubmitted automatically?

You have to specify that your job is idempotent. So, after a successful checkpoint, if the job is resubmitted then all will go right and there will have no problem (like file creation, deletion, ...).

Example:

oarsub -k 600 --signal 2 -t idempotent /path/to/prog

So this job will send a signal SIGINT (see man kill to know signal numbers) 10 minutes before the walltime ends. Then if everything goes well it will be resubmitted.

8.1.5   How to submit a non disturbing job for other users?

You can use the besteffort job type. Thus your job will be launched only if there is a hole and will be deleted if another job wants its resources.

Example:

oarsub -t besteffort /path/to/prog

8.2   Administrator

8.2.2   How can I handle licence tokens?

OAR does not manage resources with an empty "network_address". So you can define resources that are not linked with a real node.

So the steps to configure OAR with the possibility to reserve licences (or whatever you want that are other notions):

  1. Add a new field in the table resources to specify the licence name.

    oarproperty -a licence -c
    
  2. Add your licence name resources with oarnodesetting.

    oarnodesetting -a -h "" -p type=mathlab -p licence=l1
    oarnodesetting -a -h "" -p type=mathlab -p licence=l2
    oarnodesetting -a -h "" -p type=fluent -p licence=l1
    ...
    
  3. Now you have to write an admission rule to force oarsub "-l" option on resources of the type "default" (node resources) if there is no other specifications.

    ADD ADMISSION RULE HERE
    

After this configuration, users can perform submitions like

oarsub -I -l "/switch=2/nodes=10+{type = 'mathlab'}/licence=20"

So users ask OAR to give them some other resource types but nothing block their program to take more licences than they asked. You can resolve this problem with the SERVER_SCRIPT_EXEC_FILE configuration. In these files you have to bind OAR allocated resources to the licence servers to restrict user consumptions to what they asked. This is very dependant of the licence management.

8.2.3   How can I write my own scheduler?

What can and must do a scheduler:
  • your program will get 3 arguments:
    1. queue name
    2. reference time in second
    3. reference time in sql format
  • you must manipulate only jobs in your queue and with their state = "Waiting" and "Reservation" = "None"

  • you can get all informations stored in the database (in read mode)

  • you have to load previous decisions of the other schedulers (load informations from tables gantt_jobs_predictions and gantt_jobs_resources) or your decisions can conflict with previous.

  • you must store your decisions in the tables gantt_jobs_predictions and gantt_jobs_resources

  • you can set state of jobs to "toError" and OAR will delete them. After you must exit from your program with "exit code" = 1 otherwise 0.

You can look at the default OAR scheduler "oar_sched_gantt_with_timesharing". It uses a gantt and a resource tree libraries that are essential to take some decisions.

8.2.4   Does OAR handle summer and winter hours?

No.

If you change the server time when OAR is executing jobs then their stop date will be wrong. So users haveto be warned about this feature and database logs are not exact for these jobs.

8.2.5   What is the syntax of this documentation?

We are using the RST format from the Docutils project. This syntax is easily readable and can be converted into HTML, LaTex or XML.

You can find basic informations on http://docutils.sourceforge.net/docs/user/rst/quickref.html

9   OAR CHANGELOG

9.1   version 2.0.0++ enhanced:

  • Now, with the ability to declare any type of resources like licences, VLAN, IP range, computing resources must have the type default and a network_address not null.

  • Possibility to declare associated resources like licences, IP ranges, ... and to reserve them like others.

  • Now you can connect to your jobs (not only for reservations).

  • Add "cosystem" job type (execute and do nothing for these jobs).

  • New scheduler : "oar_sched_gantt_with_timesharing". You can specify jobs with the type "timesharing" that indicates that this scheduler can launch more than 1 job on a resource at a time. It is possible to restrict this feature with words "user and name". For example, '-t timesharing=user,name' indicates that only a job from the same user with the same name can be launched in the same time than it.

  • Add PostGresSQL support. So there is a choice to make between MySQL and PostgresSQL.

  • New approach for the scheduling : administrators have to insert into the databases descriptions about resources and not nodes. Resources have a network address (physical node) and properties. For example, if you have dual-processor, then you can create 2 different resources with the same natwork address but with 2 different processor names.

  • The scheduler can now handle resource properties in a hierarchical manner. Thus, for example, you can do "oarsub -l /switch=1/cpu=5" which submit a job on 5 processors on the same switch.

  • Add a signal handler in oarexec and propagate this signal to the user process.

  • Support '#OAR -p ...' options in user script.

  • Add in oar.conf:
    • DB_BASE_PASSWD_RO : for security issues, it is possible to execute request with parts specified by users with a read only account (like "-p" option).
    • OARSUB_DEFAULT_RESOURCES : when nothing is specified with the oarsub command then OAR takes this default resource description.
    • OAREXEC_DEBUG_MODE : turn on or off debug mode in oarexec (create /tmp/oar/oar.log on nodes).
    • FINAUD_FREQUENCY : indicates the frequency when OAR launchs Finaud (search dead nodes).
    • SCHEDULER_TIMEOUT : indicates to the scheduler the amount of time after what it must end itself.
    • SCHEDULER_JOB_SECURITY_TIME : time between each job.
    • DEAD_SWITCH_TIME : after this time Absent and Suspected resources are turned on the Dead state.
    • PROLOGUE_EPILOGUE_TIMEOUT : the possibility to specify a different timeout for prologue and epilogue (PROLOGUE_EPILOGUE_TIMEOUT).
    • PROLOGUE_EXEC_FILE : you can specify the path of the prologue script executed on nodes.
    • EPILOGUE_EXEC_FILE : you can specify the path of the epilogue script executed on nodes.
    • GENERIC_COMMAND : a specific script may be used instead of ping to check aliveness of nodes. The script must return bad nodes on STDERR (1 line for a bad node and it must have exactly the same name that OAR has given in argument of the command).
    • JOBDEL_SOFTWALLTIME : time after a normal frag that the system waits to retry to frag the job.
    • JOBDEL_WALLTIME : time after a normal frag that the system waits before to delete the job arbitrary and suspects nodes.
    • LOG_FILE : specify the path of OAR log file (default : /var/log/oar.log).
  • Add wait() in pingchecker to avoid zombies.

  • Better code modularization.

  • Remove node install part to launch jobs. So it is easier to upgrade from one version to an other (oarnodesetting must already be installed on each nodes if we want to use it).

  • Users can specify a method to be notified (mail or script).

  • Add cpuset support

  • Add prologue and epilogue script to be executed on the OAR server before and after launching a job.

  • Add dependancy support between jobs ("-a" option in oarsub).

  • In oarsub you can specify the launching directory ("-d" option).

  • In oarsub you can specify a job name ("-n" option).

  • In oarsub you can specify stdout and stderr file names.

  • User can resubmit a job (option "--resubmit" in oarsub).

  • It is possible to specify a read only database account and it will be used to evaluate SQL properties given by the user with the oarsub command (more scecure).

  • Add possibility to order assigned resources with their properties by the scheduler. So you can privilege some resources than others (SCHEDULER_RESOURCE_ORDER tag in oar.conf file)

  • a command can be specified to switch off idle nodes (SCHEDULER_NODE_MANAGER_SLEEP_CMD, SCHEDULER_NODE_MANAGER_IDLE_TIME, SCHEDULER_NODE_MANAGER_SLEEP_TIME in oar.conf)

  • a command can be specified to switch on nodes in the Absent state according to the resource property cm_availability in the table resources (SCHEDULER_NODE_MANAGER_WAKE_UP_CMD in oar.conf).

  • if a job goes in Error state and this is not its fault then OAR will resubmit this one.

9.2   version 1.6.1:

  • initialise the "ganttJobsPrediction" table with a right reference date (1970-01-01 00:00:01)

  • oarsub has the "-k, --checkpoint" option. It specifies the number of seconds before job walltime to send a SIGUSR2 on it.

  • You can see the list of events for jobs with -f option in oarstat

  • oardel can now send SIGUSR2 to the oarexec of a job (-c option)

  • oardel can now delete several jobs

  • Add a signal handler in oarexec and propagate this signal to the user process

  • Support '#OAR -p ...' options in user script

  • Add in oar.conf the possibility to specify a different timeout for prologue and epilogue (PROLOGUE_EPILOGUE_TIMEOUT)

  • when a connection to the DB fails, OAR retrie 5 times

  • handle the EXTERMINATE_JOB event and suspect all nodes if possible

  • add job state log table

  • add node properties log table

  • add possibility to use sentinelle.rb script in the ping_checker module

  • add a GRID5000 specific scheduler which implements specific policy (oar_sched_gant_g5k)

  • add -s option to oarnodes command --> show only state of nodes

  • add -l option to oarnodes command --> show only the node list

  • root can now delete any jobs in the same way the oar user could

  • change a few oardel messages

  • limit the commands (with arguments) specified by users:

    regular expression : [\w\s\/\.\-]*
    
  • change oaremovenode command to oarremovenode

  • enhance job error management

  • enhance suspicious nodes detection

  • fix bugs about accounting

9.3   version 1.6:

  • debug reservation jobs (if the job has not the right number of requested nodes then it will wait for a delay. Once expired, the job will be launched on the nodes available).

  • add a cache for visualization of the gantt scheme (add two gantt table for the visu)

  • add an event log mechanism. It permits to know all decisions and events occuring in OAR with regards to jobs.

  • detection of errors (they can be traced via event_log table):
    • job working directory does not exist on the node
    • output files cannot be created
  • add deploy scheduler awareness (schedule on non deploy nodes firstly)

  • possibility to change property value via oarnodesetting command (-p option)

  • add the command "oaremovenode" that allows to delete a node of the database definitely

  • bug fix : now oarsub can use the user's script even if the oar user cannot

  • now, you can use special value "all" in oarsub ("nodes" resource). It gives all free nodes corresponding to the weight and properties specified.

  • debug Gantt visualization

  • add the possibility to test nodes via nmap

  • add accounting features (accounting table)

  • change nodeState_log table scheme to increase rapidity