Univa Grid Engine 8.1.0 Features (Part 1) - The New NUMA Aware Scheduler (2012-05-17)

For the UGE 8.1 release we will add several new features to Univa Grid Engine. One of them is an enhancement of the Grid Engine scheduler component: it is now completely NUMA aware! I.e. it does its scheduling decisions based on available NUMA nodes, occupied and free cores/sockets as well as on the NUMA memory utilization on the sockets. That also means that the CPU core selection algorithm is moved into the scheduler component. This comes with another advantage: It can now be guaranteed that you will get the requested binding on the execution host because the scheduler automatically selects only host which can fulfill your core binding request (and NUMA memory request).

The new resources (i.e. complexes), which are automatically available after the installation are containing also the cache hierarchy (level 1 to level 3).

hl:m_cache_l1=32.000K
hl:m_cache_l2=256.000K
hl:m_cache_l3=12.000M

The new NUMA topology string show the NUMA nodes of a host:
hl:m_topology_numa=[SCCCC][SCCCC][SCCCC][SCCCC]

In the NUMA topology string each bracket identifies a different NUMA node (usually a socket). The memory access latency depends on which NUMA node the process runs and on which socket the memory is connected. Hence the scheduler can set the memory allocation strategy in a way that only local (NUMA node local) memory is accessed. In case more than available memory is needed the scheduler can also set it to prefer local memory.

This is done with the new qsub parameter „-mbind“, which accepts following options:

  • cores
  • cores:strict
  • round_robin
  • nlocal


-mbind cores:strict“ means that the process only will get memory from the NUMA zone in which it is running. Using memory from other NUMA nodes is not allowed.

-mbind core“ means that the job prefer memory from it home NUMA zone. If there is no more memory it will allocate memory also from other NUMA nodes.
Both settings depend from a requested core binding (with -binding linear:1 for example).

-mbind round_robin“ let the job allocate memory in an interleaved fashion. The net result could be a higher bandwidth but lower average latency depending on the hardware architecture. This memory allocation strategy does not require a core binding.

In order to handle the memory on the NUMA nodes (usually sockets) more resources are available on such architectures.

Here is an example of a 4 node (4 socket) host:

hl:m_mem_total=31.250G
hl:m_mem_used=1.953G
hl:m_mem_free=29.297G
hl:m_mem_total_n0=7.812G
hl:m_mem_free_n0=7.324G
hl:m_mem_used_n0=500.000M
hl:m_mem_total_n1=7.812G
hl:m_mem_free_n1=7.324G
hl:m_mem_used_n1=500.000M
hl:m_mem_total_n2=7.812G
hl:m_mem_free_n2=7.324G
hl:m_mem_used_n2=500.000M
hl:m_mem_total_n3=7.812G
hl:m_mem_free_n3=7.324G
hl:m_mem_used_n3=500.000M


The resource „m_mem_total“ is a consumable complex which is also reported as a load value by the execution daemon. It is also a very special one: When you are requesting this complex together with a memory allocation strategy it internally automatically turns the request into implicit host requests in order to decrement the NUMA node memory depending on the memory allocation strategy. Because during submission time you do not know how your job is distributed on the execution node it manages implicitly the memory bank (NUMA node) requests.

The following example will demonstrate this:

Assume you are submitting a two-way multi-threaded job (pe_slots allocation rule required in PE) together with the memory binding strategy „-mbind cores:strict“ (which means that the job is only allowed to use fast local memory) and you want to use 2 cores if possible on one socket. The application needs 4 gig in total, i.e. 2G per slot.This would result in following request:

qsub -mbind cores:strict -binding linear:2 -pe mytestpe 2 -l m_mem_free=2G yourapp.sh

Now the scheduler skips all hosts which don‘t offer 2 free cores (as well as slots) and have at least 4G of free memory. If there is an host where the job can be bound than additionally the NUMA memory banks are checked (m_mem_free_n<node> complex). If the app runs on 1 socket offering 2 cores in a row than the NUMA node must offer 4G in total, if not the host is skipped. If the host offers only 2 free cores on different sockets (NUMA nodes) than each of them must offer only 2G. The memory on the NUMA nodes is decremented according the job distribution, the job memory request and the jobs memory allocation strategy.

There is much more to say about it, especially for the "-mbind nlocal" complex, which comes with implicit binding requests but more about it later...