OpenHPC Technology Overview
Before entering on the OpenHPC details, it is important to define what is HPC or High Performance computer. According to InsideHPC “High Performance Computing most generally refers to the practice of aggregating computing power in a way that delivers much higher performance than one could get out of a typical desktop computer or workstation in order to solve large problems in science, engineering and business”.
There is another definition from The National Institute for Computational Sciences "High-Performance Computing," or HPC, is the application of "supercomputers" to computational problems that are either too large for standard computers or would take too long. A desktop computer generally has a single processing chip, commonly called a CPU. A HPC system, on the other hand, is essentially a network of nodes, each of which contains one or more processing chips, as well as its own memory”.
Using this definitions we can notice that HPC cluster gives a big importance on parallelization of workloads, to achieve an HPC Cluster must have these components:
- Server nodes
- Head Node – Has the administration software (Provisioning, Workload manager, Monitoring, etc.) and manages the cluster tasks. This nodes can be merged with the login nodes.
- Login Node – Allows users to login into the HPC System and transfer data from the outside. It has a user’s directory.
- Compute nodes – executes the work. Compute nodes provides the resources for the overall performance.
- Storage infrastructure – typically there are two (02) types of data storage on an HPC system:
- User storage / Home - Tends to be NFS (Network File system) that is a distributed file system with a moderate read/write performance. It is designed to hold the user data that is not being used during an actual HPC operation.
- Application storage / Scratch - Tends to be a Parallel File System (PFS) like Lustre or GPFS. Usually is designed to support a high and parallel read/write performance and it’s designed to hold the application data that is being used during an actual HPC operation. On small architectures you can use NFS for Application Storage.
- Networking Infrastructure – used to interconnect all the pieces on the architecture. There are several networks needed:
- Operative network – this network is used for the workload traverse over the compute nodes, so it needs to be fast and reliable, most common options are: Ethernet (25GbE/100GbE), InfiniBand, Intel Omnipath, CrayAries XC, SGI NUMALink.
- Provisioning / Admin network – for provisioning the operating system to the compute nodes (stateless) and for executing commands on the nodes. Network requirement is for bursting over a short period of time, usually Ethernet 1GbE or 10GbE will do the work.
- Monitoring / remote management network – network used for monitoring the overall system, using Ganglia or other software. This network can be used for executing IPMI commands on remote nodes. It is recommended to have a separate 1GbE port for this network but you can also merge it with the provisioning network.
- Public Network – it is responsible for connecting the cluster with an external network or to the internet. It is recommended that only the head node is connected to public networks for security reasons. Usually 1GbE or 10 Gbe.
- Software – is the glue for the HPC Cluster infrastructure, responsible for giving the intelligence and workload management to the infrastructure. On the next sections it is going to be a more detailed information about each of the software components.
- Operating System – provides the base for the infrastructure, it can be Linux, UNIX, Windows. Most common on HPC Clusters around the work is Linux.
- Cluster Provisioning – for preparing the filesystem, network settings and user attributes to be provided from the head node to the compute nodes.
- Workload management – is the responsible for managing the cluster network, storage and processors assigned to each workload.
- Cluster monitoring – used to know in real time the resources utilization for the cluster.
**Figure 1 Basic HPC System**
On Figure 1 there is a typical HPC System, it has compute nodes, head node, storage architecture, and networking.
Now that there was explained the building blocks of a HPC systems, it is time to define the OpenHPC project. According to it webpage, OpenHPC is a “collaborative, community effort that initiated from a desire to aggregate a number of common ingredients required to deploy and manage High performance computing (HPC) Linux clusters including provisioning tools, resource management, I/O clients, development tools and a variety of scientific libraries”.
1.1. HPC Software – provisioning system - Warewulf
Warewulf is an operating system management toolkit designed to facilitate large scale deployments of systems on physical, virtual and cloud based infrastructures. It facilitates elastic and large deployments consisting of groups of homogenous systems. Some of the main characteristics of this systems are:
- Warewulf is designed with a common set of core functionality in the form of the primary interface, shared libraries, data/object storage, and an event handler
- Within Warewulf everything is made up of objects, and each object can have its own attributes, parameters, and configuration API.
- Warewulf utilizes the concept of a Virtual Node File System for management of the node operating systems. What this means is that you can manage each node using a chroot which is a directory structure that represents a root file system.
Warewulf is currently developed and distributed using MySQL (MariaDB) as the backend data storage solution and it must be configured and initialized before using Warewulf.
 ** Figure 2. Provisioning process for Warewulf**
As we can see on Figure 2, this is the process that Warewulf use to provision operating system images to compute nodes. The protocols involved are DHCP and TFTP. Once the OS is started there is no more activity for the provisioning system, and the cluster is managed by the workload manager.
1.2. HPC Software – Workload Manager - SLURM
SLURM or Simple Linux Utility for Resource Management is an open source, fault tolerant, and highly scalable cluster management and job scheduling system for large and small Linux Cluster. SLURM requires no kernel modifications for its operation and is relatively self-contained. As a cluster workload manager, Slurm has three key functions. First, it allocates exclusive and/or non-exclusive access to resources (compute nodes) to users for some duration of time so they can perform work. Second, it provides a framework for starting, executing, and monitoring work (normally a parallel job) on the set of allocated nodes. Finally, it arbitrates contention for resources by managing a queue of pending work.
SLURM is composed from several components as:
- Slurmctld – the centralized manager, monitor resources and works. There can be one manager or there may also be a backup manager.
- Slurmd – daemon that runs on each compute node, it waits for work, execute that work, return status, and waits for more work.
- Slurmdb (optional) – to record accounting information for multiple clusters in a single database.
In order to execute works SLURM includes series of user tools that includes srun to initiate jobs, scancel to terminate queued or running jobs, sinfo to report system status, squeue to report the status of jobs, and sacct to get information about jobs and job steps that are running or have completed. The smap and sview commands graphically reports system and job status including network topology. There is an administrative tool scontrol available to monitor and/or modify configuration and state information on the cluster. The administrative tool used to manage the database is sacctmgr. It can be used to identify the clusters, valid users, valid bank accounts, etc. APIs are available for all functions.

**Figure 3. SLURM Architecture**
SLURM uses the concept of Jobs or resource allocation requests, where users submit jobs to a partition and then resources are allocated.
Node state monitored include: count of processors, size of real memory, size of temporary disk space, and state (UP, DOWN, etc.). Additional node information includes weight (preference in being allocated work) and features (arbitrary information such as processor speed or type). Nodes are grouped into partitions, which may contain overlapping nodes so they are best thought of as job queues. Partition information includes: name, list of associated nodes, state (UP or DOWN), maximum job time limit, maximum node count per job, group access list, priority (important if nodes are in multiple partitions) and shared node access policy with optional over-subscription level for gang scheduling (e.g. YES, NO or FORCE:2)
1.3. HPC Software – Cluster Monitoring - Ganglia
Ganglia is a scalable distributed monitoring system for high-performance computing systems such as clusters and Grids. It is based on a hierarchical design targeted at federations of clusters. It leverages widely used technologies such as XML for data representation, XDR for compact, portable data transport, and RRDtool for data storage and visualization. It uses carefully engineered data structures and algorithms to achieve very low per-node overheads and high concurrency. The implementation is robust, has been ported to an extensive set of operating systems and processor architectures, and is currently in use on thousands of clusters around the world. It has been used to link clusters across university campuses and around the world and can scale to handle clusters with 2000 nodes.
Ganglia is composed of two (02) daemons:
- Ganglia Monitoring Daemon (gmond) – installed on each monitored node, delivers information of status and relevant changes. Answer request for an XML Description.
- Ganglia Meta Daemon (gmetad) – periodically polls a collection of child data sources. Data sources are: gmond daemons and other gmetad daemons. Collected data is stored in RRDtool for visualization of historical trends.
Ganglia uses RRDtool (Round Robin Database) to store and visualize historical information for cluster, host and metric trends over different periods of time, ranging from minutes to years. It uses compact, constant size databases designed for storing and summarizing time series data. RRDtool generates the graphs which are then exported to users through the PHP web front-end.

**Figure 4 Ganglia Dashboard**
The implementation for Ganglia consist of the two (02) daemons: gmond and gmetad, a command line program gmetric and a client side library. Multiple daemons allows monitoring information for multiple clusters to be aggregated. Gmetric is a command-line program that applications can use to publish app-specific metrics, while the client side library provides programmatic access to a subset of Ganglia’s features.
All data stored by gmond is in-memory saved and nothing is ever written to disk by default. This, combined with nodes sending multicasting means that new daemons can be added simply by listening and announcing using multicast.
Gmond daemon publishes two (02) types of metrics:
- Built-in metrics: comprise 28 to 37 different metrics, depending on the current operating system, version and CPU architecture. Some of them are: CPU click speed, %CPU usage, CPU load, memory, processes, swap, system boot time, operating system name, architecture, MTU. Built-in messages are 12 bytes in length and are represented as key = value where 4 bytes are for the key and 4-8 for the value.
- User-defined metrics: may represent arbitrary state, maybe a compound metric developed by combining 2 or more built-in metrics. For the messages, each key must be defined every time a message is sent.
Note: Only significant changes are sent on the multicast channel.
Gmetric or Ganglia Metric Tool allows to easily monitor any arbitrary host metrics that you like expanding on the core metrics that gmond measure by default. The central repository for gmetrics is located at: https://github.com/ganglia/gmetric.
The gmetric tool formats a multicast message and sends it to all gmonds daemons that are listening.
Example: