skip to content

Research Computing Services

The Data Accelerator

About

Current burst buffer work is constrained to handmade implementations or exclusive proprietary systems. With this work, the aim is to provide a scalable cost-effective solution to provide researchers with the required performance improvement that burst buffers bring. This work also aims to help explore the use cases for this tier of accelerated storage. The term burst buffer has become overloaded and has moved on from its original purpose of improving the performance of checkpoint-restart. The term Data Accelerator is felt to better encapsulate the characteristics of this higher performing storage tier. 

With the use of existing functionality provided by SLURM, and the use of Lustre and BeeGFS this project adds an orchestrator to join the existing SLURM burst buffer plugin with established file systems to build the Data Accelerator. This reduces the amount users and system administrators are required to learn in order to build and use the system, while providing room to experiment with new workflows through the use of NVMe Over Fabrics, as well as existing per job file systems for staging data in.

Workflows

One of the motivations for describing this work as a Data Accelerator rather than just a burst buffer is the aim to provide multiple workflows for researchers than just checkpoint restart. Some of the described workflows are experimental and may not be available when the system goes into production.

  • Stage in/Stage out
    • Files from a scratch or project directory will be copied into a job allocation on the data accelerator and used as the scratch space for the job. On completion, the data can be staged out or discarded.
  • Transparent Caching
    • The data accelerator is used to improve access to files commonly used by a job, staging them in from the parallel file system and subsequent reads are directed at the accelerator and not the slower main storage system.
  • Checkpoint
    • During the running of long jobs, checkpoints of a simulation's data can be asynchronously stored on the accelerator and then the main storage system. The performance improvement reduces the time taken to restore large data sets to memory.
  • Background data movement
    • Data sent to the accelerator can be drained back to the main storage system as required. 
  • Journaling
    • During a job's I/O, the data for a file is not written and instead a delta of what data would be written is saved. This journal can then be replayed when data is required.
  • Swap memory
    • The accelerator can be used to extend the available main memory for applications that may grow beyond the compute nodes' available memory. Apache Spark is a possible candidate for this use case.

 

Hardware

The system is built using the Dell EMC R740XD and can hold up to 24 NVMe SSDs. For the Data Accelerator, it will use 12. Six are attached to a PCI bridge and the system features 2 Intel CPUs, with dual Intel Omni-Path fabric cards. The system is then attached to the Research Computing Services (RCS) HPC clusters as a non-blocking leaf.

Orchestrator

The glue that drives the SSDs and helps make them available has been built by Research Computing Services and StackHPC to make commodity storage available with burst buffer like semantics on the RCS HPC clusters. The public repository of this work can be found on GitHub.

SLURM Usage

When a user on the RCS HPC clusters wishes to use the Data Accelerator it was desired that the semantics be compatible with the existing Cray Inc Burst Buffer plugin contributed to SLURM. While the underlying implementation introduced by this work is distinct, a user should not find the experience divergent save for experimental features. Users supply existing #BB and #DW commands to their slurm batch submission script to stage data and configure their buffer.

Availability on the RCS HPC Clusters

Research Computing Services aims to make this available to users as soon as possible. As this is an ongoing project, production service of the accelerator does not have a projected start date. Early access to select users will take place soon to test and evaluate the project. If you feel your project could benefit from this work please contact us to discuss options.

Publications

  •   The Dell-EMC Data Accelerator Solving the HPC I/O bottleneck