With the increasing scale of High-Performance Computing (HPC) systems and a new awareness of the environmental impact of HPC, new strategies are required to improve the efficiency of resource usage on these systems. One such strategy is Dynamic Resource Management (DRM), which allows changing the resources assigned to a job dynamically during its execution. This increased flexibility in resource allocation and job scheduling can lead to improvements in several system efficiency metrics.
Despite these benefits, DRM has not yet been established as a ready-to-use technology for production HPC systems. This is caused by the significant changes required in all the layers of the HPC system software stack, which are only achievable with an extensive and holistic co-design process between resource management software and applications.
In this work, we demonstrate the applicability of a recently introduced, generic design approach for dynamic resources called Dynamic Processes with PSets (DPP), to enable DRM in real- world systems. To this end, we developed an exemplary, dynamic system software stack implementation following the DPP design principles throughout all layers. Based on this, we propose a resource optimization approach using a discrete steepest ascent method, informed by application scalability models.
We assess the applicability and performance of our approach using both synthetic benchmarks and job mixes consisting of several dynamic, real-world applications. On up to 100 nodes, we measure moderate overheads for process reconfiguration in applications while significantly improving the system throughput and average job turnaround time compared to static scheduling in crowded system scenarios.
Link:
www.martin-schreiber.info/data/publications/2025_huber_et_al_drm_in_hpc_with_psets.pdf