Cost Efficient GPU Cluster Management for Training and Inference of Deep Learning

Dong-Ki Kang; Ki-Beom Lee; Young-Chon Kim

LAPSE:2023.16164

Published Article

LAPSE:2023.16164

Cost Efficient GPU Cluster Management for Training and Inference of Deep Learning

Dong-Ki Kang, Ki-Beom Lee, Young-Chon Kim

March 3, 2023

Abstract
Expanding the scale of GPU-based deep learning (DL) clusters would bring not only accelerated AI services but also significant energy consumption costs. In this paper, we propose a cost efficient deep learning job allocation (CE-DLA) approach minimizing the energy consumption cost for the DL cluster operation while guaranteeing the performance requirements of user requests. To do this, we first categorize the DL jobs into two classes: training jobs and inference jobs. Through the architecture-agnostic modeling, our CE-DLA approach is able to conduct the delicate mapping of heterogeneous DL jobs to GPU computing nodes. Second, we design the electricity price-aware DL job allocation so as to minimize the energy consumption cost of the cluster. We show that our approach efficiently avoids the peak-rate time slots of the GPU computing nodes by using the sophisticated mixed-integer nonlinear problem (MINLP) formulation. We additionally integrate the dynamic right-sizing (DRS) method with our CE-DLA approach, so as to minimize the energy consumption of idle nodes having no running job. In order to investigate the realistic behavior of our approach, we measure the actual output from the NVIDIA-based GPU devices with well-known deep neural network (DNN) models. Given the real trace data of the electricity price, we show that the CE-DLA approach outperforms the competitors in views of both the energy consumption cost and the performance for DL job processing.

Record ID

LAPSE:2023.16164

Keywords

deep learning, electricity price, energy consumption cost, GPU-based cluster, inference, job allocation, training

Subject

Modelling and Simulations

Suggested Citation

Kang DK, Lee KB, Kim YC. Cost Efficient GPU Cluster Management for Training and Inference of Deep Learning. (2023). LAPSE:2023.16164

Author Affiliations

Kang DK: Division of Electronic and Information, Department of Computer Science and Engineering, Jeonbuk National University, Jeonju 54896, Korea [ORCID]
Lee KB: Division of Electronic and Information, Department of Computer Science and Engineering, Jeonbuk National University, Jeonju 54896, Korea [ORCID]
Kim YC: Division of Electronic and Information, Department of Computer Science and Engineering, Jeonbuk National University, Jeonju 54896, Korea [ORCID]

Journal Name

Energies

Volume

15

Issue

2

First Page

474

Year

2022

Publication Date

2022-01-10