LAPSE:2023.16164
Published Article

LAPSE:2023.16164
Cost Efficient GPU Cluster Management for Training and Inference of Deep Learning
March 3, 2023
Abstract
Expanding the scale of GPU-based deep learning (DL) clusters would bring not only accelerated AI services but also significant energy consumption costs. In this paper, we propose a cost efficient deep learning job allocation (CE-DLA) approach minimizing the energy consumption cost for the DL cluster operation while guaranteeing the performance requirements of user requests. To do this, we first categorize the DL jobs into two classes: training jobs and inference jobs. Through the architecture-agnostic modeling, our CE-DLA approach is able to conduct the delicate mapping of heterogeneous DL jobs to GPU computing nodes. Second, we design the electricity price-aware DL job allocation so as to minimize the energy consumption cost of the cluster. We show that our approach efficiently avoids the peak-rate time slots of the GPU computing nodes by using the sophisticated mixed-integer nonlinear problem (MINLP) formulation. We additionally integrate the dynamic right-sizing (DRS) method with our CE-DLA approach, so as to minimize the energy consumption of idle nodes having no running job. In order to investigate the realistic behavior of our approach, we measure the actual output from the NVIDIA-based GPU devices with well-known deep neural network (DNN) models. Given the real trace data of the electricity price, we show that the CE-DLA approach outperforms the competitors in views of both the energy consumption cost and the performance for DL job processing.
Expanding the scale of GPU-based deep learning (DL) clusters would bring not only accelerated AI services but also significant energy consumption costs. In this paper, we propose a cost efficient deep learning job allocation (CE-DLA) approach minimizing the energy consumption cost for the DL cluster operation while guaranteeing the performance requirements of user requests. To do this, we first categorize the DL jobs into two classes: training jobs and inference jobs. Through the architecture-agnostic modeling, our CE-DLA approach is able to conduct the delicate mapping of heterogeneous DL jobs to GPU computing nodes. Second, we design the electricity price-aware DL job allocation so as to minimize the energy consumption cost of the cluster. We show that our approach efficiently avoids the peak-rate time slots of the GPU computing nodes by using the sophisticated mixed-integer nonlinear problem (MINLP) formulation. We additionally integrate the dynamic right-sizing (DRS) method with our CE-DLA approach, so as to minimize the energy consumption of idle nodes having no running job. In order to investigate the realistic behavior of our approach, we measure the actual output from the NVIDIA-based GPU devices with well-known deep neural network (DNN) models. Given the real trace data of the electricity price, we show that the CE-DLA approach outperforms the competitors in views of both the energy consumption cost and the performance for DL job processing.
Record ID
Keywords
deep learning, electricity price, energy consumption cost, GPU-based cluster, inference, job allocation, training
Subject
Suggested Citation
Kang DK, Lee KB, Kim YC. Cost Efficient GPU Cluster Management for Training and Inference of Deep Learning. (2023). LAPSE:2023.16164
Author Affiliations
Kang DK: Division of Electronic and Information, Department of Computer Science and Engineering, Jeonbuk National University, Jeonju 54896, Korea [ORCID]
Lee KB: Division of Electronic and Information, Department of Computer Science and Engineering, Jeonbuk National University, Jeonju 54896, Korea [ORCID]
Kim YC: Division of Electronic and Information, Department of Computer Science and Engineering, Jeonbuk National University, Jeonju 54896, Korea [ORCID]
Lee KB: Division of Electronic and Information, Department of Computer Science and Engineering, Jeonbuk National University, Jeonju 54896, Korea [ORCID]
Kim YC: Division of Electronic and Information, Department of Computer Science and Engineering, Jeonbuk National University, Jeonju 54896, Korea [ORCID]
Journal Name
Energies
Volume
15
Issue
2
First Page
474
Year
2022
Publication Date
2022-01-10
ISSN
1996-1073
Version Comments
Original Submission
Other Meta
PII: en15020474, Publication Type: Journal Article
Record Map
Published Article

LAPSE:2023.16164
This Record
External Link

https://doi.org/10.3390/en15020474
Publisher Version
Download
Meta
Record Statistics
Record Views
191
Version History
[v1] (Original Submission)
Mar 3, 2023
Verified by curator on
Mar 3, 2023
This Version Number
v1
Citations
Most Recent
This Version
URL Here
https://psecommunity.org/LAPSE:2023.16164
Record Owner
Auto Uploader for LAPSE
Links to Related Works
