Optimal Data Reduction of Training Data in Machine Learning-Based Modelling: A Multidimensional Bin Packing Approach

Jelke Wibbeke; Payam Teimourzadeh Baboli; Sebastian Rohjans

LAPSE

Living Archive for Process Systems Engineering

LAPSE:2023.13455

Published Article

LAPSE:2023.13455

Optimal Data Reduction of Training Data in Machine Learning-Based Modelling: A Multidimensional Bin Packing Approach

Jelke Wibbeke, Payam Teimourzadeh Baboli, Sebastian Rohjans

March 1, 2023

In these days, when complex, IT-controlled systems have found their way into many areas, models and the data on which they are based are playing an increasingly important role. Due to the constantly growing possibilities of collecting data through sensor technology, extensive data sets are created that need to be mastered. In concrete terms, this means extracting the information required for a specific problem from the data in a high quality. For example, in the field of condition monitoring, this includes relevant system states. Especially in the application field of machine learning, the quality of the data is of significant importance. Here, different methods already exist to reduce the size of data sets without reducing the information value. In this paper, the multidimensional binned reduction (MdBR) method is presented as an approach that has a much lower complexity in comparison on the one hand and deals with regression, instead of classification as most other approaches do, on the other. The approach merges discretization approaches with non-parametric numerosity reduction via histograms. MdBR has linear complexity and can be facilitated to reduce large multivariate data sets to smaller subsets, which could be used for model training. The evaluation, based on a dataset from the photovoltaic sector with approximately 92 million samples, aims to train a multilayer perceptron (MLP) model to estimate the output power of the system. The results show that using the approach, the number of samples for training could be reduced by more than 99%, while also increasing the model’s performance. It works best with large data sets of low-dimensional data. Although periodic data often include the most redundant samples and thus provide the best reduction capabilities, the presented approach can only handle time-invariant data and not sequences of samples, as often done in time series.

Record ID

LAPSE:2023.13455

Keywords

Big Data, discretization, histogram, neural network, numerosity reduction, regression, training data

Subject

Numerical Methods and Statistics

Suggested Citation

Wibbeke J, Teimourzadeh Baboli P, Rohjans S. Optimal Data Reduction of Training Data in Machine Learning-Based Modelling: A Multidimensional Bin Packing Approach. (2023). LAPSE:2023.13455

Author Affiliations

Wibbeke J: Department for Civil Engineering Geoinformation and Health Technology, Jade University of Applied Science, 26121 Oldenburg, Germany; Energy Department, OFFIS—Institute for Information Technology, 26121 Oldenburg, Germany [ORCID]
Teimourzadeh Baboli P: Energy Department, OFFIS—Institute for Information Technology, 26121 Oldenburg, Germany [ORCID]
Rohjans S: Department for Civil Engineering Geoinformation and Health Technology, Jade University of Applied Science, 26121 Oldenburg, Germany

Journal Name

Energies

Volume

15

Issue

9

First Page

3092

Year

2022

Publication Date

2022-04-23