LAPSE:2023.13455
Published Article
LAPSE:2023.13455
Optimal Data Reduction of Training Data in Machine Learning-Based Modelling: A Multidimensional Bin Packing Approach
March 1, 2023
In these days, when complex, IT-controlled systems have found their way into many areas, models and the data on which they are based are playing an increasingly important role. Due to the constantly growing possibilities of collecting data through sensor technology, extensive data sets are created that need to be mastered. In concrete terms, this means extracting the information required for a specific problem from the data in a high quality. For example, in the field of condition monitoring, this includes relevant system states. Especially in the application field of machine learning, the quality of the data is of significant importance. Here, different methods already exist to reduce the size of data sets without reducing the information value. In this paper, the multidimensional binned reduction (MdBR) method is presented as an approach that has a much lower complexity in comparison on the one hand and deals with regression, instead of classification as most other approaches do, on the other. The approach merges discretization approaches with non-parametric numerosity reduction via histograms. MdBR has linear complexity and can be facilitated to reduce large multivariate data sets to smaller subsets, which could be used for model training. The evaluation, based on a dataset from the photovoltaic sector with approximately 92 million samples, aims to train a multilayer perceptron (MLP) model to estimate the output power of the system. The results show that using the approach, the number of samples for training could be reduced by more than 99%, while also increasing the model’s performance. It works best with large data sets of low-dimensional data. Although periodic data often include the most redundant samples and thus provide the best reduction capabilities, the presented approach can only handle time-invariant data and not sequences of samples, as often done in time series.
Keywords
Big Data, discretization, histogram, neural network, numerosity reduction, regression, training data
Suggested Citation
Wibbeke J, Teimourzadeh Baboli P, Rohjans S. Optimal Data Reduction of Training Data in Machine Learning-Based Modelling: A Multidimensional Bin Packing Approach. (2023). LAPSE:2023.13455
Author Affiliations
Wibbeke J: Department for Civil Engineering Geoinformation and Health Technology, Jade University of Applied Science, 26121 Oldenburg, Germany; Energy Department, OFFIS—Institute for Information Technology, 26121 Oldenburg, Germany [ORCID]
Teimourzadeh Baboli P: Energy Department, OFFIS—Institute for Information Technology, 26121 Oldenburg, Germany [ORCID]
Rohjans S: Department for Civil Engineering Geoinformation and Health Technology, Jade University of Applied Science, 26121 Oldenburg, Germany
Journal Name
Energies
Volume
15
Issue
9
First Page
3092
Year
2022
Publication Date
2022-04-23
Published Version
ISSN
1996-1073
Version Comments
Original Submission
Other Meta
PII: en15093092, Publication Type: Journal Article
Record Map
Published Article

LAPSE:2023.13455
This Record
External Link

doi:10.3390/en15093092
Publisher Version
Download
Files
[Download 1v1.pdf] (262 kB)
Mar 1, 2023
Main Article
License
CC BY 4.0
Meta
Record Statistics
Record Views
89
Version History
[v1] (Original Submission)
Mar 1, 2023
 
Verified by curator on
Mar 1, 2023
This Version Number
v1
Citations
Most Recent
This Version
URL Here
https://psecommunity.org/LAPSE:2023.13455
 
Original Submitter
Auto Uploader for LAPSE
Links to Related Works
Directly Related to This Work
Publisher Version