LAPSE:2023.3014
Published Article
LAPSE:2023.3014
A Study of Text Vectorization Method Combining Topic Model and Transfer Learning
Xi Yang, Kaiwen Yang, Tianxu Cui, Min Chen, Liyan He
February 21, 2023
Abstract
With the development of Internet cloud technology, the scale of data is expanding. Traditional processing methods find it difficult to deal with the problem of information extraction of big data. Therefore, it is necessary to use machine-learning-assisted intelligent processing to extract information from data in order to solve the optimization problem in complex systems. There are many forms of data storage. Among them, text data is an important data type that directly reflects semantic information. Text vectorization is an important concept in natural language processing tasks. Because text data can not be directly used for model parameter training, it is necessary to vectorize the original text data and make it numerical, and then the feature extraction operation can be carried out. The traditional text digitization method is often realized by constructing a bag of words, but the vector generated by this method can not reflect the semantic relationship between words, and it also easily causes the problems of data sparsity and dimension explosion. Therefore, this paper proposes a text vectorization method combining a topic model and transfer learning. Firstly, the topic model is selected to model the text data and extract its keywords, to grasp the main information of the text data. Then, with the help of the bidirectional encoder representations from transformers (BERT) model, which belongs to the pretrained model, model transfer learning is carried out to generate vectors, which are applied to the calculation of similarity between texts. By setting up a comparative experiment, this method is compared with the traditional vectorization method. The experimental results show that the vector generated by the topic-modeling- and transfer-learning-based text vectorization (TTTV) proposed in this paper can obtain better results when calculating the similarity between texts with the same topic, which means that it can more accurately judge whether the contents of the given two texts belong to the same topic.
Keywords
pretrained model, text vectorization, topic model, transfer learning
Suggested Citation
Yang X, Yang K, Cui T, Chen M, He L. A Study of Text Vectorization Method Combining Topic Model and Transfer Learning. (2023). LAPSE:2023.3014
Author Affiliations
Yang X: School of Information, Beijing Wuzi University, Beijing 101149, China; School of Computer & Communication Engineering, University of Science and Technology Beijing, Beijing 100083, China
Yang K: School of Information, Beijing Wuzi University, Beijing 101149, China
Cui T: School of Information, Beijing Wuzi University, Beijing 101149, China [ORCID]
Chen M: School of Information, Beijing Wuzi University, Beijing 101149, China
He L: School of Information, Beijing Wuzi University, Beijing 101149, China
Journal Name
Processes
Volume
10
Issue
2
First Page
350
Year
2022
Publication Date
2022-02-11
ISSN
2227-9717
Version Comments
Original Submission
Other Meta
PII: pr10020350, Publication Type: Journal Article
Record Map
Published Article

LAPSE:2023.3014
This Record
External Link

https://doi.org/10.3390/pr10020350
Publisher Version
Download
Files
Feb 21, 2023
Main Article
License
CC BY 4.0
Meta
Record Statistics
Record Views
199
Version History
[v1] (Original Submission)
Feb 21, 2023
 
Verified by curator on
Feb 21, 2023
This Version Number
v1
Citations
Most Recent
This Version
URL Here
https://psecommunity.org/LAPSE:2023.3014
 
Record Owner
Auto Uploader for LAPSE
Links to Related Works
Directly Related to This Work
Publisher Version
(0.3 seconds) 0.01 + 0.02 + 0.13 + 0.05 + 0 + 0.02 + 0.01 + 0 + 0.01 + 0.02 + 0 + 0