Search CORE

5 research outputs found

Can Automatic Classification Help to Increase Accuracy in Data Collection?

Author: Diego Chavarro
Frederique Lang
Yuxian Liu
Yuxian Liu (E-mail: [email protected]).
Publication venue: 'Journal of Data and Information Science'
Publication date: 18/09/2016
Field of study

Purpose: The authors aim at testing the performance of a set of machine learning algorithms that could improve the process of data cleaning when building datasets. Design/methodology/approach: The paper is centered on cleaning datasets gathered from publishers and online resources by the use of specific keywords. In this case, we analyzed data from the Web of Science. The accuracy of various forms of automatic classification was tested here in comparison with manual coding in order to determine their usefulness for data collection and cleaning. We assessed the performance of seven supervised classification algorithms (Support Vector Machine (SVM), Scaled Linear Discriminant Analysis, Lasso and elastic-net regularized generalized linear models, Maximum Entropy, Regression Tree, Boosting, and Random Forest) and analyzed two properties: accuracy and recall. We assessed not only each algorithm individually, but also their combinations through a voting scheme. We also tested the performance of these algorithms with different sizes of training data. When assessing the performance of different combinations, we used an indicator of coverage to account for the agreement and disagreement on classification between algorithms. Findings: We found that the performance of the algorithms used vary with the size of the sample for training. However, for the classification exercise in this paper the best performing algorithms were SVM and Boosting. The combination of these two algorithms achieved a high agreement on coverage and was highly accurate. This combination performs well with a small training dataset (10%), which may reduce the manual work needed for classification tasks. Research limitations: The dataset gathered has significantly more records related to the topic of interest compared to unrelated topics. This may affect the performance of some algorithms, especially in their identification of unrelated papers. Practical implications: Although the classification achieved by this means is not completely accurate, the amount of manual coding needed can be greatly reduced by using classification algorithms. This can be of great help when the dataset is big. With the help of accuracy, recall, and coverage measures, it is possible to have an estimation of the error involved in this classification, which could open the possibility of incorporating the use of these algorithms in software specifically designed for data cleaning and classification. Originality/value: We analyzed the performance of seven algorithms and whether combinations of these algorithms improve accuracy in data collection. Use of these algorithms could reduce time needed for manual data cleaning. Purpose: The authors aim at testing the performance of a set of machine learning algorithms that could improve the process of data cleaning when building datasets. Design/methodology/approach: The paper is centered on cleaning datasets gathered from publishers and online resources by the use of specific keywords. In this case, we analyzed data from the Web of Science. The accuracy of various forms of automatic classification was tested here in comparison with manual coding in order to determine their usefulness for data collection and cleaning. We assessed the performance of seven supervised classification algorithms (Support Vector Machine (SVM), Scaled Linear Discriminant Analysis, Lasso and elastic-net regularized generalized linear models, Maximum Entropy, Regression Tree, Boosting, and Random Forest) and analyzed two properties: accuracy and recall. We assessed not only each algorithm individually, but also their combinations through a voting scheme. We also tested the performance of these algorithms with different sizes of training data. When assessing the performance of different combinations, we used an indicator of coverage to account for the agreement and disagreement on classification between algorithms. Findings: We found that the performance of the algorithms used vary with the size of the sample for training. However, for the classification exercise in this paper the best performing algorithms were SVM and Boosting. The combination of these two algorithms achieved a high agreement on coverage and was highly accurate. This combination performs well with a small training dataset (10%), which may reduce the manual work needed for classification tasks. Research limitations: The dataset gathered has significantly more records related to the topic of interest compared to unrelated topics. This may affect the performance of some algorithms, especially in their identification of unrelated papers. Practical implications: Although the classification achieved by this means is not completely accurate, the amount of manual coding needed can be greatly reduced by using classification algorithms. This can be of great help when the dataset is big. With the help of accuracy, recall, and coverage measures, it is possible to have an estimation of the error involved in this classification, which could open the possibility of incorporating the use of these algorithms in software specifically designed for data cleaning and classification. Originality/value: We analyzed the performance of seven algorithms and whether combinations of these algorithms improve accuracy in data collection. Use of these algorithms could reduce time needed for manual data cleaning.</div

National Science Library,Chinese Academy of Sciences

First observation of yrast band in odd-odd Lu-162

Author: GJLiu
QZSun
SXYuan
XFLei
XFZhu
XGGuo
XHZhao
YHZhou
YTWen
YXLiu
ZChen
Zhang
Publication venue
Publication date
Field of study

High spin states of the odd-odd Lu-162 nucleus have been studied via High spin states of the odd-odd Lu-162 nucleus have been studied vi

Institutional Repository of Institute of Modern Physics, CAS

High-spin states in Lu-162 and signature inversion of yrast bands indoubly odd nuclei around the A=160 mass region

Author: GJLiu
QZZhou
SXYuan
XFLei
XFZhu
XGGuo
XHSun
YHZhao
YTWen
YXLiu
ZChen
Zhang
Publication venue
Publication date
Field of study

High-spin states of Lu-162 have been produced and studied via the High-spin states of Lu-162 have been produced and studied via th

Institutional Repository of Institute of Modern Physics, CAS

Study of isotopic distributions of the fragments from the 25MeV/u Ar-40induced reactions

Author: GHWang
HHQi
JCLuo
JQZhao
Lin
WLGuo
WSQin
YFLei
YGZhan
YXLiu
ZYZhou
ZZhang
Publication venue
Publication date
Field of study

The forward emitted fragments in 25 MeV / u Ar-40 induced reactions The forward emitted fragments in 25 MeV / u Ar-40 induced reaction

Institutional Repository of Institute of Modern Physics, CAS

Signature inversion of yrast band in odd-odd Lu-162 nucleus

Author: GJLiu
QZSun
SXYuan
XFLei
XFZhu
XGGuo
XHZhao
YHZhou
YTWen
YXLiu
ZChen
Zhang
Publication venue
Publication date
Field of study

High spin states of the odd-odd Lu-162 nucleus have been studied via High spin states of the odd-odd Lu-162 nucleus have been studied vi

Institutional Repository of Institute of Modern Physics, CAS