TY - JOUR
T1 - An automatic data cleaning procedure for electron cyclotron emission imaging on EAST tokamak using machine learning algorithm
AU - Li, C.
AU - Lan, T.
AU - Wang, Y.
AU - Liu, J.
AU - Xie, J.
AU - Lan, T.
AU - Li, H.
AU - Qin, H.
N1 - Publisher Copyright:
© 2018 IOP Publishing Ltd and Sissa Medialab.
PY - 2018/10/24
Y1 - 2018/10/24
N2 - A new data cleaning procedure for the electron cyclotron emission imaging (ECEI) of the EAST tokamak is developed. Machine learning techniques, including support vector machine (SVM) and Decision Trees, are applied to the identification of saturated, zero, and weak signals of the ECEI raw data. As a result, the burden of data analysis is reduced, and the classification accuracy is improved. Proper training sets are sampled using the massive raw ECEI data from the EAST tokamak. The optimal window size of temporal signals, the kernel function, and other model parameters are obtained by the model training. Five-fold cross-validation (CV) is applied during modeling and an external testing set is employed to validate the prediction performance of models. The average recall rates on CV sets of saturated, zero, and weak signals are 95.9%, 96.72%, and 100%, respectively, which prove the accuracy of this procedure. Random Forest, as a comparative method, is also employed to deal with the same data sets. The average recall rates on CV sets of saturated, zero, and weak signals performed by Random Forest are 95.9%, 96.72%, and 95.88%. Our method has been proved to outperform Random Forest with small data sets.
AB - A new data cleaning procedure for the electron cyclotron emission imaging (ECEI) of the EAST tokamak is developed. Machine learning techniques, including support vector machine (SVM) and Decision Trees, are applied to the identification of saturated, zero, and weak signals of the ECEI raw data. As a result, the burden of data analysis is reduced, and the classification accuracy is improved. Proper training sets are sampled using the massive raw ECEI data from the EAST tokamak. The optimal window size of temporal signals, the kernel function, and other model parameters are obtained by the model training. Five-fold cross-validation (CV) is applied during modeling and an external testing set is employed to validate the prediction performance of models. The average recall rates on CV sets of saturated, zero, and weak signals are 95.9%, 96.72%, and 100%, respectively, which prove the accuracy of this procedure. Random Forest, as a comparative method, is also employed to deal with the same data sets. The average recall rates on CV sets of saturated, zero, and weak signals performed by Random Forest are 95.9%, 96.72%, and 95.88%. Our method has been proved to outperform Random Forest with small data sets.
KW - Analysis and statistical methods
KW - Data processing methods
UR - https://www.scopus.com/pages/publications/85056115771
UR - https://www.scopus.com/pages/publications/85056115771#tab=citedBy
U2 - 10.1088/1748-0221/13/10/P10029
DO - 10.1088/1748-0221/13/10/P10029
M3 - Article
AN - SCOPUS:85056115771
SN - 1748-0221
VL - 13
JO - Journal of Instrumentation
JF - Journal of Instrumentation
IS - 10
M1 - P10029
ER -