14 research outputs found
(A) study on the mechanical behavior of rock-fill materials in railway roadbed
학위논문(석사) --서울대학교 대학원 :에너지시스템공학부,2010.2.Maste
Bias corrections for Random Forest in regression using residual rotation
This paper studies bias correction methods for Random Forest in regression. Random Forest is a special bagging trees that can be used in regression and classification. It is a popular method because of its high prediction accuracy. However, we find that Random Forest can have significant bias in regression at times. We propose a method to reduce the bias of Random Forest in regression using residual rotation. The real data applications show that our method can reduce the bias of Random Forest significantly. © 2015 The Korean Statistical Society
A sequential clustering algorithm with applications to gene expression data
Clustering algorithms are used in the analysis of gene expression data to identify groups of genes with similar expression patterns. These algorithms group genes with respect to a predefined dissimilarity measure without using any prior classification of the data. Most of the clustering algorithms require the number of clusters as input, and all the objects in the dataset are usually assigned to one of the clusters. We propose a clustering algorithm that finds clusters sequentially, and allows for sporadic objects, so there are objects that are not assigned to any cluster. The proposed sequential clustering algorithm has two steps. First it finds candidates for centers of clusters. Multiple candidates are used to make the search for clusters more efficient. Secondly, it conducts a local search around the candidate centers to find the set of objects that defines a cluster. The candidate clusters are compared using a predefined score, the best cluster is removed from data, and the procedure is repeated. We investigate the performance of this algorithm using simulated data and we apply this method to analyze gene expression profiles in a study on the plasticity of the dendritic cells. © 2008 The Korean Statistical Society
Robust gene selection methods using weighting schemes for microarray data analysis
Background: A common task in microarray data analysis is to identify informative genes that are differentially expressed between two different states. Owing to the high-dimensional nature of microarray data, identification of significant genes has been essential in analyzing the data. However, the performances of many gene selection techniques are highly dependent on the experimental conditions, such as the presence of measurement error or a limited number of sample replicates. Results: We have proposed new filter-based gene selection techniques, by applying a simple modification to significance analysis of microarrays (SAM). To prove the effectiveness of the proposed method, we considered a series of synthetic datasets with different noise levels and sample sizes along with two real datasets. The following findings were made. First, our proposed methods outperform conventional methods for all simulation set-ups. In particular, our methods are much better when the given data are noisy and sample size is small. They showed relatively robust performance regardless of noise level and sample size, whereas the performance of SAM became significantly worse as the noise level became high or sample size decreased. When sufficient sample replicates were available, SAM and our methods showed similar performance. Finally, our proposed methods are competitive with traditional methods in classification tasks for microarrays. Conclusions: The results of simulation study and real data analysis have demonstrated that our proposed methods are effective for detecting significant genes and classification tasks, especially when the given data are noisy or have few sample replicates. By employing weighting schemes, we can obtain robust and reliable results for microarray data analysis. © 2017 The Author(s)
Understanding recurrent neural network for texts using English-Korean corpora
Deep Learning is the most important key to the development of Artificial Intelligence (AI). There are several distinguishable architectures of neural networks such as MLP, CNN, and RNN. Among them, we try to understand one of the main architectures called Recurrent Neural Network (RNN) that differs from other networks in handling sequential data, including time series and texts. As one of the main tasks recently in Natural Language Processing (NLP), we consider Neural Machine Translation (NMT) using RNNs. We also summarize fundamental structures of the recurrent networks, and some topics of representing natural words to reasonable numeric vectors. We organize topics to understand estimation procedures from representing input source sequences to predict target translated sequences. In addition, we apply multiple translation models with Gated Recurrent Unites (GRUs) in Keras on English-Korean sentences that contain about 26,000 pairwise sequences in total from two different corpora, colloquialism and news. We verified some crucial factors that influence the quality of training. We found that loss decreases with more recurrent dimensions and using bidirectional RNN in the encoder when dealing with short sequences. We also computed BLEU scores which are the main measures of the translation performance, and compared them with the score from Google Translate using the same test sentences. We sum up some difficulties when training a proper translation model as well as dealing with Korean language. The use of Keras in Python for overall tasks from processing raw texts to evaluating the translation model also allows us to include some useful functions and vocabulary libraries as well. © 2020 The Korean Statistical Society, and Korean International Statistical Society
Asymptotic option pricing under pure-jump Lévy processes via nonlinear regression
When the underlying asset price process follows a Lévy process, the market becomes incomplete, in which the option pricing can be a complicated problem. This paper proposes a method of asymptotic option pricing when the underlying asset price process follows a pure-jump Lévy process. We express the option price as the expected value of the discounted payoff and expand it at the Black-Scholes price assuming that the price process converges weakly to the Black-Scholes model. The price can be approximated by a formula with 4 parameters, which can easily be estimated using option prices observed in the market. The proposed price explains the market option data better than the Black-Scholes price in real data application with KOSPI 200. © 2010 The Korean Statistical Society
A new dissimilarity measure in time-dependent experiments
Most distance measures used in unsupervised learning methods including the Euclidean distance and correlation-based distances disregard the time order of observations. In this paper, we consider a new dissimilarity measure that incorporates the time order of observations for time-dependent experiments. It measures the distance between a linear combination of two consecutive observations. To consider the length of time interval between observations, we use the same measure with the weight of time length, Δ ti. We show that this measure has larger asymptotic discriminating power than the Euclidean distance, and it also gives a good small sample performance. © 2008 The Korean Statistical Society
Session-based classification of internet applications in 3G wireless networks
Accurately classifying and identifying wireless network traffic associated with various applications, such as Web, VoIP, and VoD, is a challenge for both service providers and network operators. Traditional classification schemes exploiting port or payload analysis are becoming ineffective in actual networks, as many new applications are emerging. This paper presents the classification of HSDPA network traffic applications using Classification and Regression Tree (CART) and Support Vector Machine (SVM) with the session information as a basic measure. The session is bidirectional traffic stream between two hosts that is used as a basic measure and a unit of information. We acquired and processed HSDPA traffic from a real 3G network without sanitizing the data. CART and SVM are used to classify six application groups (download, game, upload, VoD, VoiP, and web) with a set of twelve easily retrievable features. These features are composed of simple statistical pieces of information, such as the standard deviation of the packet sizes, the number of packets, and the duration of a session. Compared to results of a flow-based application classification, session-based classification produces 11.07% (CART) and 21.99% (SVM) increases in the true positive rate. This feature set is further reduced to two principal components using Principal Component Regression. This paper also takes the initiative to compare CART to K-Means, the wired network traffic clustering scheme, and shows that CART is more accurate for classification than is K-Means. © 2011 Elsevier B.V. All rights reserved
A quantile estimation for massive data with generalized Pareto distribution
This paper proposes a new method of estimating extreme quantiles of heavy-tailed distributions for massive data. The method utilizes the Peak Over Threshold (POT) method with generalized Pareto distribution (GPD) that is commonly used to estimate extreme quantiles and the parameter estimation of GPD using the empirical distribution function (EDF) and nonlinear least squares (NLS). We first estimate the parameters of GPD using EDF and NLS and then, estimate multiple high quantiles for massive data based on observations over a certain threshold value using the conventional POT. The simulation results demonstrate that our parameter estimation method has a smaller Mean square error (MSE) than other common methods when the shape parameter of GPD is at least 0. The estimated quantiles also show the best performance in terms of root MSE (RMSE) and absolute relative bias (ARB) for heavy-tailed distributions. © 2011 Elsevier B.V. All rights reserved
Introduction to convolutional neural network using Keras; An understanding from a statistician
Deep Learning is one of the machine learning methods to find features from a huge data using non-linear transformation. It is now commonly used for supervised learning in many fields. In particular, Convolutional Neural Network (CNN) is the best technique for the image classification since 2012. For users who consider deep learning models for real-world applications, Keras is a popular API for neural networks written in Python and also can be used in R. We try examine the parameter estimation procedures of Deep Neural Network and structures of CNN models from basics to advanced techniques. We also try to figure out some crucial steps in CNN that can improve image classification performance in the CIFAR10 dataset using Keras. We found that several stacks of convolutional layers and batch normalization could improve prediction performance. We also compared image classification performances with other machine learning methods, including K-Nearest Neighbors (K-NN), Random Forest, and XGBoost, in both MNIST and CIFAR10 dataset. © Korean Statistical Society
