2,840 research outputs found
All mixed up? Finding the optimal feature set for general readability prediction and its application to English and Dutch
Readability research has a long and rich tradition, but there has been too little focus on general readability prediction without targeting a specific audience or text genre. Moreover, though NLP-inspired research has focused on adding more complex readability features there is still no consensus on which features contribute most to the prediction. In this article, we investigate in close detail the feasibility of constructing a readability prediction system for English and Dutch generic text using supervised machine learning. Based on readability assessments by both experts
and a crowd, we implement different types of text characteristics ranging from easy-to-compute superficial text characteristics to features requiring a deep linguistic processing, resulting in ten
different feature groups. Both a regression and classification setup are investigated reflecting the two possible readability prediction tasks: scoring individual texts or comparing two texts. We show that going beyond correlation calculations for readability optimization using a wrapper-based genetic algorithm optimization approach is a promising task which provides considerable insights in which feature combinations contribute to the overall readability prediction. Since we also have gold standard information available for those features requiring deep processing we are able to investigate the true upper bound of our Dutch system. Interestingly, we will observe that the performance of our fully-automatic readability prediction pipeline is on par with the pipeline using golden deep syntactic and semantic information
Unbalanced load flow with hybrid wavelet transform and support vector machine based Error-Correcting Output Codes for power quality disturbances classification including wind energy
Purpose. The most common methods to designa multiclass classification consist to determine a set of binary classifiers and to combine them. In this paper support vector machine with Error-Correcting Output Codes (ECOC-SVM) classifier is proposed to classify and characterize the power qualitydisturbances such as harmonic distortion,voltage sag, and voltage swell include wind farms generator in power transmission systems. Firstly three phases unbalanced load flow analysis is executed to calculate difference electric network characteristics, levels of voltage, active and reactive power. After, discrete wavelet transform is combined with the probabilistic ECOC-SVM model to construct the classifier. Finally, the ECOC-SVM classifies and identifies the disturbance type according tothe energy deviation of the discrete wavelet transform. The proposedmethod gives satisfactory accuracy with 99.2% compared with well known methods and shows that each power quality disturbances has specific deviations from the pure sinusoidal waveform,this is good at recognizing and specifies the type of disturbance generated from the wind
power generator.Наиболее распространенные методы построения мультиклассовой классификации заключаются в определении набора двоичных классификаторов и их объединении. В данной статье предложена машина опорных векторов с классификатором выходных кодов исправления ошибок(ECOC-SVM) с целью классифицировать и характеризовать такие нарушения качества электроэнергии, как гармонические искажения, падение напряжения и скачок напряжения, включая генератор ветровых электростанций в системах передачи электроэнергии. Сначала выполняется анализ потока несимметричной нагрузки трех фаз для расчета разностных характеристик электрической сети, уровней напряжения, активной и реактивной мощности. После этого дискретное вейвлет-преобразование объединяется с вероятностной моделью ECOC-SVM для построения классификатора. Наконец, ECOC-SVM классифицирует и идентифицирует тип возмущения в соответствии с отклонением энергии дискретного вейвлет-преобразования. Предложенный метод дает удовлетворительную точность 99,2% по сравнению с хорошо известными методами и показывает, что каждое нарушение качества электроэнергии имеет определенные отклонения от чисто синусоидальной формы волны, что способствует распознаванию и определению типа возмущения, генерируемого ветровым генератором
On the design of an ECOC-compliant genetic algorithm
Genetic Algorithms (GA) have been previously applied to Error-Correcting Output Codes (ECOC) in state-of-the-art works in order to find a suitable coding matrix. Nevertheless, none of the presented techniques directly take into account the properties of the ECOC matrix. As a result the considered search space is unnecessarily large. In this paper, a novel Genetic strategy to optimize the ECOC coding step is presented. This novel strategy redefines the usual crossover and mutation operators in order to take into account the theoretical properties of the ECOC framework. Thus, it reduces the search space and lets the algorithm to converge faster. In addition, a novel operator that is able to enlarge the code in a smart way is introduced. The novel methodology is tested on several UCI datasets and four challenging computer vision problems. Furthermore, the analysis of the results done in terms of performance, code length and number of Support Vectors shows that the optimization process is able to find very efficient codes, in terms of the trade-off between classification performance and the number of classifiers. Finally, classification performance per dichotomizer results shows that the novel proposal is able to obtain similar or even better results while defining a more compact number of dichotomies and SVs compared to state-of-the-art approaches
Elephant Search with Deep Learning for Microarray Data Analysis
Even though there is a plethora of research in Microarray gene expression
data analysis, still, it poses challenges for researchers to effectively and
efficiently analyze the large yet complex expression of genes. The feature
(gene) selection method is of paramount importance for understanding the
differences in biological and non-biological variation between samples. In
order to address this problem, a novel elephant search (ES) based optimization
is proposed to select best gene expressions from the large volume of microarray
data. Further, a promising machine learning method is envisioned to leverage
such high dimensional and complex microarray dataset for extracting hidden
patterns inside to make a meaningful prediction and most accurate
classification. In particular, stochastic gradient descent based Deep learning
(DL) with softmax activation function is then used on the reduced features
(genes) for better classification of different samples according to their gene
expression levels. The experiments are carried out on nine most popular Cancer
microarray gene selection datasets, obtained from UCI machine learning
repository. The empirical results obtained by the proposed elephant search
based deep learning (ESDL) approach are compared with most recent published
article for its suitability in future Bioinformatics research.Comment: 12 pages, 5 Tabl
Soft Methodology for Cost-and-error Sensitive Classification
Many real-world data mining applications need varying cost for different
types of classification errors and thus call for cost-sensitive classification
algorithms. Existing algorithms for cost-sensitive classification are
successful in terms of minimizing the cost, but can result in a high error rate
as the trade-off. The high error rate holds back the practical use of those
algorithms. In this paper, we propose a novel cost-sensitive classification
methodology that takes both the cost and the error rate into account. The
methodology, called soft cost-sensitive classification, is established from a
multicriteria optimization problem of the cost and the error rate, and can be
viewed as regularizing cost-sensitive classification with the error rate. The
simple methodology allows immediate improvements of existing cost-sensitive
classification algorithms. Experiments on the benchmark and the real-world data
sets show that our proposed methodology indeed achieves lower test error rates
and similar (sometimes lower) test costs than existing cost-sensitive
classification algorithms. We also demonstrate that the methodology can be
extended for considering the weighted error rate instead of the original error
rate. This extension is useful for tackling unbalanced classification problems.Comment: A shorter version appeared in KDD '1
Recommended from our members
Parallelizing support vector machines for scalable image annotation
This thesis was submitted for the degree of Doctor of Philosophy and awarded by Brunel University.Machine learning techniques have facilitated image retrieval by automatically classifying and annotating images with keywords. Among them Support Vector Machines (SVMs) are used extensively due to their generalization properties. However, SVM training is notably a computationally intensive process especially when the training dataset is large.
In this thesis distributed computing paradigms have been investigated to speed up SVM training, by partitioning a large training dataset into small data chunks and process each chunk in parallel utilizing the resources of a cluster of computers. A resource aware parallel SVM algorithm is introduced for large scale image annotation in parallel using a cluster of computers. A genetic algorithm based load balancing scheme is designed to optimize the performance of the algorithm in heterogeneous computing environments.
SVM was initially designed for binary classifications. However, most classification problems arising in domains such as image annotation usually involve more than two classes. A resource aware parallel multiclass SVM algorithm for large scale image annotation in parallel using a cluster of computers is introduced.
The combination of classifiers leads to substantial reduction of classification error in a wide range of applications. Among them SVM ensembles with bagging is shown to outperform a single SVM in terms of classification accuracy. However, SVM ensembles training are notably a computationally intensive process especially when the number replicated samples based on bootstrapping is large. A distributed SVM ensemble algorithm for image annotation is introduced which re-samples the training data based on bootstrapping and training SVM on each sample in parallel using a cluster of computers.
The above algorithms are evaluated in both experimental and simulation environments showing that the distributed SVM algorithm, distributed multiclass SVM algorithm, and distributed SVM ensemble algorithm, reduces the training time significantly while maintaining a high level of accuracy in classifications
- …