2,840 research outputs found

    All mixed up? Finding the optimal feature set for general readability prediction and its application to English and Dutch

    Get PDF
    Readability research has a long and rich tradition, but there has been too little focus on general readability prediction without targeting a specific audience or text genre. Moreover, though NLP-inspired research has focused on adding more complex readability features there is still no consensus on which features contribute most to the prediction. In this article, we investigate in close detail the feasibility of constructing a readability prediction system for English and Dutch generic text using supervised machine learning. Based on readability assessments by both experts and a crowd, we implement different types of text characteristics ranging from easy-to-compute superficial text characteristics to features requiring a deep linguistic processing, resulting in ten different feature groups. Both a regression and classification setup are investigated reflecting the two possible readability prediction tasks: scoring individual texts or comparing two texts. We show that going beyond correlation calculations for readability optimization using a wrapper-based genetic algorithm optimization approach is a promising task which provides considerable insights in which feature combinations contribute to the overall readability prediction. Since we also have gold standard information available for those features requiring deep processing we are able to investigate the true upper bound of our Dutch system. Interestingly, we will observe that the performance of our fully-automatic readability prediction pipeline is on par with the pipeline using golden deep syntactic and semantic information

    Unbalanced load flow with hybrid wavelet transform and support vector machine based Error-Correcting Output Codes for power quality disturbances classification including wind energy

    Get PDF
    Purpose. The most common methods to designa multiclass classification consist to determine a set of binary classifiers and to combine them. In this paper support vector machine with Error-Correcting Output Codes (ECOC-SVM) classifier is proposed to classify and characterize the power qualitydisturbances such as harmonic distortion,voltage sag, and voltage swell include wind farms generator in power transmission systems. Firstly three phases unbalanced load flow analysis is executed to calculate difference electric network characteristics, levels of voltage, active and reactive power. After, discrete wavelet transform is combined with the probabilistic ECOC-SVM model to construct the classifier. Finally, the ECOC-SVM classifies and identifies the disturbance type according tothe energy deviation of the discrete wavelet transform. The proposedmethod gives satisfactory accuracy with 99.2% compared with well known methods and shows that each power quality disturbances has specific deviations from the pure sinusoidal waveform,this is good at recognizing and specifies the type of disturbance generated from the wind power generator.Наиболее распространенные методы построения мультиклассовой классификации заключаются в определении набора двоичных классификаторов и их объединении. В данной статье предложена машина опорных векторов с классификатором выходных кодов исправления ошибок(ECOC-SVM) с целью классифицировать и характеризовать такие нарушения качества электроэнергии, как гармонические искажения, падение напряжения и скачок напряжения, включая генератор ветровых электростанций в системах передачи электроэнергии. Сначала выполняется анализ потока несимметричной нагрузки трех фаз для расчета разностных характеристик электрической сети, уровней напряжения, активной и реактивной мощности. После этого дискретное вейвлет-преобразование объединяется с вероятностной моделью ECOC-SVM для построения классификатора. Наконец, ECOC-SVM классифицирует и идентифицирует тип возмущения в соответствии с отклонением энергии дискретного вейвлет-преобразования. Предложенный метод дает удовлетворительную точность 99,2% по сравнению с хорошо известными методами и показывает, что каждое нарушение качества электроэнергии имеет определенные отклонения от чисто синусоидальной формы волны, что способствует распознаванию и определению типа возмущения, генерируемого ветровым генератором

    On the design of an ECOC-compliant genetic algorithm

    Get PDF
    Genetic Algorithms (GA) have been previously applied to Error-Correcting Output Codes (ECOC) in state-of-the-art works in order to find a suitable coding matrix. Nevertheless, none of the presented techniques directly take into account the properties of the ECOC matrix. As a result the considered search space is unnecessarily large. In this paper, a novel Genetic strategy to optimize the ECOC coding step is presented. This novel strategy redefines the usual crossover and mutation operators in order to take into account the theoretical properties of the ECOC framework. Thus, it reduces the search space and lets the algorithm to converge faster. In addition, a novel operator that is able to enlarge the code in a smart way is introduced. The novel methodology is tested on several UCI datasets and four challenging computer vision problems. Furthermore, the analysis of the results done in terms of performance, code length and number of Support Vectors shows that the optimization process is able to find very efficient codes, in terms of the trade-off between classification performance and the number of classifiers. Finally, classification performance per dichotomizer results shows that the novel proposal is able to obtain similar or even better results while defining a more compact number of dichotomies and SVs compared to state-of-the-art approaches

    Elephant Search with Deep Learning for Microarray Data Analysis

    Full text link
    Even though there is a plethora of research in Microarray gene expression data analysis, still, it poses challenges for researchers to effectively and efficiently analyze the large yet complex expression of genes. The feature (gene) selection method is of paramount importance for understanding the differences in biological and non-biological variation between samples. In order to address this problem, a novel elephant search (ES) based optimization is proposed to select best gene expressions from the large volume of microarray data. Further, a promising machine learning method is envisioned to leverage such high dimensional and complex microarray dataset for extracting hidden patterns inside to make a meaningful prediction and most accurate classification. In particular, stochastic gradient descent based Deep learning (DL) with softmax activation function is then used on the reduced features (genes) for better classification of different samples according to their gene expression levels. The experiments are carried out on nine most popular Cancer microarray gene selection datasets, obtained from UCI machine learning repository. The empirical results obtained by the proposed elephant search based deep learning (ESDL) approach are compared with most recent published article for its suitability in future Bioinformatics research.Comment: 12 pages, 5 Tabl

    Soft Methodology for Cost-and-error Sensitive Classification

    Full text link
    Many real-world data mining applications need varying cost for different types of classification errors and thus call for cost-sensitive classification algorithms. Existing algorithms for cost-sensitive classification are successful in terms of minimizing the cost, but can result in a high error rate as the trade-off. The high error rate holds back the practical use of those algorithms. In this paper, we propose a novel cost-sensitive classification methodology that takes both the cost and the error rate into account. The methodology, called soft cost-sensitive classification, is established from a multicriteria optimization problem of the cost and the error rate, and can be viewed as regularizing cost-sensitive classification with the error rate. The simple methodology allows immediate improvements of existing cost-sensitive classification algorithms. Experiments on the benchmark and the real-world data sets show that our proposed methodology indeed achieves lower test error rates and similar (sometimes lower) test costs than existing cost-sensitive classification algorithms. We also demonstrate that the methodology can be extended for considering the weighted error rate instead of the original error rate. This extension is useful for tackling unbalanced classification problems.Comment: A shorter version appeared in KDD '1
    corecore