10,496 research outputs found

    Evaluating Feature Extraction Methods for Biomedical Word Sense Disambiguation

    Get PDF
    Evaluating Feature Extraction Methods for Biomedical WSD Clint Cuffy, Sam Henry and Bridget McInnes, PhD Virginia Commonwealth University, Richmond, Virginia, USA Introduction. Biomedical text processing is currently a high active research area but ambiguity is still a barrier to the processing and understanding of these documents. Many word sense disambiguation (WSD) approaches represent instances of an ambiguous word as a distributional context vector. One problem with using these vectors is noise -- information that is overly general and does not contribute to the word’s representation. Feature extraction approaches attempt to compensate for sparsity and reduce noise by transforming the data from high-dimensional space to a space of fewer dimensions. Currently, word embeddings [1] have become an increasingly popular method to reduce the dimensionality of vector representations. In this work, we evaluate word embeddings in a knowledge-based word sense disambiguation method. Methods. Context requiring disambiguation consists of an instance of an ambiguous word, and multiple denotative senses. In our method, each word is replaced with its respective word embedding and either summed or averaged to form a single instance vector representation. This also is performed for each sense of an ambiguous word using the sense’s definition obtained from the Unified Medical Language System (UMLS). We calculate the cosine similarity between each sense and instance vectors, and assign the instance the sense with the highest value. Evaluation. We evaluate our method on three biomedical WSD datasets: NLM-WSD, MSH-WSD and Abbrev. The word embeddings were trained on the titles and abstracts from the 2016 Medline baseline. We compare using two word embedding models, Skip-gram and Continuous Bag of Words (CBOW), and vary the word vector representational lengths, from one-hundred to one-thousand, and compare differences in accuracy. Results. The overall outcome of this method demonstrates fairly high accuracy at disambiguating biomedical instance context from groups of denotative senses. The results showed the Skip-gram model obtained a higher disambiguation accuracy than CBOW but the increase was not significant for all of the datasets. Similarly, vector representations of differing lengths displayed minimal change in results, often differing by mere tenths in percentage. We also compared our results to current state-of-the-art knowledge-based WSD systems, including those that have used word embeddings, showing comparable or higher disambiguation accuracy. Conclusion. Although biomedical literature can be ambiguous, our knowledge-based feature extraction method using word embeddings demonstrates a high accuracy in disambiguating biomedical text while eliminating variations of associated noise. In the future, we plan to explore additional dimensionality reduction methods and training data. [1] T. Mikolov, I. Sutskever, K. Chen, G. Corrado and J. Dean, Distributed representations of words and phrases and their compositionality, Advances in neural information processing systems, pp. 3111-3119, 2013.https://scholarscompass.vcu.edu/uresposters/1278/thumbnail.jp

    Feature Extraction Methods for Character Recognition

    Get PDF
    Not Include

    Comparing feature extraction methods for texture classification

    Get PDF
    U radu su analizirane tri različite metode za dohvaćanje značajki pri klasifikaciji tekstura. Korištene metode su histogram, LBP te Haarova valić metoda. U teorijskom dijelu objašnjene su metode za dohvaćanje značajki, neki klasifikatori, metode za ocjenjivanje klasifikatora i načini implementacije metoda. Praktični dio sastoji se od implementacije navedenih metoda u programskom jeziku C# te od analize rezultata u alatu za rudarenje podatcima Weka. Ova tema postaje sve više popularna u svijetu jer je doba digitalizacije i računala te se sve pokušava svesti na to da računalo što više posla obavi umjesto ljudi kako bi se uštedjelo što više vremena, a time i novca. Prema dobivenim rezultatima najbolja metoda, na primjenjenim teksturama, je histogram.This paper analyses three different methods for extraction features during texture classification. The utilized methods are: histogram, LBP and the Haar wavelet method. The theoretical part of the paper elaborates on the methods for extraction features, some of the classifiers, methods for evaluating classifiers as well as on the ways of implementing the methods. The practical part of the paper is comprised of two sections: the implementation of the previously mentioned methods in a programming language C#, and the analysis of the results in a data mining tool Weka. This topic is constantly gaining popularity nowadays due to the digital era in which the computer is doing most of the work, so people can save time, as well as money

    Feature Extraction Methods by Various Concepts using SOM

    Get PDF
    Image retrieval systems gained traction with the increased use of visual and media data. It is critical to understand and manage big data, lot of analysis done in image retrieval applications. Given the considerable difficulty involved in handling big data using a traditional approach, there is a demand for its efficient management, particularly regarding accuracy and robustness. To solve these issues, we employ content-based image retrieval (CBIR) methods within both supervised , unsupervised pictures. Self-Organizing Maps (SOM), a competitive unsupervised learning aggregation technique, are applied in our innovative multilevel fusion methodology to extract features that are categorised. The proposed methodology beat state-of-the-art algorithms with 90.3% precision, approximate retrieval precision (ARP) of 0.91, and approximate retrieval recall (ARR) of 0.82 when tested on several benchmark datasets

    Physiologically-Motivated Feature Extraction Methods for Speaker Recognition

    Get PDF
    Speaker recognition has received a great deal of attention from the speech community, and significant gains in robustness and accuracy have been obtained over the past decade. However, the features used for identification are still primarily representations of overall spectral characteristics, and thus the models are primarily phonetic in nature, differentiating speakers based on overall pronunciation patterns. This creates difficulties in terms of the amount of enrollment data and complexity of the models required to cover the phonetic space, especially in tasks such as identification where enrollment and testing data may not have similar phonetic coverage. This dissertation introduces new features based on vocal source characteristics intended to capture physiological information related to the laryngeal excitation energy of a speaker. These features, including RPCC, GLFCC and TPCC, represent the unique characteristics of speech production not represented in current state-of-the-art speaker identification systems. The proposed features are evaluated through three experimental paradigms including cross-lingual speaker identification, cross song-type avian speaker identification and mono-lingual speaker identification. The experimental results show that the proposed features provide information about speaker characteristics that is significantly different in nature from the phonetically-focused information present in traditional spectral features. The incorporation of the proposed glottal source features offers significant overall improvement to the robustness and accuracy of speaker identification tasks

    Dropout Model Evaluation in MOOCs

    Full text link
    The field of learning analytics needs to adopt a more rigorous approach for predictive model evaluation that matches the complex practice of model-building. In this work, we present a procedure to statistically test hypotheses about model performance which goes beyond the state-of-the-practice in the community to analyze both algorithms and feature extraction methods from raw data. We apply this method to a series of algorithms and feature sets derived from a large sample of Massive Open Online Courses (MOOCs). While a complete comparison of all potential modeling approaches is beyond the scope of this paper, we show that this approach reveals a large gap in dropout prediction performance between forum-, assignment-, and clickstream-based feature extraction methods, where the latter is significantly better than the former two, which are in turn indistinguishable from one another. This work has methodological implications for evaluating predictive or AI-based models of student success, and practical implications for the design and targeting of at-risk student models and interventions
    corecore