277,846 research outputs found

    Lexicon-corpus Based Korean Unknown Foreign Word Extraction and Updating Using Syllable Identification

    Get PDF
    AbstractThis paper presents an efficient text mining method focusing on extraction and updating of unknown words (unknown foreign words) to improve data classification and POS tags. Proposed methods can also help to improve the accuracy of mining frequent pattern and association rules from unstructured (textual) data. Many researches have been done by numerous scholars on estimation and segmentation for unknown words, but, they are limited to grammatical and linguistic rules with limited vocabulary. In our project we have consider the fact, that no language is free from the influence of foreign languages, especially, country like Korea where there is a rapid improvement in the area of culture and media and the frequent usage of these foreign languages, resulted in mixing up different languages, their style along with slangs and also abbreviated words in daily life and conversation. The main characteristic of our system is to find such unknown foreign words and update them to appropriate words, which depends on available information through dictionaries. We have also explained the essential natural language processing (NLP) tools used for data processing. Our proposed method used simple but efficient techniques, first it converts the data into structured form, using data preprocessing techniques. In this phase data passes through different stages, such as, cleaning, integration and selection of important data, and then it gets organized into databases structure for further analysis and processing. This database consists of different kinds of dictionaries, our system heavily based on dictionaries. We have manually created various kinds of dictionaries for different kinds of unknown foreign words processing and analysis with the help of our team members. Our proposed methods for discovering and updating foreign unknown word, first discovers the foreign word using morphological analysis with the help of automatically and manually created dictionaries, then suffix trimming and word segmentation, next our algorithm checks for its different written pattern using dictionaries according to its spelling and synonym word in native language (Korean) and also, updates the POS tags. We have tested on different collection of data from economics news, beauty & fashion and college student blogs, the results have shown great efficiency and improvement, and they were adequate enough to research further

    Feature fusion at the local region using localized maximum-margin learning for scene categorization

    Get PDF
    In the field of visual recognition such as scene categorization, representing an image based on the local feature (e.g., the bag-of-visual-word (BOVW) model and the bag-of-contextual-visual-word (BOCVW) model) has become popular and one of the most successful methods. In this paper, we propose a method that uses localized maximum-margin learning to fuse different types of features during the BOCVW modeling for eventual scene classification. The proposed method fuses multiple features at the stage when the best contextual visual word is selected to represent a local region (hard assignment) or the probabilities of the candidate contextual visual words used to represent the unknown region are estimated (soft assignment). The merits of the proposed method are that (1) errors caused by the ambiguity of single feature when assigning local regions to the contextual visual words can be corrected or the probabilities of the candidate contextual visual words used to represent the region can be estimated more accurately; and that (2) it offers a more flexible way in fusing these features through determining the similarity-metric locally by localized maximum-margin learning. The proposed method has been evaluated experimentally and the results indicate its effectiveness. © 2011 Elsevier Ltd All rights reserved.postprin

    Detecting acronyms from capital letter sequences in Spanish

    Get PDF
    This paper presents an automatic strategy to decide how to pronounce a Capital Letter Sequence (CLS) in a Text to Speech system (TTS). If CLS is well known by the TTS, it can be expanded in several words. But when the CLS is unknown, the system has two alternatives: spelling it (abbreviation) or pronouncing it as a new word (acronym). In Spanish, there is a high relationship between letters and phonemes. Because of this, when a CLS is similar to other words in Spanish, there is a high tendency to pronounce it as a standard word. This paper proposes an automatic method for detecting acronyms. Additionaly, this paper analyses the discrimination capability of some features, and several strategies for combining them in order to obtain the best classifier. For the best classifier, the classification error is 8.45%. About the feature analysis, the best features have been the Letter Sequence Perplexity and the Average N-gram order

    Emotion Recognition for Japanese Short Sentences Including Slangs

    Get PDF
    The growth of Internet communication sites such as weblogs and social networking sites brought younger people especially in teens and in their 20s to create new words and to use them very often. We prepared an emotion corpus by collecting weblog article texts including new words, analyzed the corpus statistically, and proposed a method to estimate emotions of the texts. Most slang words such as Youth Slang are too ambiguous in sense classification to be registered into the existing dictionaries such as thesaurus. To cope with these words, we created a large scale of Twitter corpus and calculated sense similarities between words. We proposed to convert unknown word to semantic class id so that we might be able to process the words that were not included in the learning data. For calculation similarities between words and converting the word into word cluster id, we used the word embedding algorithms such as word2vec, or GloVe. We defined this method as a method using Bag of Concepts as feature. As a result of the evaluation experiment using several classifiers, the proposed method was proved its robustness for unknown expressions

    Word Embedding, Neural Networks and Text Classification: What is the State-of-the-Art?

    Get PDF
    In this bachelor thesis, I first introduce the machine learning methodology of text classification with the goal to describe the functioning of neural networks. Then, I identify and discuss the current development of Convolutional Neural Networks and Recurrent Neural Networks from a text classification perspective and compare both models. Furthermore, I introduce different techniques used to translate textual information in a language comprehensible by the computer, which ultimately serve as inputs for the models previously discussed. From there, I propose a method for the models to cope with words absent from a training corpus. This first part has also the goal to facilitate the access to the machine learning world to a broader audience than computer science students and experts. To test the proposal, I implement and compare two state-of-the-art models and eight different word representations using pre-trained vectors on a dataset given by LogMeIn and on a common benchmark. I find that, with my configuration, Convolutional Neural Networks are easier to train and are also yielding better results. Nevertheless, I highlight that models that combine both architectures can potentially have a better performance, but need more work on identifying appropriate hyperparameters for training. Finally, I find that the efficacy of word embedding methods depends not only on the dataset but also on the model used to tackle the subsequent task. In my context, they can boost performance by up to 10.2% compared to a random initialization. However, further investigations are necessary to evaluate the value of my proposal with a corpus that contains a greater ratio of unknown relevant words. Keywords: neural networks; machine learning; word embedding; text classification; business analytic

    A hybrid method to trace technology evolution pathways: a case study of 3D printing

    Full text link
    © 2017, Akadémiai Kiadó, Budapest, Hungary. Whether it be for countries to improve the ability to undertake independent innovation or for enterprises to enhance their international competitiveness, tracing historical progression and forecasting future trends of technology evolution is essential for formulating technology strategies and policies. In this paper, we apply co-classification analysis to reveal the technical evolution process of a certain technical field, use co-word analysis to extract implicit or unknown patterns and topics, and employ main path analysis to discover significant clues about technology hotspots and development prospects. We illustrate this hybrid approach with 3D printing, referring to various technologies and processes used to synthesize a three-dimensional object. Results show how our method offers technical insights and traces technology evolution pathways, and then helps decision-makers guide technology development
    • …
    corecore