38 research outputs found

    A Korean Homonym Disambiguation System Based on Statistical Model Using Weights

    Get PDF

    Exploiting Language Models to Classify Events from Twitter

    Get PDF
    Classifying events is challenging in Twitter because tweets texts have a large amount of temporal data with a lot of noise and various kinds of topics. In this paper, we propose a method to classify events from Twitter. We firstly find the distinguishing terms between tweets in events and measure their similarities with learning language models such as ConceptNet and a latent Dirichlet allocation method for selectional preferences (LDA-SP), which have been widely studied based on large text corpora within computational linguistic relations. The relationship of term words in tweets will be discovered by checking them under each model. We then proposed a method to compute the similarity between tweets based on tweets' features including common term words and relationships among their distinguishing term words. It will be explicit and convenient for applying to k-nearest neighbor techniques for classification. We carefully applied experiments on the Edinburgh Twitter Corpus to show that our method achieves competitive results for classifying events

    Exploiting Language Models to Classify Events from Twitter

    No full text
    Classifying events is challenging in Twitter because tweets texts have a large amount of temporal data with a lot of noise and various kinds of topics. In this paper, we propose a method to classify events from Twitter. We firstly find the distinguishing terms between tweets in events and measure their similarities with learning language models such as ConceptNet and a latent Dirichlet allocation method for selectional preferences (LDA-SP), which have been widely studied based on large text corpora within computational linguistic relations. The relationship of term words in tweets will be discovered by checking them under each model. We then proposed a method to compute the similarity between tweets based on tweets’ features including common term words and relationships among their distinguishing term words. It will be explicit and convenient for applying to k-nearest neighbor techniques for classification. We carefully applied experiments on the Edinburgh Twitter Corpus to show that our method achieves competitive results for classifying events

    UPC: An Open Word-Sense Annotated Parallel Corpora for Machine Translation Study

    No full text
    Machine translation (MT) has recently attracted much research on various advanced techniques (i.e., statistical-based and deep learning-based) and achieved great results for popular languages. However, the research on it involving low-resource languages such as Korean often suffer from the lack of openly available bilingual language resources. In this research, we built the open extensive parallel corpora for training MT models, named Ulsan parallel corpora (UPC). Currently, UPC contains two parallel corpora consisting of Korean-English and Korean-Vietnamese datasets. The Korean-English dataset has over 969 thousand sentence pairs, and the Korean-Vietnamese parallel corpus consists of over 412 thousand sentence pairs. Furthermore, the high rate of homographs of Korean causes an ambiguous word issue in MT. To address this problem, we developed a powerful word-sense annotation system based on a combination of sub-word conditional probability and knowledge-based methods, named UTagger. We applied UTagger to UPC and used these corpora to train both statistical-based and deep learning-based neural MT systems. The experimental results demonstrated that using UPC, high-quality MT systems (in terms of the Bi-Lingual Evaluation Understudy (BLEU) and Translation Error Rate (TER) score) can be built. Both UPC and UTagger are available for free download and usage

    A Korean Homonym Disambiguation System Based on Statistical Model Using weights

    No full text
    A homonym could be disambiguated by another words in the context as nouns, predicates used with the homonym. This paper using semantic information (co-occurrence data) obtained from definitions of part of speech (POS) tagged UMRD-S 1). In this research, we have analyzed the result of an experiment on a homonym disambiguation system based on statistical model, to which Bayes ' theorem is applied, and suggested a model established of the weight of sense rate and the weight of distance to the adjacent words to improve the accuracy. The result of applying the homonym disambiguation system using semantic information to disambiguating homonyms appearing on the dictionary definition sentences showed average accuracy of 98.32 % with regard to the most frequent 200 homonyms. We selected 49 (31 substantives and 18 predicates) out of the 200 homonyms that were used in the experiment, and performed an experiment on 50,703 sentences extracted from Sejong Project tagged corpus (i.e. a corpus of morphologically analyzed words) of 3.5 million words that includes one of the 49 homonyms. The result of experimenting by assigning the weight of sense rate(prior probability) and the weight of distance concerning the 5 words at the front/behind the homonym to be disambiguated showed better accuracy than disambiguation systems based on existing statistical models by 2.93%.

    Toll-Like Receptor 10-1-6 Gene Cluster Polymorphisms Are Not Associated With Benign Prostatic Hyperplasia in Korean Population

    No full text
    PurposeInflammation and infection have been associated with the pathogenesis of benign prostatic hyperplasia (BPH). Toll-like receptors (TLRs) play key roles in the innate immune system and initiate the inflammatory response to foreign pathogens. We investigated the relationship between TLR10-1-6 gene cluster polymorphisms and BPH.MethodsWe genotyped four promoter single nucleotide polymorphisms (SNPs) (TLR10, rs10004195; TLR1, rs5743557; and TLR6, rs1039560 and rs1039559) by directly sequencing (233 BPH patients and 214 control subjects). SNPStats and Haploview version 4.02 were used to analyze the data. Multiple logistic regression models (log-additive, dominant, and recessive) were performed to determine odds ratios (ORs), 95% confidence intervals (CIs), and P-values.ResultsThe genotype and allele frequencies of each SNP was not different between the BPH and control groups (P>0.05). Haplotype analysis showed no association between the haplotype in the linkage disequilibrium (LD) block and BPH (P>0.05), although the LD block was constructed.ConclusionsThese results indicate that the TLR10-1-6 gene cluster may be not associated with the development of BPH in the Korean population
    corecore