38 research outputs found

    Fighting with the Sparsity of Synonymy Dictionaries

    Full text link
    Graph-based synset induction methods, such as MaxMax and Watset, induce synsets by performing a global clustering of a synonymy graph. However, such methods are sensitive to the structure of the input synonymy graph: sparseness of the input dictionary can substantially reduce the quality of the extracted synsets. In this paper, we propose two different approaches designed to alleviate the incompleteness of the input dictionaries. The first one performs a pre-processing of the graph by adding missing edges, while the second one performs a post-processing by merging similar synset clusters. We evaluate these approaches on two datasets for the Russian language and discuss their impact on the performance of synset induction methods. Finally, we perform an extensive error analysis of each approach and discuss prominent alternative methods for coping with the problem of the sparsity of the synonymy dictionaries.Comment: In Proceedings of the 6th Conference on Analysis of Images, Social Networks, and Texts (AIST'2017): Springer Lecture Notes in Computer Science (LNCS

    Mining Entity Synonyms with Efficient Neural Set Generation

    Full text link
    Mining entity synonym sets (i.e., sets of terms referring to the same entity) is an important task for many entity-leveraging applications. Previous work either rank terms based on their similarity to a given query term, or treats the problem as a two-phase task (i.e., detecting synonymy pairs, followed by organizing these pairs into synonym sets). However, these approaches fail to model the holistic semantics of a set and suffer from the error propagation issue. Here we propose a new framework, named SynSetMine, that efficiently generates entity synonym sets from a given vocabulary, using example sets from external knowledge bases as distant supervision. SynSetMine consists of two novel modules: (1) a set-instance classifier that jointly learns how to represent a permutation invariant synonym set and whether to include a new instance (i.e., a term) into the set, and (2) a set generation algorithm that enumerates the vocabulary only once and applies the learned set-instance classifier to detect all entity synonym sets in it. Experiments on three real datasets from different domains demonstrate both effectiveness and efficiency of SynSetMine for mining entity synonym sets.Comment: AAAI 2019 camera-ready versio

    Corpus Linguistics and 17th-Century Prostitution

    Get PDF
    Corpus linguistics has much to offer history, being as both disciplines engage so heavily in analysis of large amounts of textual material. This book demonstrates the opportunities for exploring corpus linguistics as a method in historiography and the humanities and social sciences more generally. Focusing on the topic of prostitution in 17th-century England, it shows how corpus methods can assist in social research, and can be used to deepen our understanding and comprehension. McEnery and Baker draw principally on two sources – the newsbook Mercurius Fumigosis and the Early English Books Online Corpus. This scholarship on prostitution and the sex trade offers insight into the social position of women in history

    Explaining ambiguity in scientific language

    Get PDF

    Data mining jako metoda použitelná v oblasti japonských studií

    Get PDF
    Tato práce se zaměřuje na problematiku potenciálního využití metod dolování z textu v oblasti japonských studií. První část práce shrnuje základní přístupy dolování z textu a jejich aplikace v praxi. Dále podáváme podrobný výklad problematiky předzpracování textu, u kterého se soustředíme na techniky používané v případě japonštiny a angličtiny. Hlavní část práce spočívá v aplikaci metod dolování z textu na tři konkrétní výzkumné otázky z oblasti japonských studií. V prvním tématu ukážeme na příkladu děl dvou vybraných japonských proletářských autorů, jak mohou techniky shlukování odhalit zajímavé tematické rysy literárních děl. V případě druhého výzkumného tématu využijeme analýzu sentimentu za účelem vyšetření míry negativního sentimentu, který se objevuje v japonských a zahraničních novinových článcích pojednávajících o návštěvách, které vykonávají japonští představitelé ve svatyni Jasukuni. Nakonec se zaměříme na metody automatického shrnutí dokumentů, které aplikujeme na japonské a anglické texty. Získané výsledky detailně diskutujeme, zvláště se zaměřujeme na vyhodnocení použitelnosti představovaných metod pro japonská studia.In this thesis we address the problem of possible utilization of text mining methods in the field of Japanese studies. We review the fundamental text mining approaches and their practical applications in the first part. Then we elaborate on the topic of preprocessing with special focus on techniques used for Japanese and English texts. In the main part of the thesis we apply text mining methods to three concrete research questions relevant in Japanese studies. The first research topic illustrates the technique of clustering applied to works written by two Japanese proletarian authors to reveal interesting topic patterns in their writings. The second topic makes use of the sentiment analysis with the aim of studying the extent of negative sentiment expressed in both foreign and Japanese newspaper articles that refer to Japanese officials' visits to Yasukuni shrine. Finally, we address methods of automatic summarization and their application to Japanese as well as English sample texts. The results obtained are discussed in detail with a special focus on the assessment of viability of the presented methods in Japanese studies.Institute of East Asian StudiesÚstav Dálného východuFaculty of ArtsFilozofická fakult

    Using Data Mining for Facilitating User Contributions in the Social Semantic Web

    Get PDF
    This thesis utilizes recommender systems to aid the user in contributing to the Social Semantic Web. In this work, we propose a framework that maps domain properties to recommendation technologies. Next, we develop novel recommendation algorithms for improving personalized tag recommendation and for recommendation of semantic relations. Finally, we introduce a framework to analyze different types of potential attacks against social tagging systems and evaluate their impact on those systems

    Microblogging Temporal Summarization: Filtering Important Twitter Updates for Breaking News

    Get PDF
    While news stories are an important traditional medium to broadcast and consume news, microblogging has recently emerged as a place where people can dis- cuss, disseminate, collect or report information about news. However, the massive information in the microblogosphere makes it hard for readers to keep up with these real-time updates. This is especially a problem when it comes to breaking news, where people are more eager to know “what is happening”. Therefore, this dis- sertation is intended as an exploratory effort to investigate computational methods to augment human effort when monitoring the development of breaking news on a given topic from a microblog stream by extractively summarizing the updates in a timely manner. More specifically, given an interest in a topic, either entered as a query or presented as an initial news report, a microblog temporal summarization system is proposed to filter microblog posts from a stream with three primary concerns: topical relevance, novelty, and salience. Considering the relatively high arrival rate of microblog streams, a cascade framework consisting of three stages is proposed to progressively reduce quantity of posts. For each step in the cascade, this dissertation studies methods that improve over current baselines. In the relevance filtering stage, query and document expansion techniques are applied to mitigate sparsity and vocabulary mismatch issues. The use of word embedding as a basis for filtering is also explored, using unsupervised and supervised modeling to characterize lexical and semantic similarity. In the novelty filtering stage, several statistical ways of characterizing novelty are investigated and ensemble learning techniques are used to integrate results from these diverse techniques. These results are compared with a baseline clustering approach using both standard and delay-discounted measures. In the salience filtering stage, because of the real-time prediction requirement a method of learning verb phrase usage from past relevant news reports is used in conjunction with some standard measures for characterizing writing quality. Following a Cranfield-like evaluation paradigm, this dissertation includes a se- ries of experiments to evaluate the proposed methods for each step, and for the end- to-end system. New microblog novelty and salience judgments are created, building on existing relevance judgments from the TREC Microblog track. The results point to future research directions at the intersection of social media, computational jour- nalism, information retrieval, automatic summarization, and machine learning

    Neural machine translation for multimodal interaction

    Get PDF
    Typically it is seen that multimodal neural machine translation (MNMT) systems trained on a combination of visual and textual inputs produce better translations than systems trained using only textual inputs. The task of such systems can be decomposed into two sub-tasks: learning visually grounded representations from images and translation of the textual counterparts using those representations. In a multi-task learning framework, translations are generated from an attention-based encoder-decoder framework and grounded representations that are learned from pretrained convolutional neural networks (CNNs) for classifying images. In this thesis, I study different computational techniques to translate the meaning of sentences from one language into another considering the visual modality as a naturally occurring meaning representation bridging between languages. We examine the behaviour of state-of-the-art MNMT systems from the data perspective in order to understand the role of the both textual and visual inputs in such systems. We evaluate our models on the Multi30k, a large-scale multilingual multimodal dataset publicly available for machine learning research. Our results in the optimal and sparse data settings show that the differences in translation system performance are proportional to the amount of both visual and linguistic information whereas, in the adversarial condition the effect of the visual modality is rather small or negligible. The chapters of the thesis follow a progression starting with using different state-of-the-art MMT models for incorporating images in optimal data settings to creating synthetic image data under the low-resource scenario and extending to addition of adversarial perturbations to the textual input for evaluating the real contribution of images
    corecore