99 research outputs found

    Style Transfer via Zero-Shot Monolingual Neural Machine Translation

    Get PDF
    Teksti stiiliülekande ülesanne on muuta teksti stiilseid omadusi, säilitades samal ajal selle tähenduse ja soravuse. Seda saab vaadata järjendite teisenduse ülesandena, kuid otseselt märgendatud paralleelandmete nappus muudab selle enamiku juhtumite jaoks võimatuks. Käesolevas töös me kirjeldame stiiliülekande lähenemist, mis põhineb zero-shot masintõlke ideel. Meie pakutud mudel sooritab stiiliülekande neuromasintõlke abil, ilma nõudmata paralleelseid stiili-kohandatud tekste, toetudes hoopis regulaarse keele paralleelsetele andmetele. Meetodit saab rakendada mitmele keelele ühes ja samas treenitud mudelis. Meie kirjeldame pakutud metodoloogiat, oma katseid sellega ning esitame põhjaliku automaatse ja käsitsi teostatava hindamise, kus võrdleme seda juhendatud masinõppe põhise baasmudeliga. Me näitame, et meie mudeli tulemuste hinnangud ületavad juhendatud baaslähenemist mitmes aspektis vastavalt inimeste arvamustele ning on usaldusväärsed mitme stiiliülekande aspekti suhtes.Text style transfer is the task of altering stylistic features of a text while preserving its meaning and fluency. It can be viewed as a sequence-to-sequence transformation task, but the scarcity of directly annotated parallel data makes it unfeasible for most settings. We propose an approach to style transfer that builds on the idea of zero-shot machine translation. It performs style transfer within a neural machine translation model, without requiring any parallel style-adapted texts, relying instead only on regular language-parallel data. The method is applicable to multiple languages within a single model. We outline the method, describe our experiments with it, and, finally, present a thorough automatic and manual evaluation of the approach in comparison to a baseline and independently, showing that our zero-shot model outperforms the supervised baseline on several aspects according to human judgments, and is reliable for a number of style transfer aspects, while not depending on annotated data

    Pop culturally motivated lexical borrowing: Use of Korean in an English-majority fan forum

    Get PDF
    I denne oppgaven ser jeg på bruken av popkulturelt motiverte lånord fra koreansk i et engelsk-språklig nettsamfunn for fans av koreansk pop-musikk. Data er hentet fra subredditen /r/kpop og blir analysert kvantitativt og kvalitativt. Frekvensanalyse viser at de mest lånte ordene er slektskapsterminologi, men bruken på koreansk samsvarer ikke med bruken på engelsk. Låningen foregår på tvers av to skriftsystemer så koreansk skrevet med hangeul romaniseres for å tilpasses det latinske skriftsystemet, men følger ingen etablerte romaniseringssystemer. De romaniserte lånordene er integrert i den engelskspråklige rammen og brukes produktivt og kreativt til å produsere språklige nydannelser både som lånord og hybridkonstruksjoner. Bruken av lånord samsvarer kun delvis med bruken på koreansk og denne bruken er i flere tilfeller unik for r/kpop som en markør på gruppetilhørighet. Språklig lek, satire og sarkasme brukes for å distansere brukerne fra andre K-pop fans utenfor r/kpop. Oppgaven er ment som en utforskende studie av et kontemporært fenomen som tidligere ikke har blitt undersøkt i et lingvistisk perspektiv.Lingvistikk mastergradsoppgaveMAHF-LINGLING35

    SOCIALQ&A: A NOVEL APPROACH TO NOTIFIYING THE CORRECT USERS IN QUESTION AND ANSWERING SYSTEMS

    Get PDF
    Question and Answering (Q&A) systems are currently in use by a large number of Internet users. Q&A systems play a vital role in our daily life as an important platform for information and knowledge sharing. Hence, much research has been devoted to improving the performance of Q&A systems, with a focus on improving the quality of answers provided by users, reducing the wait time for users who ask questions, using a knowledge base to provide answers via text mining, and directing questions to appropriate users. Due to the growing popularity of Q&A systems, the number of questions in the system can become very large; thus, it is unlikely for an answer provider to simply stumble upon a question that he/she can answer properly. The primary objective of this research is to improve the quality of answers and to decrease wait times by forwarding questions to users who exhibit an interest or expertise in the area to which the question belongs. To that end, this research studies how to leverage social networks to enhance the performance of Q&A systems. We have proposed SocialQ&A, a social network based Q&A system that identifies and notifies the users who are most likely to answer a question. SocialQ&A incorporates three major components: User Interest Analyzer, Question Categorizer, and Question- User Mapper. The User Interest Analyzer associates each user with a vector of interest categories. The Question Categorizer algorithm associates a vector of interest categories to each question. Then, based on user interest and user social connectedness, the Question-User Mapper identifies a list of potential answer providers for each question. We have also implemented a real-world prototype for SocialQ&A and analyzed the data from questions/answers obtained from the prototype. Results suggest that social networks can be leveraged to improve the quality of answers and reduce the wait time for answers. Thus, this research provides a promising direction to improve the performance of Q&A systems

    Question Generation from Knowledge Graphs

    No full text

    Caught in the middle – language use and translation : a festschrift for Erich Steiner on the occasion of his 60th birthday

    Get PDF
    This book celebrates Erich Steiner’s scholarly work. In 25 contributions, colleagues and friends take up issues closely related to his research interests in linguistics and translation studies. The result is a colourful kaleidoscope reflecting the many strands of research questions that Erich Steiner helped advance in the past decades and the cheerful, inspiring atmosphere he continues to create

    Semi-automated Ontology Generation for Biocuration and Semantic Search

    Get PDF
    Background: In the life sciences, the amount of literature and experimental data grows at a tremendous rate. In order to effectively access and integrate these data, biomedical ontologies – controlled, hierarchical vocabularies – are being developed. Creating and maintaining such ontologies is a difficult, labour-intensive, manual process. Many computational methods which can support ontology construction have been proposed in the past. However, good, validated systems are largely missing. Motivation: The biocuration community plays a central role in the development of ontologies. Any method that can support their efforts has the potential to have a huge impact in the life sciences. Recently, a number of semantic search engines were created that make use of biomedical ontologies for document retrieval. To transfer the technology to other knowledge domains, suitable ontologies need to be created. One area where ontologies may prove particularly useful is the search for alternative methods to animal testing, an area where comprehensive search is of special interest to determine the availability or unavailability of alternative methods. Results: The Dresden Ontology Generator for Directed Acyclic Graphs (DOG4DAG) developed in this thesis is a system which supports the creation and extension of ontologies by semi-automatically generating terms, definitions, and parent-child relations from text in PubMed, the web, and PDF repositories. The system is seamlessly integrated into OBO-Edit and Protégé, two widely used ontology editors in the life sciences. DOG4DAG generates terms by identifying statistically significant noun-phrases in text. For definitions and parent-child relations it employs pattern-based web searches. Each generation step has been systematically evaluated using manually validated benchmarks. The term generation leads to high quality terms also found in manually created ontologies. Definitions can be retrieved for up to 78% of terms, child ancestor relations for up to 54%. No other validated system exists that achieves comparable results. To improve the search for information on alternative methods to animal testing an ontology has been developed that contains 17,151 terms of which 10% were newly created and 90% were re-used from existing resources. This ontology is the core of Go3R, the first semantic search engine in this field. When a user performs a search query with Go3R, the search engine expands this request using the structure and terminology of the ontology. The machine classification employed in Go3R is capable of distinguishing documents related to alternative methods from those which are not with an F-measure of 90% on a manual benchmark. Approximately 200,000 of the 19 million documents listed in PubMed were identified as relevant, either because a specific term was contained or due to the automatic classification. The Go3R search engine is available on-line under www.Go3R.org

    Knowledge Expansion of a Statistical Machine Translation System using Morphological Resources

    Get PDF
    Translation capability of a Phrase-Based Statistical Machine Translation (PBSMT) system mostly depends on parallel data and phrases that are not present in the training data are not correctly translated. This paper describes a method that efficiently expands the existing knowledge of a PBSMT system without adding more parallel data but using external morphological resources. A set of new phrase associations is added to translation and reordering models; each of them corresponds to a morphological variation of the source/target/both phrases of an existing association. New associations are generated using a string similarity score based on morphosyntactic information. We tested our approach on En-Fr and Fr-En translations and results showed improvements of the performance in terms of automatic scores (BLEU and Meteor) and reduction of out-of-vocabulary (OOV) words. We believe that our knowledge expansion framework is generic and could be used to add different types of information to the model.JRC.G.2-Global security and crisis managemen

    Identifying Experts in Question \& Answer Portals: A Case Study on Data Science Competencies in Reddit

    Full text link
    The irreplaceable key to the triumph of Question & Answer (Q&A) platforms is their users providing high-quality answers to the challenging questions posted across various topics of interest. Recently, the expert finding problem attracted much attention in information retrieval research. In this work, we inspect the feasibility of supervised learning model to identify data science experts in Reddit. Our method is based on the manual coding results where two data science experts labelled expert, non-expert and out-of-scope comments. We present a semi-supervised approach using the activity behaviour of every user, including Natural Language Processing (NLP), crowdsourced and user feature sets. We conclude that the NLP and user feature sets contribute the most to the better identification of these three classes It means that this method can generalise well within the domain. Moreover, we present different types of users, which can be helpful to detect various types of users in the future

    Social Media Moderations, User Ban, and Content Generation: Evidence from Zhihu

    Get PDF
    Social media platforms have evolved as major outlets for many entities to distribute and consume information. The content on social media sites, however, are often considered inaccurate, misleading, or even harmful. To deal with such challenges, the platforms have developed rules and guidelines to moderate and regulate the content on their sites. In this study, we explore user banning as a moderation strategy that restricts, suspends, or bans a user who the platform deems as violating community rules from further participation on the platform for a predetermined period of time. We examine the impact of such moderation strategy using data from a major Q&A platform. Our analyses indicate that user banning increases a user’s contribution after the platform lifts the ban. The magnitude of the impact, however, depends on the user’s engagement level with the platform. We find that the increase in contributions is smaller for a more engaged user. Additionally, we find that the quality of the user-generated content (UGC) decreases after the user ban is lifted. Our research is among the first to empirically evaluate the effectiveness of platform moderations. The findings have important implications for platform owners in managing the content on their sites
    corecore