17 research outputs found

    Using Section Headings to Compute Cross-Lingual Similarity of Wikipedia Articles

    Get PDF
    Measuring the similarity of interlanguage-linked Wikipedia articles often requires the use of suitable language resources (e.g., dictionaries and MT systems) which can be problematic for languages with limited or poor translation resources. The size of Wikipedia can also present computational demands when computing similarity. This paper presents a ‘lightweight’ approach to measure cross-lingual similarity in Wikipedia using section headings rather than the entire Wikipedia article, and language resources derived from Wikipedia and Wiktionary to perform translation. Using an existing dataset we evaluate the approach for 7 language pairs. Results show that the performance using section headings is comparable to using all article content, dictionaries derived from Wikipedia and Wiktionary are sufficient to compute cross-lingual similarity and combinations of features can further improve results

    Do origin and facts identify automatically generated text?

    Get PDF
    We present a proof of concept investigating whether native language identification and fact checking information improves a language model (GPT-2) classifier which determines whether a piece of text was written by a human or a machine. Since automatical text generation is trained on writings of many individuals, we hypothesize that there will not be a clear native language for 'the writer' and therefore that a native language identification module can be used in reverse - i.e. when a native language cannot be identified, the probability of automatic generation is higher. Automatic generation is also known to hallucinate, making up content. To this end, we integrate a Wikipedia fact checking module. Both pieces of information are simply added to the input to the GPT-2 classifier, and result in an improvement over its baseline performance in the English language human or generated subtask of the Automated Text Identification (AuTexTification) shared task [1]

    The SENSEI Annotated Corpus: Human Summaries of Reader Comment Conversations in On-line News

    Get PDF
    Researchers are beginning to explore how to generate summaries of extended argumentative conversations in social media, such as those found in reader comments in on-line news. To date, however, there has been little discussion of what these summaries should be like and a lack of humanauthored exemplars, quite likely because writing summaries of this kind of interchange is so difficult. In this paper we propose one type of reader comment summary – the conversation overview summary – that aims to capture the key argumentative content of a reader comment conversation. We describe a method we have developed to support humans in authoring conversation overview summaries and present a publicly available corpus – the first of its kind – of news articles plus comment sets, each multiply annotated, according to our method, with conversation overview summaries

    The SENSEI Overview of Newspaper Readers’ Comments

    Get PDF
    Automatic summarization of reader comments in on-line news is a challenging but clearly useful task. Work to date has produced extractive summaries using well-known techniques from other areas of NLP. But do users really want these, and do they support users in realistic tasks? We specify an alternative summary type for reader comments, based on the notions of issues and viewpoints, and demonstrate our user interface to present it. An evaluation to assess how well summarization systems support users in time-limited tasks (identifying issues and characterizing opinions) gives good results for this prototype

    A Comparison of Approaches for Measuring Cross-Lingual Similarity of Wikipedia Articles

    Get PDF
    Wikipedia has been used as a source of comparable texts for a range of tasks, such as Statistical Machine Translation and CrossLanguage Information Retrieval. Articles written in different languages on the same topic are often connected through inter-language-links. However, the extent to which these articles are similar is highly variable and this may impact on the use of Wikipedia as a comparable resource. In this paper we compare various language-independent methods for measuring cross-lingual similarity: character n-grams, cognateness, word count ratio, and an approach based on outlinks. These approaches are compared against a baseline utilising MT resources. Measures are also compared to human judgements of similarity using a manually created resource containing 700 pairs of Wikipedia articles (in 7 language pairs). Results indicate that a combination of language-independent models (char-ngrams, outlinks and word-count ratio) is highly effective for identifying cross-lingual similarity and performs comparably to language-dependent models (translation and monolingual analysis).The work of the first author was in the framework of the Tacardi research project (TIN2012-38523-C02-00). The work of the fourth author was in the framework of the DIANA-Applications (TIN2012-38603-C02-01) and WIQ-EI IRSES (FP7 Marie Curie No. 269180) research projects.BarrĂłn Cedeño, LA.; Paramita, ML.; Clough, P.; Rosso, P. (2014). A Comparison of Approaches for Measuring Cross-Lingual Similarity of Wikipedia Articles. En Advances in Information Retrieval. Springer Verlag (Germany). 424-429. https://doi.org/10.1007/978-3-319-06028-6_36S424429Adafre, S., de Rijke, M.: Finding Similar Sentences across Multiple Languages in Wikipedia. In: Proc. of the 11th Conf. of the European Chapter of the Association for Computational Linguistics, pp. 62–69 (2006)Dumais, S., Letsche, T., Littman, M., Landauer, T.: Automatic Cross-Language Retrieval Using Latent Semantic Indexing. In: AAAI 1997 Spring Symposium Series: Cross-Language Text and Speech Retrieval, Stanford University, pp. 24–26 (1997)Filatova, E.: Directions for exploiting asymmetries in multilingual Wikipedia. In: Proc. of the Third Intl. Workshop on Cross Lingual Information Access: Addressing the Information Need of Multilingual Societies, Boulder, CO (2009)Levow, G.A., Oard, D., Resnik, P.: Dictionary-Based Techniques for Cross-Language Information Retrieval. Information Processing and Management: Special Issue on Cross-Language Information Retrieval 41(3), 523–547 (2005)Mcnamee, P., Mayfield, J.: Character N-Gram Tokenization for European Language Text Retrieval. Information Retrieval 7(1-2), 73–97 (2004)Mihalcea, R.: Using Wikipedia for Automatic Word Sense Disambiguation. In: Proc. of NAACL 2007. ACL, Rochester (2007)Mohammadi, M., GhasemAghaee, N.: Building Bilingual Parallel Corpora based on Wikipedia. In: Second Intl. Conf. on Computer Engineering and Applications., vol. 2, pp. 264–268 (2010)Munteanu, D., Fraser, A., Marcu, D.: Improved Machine Translation Performace via Parallel Sentence Extraction from Comparable Corpora. In: Proc. of the Human Language Technology and North American Association for Computational Linguistics Conf (HLT/NAACL 2004), Boston, MA (2004)Nguyen, D., Overwijk, A., Hauff, C., Trieschnigg, D.R.B., Hiemstra, D., de Jong, F.: WikiTranslate: Query Translation for Cross-Lingual Information Retrieval Using Only Wikipedia. In: Peters, C., Deselaers, T., Ferro, N., Gonzalo, J., Jones, G.J.F., Kurimo, M., Mandl, T., Peñas, A., Petras, V. (eds.) CLEF 2008. LNCS, vol. 5706, pp. 58–65. Springer, Heidelberg (2009)Paramita, M.L., Clough, P.D., Aker, A., Gaizauskas, R.: Correlation between Similarity Measures for Inter-Language Linked Wikipedia Articles. In: Calzolari, E.A. (ed.) Proc. of the 8th Intl. Language Resources and Evaluation (LREC 2012), pp. 790–797. ELRA, Istanbul (2012)Potthast, M., Stein, B., Anderka, M.: A Wikipedia-Based Multilingual Retrieval Model. In: Macdonald, C., Ounis, I., Plachouras, V., Ruthven, I., White, R.W. (eds.) ECIR 2008. LNCS, vol. 4956, pp. 522–530. Springer, Heidelberg (2008)Simard, M., Foster, G.F., Isabelle, P.: Using Cognates to Align Sentences in Bilingual Corpora. In: Proc. of the Fourth Intl. Conf. on Theoretical and Methodological Issues in Machine Translation (1992)Steinberger, R., Pouliquen, B., Hagman, J.: Cross-lingual Document Similarity Calculation Using the Multilingual Thesaurus EUROVOC. In: Gelbukh, A. (ed.) CICLing 2002. LNCS, vol. 2276, pp. 415–424. Springer, Heidelberg (2002)Toral, A., Muñoz, R.: A proposal to automatically build and maintain gazetteers for Named Entity Recognition using Wikipedia. In: Proc. of the EACL Workshop on New Text 2006. Association for Computational Linguistics, Trento (2006

    Bilingual dictionaries for all EU languages

    Get PDF
    Bilingual dictionaries can be automatically generated using the GIZA++ tool. However, these dictionaries contain a lot of noise, because of which the qualities of outputs of tools relying on the dictionaries are negatively affected. In this work, we present three different methods for cleaning noise from automatically generated bilingual dictionaries: LLR, pivot and transliteration based approach. We have applied these approaches on the GIZA++ dictionaries – dictionaries covering official EU languages – in order to remove noise. Our evaluation showed that all methods help to reduce noise. However, the best performance is achieved using the transliteration based approach. We provide all bilingual dictionaries (the original GIZA++ dictionaries and the cleaned ones) free for download. We also provide the cleaning tools and scripts for free download

    Do you see what I see? Images of the COVID-19 pandemic through the lens of Google

    Get PDF
    During times of crisis, information access is crucial. Given the opaque processes behind modern search engines, it is important to understand the extent to which the “picture” of the Covid-19 pandemic accessed by users differs. We explore variations in what users “see” concerning the pandemic through Google image search, using a two-step approach. First, we crowdsource a search task to users in four regions of Europe, asking them to help us create a photo documentary of Covid-19 by providing image search queries. Analysing the queries, we find five common themes describing information needs. Next, we study three sources of variation - users’ information needs, their geo-locations and query languages - and analyse their influences on the similarity of results. We find that users see the pandemic differently depending on where they live, as evidenced by the 46% similarity across results. When users expressed a given query in different languages, there was no overlap for most of the results. Our analysis suggests that localisation plays a major role in the (dis)similarity of results, and provides evidence of the diverse “picture” of the pandemic seen through Google

    Preserving the memory of the first wave of COVID-19 pandemic: Crowdsourcing a collection of image search queries

    Get PDF
    The unprecedented events of the COVID-19 pandemic have generated an enormous amount of information and populated the Web with new content relevant to the pandemic and its implications. Visual information such as images has been shown to be crucial in the context of scientific communication. Images are often interpreted as being closer to the truth as compared to other forms of communication, because of their physical representation of an event such as the COVID-19 pandemic. In this work, we ask crowdworkers across four regions of Europe that were severely affected by the first wave of pandemic, to provide us with image search queries related to COVID-19 pandemic. The goal of this study is to understand the similarities/differences of the aspects that are most important to users across different locations regarding the first wave of COVID-19 pandemic. Through a content analysis of their queries, we discovered five common themes of concern to all, although the frequency of use differed across regions

    Extracting bilingual terms from the Web

    Get PDF
    In this paper we make two contributions. First, we describe a multi-component system called BiTES (Bilingual Term Extraction System) designed to automatically gather domain-specific bilingual term pairs from Web data. BiTES components consist of data gathering tools, domain classifiers, monolingual text extraction systems and bilingual term aligners. BiTES is readily extendable to new language pairs and has been successfully used to gather bilingual terminology for 24 language pairs, including English and all official EU languages, save Irish. Second, we describe a novel set of methods for evaluating the main components of BiTES and present the results of our evaluation for six language pairs. Results show that the BiTES approach can be used to successfully harvest quality bilingual term pairs from the Web. Our evaluation method delivers significant insights about the strengths and weaknesses of our techniques. It can be straightforwardly reused to evaluate other bilingual term extraction systems and makes a novel contribution to the study of how to evaluate bilingual terminology extraction systems

    Report on the CyCAT winter school on fairness, accountability, transparency and ethics (FATE) in AI

    Get PDF
    The first FATE Winter School, organized by the Cyprus Center for Algorithmic Transparency (CyCAT) provided a forum for both students as well as senior researchers to examine the complex topic of Fairness, Accountability, Transparency and Ethics (FATE). Through a program that included two invited keynotes, as well as sessions led by CyCAT partners across Europe and Israel, participants were exposed to a range of approaches on FATE, in a holistic manner. During the Winter School, the team also organized a hands-on activity to evaluate a tool-based intervention where participants interacted with eight prototypes of bias-aware search engines. Finally, participants were invited to join one of four collaborative projects coordinated by CyCAT, thus furthering common understanding and interdisciplinary collaboration on this emerging topic
    corecore