48 research outputs found

    Validation of Tagging Suggestion Models for a Hotel Ticketing Corpus

    Get PDF
    This paper investigates methods for the prediction of tags on a textual corpus that describes hotel staff inputs in a ticketing system. The aim is to improve the tagging process and find the most suitable method for suggesting tags for a new text entry. The paper consists of two parts: (i) exploration of existing sample data, which includes statistical analysis and visualisation of the data to provide an overview, and (ii) evaluation of tag prediction approaches. We have included different approaches from different research fields in order to cover a broad spectrum of possible solutions. As a result, we have tested a machine learning model for multi-label classification (using gradient boosting), a statistical approach (using frequency heuristics), and two simple similarity-based classification approaches (Nearest Centroid and k-Nearest Neighbours). The experiment which compares the approaches uses recall to measure the quality of results. Finally, we provide a recommendation of the modelling approach which produces the best accuracy in terms of tag prediction on the sample data

    Comparing tagging suggestion models on discrete corpora

    Get PDF
    This paper aims to investigate the methods for the prediction of tags on a textual corpus that describes diverse data sets based on short messages; as an example, the authors demonstrate the usage of methods based on hotel staff inputs in a ticketing system as well as the publicly available StackOverflow corpus. The aim is to improve the tagging process and find the most suitable method for suggesting tags for a new text entry

    Semi-Supervised Tag Recommendation- Using Untagged Resources to Mitigate Cold-Start Problems

    No full text
    Abstract. Tag recommender systems are often used in social tagging systems, a popular family of Web 2.0 applications, to assist users in the tagging process. But in cold-start situations i.e., when new users or resources enter the system, state-of-the-art tag recommender systems perform poorly and are not always able to generate recommendations. Many user profiles contain untagged resources, which could provide valuable information especially for cold-start scenarios where tagged data is scarce. The existing methods do not explore this additional information source. In this paper we propose to use a purely graph-based semi-supervised relational approach that uses untagged posts for addressing the cold-start problem. We conduct experiments on two real-life datasets and show that our approach outperforms the state-of-the-art in many cases.

    Dream Hunter \u3cem\u3eA National Wildlife Refuge Manager’s Memoir\u3c/em\u3e

    Get PDF
    (From the Preface) The lives of my ancestors were often venturesome, sometimes dangerous and occasionally deadly. The Crozier Clan, who lived in Scotland, killed and stole from their neighbors on both sides of the border with England. In the 1600’s they emigrated to Ireland and in the 1700’s on to America, settling in the wilds of New York and serving in the Revolutionary War. Other Crozier ancestors pioneered in Illinois and Iowa and some served in the Civil War. One shirttail relative killed, with his bare fists, two brothers who had assaulted him for romancing their sister. On my maternal grandmother’s side of the family, the Tschepen men were noblemen’s gamekeepers for several generations. In the late 1800s, when my grandmother and her sister came to America, they crossed the Atlantic while other ship passengers died from cholera and were buried at sea. Unfortunately, there are only a few anecdotal fragments about these adventurous ancestors to be passed down through the generations and enjoyed by their descendants, Although my life has not been as interesting as my ancestors by any stretch of imagination, it has been full of some experiences that I wish to pass on to my descendants, thus this memoir -- with all of its detail. The recollections in this memoir are about my outdoor experiences and as a professional wildlife manager, a career I loved. These recollections range from my days as a youth through nearly fifty years of association with the National Wildlife Refuge System. The recounted memories of a man at age 71 are much like life; sometimes wearisome, sometimes flawed, sometimes redundant and occasionally unique and interesting. Consequently, potential readers should take that into account. They should review the table of contents then browse through the stories or chapters to look for parts that appeal to them. Some stories in this book are about living my dreams as an employee of the U.S. Fish and Wildlife Service and pursuing my aspirations, including those of improving the National Wildlife Refuge System or parts of it, thus the title of this book – DREAM HUNTE

    Proceedings of the Eighth Italian Conference on Computational Linguistics CliC-it 2021

    Get PDF
    The eighth edition of the Italian Conference on Computational Linguistics (CLiC-it 2021) was held at UniversitĂ  degli Studi di Milano-Bicocca from 26th to 28th January 2022. After the edition of 2020, which was held in fully virtual mode due to the health emergency related to Covid-19, CLiC-it 2021 represented the first moment for the Italian research community of Computational Linguistics to meet in person after more than one year of full/partial lockdown

    EVALITA Evaluation of NLP and Speech Tools for Italian - December 17th, 2020

    Get PDF
    Welcome to EVALITA 2020! EVALITA is the evaluation campaign of Natural Language Processing and Speech Tools for Italian. EVALITA is an initiative of the Italian Association for Computational Linguistics (AILC, http://www.ai-lc.it) and it is endorsed by the Italian Association for Artificial Intelligence (AIxIA, http://www.aixia.it) and the Italian Association for Speech Sciences (AISV, http://www.aisv.it)

    Eesti keele ĂŒhendverbide automaattuvastus lingvistiliste ja statistiliste meetoditega

    Get PDF
    TĂ€napĂ€eval on inimkeeli (kaasa arvatud eesti keelt) töötlevad tehnoloogiaseadmed igapĂ€evaelu osa, kuid arvutite „keeleoskus“ pole kaugeltki tĂ€iuslik. Keele automaattöötluse kĂ”ige rohkem kasutust leidev rakendus on ilmselt masintĂ”lge. Ikka ja jĂ€lle jagatakse sotsiaalmeedias, kuidas tuntud sĂŒsteemid (nĂ€iteks Google Translate) midagi valesti tĂ”lgivad. Enamasti tekitavad absurdse olukorra mitmest sĂ”nast koosnevad fraasid vĂ”i laused. NĂ€iteks ei suuda tĂ”lkesĂŒsteemid tabada lauses „Ta lĂ€ks lepinguga alt“ ĂŒhendi alt minema tĂ€hendust petta saama, sest Ă”ige tĂ€henduse edastamiseks ei saa selle ĂŒhendi komponente sĂ”na-sĂ”nalt tĂ”lkida ja seetĂ”ttu satubki arvuti hĂ€tta. Selleks et nii masintĂ”lkesĂŒsteemide kui ka teiste kasulike rakenduste nagu libauudiste tuvastuse vĂ”i kĂŒsimus-vastus sĂŒsteemide kvaliteet paraneks, on oluline, et arvuti oskaks tuvastada mitmesĂ”nalisi ĂŒksuseid ja nende eri tĂ€hendusi, mida inimesed konteksti pĂ”hjal ĂŒpriski lihtalt teha suudavad. PĂŒsiĂŒhendite (tĂ€henduse) automaattuvastus on oluline kĂ”ikides keeltes ja on seetĂ”ttu pĂ€lvinud arvutilingvistikas rohkelt tĂ€helepanu. Seega on eriti inglise keele pĂ”hjal vĂ€lja pakutud terve hulk meetodeid, mida pole siiamaani eesti keele pĂŒsiĂŒhendite tuvastamiseks rakendatud. Doktoritöös kasutataksegi masinĂ”ppe meetodeid, mis on teiste keelte pĂŒsiĂŒhendite tuvastamisel edukad olnud, ĂŒht liiki eesti keele pĂŒsiĂŒhendi – ĂŒhendverbi – automaatseks tuvastamiseks. Töös demonstreeritakse suurte tekstiandmete pĂ”hjal, et seni eesti keele traditsioonilises kĂ€sitluses esitatud eesti keele ĂŒhendverbide jaotus ainukordseteks (ĂŒhendi komponentide koosesinemisel tekib uus tĂ€hendus) ja korrapĂ€rasteks (ĂŒhendi tĂ€hendus on tema komponentide summa) ei ole piisavalt pĂ”hjalik. Nimelt kinnitab töö arvutilingvistilistes uurimustes laialt levinud arusaama, et pĂŒsiĂŒhendid (k.a ĂŒhendverbid) jaotuvad skaalale, mille ĂŒhes otsas on ĂŒhendid, mille tĂ€hendus on selgelt komponentide tĂ€henduste summa. ja teises need ĂŒhendid, mis saavad uue tĂ€henduse. Uurimus nĂ€itab, et lisaks kontekstile aitavad arvutil tuvastada ĂŒhendverbi Ă”iget tĂ€hendust mitmed teised tunnuseid, nĂ€iteks subjekti ja objekti elusus ja kÀÀnded. Doktoritöö raames valminud andmestikud ja vektoresitused on vajalikud uued ressursid, mis on avalikud edaspidisteks uurimusteks.Nowadays, applications that process human languages (including Estonian) are part of everyday life. However, computers are not yet able to understand every nuance of language. Machine translation is probably the most well-known application of natural language processing. Occasionally, the worst failures of machine translation systems (e.g. Google Translate) are shared on social media. Most of such cases happen when sequences longer than words are translated. For example, translation systems are not able to catch the correct meaning of the particle verb alt (‘from under’) minema (‘to go’) (‘to get deceived’) in the sentence Ta lĂ€ks lepinguga alt because the literal translation of the components of the expression is not correct. In order to improve the quality of machine translation systems and other useful applications, e.g. spam detection or question answering systems, such (idiomatic) multi-word expressions and their meanings must be well detected. The detection of multi-word expressions and their meaning is important in all languages and therefore much research has been done in the field, especially in English. However, the suggested methods have not been applied to the detection of Estonian multi-word expressions before. The dissertation fills that gap and applies well-known machine learning methods to detect one type of Estonian multi-word expressions – the particle verbs. Based on large textual data, the thesis demonstrates that the traditional binary division of Estonian particle verbs to non-compositional (ainukordne, meaning is not predictable from the meaning of its components) and compositional (korrapĂ€rane, meaning is predictable from the meaning of its components) is not comprehensive enough. The research confirms the widely adopted view in computational linguistics that the multi-word expressions form a continuum between the compositional and non-compositional units. Moreover, it is shown that in addition to context, there are some linguistic features, e.g. the animacy and cases of subject and object that help computers to predict whether the meaning of a particle verb in a sentence is compositional or non-compositional. In addition, the research introduces novel resources for Estonian language – trained embeddings and created compositionality datasets are available for the future research.https://www.ester.ee/record=b5252157~S

    Pretrained Transformers for Text Ranking: BERT and Beyond

    Get PDF
    The goal of text ranking is to generate an ordered list of texts retrieved from a corpus in response to a query. Although the most common formulation of text ranking is search, instances of the task can also be found in many natural language processing applications. This survey provides an overview of text ranking with neural network architectures known as transformers, of which BERT is the best-known example. The combination of transformers and self-supervised pretraining has been responsible for a paradigm shift in natural language processing (NLP), information retrieval (IR), and beyond. In this survey, we provide a synthesis of existing work as a single point of entry for practitioners who wish to gain a better understanding of how to apply transformers to text ranking problems and researchers who wish to pursue work in this area. We cover a wide range of modern techniques, grouped into two high-level categories: transformer models that perform reranking in multi-stage architectures and dense retrieval techniques that perform ranking directly. There are two themes that pervade our survey: techniques for handling long documents, beyond typical sentence-by-sentence processing in NLP, and techniques for addressing the tradeoff between effectiveness (i.e., result quality) and efficiency (e.g., query latency, model and index size). Although transformer architectures and pretraining techniques are recent innovations, many aspects of how they are applied to text ranking are relatively well understood and represent mature techniques. However, there remain many open research questions, and thus in addition to laying out the foundations of pretrained transformers for text ranking, this survey also attempts to prognosticate where the field is heading

    Reinventing the Social Scientist and Humanist in the Era of Big Data

    Get PDF
    This book explores the big data evolution by interrogating the notion that big data is a disruptive innovation that appears to be challenging existing epistemologies in the humanities and social sciences. Exploring various (controversial) facets of big data such as ethics, data power, and data justice, the book attempts to clarify the trajectory of the epistemology of (big) data-driven science in the humanities and social sciences

    Tune your brown clustering, please

    Get PDF
    Brown clustering, an unsupervised hierarchical clustering technique based on ngram mutual information, has proven useful in many NLP applications. However, most uses of Brown clustering employ the same default configuration; the appropriateness of this configuration has gone predominantly unexplored. Accordingly, we present information for practitioners on the behaviour of Brown clustering in order to assist hyper-parametre tuning, in the form of a theoretical model of Brown clustering utility. This model is then evaluated empirically in two sequence labelling tasks over two text types. We explore the dynamic between the input corpus size, chosen number of classes, and quality of the resulting clusters, which has an impact for any approach using Brown clustering. In every scenario that we examine, our results reveal that the values most commonly used for the clustering are sub-optimal
    corecore