31 research outputs found

    State-of-the-art generalisation research in NLP: a taxonomy and review

    Get PDF
    The ability to generalise well is one of the primary desiderata of natural language processing (NLP). Yet, what `good generalisation' entails and how it should be evaluated is not well understood, nor are there any common standards to evaluate it. In this paper, we aim to lay the ground-work to improve both of these issues. We present a taxonomy for characterising and understanding generalisation research in NLP, we use that taxonomy to present a comprehensive map of published generalisation studies, and we make recommendations for which areas might deserve attention in the future. Our taxonomy is based on an extensive literature review of generalisation research, and contains five axes along which studies can differ: their main motivation, the type of generalisation they aim to solve, the type of data shift they consider, the source by which this data shift is obtained, and the locus of the shift within the modelling pipeline. We use our taxonomy to classify over 400 previous papers that test generalisation, for a total of more than 600 individual experiments. Considering the results of this review, we present an in-depth analysis of the current state of generalisation research in NLP, and make recommendations for the future. Along with this paper, we release a webpage where the results of our review can be dynamically explored, and which we intend to up-date as new NLP generalisation studies are published. With this work, we aim to make steps towards making state-of-the-art generalisation testing the new status quo in NLP.Comment: 35 pages of content + 53 pages of reference

    Tune your brown clustering, please

    Get PDF
    Brown clustering, an unsupervised hierarchical clustering technique based on ngram mutual information, has proven useful in many NLP applications. However, most uses of Brown clustering employ the same default configuration; the appropriateness of this configuration has gone predominantly unexplored. Accordingly, we present information for practitioners on the behaviour of Brown clustering in order to assist hyper-parametre tuning, in the form of a theoretical model of Brown clustering utility. This model is then evaluated empirically in two sequence labelling tasks over two text types. We explore the dynamic between the input corpus size, chosen number of classes, and quality of the resulting clusters, which has an impact for any approach using Brown clustering. In every scenario that we examine, our results reveal that the values most commonly used for the clustering are sub-optimal

    Scalable and Quality-Aware Training Data Acquisition for Conversational Cognitive Services

    Full text link
    Dialog Systems (or simply bots) have recently become a popular human-computer interface for performing user's tasks, by invoking the appropriate back-end APIs (Application Programming Interfaces) based on the user's request in natural language. Building task-oriented bots, which aim at performing real-world tasks (e.g., booking flights), has become feasible with the continuous advances in Natural Language Processing (NLP), Artificial Intelligence (AI), and the countless number of devices which allow third-party software systems to invoke their back-end APIs. Nonetheless, bot development technologies are still in their preliminary stages, with several unsolved theoretical and technical challenges stemming from the ambiguous nature of human languages. Given the richness of natural language, supervised models require a large number of user utterances paired with their corresponding tasks -- called intents. To build a bot, developers need to manually translate APIs to utterances (called canonical utterances) and paraphrase them to obtain a diverse set of utterances. Crowdsourcing has been widely used to obtain such datasets, by paraphrasing the initial utterances generated by the bot developers for each task. However, there are several unsolved issues. First, generating canonical utterances requires manual efforts, making bot development both expensive and hard to scale. Second, since crowd workers may be anonymous and are asked to provide open-ended text (paraphrases), crowdsourced paraphrases may be noisy and incorrect (not conveying the same intent as the given task). This thesis first surveys the state-of-the-art approaches for collecting large training utterances for task-oriented bots. Next, we conduct an empirical study to identify quality issues of crowdsourced utterances (e.g., grammatical errors, semantic completeness). Moreover, we propose novel approaches for identifying unqualified crowd workers and eliminating malicious workers from crowdsourcing tasks. Particularly, we propose a novel technique to promote the diversity of crowdsourced paraphrases by dynamically generating word suggestions while crowd workers are paraphrasing a particular utterance. Moreover, we propose a novel technique to automatically translate APIs to canonical utterances. Finally, we present our platform to automatically generate bots out of API specifications. We also conduct thorough experiments to validate the proposed techniques and models

    Computational Methods for Medical and Cyber Security

    Get PDF
    Over the past decade, computational methods, including machine learning (ML) and deep learning (DL), have been exponentially growing in their development of solutions in various domains, especially medicine, cybersecurity, finance, and education. While these applications of machine learning algorithms have been proven beneficial in various fields, many shortcomings have also been highlighted, such as the lack of benchmark datasets, the inability to learn from small datasets, the cost of architecture, adversarial attacks, and imbalanced datasets. On the other hand, new and emerging algorithms, such as deep learning, one-shot learning, continuous learning, and generative adversarial networks, have successfully solved various tasks in these fields. Therefore, applying these new methods to life-critical missions is crucial, as is measuring these less-traditional algorithms' success when used in these fields

    Proceedings of the Seventh Italian Conference on Computational Linguistics CLiC-it 2020

    Get PDF
    On behalf of the Program Committee, a very warm welcome to the Seventh Italian Conference on Computational Linguistics (CLiC-it 2020). This edition of the conference is held in Bologna and organised by the University of Bologna. The CLiC-it conference series is an initiative of the Italian Association for Computational Linguistics (AILC) which, after six years of activity, has clearly established itself as the premier national forum for research and development in the fields of Computational Linguistics and Natural Language Processing, where leading researchers and practitioners from academia and industry meet to share their research results, experiences, and challenges

    Evaluating visually grounded language capabilities using microworlds

    Get PDF
    Deep learning has had a transformative impact on computer vision and natural language processing. As a result, recent years have seen the introduction of more ambitious holistic understanding tasks, comprising a broad set of reasoning abilities. Datasets in this context typically act not just as application-focused benchmark, but also as basis to examine higher-level model capabilities. This thesis argues that emerging issues related to dataset quality, experimental practice and learned model behaviour are symptoms of the inappropriate use of benchmark datasets for capability-focused assessment. To address this deficiency, a new evaluation methodology is proposed here, which specifically targets in-depth investigation of model performance based on configurable data simulators. This focus on analysing system behaviour is complementary to the use of monolithic datasets as application-focused comparative benchmarks. Visual question answering is an example of a modern holistic understanding task, unifying a range of abilities around visually grounded language understanding in a single problem statement. It has also been an early example for which some of the aforementioned issues were identified. To illustrate the new evaluation approach, this thesis introduces ShapeWorld, a diagnostic data generation framework. Its design is guided by the goal to provide a configurable and extensible testbed for the domain of visually grounded language understanding. Based on ShapeWorld data, the strengths and weaknesses of various state-of-the-art visual question answering models are analysed and compared in detail, with respect to their ability to correctly handle statements involving, for instance, spatial relations or numbers. Finally, three case studies illustrate the versatility of this approach and the ShapeWorld generation framework: an investigation of multi-task and curriculum learning, a replication of a psycholinguistic study for deep learning models, and an exploration of a new approach to assess generative tasks like image captioning.Qualcomm Award Premium Research Studentship, Engineering and Physical Sciences Research Council Doctoral Training Studentshi

    A Corpus-based Register Analysis of Corporate Blogs – Text Types and Linguistic Features

    Get PDF
    A main theme in sociolinguistics is register variation, a situation and use dependent variation of language. Numerous studies have provided evidence of linguistic variation across situations of use in English. However, very little attention has been paid to the language of corporate blogs (CBs), which is often seen as an emerging genre of computer-mediated communication (CMC). Previous studies on blogs and corporate blogs have provided important information about their linguistic features as well as functions; however, our understanding of the linguistic variation in corporate blogs remains limited in particular ways, because many of these previous studies have focused on individual linguistic features, rather than how features interact and what the possible relations between forms (linguistic features) and functions are. Given these limitations, it would be necessary to have a more systematic perspective on linguistic variation in corporate blogs. In order to study register variation in corporate blogs more systematically, a combined framework rooted in Systemic Functional Linguistics (SFL), and register theories (e.g., Biber, 1988, 1995; Halliday & Hasan, 1989) is adopted. This combination is based on some common grounds they share, which concern the functional view of language, co-occurrence patterns of linguistic features, and the importance of large corpora to linguistic research. Guided by this framework, this thesis aims to: 1) investigate the functional linguistic variations in corporate blogs, and identify the text types that are distinguished linguistically, as well as how the CB text types cut across CB industry-categories, and 2) to identify salient linguistic differences across text types in corporate blogs in the configuration of the three components of the context of situation - field, tenor, and mode of discourse. In order to achieve these goals, a 590,520-word corpus consisting of 1,020 textual posts from 41 top-ranked corporate blogs is created and mapped onto the combined framework which consists of Biber’s multi-dimensional (MD) approach and Halliday’s SFL. Accordingly, two sets of empirical analyses are conducted one after another in this research project. At first, by using a corpus-based MD approach which applies multivariate statistical techniques (including factor analysis and cluster analysis) to the investigation of register variation, CB text types are identified; and then, some linguistic features, including the most common verbs and their process types, personal pronouns, modals, lexical density, and grammatical complexity, are selected from language metafunctions of mode, tenor and field within the SFL framework, and their linguistic differences across different text types are analysed. The results of these analyses not only show that the corporate blog is a hybrid genre, representing a combination of various text types, which serve to achieve different communicative purposes and functional goals, but also exhibit a close relationship between certain text types and particular industries, which means the CB texts categorized into a certain text type are mainly from a particular industry. On this basis, the lexical and grammatical features (i.e., the most common verbs, pronouns, modal verbs, lexical density and grammatical complexity) associated with Halliday’s metafunctions are further explored and compared across six text types. It is found that language features which are related to field, tenor and mode in corporate blogs demonstrate a dynamic nature: centring on an interpersonal function, the online blogs in a business setting are basically used for the purposes of sales, customer relationship management and branding. This research project contributes to the existing field of knowledge in the following ways: Firstly, it develops the methodology used in corpus investigation of language variation, and paves the way for further research into corporate blogs and other forms of electronic communication and, more generally, for researchers engaging in corpus-based investigations of other language varieties. Secondly, it adds greatly to a description of corporate blog as a language variety in its own right, which includes different text types identified in CB discourse, and some linguistic features realized in the context of situation. This highlights the fact that corporate blogs cannot be regarded as a simple discourse; rather, they vary according to text types and context of situation

    Understanding search

    Get PDF
    This thesis provides a framework for information retrieval based on a set of models which together illustrate how users of search engines come to express their needs in a particular way. With such insights, we may be able to improve systems’ capabilities of understanding users’ requests and through that eventually the ability to satisfy their needs. Developing the framework necessitates discussion of context, relevance, need development, and the cybernetics of search, all of which are controversial topics. Transaction log data from two enterprise search engines are analysed using a specially developed method which classifies queries according to what aspect of the need they refer to

    CyberResearch on the Ancient Near East and Eastern Mediterranean

    Get PDF
    CyberResearch on the Ancient Near East and Neighboring Regions provides case studies on archaeology, objects, cuneiform texts, and online publishing, digital archiving, and preservation. Eleven chapters present a rich array of material, spanning the fifth through the first millennium BCE, from Anatolia, the Levant, Mesopotamia, and Iran. Customized cyber- and general glossaries support readers who lack either a technical background or familiarity with the ancient cultures. Edited by Vanessa Bigot Juloux, Amy Rebecca Gansell, and Alessandro Di Ludovico, this volume is dedicated to broadening the understanding and accessibility of digital humanities tools, methodologies, and results to Ancient Near Eastern Studies. Ultimately, this book provides a model for introducing cyber-studies to the mainstream of humanities research
    corecore