43 research outputs found

    Language-Independent Tokenisation Rivals Language-Specific Tokenisation for Word Similarity Prediction.

    Get PDF
    Language-independent tokenisation (LIT) methods that do not require labelled language resources or lexicons have recently gained popularity because of their applicability in resource-poor languages. Moreover, they compactly represent a language using a fixed size vocabulary and can efficiently handle unseen or rare words. On the other hand, language-specific tokenisation (LST) methods have a long and established history, and are developed using carefully created lexicons and training resources. Unlike subtokens produced by LIT methods, LST methods produce valid morphological subwords. Despite the contrasting trade-offs between LIT vs. LST methods, their performance on downstream NLP tasks remain unclear. In this paper, we empirically compare the two approaches using semantic similarity measurement as an evaluation task across a diverse set of languages. Our experimental results covering eight languages show that LST consistently outperforms LIT when the vocabulary size is large, but LIT can produce comparable or better results than LST in many languages with comparatively smaller (i.e. less than 100K words) vocabulary sizes, encouraging the use of LIT when language-specific resources are unavailable, incomplete or a smaller model is required. Moreover, we find that smoothed inverse frequency (SIF) to be an accurate method to create word embeddings from subword embeddings for multilingual semantic similarity prediction tasks. Further analysis of the nearest neighbours of tokens show that semantically and syntactically related tokens are closely embedded in subword embedding spacesComment: To appear in the 12th Language Resources and Evaluation (LREC 2020) Conferenc

    A Survey on Text Classification Algorithms: From Text to Predictions

    Get PDF
    In recent years, the exponential growth of digital documents has been met by rapid progress in text classification techniques. Newly proposed machine learning algorithms leverage the latest advancements in deep learning methods, allowing for the automatic extraction of expressive features. The swift development of these methods has led to a plethora of strategies to encode natural language into machine-interpretable data. The latest language modelling algorithms are used in conjunction with ad hoc preprocessing procedures, of which the description is often omitted in favour of a more detailed explanation of the classification step. This paper offers a concise review of recent text classification models, with emphasis on the flow of data, from raw text to output labels. We highlight the differences between earlier methods and more recent, deep learning-based methods in both their functioning and in how they transform input data. To give a better perspective on the text classification landscape, we provide an overview of datasets for the English language, as well as supplying instructions for the synthesis of two new multilabel datasets, which we found to be particularly scarce in this setting. Finally, we provide an outline of new experimental results and discuss the open research challenges posed by deep learning-based language models

    Investigating multilingual approaches for parsing universal dependencies

    Get PDF
    Multilingual dependency parsing encapsulates any attempt to parse multiple languages. It can involve parsing multiple languages in isolation (poly-monolingual), leveraging training data from multiple languages to process any of the included languages (polyglot), or training on one or multiple languages to process a low-resource language with no training data (zero-shot). In this thesis, we explore multilingual dependency parsing across all three paradigms, first analysing whether polyglot training on a number of source languages is beneficial for processing a target language in a zero-shot cross-lingual dependency parsing experiment using annotation projection. The results of this experiment show that polyglot training produces an overall trend of better results on the target language but a highly-related single source language can still be better for transfer. We then look at the role of pretrained language models in processing a moderately low-resource language in Irish. Here, we develop our own monolingual Irish BERT model gaBERT from scratch and compare it to a number of multilingual baselines, showing that developing a monolingual language model for Irish is worthwhile. We then turn to the topic of parsing Enhanced Universal Dependencies (EUD) Graphs, which are an extension of basic Universal Dependencies trees, where we describe the DCU-EPFL submission to the 2021 IWPT shared task on EUD parsing. Here, we developed a multitask model to jointly learn the tasks of basic dependency parsing and EUD graph parsing, showing improvements over a single-task basic dependency parser. Lastly, we revisit the topic of polyglot parsing and investigate whether multiview learning can be applied to the problem of multilingual dependency parsing. Here, we learn different views based on the dataset source. We show that multiview learning can be used to train parsers with multiple datasets, showing a general improvement over single-view baselines

    Adaptive sentiment analysis

    Get PDF
    Domain dependency is one of the most challenging problems in the field of sentiment analysis. Although most sentiment analysis methods have decent performance if they are targeted at a specific domain and writing style, they do not usually work well with texts that are originated outside of their domain boundaries. Often there is a need to perform sentiment analysis in a domain where no labelled document is available. To address this scenario, researchers have proposed many domain adaptation or unsupervised sentiment analysis methods. However, there is still much room for improvement, as those methods typically cannot match conventional supervised sentiment analysis methods. In this thesis, we propose a novel aspect-level sentiment analysis method that seamlessly integrates lexicon- and learning-based methods. While its performance is comparable to existing approaches, it is less sensitive to domain boundaries and can be applied to cross-domain sentiment analysis when the target domain is similar to the source domain. It also offers more structured and readable results by detecting individual topic aspects and determining their sentiment strengths. Furthermore, we investigate a novel approach to automatically constructing domain-specific sentiment lexicons based on distributed word representations (aka word embeddings). The induced lexicon has quality on a par with a handcrafted one and could be used directly in a lexiconbased algorithm for sentiment analysis, but we find that a two-stage bootstrapping strategy could further boost the sentiment classification performance. Compared to existing methods, such an end-to-end nearly-unsupervised approach to domain-specific sentiment analysis works out of the box for any target domain, requires no handcrafted lexicon or labelled corpus, and achieves sentiment classification accuracy comparable to that of fully supervised approaches. Overall, the contribution of this Ph.D. work to the research field of sentiment analysis is twofold. First, we develop a new sentiment analysis system which can ā€” in a nearlyunsupervised mannerā€”adapt to the domain at hand and perform sentiment analysis with minimal loss of performance. Second, we showcase this system in several areas (including finance, politics, and e-business), and investigate particularly the temporal dynamics of sentiment in such contexts

    Un environnement gƩnƩrique et ouvert pour le traitement des expressions polylexicales

    Get PDF
    The treatment of multiword expressions (MWEs), like take off, bus stop and big deal, is a challenge for NLP applications. This kind of linguistic construction is not only arbitrary but also much more frequent than one would initially guess. This thesis investigates the behaviour of MWEs across different languages, domains and construction types, proposing and evaluating an integrated methodological framework for their acquisition. There have been many theoretical proposals to define, characterise and classify MWEs. We adopt generic definition stating that MWEs are word combinations which must be treated as a unit at some level of linguistic processing. They present a variable degree of institutionalisation, arbitrariness, heterogeneity and limited syntactic and semantic variability. There has been much research on automatic MWE acquisition in the recent decades, and the state of the art covers a large number of techniques and languages. Other tasks involving MWEs, namely disambiguation, interpretation, representation and applications, have received less emphasis in the field. The first main contribution of this thesis is the proposal of an original methodological framework for automatic MWE acquisition from monolingual corpora. This framework is generic, language independent, integrated and contains a freely available implementation, the mwetoolkit. It is composed of independent modules which may themselves use multiple techniques to solve a specific sub-task in MWE acquisition. The evaluation of MWE acquisition is modelled using four independent axes. We underline that the evaluation results depend on parameters of the acquisition context, e.g., nature and size of corpora, language and type of MWE, analysis depth, and existing resources. The second main contribution of this thesis is the application-oriented evaluation of our methodology proposal in two applications: computer-assisted lexicography and statistical machine translation. For the former, we evaluate the usefulness of automatic MWE acquisition with the mwetoolkit for creating three lexicons: Greek nominal expressions, Portuguese complex predicates and Portuguese sentiment expressions. For the latter, we test several integration strategies in order to improve the treatment given to English phrasal verbs when translated by a standard statistical MT system into Portuguese. Both applications can benefit from automatic MWE acquisition, as the expressions acquired automatically from corpora can both speed up and improve the quality of the results. The promising results of previous and ongoing experiments encourage further investigation about the optimal way to integrate MWE treatment into other applications. Thus, we conclude the thesis with an overview of the past, ongoing and future work

    Towards generic relation extraction

    Get PDF
    A vast amount of usable electronic data is in the form of unstructured text. The relation extraction task aims to identify useful information in text (e.g., PersonW works for OrganisationX, GeneY encodes ProteinZ) and recode it in a format such as a relational database that can be more effectively used for querying and automated reasoning. However, adapting conventional relation extraction systems to new domains or tasks requires significant effort from annotators and developers. Furthermore, previous adaptation approaches based on bootstrapping start from example instances of the target relations, thus requiring that the correct relation type schema be known in advance. Generic relation extraction (GRE) addresses the adaptation problem by applying generic techniques that achieve comparable accuracy when transferred, without modification of model parameters, across domains and tasks. Previous work on GRE has relied extensively on various lexical and shallow syntactic indicators. I present new state-of-the-art models for GRE that incorporate governordependency information. I also introduce a dimensionality reduction step into the GRE relation characterisation sub-task, which serves to capture latent semantic information and leads to significant improvements over an unreduced model. Comparison of dimensionality reduction techniques suggests that latent Dirichlet allocation (LDA) ā€“ a probabilistic generative approach ā€“ successfully incorporates a larger and more interdependent feature set than a model based on singular value decomposition (SVD) and performs as well as or better than SVD on all experimental settings. Finally, I will introduce multi-document summarisation as an extrinsic test bed for GRE and present results which demonstrate that the relative performance of GRE models is consistent across tasks and that the GRE-based representation leads to significant improvements over a standard baseline from the literature. Taken together, the experimental results 1) show that GRE can be improved using dependency parsing and dimensionality reduction, 2) demonstrate the utility of GRE for the content selection step of extractive summarisation and 3) validate the GRE claim of modification-free adaptation for the first time with respect to both domain and task. This thesis also introduces data sets derived from publicly available corpora for the purpose of rigorous intrinsic evaluation in the news and biomedical domains

    Evaluating the impact of social-media on sales forecasting: a quantitative study of worlds biggest brands using Twitter, Facebook and Google Trends

    Get PDF
    In the world of digital communication, data from online sources such as social networks might provide additional information about changing consumer interest and significantly improve the accuracy of forecasting models. In this thesis I investigate whether information from Twitter, Facebook and Google Trends have the ability to improve daily sales forecasts for companies with respect to the forecasts from transactional sales data only. My original contribution to this domain, exposed in the present thesis, consists in the following main steps: 1. Data collection. I collected Twitter, Facebook and Google Trends data for the period May 2013 May 2015 for 75 brands. Historical transactional sales data was supplied by Certona Corporation. 2. Sentiment analysis. I introduced a new sentiment classification approach based on combining the two standard techniques (lexicon-based and machine learning based). The proposed method outperforms the state-of-the-art approach by 7% in F-score. 3. Identification and classification of events. I proposed a framework for events detection and a robust method for clustering Twitter events into different types based on the shape of the Twitter volume and sentiment peaks. This approach allows to capture the varying dynamics of information propagation through the social network. I provide empirical evidence that it is possible to identify types of Twitter events that have significant power to predict spikes in sales. 4. Forecasting next day sales. I explored linear, non-linear and cointegrating relationships between sales and social-media variables for 18 brands and showed that social-media variables can improve daily sales forecasts for the majority of brands by capturing factors, such as consumer sentiment and brand perception. Moreover, I identified that social-media data without sales information, can be used to predict sales direction with the accuracy of 63%. The experts from the industry consider the results obtained in this thesis to be valuable and useful for decision making and for making strategic planning for the future

    Astronomia ex machina: a history, primer, and outlook on neural networks in astronomy

    Get PDF
    In recent years, deep learning has infiltrated every field it has touched, reducing the need for specialist knowledge and automating the process of knowledge discovery from data. This review argues that astronomy is no different, and that we are currently in the midst of a deep learning revolution that is transforming the way we do astronomy. We trace the history of astronomical connectionism from the early days of multilayer perceptrons, through the second wave of convolutional and recurrent neural networks, to the current third wave of self-supervised and unsupervised deep learning. We then predict that we will soon enter a fourth wave of astronomical connectionism, in which finetuned versions of an all-encompassing 'foundation' model will replace expertly crafted deep learning models. We argue that such a model can only be brought about through a symbiotic relationship between astronomy and connectionism, whereby astronomy provides high quality multimodal data to train the foundation model, and in turn the foundation model is used to advance astronomical research.Comment: 60 pages, 269 references, 29 figures. Review submitted to Royal Society Open Science. Comments and feedback welcom

    Positive emotions in the voice:Towards an ethological understanding

    Get PDF
    corecore