9 research outputs found

    Inferring multilingual domain-specific word embeddings from large document corpora

    Get PDF
    The use of distributed vector representations of words in Natural Language Processing has become established. To tailor general-purpose vector spaces to the context under analysis, several domain adaptation techniques have been proposed. They all require sufficiently large document corpora tailored to the target domains. However, in several cross-lingual NLP domains both large enough domain-specific document corpora and pre-trained domain-specific word vectors are hard to find for languages other than English. This paper aims at tackling the aforesaid issue. It proposes a new methodology to automatically infer aligned domain-specific word embeddings for a target language on the basis of the general-purpose and domain-specific models available for a source language (typically, English). The proposed inference method relies on a two-step process, which first automatically identifies domain-specific words and then opportunistically reuses the non-linear space transformations applied to the word vectors of the source language in order to learn how to tailor the vector space of the target language to the domain of interest. The performance of the proposed method was validated via extrinsic evaluation by addressing the established word retrieval task. To this aim, a new benchmark multilingual dataset, derived from Wikipedia, has been released. The results confirmed the effectiveness and usability of the proposed approach

    A Word Embedding-Based Method for Unsupervised Adaptation of Cooking Recipes

    Get PDF
    Studying food recipes is indispensable to understand the science of cooking. An essential problem in food computing is the adaptation of recipes to user needs and preferences. The main difficulty when adapting recipes is in determining ingredients relations, which are compound and hard to interpret. Word embedding models can catch the semantics of food items in a recipe, helping to understand how ingredients are combined and substituted. In this work, we propose an unsupervised method for adapting ingredient recipes to user preferences. To learn food representations and relations, we create and apply a specific-domain word embedding model. In contrast to previous works, we not only use the list of ingredients to train the model but also the cooking instructions. We enrich the ingredient data by mapping them to a nutrition database to guide the adaptation and find ingredient substitutes. We performed three different kinds of recipe adaptation based on nutrition preferences, adapting to similar ingredients, and vegetarian and vegan diet restrictions. With a 95% of confidence, our method can obtain quality adapted recipes without a previous knowledge extraction on the recipe adaptation domain. Our results confirm the potential of using a specific-domain semantic model to tackle the recipe adaptation task.European Commission 816303University of Granad

    Linguistically-Informed Neural Architectures for Lexical, Syntactic and Semantic Tasks in Sanskrit

    Full text link
    The primary focus of this thesis is to make Sanskrit manuscripts more accessible to the end-users through natural language technologies. The morphological richness, compounding, free word orderliness, and low-resource nature of Sanskrit pose significant challenges for developing deep learning solutions. We identify four fundamental tasks, which are crucial for developing a robust NLP technology for Sanskrit: word segmentation, dependency parsing, compound type identification, and poetry analysis. The first task, Sanskrit Word Segmentation (SWS), is a fundamental text processing task for any other downstream applications. However, it is challenging due to the sandhi phenomenon that modifies characters at word boundaries. Similarly, the existing dependency parsing approaches struggle with morphologically rich and low-resource languages like Sanskrit. Compound type identification is also challenging for Sanskrit due to the context-sensitive semantic relation between components. All these challenges result in sub-optimal performance in NLP applications like question answering and machine translation. Finally, Sanskrit poetry has not been extensively studied in computational linguistics. While addressing these challenges, this thesis makes various contributions: (1) The thesis proposes linguistically-informed neural architectures for these tasks. (2) We showcase the interpretability and multilingual extension of the proposed systems. (3) Our proposed systems report state-of-the-art performance. (4) Finally, we present a neural toolkit named SanskritShala, a web-based application that provides real-time analysis of input for various NLP tasks. Overall, this thesis contributes to making Sanskrit manuscripts more accessible by developing robust NLP technology and releasing various resources, datasets, and web-based toolkit.Comment: Ph.D. dissertatio

    Machine learning to generate soil information

    Get PDF
    This thesis is concerned with the novel use of machine learning (ML) methods in soil science research. ML adoption in soil science has increased considerably, especially in pedometrics (the use of quantitative methods to study the variation of soils). In parallel, the size of the soil datasets has also increased thanks to projects of global impact that aim to rescue legacy data or new large extent surveys to collect new information. While we have big datasets and global projects, currently, modelling is mostly based on "traditional" ML approaches which do not take full advantage of these large data compilations. This compilation of these global datasets is severely limited by privacy concerns and, currently, no solution has been implemented to facilitate the process. If we consider the performance differences derived from the generality of global models versus the specificity of local models, there is still a debate on which approach is better. Either in global or local DSM, most applications are static. Even with the large soil datasets available to date, there is not enough soil data to perform a fully-empirical, space-time modelling. Considering these knowledge gaps, this thesis aims to introduce advanced ML algorithms and training techniques, specifically deep neural networks, for modelling large datasets at a global scale and provide new soil information. The research presented here has been successful at applying the latest advances in ML to improve upon some of the current approaches for soil modelling with large datasets. It has also created opportunities to utilise information, such as descriptive data, that has been generally disregarded. ML methods have been embraced by the soil community and their adoption is increasing. In the particular case of neural networks, their flexibility in terms of structure and training makes them a good candidate to improve on current soil modelling approaches
    corecore