76 research outputs found

    Word Sequence Modeling using Deep Learning:an End-to-end Approach and its Applications

    Get PDF
    For a long time, natural language processing (NLP) has relied on generative models with task specific and manually engineered features. Recently, there has been a resurgence of interest for neural networks in the machine learning community, obtaining state-of-the-art results in various fields such as computer vision, speech processing and natural language processing. The central idea behind these approaches is to learn features and models simultaneously, in an end-to-end manner, and making as few assumptions as possible. In NLP, word embeddings, mapping words in a dictionary on a continuous low-dimensional vector space, have proven to be very efficient for a large variety of tasks while requiring almost no a-priori linguistic assumptions. In this thesis, we investigate continuous representations of segments in a sentence for the purpose of solving NLP tasks that involve complex sentence-level relationships. Our sequence modelling approach is based on neural networks and takes advantage of word embeddings. A first approach models words in context in the form of continuous vector representations which are used to solve the task of interest. With the use of a compositional procedure, allowing arbitrarily-sized segments to be compressed onto continuous vectors, the model is able to consider long-range word dependencies as well. We first validate our approach on the task of bilingual word alignment, consisting in finding word correspondences between a sentence in two different languages. Source and target words in context are modeled using convolutional neural networks, obtaining representations that are later used to compute alignment scores. An aggregation operation enables unsupervised training for this task. We show that our model outperforms a standard generative model. The model above is extended to tackle phrase prediction tasks where phrases rather than single words are to be tagged. These tasks have been typically cast as classic word tagging problems using special tagging schemes to identify the segments boundaries. The proposed neural model focuses on learning fixed-size representations of arbitrarily-sized chunks of words that are used to solve the tagging task. A compositional operation is introduced in this work for the purpose of computing these representations. We demonstrate the viability of the proposed representations by evaluating the approach on the multiwork expression tagging task. The remainder of this thesis addresses the task of syntactic constituency parsing which, as opposed to the above tasks, aims at producing a structured output, in the form of a tree, of an input sentence. Syntactic parsing is cast as multiple phrase prediction problems that are solved recursively in a greedy manner. An extension using recursive compositional vector representations, allowing for lexical infor- mation to be propagated from early stages, is explored as well. This approach is evaluated on a standard corpus obtaining performance comparable to generative models with much shorter computation time. Finally, morphological tags are included as additional features, using a similar composition procedure, to improve the parsing performance for morphologically rich languages. State-of-the-art results were obtained for these task and languages

    Is Deep Learning Really Necessary for Word Embeddings?

    Get PDF
    Word embeddings resulting from neural language models have been shown to be successful for a large variety of NLP tasks. However, such architecture might be difficult to train and time-consuming. Instead, we propose to drastically sim- plify the word embeddings computation through a Hellinger PCA of the word co-occurence matrix. We compare those new word embeddings with some well- known embeddings on NER and movie review tasks and show that we can reach similar or even better performance. Although deep learning is not really necessary for generating good word embeddings, we show that it can provide an easy way to adapt embeddings to specific tasks

    PGxCorpus and PGxLOD: two shared resources for knowledge management in pharmacogenomics

    Get PDF
    National audiencePharmacogenomics (PGx) studies the impact of genetic factors on drug response phenotypes. Atomic knowledge units in PGx have the form of ternary relationships linking one or more drugs, one or more genetic factors, and one or more phenotypes. Such relationships state that a patient having the specified genetic factors and being treated with the specified drugs is likely to experience the given phenotypes. PGx knowledge is of particular interest for the development of precision medicine which aims at tailoring drug treatments to each patient to reduce adverse effects and maximize drug efficacy. However, PGx knowledge is scattered across many sources (e.g., reference databases, the biomedical literature) and suffers from very heterogeneous levels of validation, i.e., some PGx relationships are extensively studied and have been translated into clinical practice, but most are only observed on small-size cohorts or not reproduced yet and necessitate further investigation. Consequently, there is a strong interest in extracting and integrating knowledge units from these different sources into a unique place to provide a consolidated view of the state-of-the-art knowledge of this domain and drive to the validation, or moderation, of insufficiently validated knowledge units. To this aim, we created and share with the community two resources: PGxCorpus and PGxLOD

    Syntax-based Transfer Learning for the Task of Biomedical Relation Extraction

    Get PDF
    International audienceTransfer learning (TL) proposes to enhance machine learning performance on a problem, by reusing labeled data originally designed for a related problem. In particular, domain adaptation consists, for a specific task, in reusing training data developed for the same task but a distinct domain. This is particularly relevant to the applications of deep learning in Natural Language Processing, because those usually require large annotated corpora that may not exist for the targeted domain, but exist for side domains. In this paper, we experiment with TL for the task of Relation Extraction (RE) from biomedical texts, using the TreeLSTM model. We empirically show the impact of TreeLSTM alone and with domain adaptation by obtaining better performances than the state of the art on two biomedical RE tasks and equal performances for two others, for which few annotated data are available. Furthermore, we propose an analysis of the role that syntactic features may play in TL for RE

    PGxO: A very lite ontology to reconcile pharmacogenomic knowledge units

    Get PDF
    International audienceWe present in this article a lightweight ontology named PGxO and a set of rules for its instantiation, which we developed as a frame for reconciling and tracing pharmacogenomics (PGx) knowledge. PGx studies how genomic variations impact variations in drug response phenotypes. Knowledge in PGx is typically composed of units that have the form of ternary relationships gene variant–drug–adverse event, stating that an adverse event may occur for patients having the gene variant when being exposed to the drug. These knowledge units (i) are available in reference databases, such as PharmGKB, are reported in the scientific biomedical literature and (ii) may be discovered by mining clinical data such as Electronic Health Records (EHRs). Therefore, knowledge in PGx is heterogeneously described (i.e., with various quality, granularity, vocabulary, etc.). It is consequently worth to extract, then compare, assertions from distinct resources. Using PGxO, one can represent multiple provenances for pharmacogenomic knowledge units, and reconcile duplicates when they come from distinct sources

    Genetic control of plasticity of oil yield for combined abiotic stresses using a joint approach of crop modeling and genome-wide association

    Full text link
    Understanding the genetic basis of phenotypic plasticity is crucial for predicting and managing climate change effects on wild plants and crops. Here, we combined crop modeling and quantitative genetics to study the genetic control of oil yield plasticity for multiple abiotic stresses in sunflower. First we developed stress indicators to characterize 14 environments for three abiotic stresses (cold, drought and nitrogen) using the SUNFLO crop model and phenotypic variations of three commercial varieties. The computed plant stress indicators better explain yield variation than descriptors at the climatic or crop levels. In those environments, we observed oil yield of 317 sunflower hybrids and regressed it with three selected stress indicators. The slopes of cold stress norm reaction were used as plasticity phenotypes in the following genome-wide association study. Among the 65,534 tested SNP, we identified nine QTL controlling oil yield plasticity to cold stress. Associated SNP are localized in genes previously shown to be involved in cold stress responses: oligopeptide transporters, LTP, cystatin, alternative oxidase, or root development. This novel approach opens new perspectives to identify genomic regions involved in genotype-by-environment interaction of a complex traits to multiple stresses in realistic natural or agronomical conditions.Comment: 12 pages, 5 figures, Plant, Cell and Environmen
    corecore