45 research outputs found

    Leveraging syntactic and semantic graph kernels to extract pharmacokinetic drug drug interactions from biomedical literature

    Get PDF
    BACKGROUND: Information about drug-drug interactions (DDIs) supported by scientific evidence is crucial for establishing computational knowledge bases for applications like pharmacovigilance. Since new reports of DDIs are rapidly accumulating in the scientific literature, text-mining techniques for automatic DDI extraction are critical. We propose a novel approach for automated pharmacokinetic (PK) DDI detection that incorporates syntactic and semantic information into graph kernels, to address the problem of sparseness associated with syntactic-structural approaches. First, we used a novel all-path graph kernel using shallow semantic representation of sentences. Next, we statistically integrated fine-granular semantic classes into the dependency and shallow semantic graphs. RESULTS: When evaluated on the PK DDI corpus, our approach significantly outperformed the original all-path graph kernel that is based on dependency structure. Our system that combined dependency graph kernel with semantic classes achieved the best F-scores of 81.94 % for in vivo PK DDIs and 69.34 % for in vitro PK DDIs, respectively. Further, combining shallow semantic graph kernel with semantic classes achieved the highest precisions of 84.88 % for in vivo PK DDIs and 74.83 % for in vitro PK DDIs, respectively. CONCLUSIONS: We presented a graph kernel based approach to combine syntactic and semantic information for extracting pharmacokinetic DDIs from Biomedical Literature. Experimental results showed that our proposed approach could extract PK DDIs from literature effectively, which significantly enhanced the performance of the original all-path graph kernel based on dependency structure

    Vers la définition de nouvelles langues contrôlées pour limiter le « risque langagier »

    Get PDF
    International audienceLe propos de cet article est basé sur plusieurs études réalisées depuis une dizaine d'années à CLLE-ERSS, sur les langues contrôlées (CNLs). Le principal constat est que les CNLs ne sont pas toujours adaptées et utilisables. Par ailleurs, l'impact réel de leur mise en oeuvre sur l'amélioration de la « readability » a été très peu évalué. L'article dresse un panorama des problèmes associés aux CNLs et propose de nouvelles pistes de constitution. Dans cet objectif, certaines méthodes de TAL et de psycholinguistique pourraient être mises en oeuvre pour améliorer les CNLs existantes ou en proposer de nouvelles. ABSTRACT Toward the Definition of New Controlled Natural Languages in order to Prevent Risks Related to Language Use. The content of this paper is based on several studies carried out over the last decade in CLLE-ERSS lab, on controlled natural languages (CNLs). It finds that they are not always adapted and usable. Another point is that their impact on readability has not been widely measured. The paper gives an overview of the problems linked to CNLs and proposes new directions for designing them. In this aim, NLP and psycholinguistic methods could be used to improve existing CNLs or to propose new ones

    A corpus-based approach to esp: est vocabulary in information technology

    Get PDF
    Dissertação (mestrado) - Universidade Federal de Santa Catarina, Centro de Comunicação e Expressão. Programa de Pós-Graduação em Letras/Inglês e Literatura Correspondente.The aim of this research is to contribute some ideas to the investigation of vocabulary instruction. In the 90#s, Applied Linguistics emphasized the importance of vocabulary teaching since grammar and lexis were understood as inseparable concerns. Nowadays, specialized vocabulary is studied as a sublanguage or English for Science or Technology (EST). This sublanguage, which is present in technical contexts, is part of students knowledge, who study English for Specific or Academic purposes (ESP or EAP) while enhancing their skills. In TEFL, the tradition, was to teach grammar and language structures. Typical pedagogical materials for ESP present some limitations due to the following: (i) they use unauthentic texts; (ii) they are not constantly updated, since they are printed and still based on grammar. In an attempt to contribute to remedy this situation, some methodologies are appearing in Corpus Linguistics (CL).Three approaches which make use of CL methodology (Lexical Syllabus, Lexical Approach, and Data-Driven Learning) were applied in one specific unexplored area: EST vocabulary. In this context, the following research questions are addressed: (i) How does EST vocabulary behave in some IT texts of Linux guides? (ii) What are the advantages of using CL methodology to produce ESP materials in Information technology? (iii) How do learners act in response to corpus-based activities designed focusing on vocabulary instruction? The research methodology includes corpus compilation, study of language standards retrieved with WordSmith tools (Scott, 1996), elaboration of pedagogical material supported by corpus-based approaches as well as the application and analysis of the developed material. The most important results pointed out: (i) Lexical combinations are context dependent, (ii) the design of a satisfactory material for teacher and learners, (iii) corpus-based exercises assumed a different position

    Foundation, Implementation and Evaluation of the MorphoSaurus System: Subword Indexing, Lexical Learning and Word Sense Disambiguation for Medical Cross-Language Information Retrieval

    Get PDF
    Im medizinischen Alltag, zu welchem viel Dokumentations- und Recherchearbeit gehört, ist mittlerweile der überwiegende Teil textuell kodierter Information elektronisch verfügbar. Hiermit kommt der Entwicklung leistungsfähiger Methoden zur effizienten Recherche eine vorrangige Bedeutung zu. Bewertet man die Nützlichkeit gängiger Textretrievalsysteme aus dem Blickwinkel der medizinischen Fachsprache, dann mangelt es ihnen an morphologischer Funktionalität (Flexion, Derivation und Komposition), lexikalisch-semantischer Funktionalität und der Fähigkeit zu einer sprachübergreifenden Analyse großer Dokumentenbestände. In der vorliegenden Promotionsschrift werden die theoretischen Grundlagen des MorphoSaurus-Systems (ein Akronym für Morphem-Thesaurus) behandelt. Dessen methodischer Kern stellt ein um Morpheme der medizinischen Fach- und Laiensprache gruppierter Thesaurus dar, dessen Einträge mittels semantischer Relationen sprachübergreifend verknüpft sind. Darauf aufbauend wird ein Verfahren vorgestellt, welches (komplexe) Wörter in Morpheme segmentiert, die durch sprachunabhängige, konzeptklassenartige Symbole ersetzt werden. Die resultierende Repräsentation ist die Basis für das sprachübergreifende, morphemorientierte Textretrieval. Neben der Kerntechnologie wird eine Methode zur automatischen Akquise von Lexikoneinträgen vorgestellt, wodurch bestehende Morphemlexika um weitere Sprachen ergänzt werden. Die Berücksichtigung sprachübergreifender Phänomene führt im Anschluss zu einem neuartigen Verfahren zur Auflösung von semantischen Ambiguitäten. Die Leistungsfähigkeit des morphemorientierten Textretrievals wird im Rahmen umfangreicher, standardisierter Evaluationen empirisch getestet und gängigen Herangehensweisen gegenübergestellt

    Adaptive Semantic Annotation of Entity and Concept Mentions in Text

    Get PDF
    The recent years have seen an increase in interest for knowledge repositories that are useful across applications, in contrast to the creation of ad hoc or application-specific databases. These knowledge repositories figure as a central provider of unambiguous identifiers and semantic relationships between entities. As such, these shared entity descriptions serve as a common vocabulary to exchange and organize information in different formats and for different purposes. Therefore, there has been remarkable interest in systems that are able to automatically tag textual documents with identifiers from shared knowledge repositories so that the content in those documents is described in a vocabulary that is unambiguously understood across applications. Tagging textual documents according to these knowledge bases is a challenging task. It involves recognizing the entities and concepts that have been mentioned in a particular passage and attempting to resolve eventual ambiguity of language in order to choose one of many possible meanings for a phrase. There has been substantial work on recognizing and disambiguating entities for specialized applications, or constrained to limited entity types and particular types of text. In the context of shared knowledge bases, since each application has potentially very different needs, systems must have unprecedented breadth and flexibility to ensure their usefulness across applications. Documents may exhibit different language and discourse characteristics, discuss very diverse topics, or require the focus on parts of the knowledge repository that are inherently harder to disambiguate. In practice, for developers looking for a system to support their use case, is often unclear if an existing solution is applicable, leading those developers to trial-and-error and ad hoc usage of multiple systems in an attempt to achieve their objective. In this dissertation, I propose a conceptual model that unifies related techniques in this space under a common multi-dimensional framework that enables the elucidation of strengths and limitations of each technique, supporting developers in their search for a suitable tool for their needs. Moreover, the model serves as the basis for the development of flexible systems that have the ability of supporting document tagging for different use cases. I describe such an implementation, DBpedia Spotlight, along with extensions that we performed to the knowledge base DBpedia to support this implementation. I report evaluations of this tool on several well known data sets, and demonstrate applications to diverse use cases for further validation

    Toward knowledge-based automatic 3D spatial topological modeling from LiDAR point clouds for urban areas

    Get PDF
    Le traitement d'un très grand nombre de données LiDAR demeure très coûteux et nécessite des approches de modélisation 3D automatisée. De plus, les nuages de points incomplets causés par l'occlusion et la densité ainsi que les incertitudes liées au traitement des données LiDAR compliquent la création automatique de modèles 3D enrichis sémantiquement. Ce travail de recherche vise à développer de nouvelles solutions pour la création automatique de modèles géométriques 3D complets avec des étiquettes sémantiques à partir de nuages de points incomplets. Un cadre intégrant la connaissance des objets à la modélisation 3D est proposé pour améliorer la complétude des modèles géométriques 3D en utilisant un raisonnement qualitatif basé sur les informations sémantiques des objets et de leurs composants, leurs relations géométriques et spatiales. De plus, nous visons à tirer parti de la connaissance qualitative des objets en reconnaissance automatique des objets et à la création de modèles géométriques 3D complets à partir de nuages de points incomplets. Pour atteindre cet objectif, plusieurs solutions sont proposées pour la segmentation automatique, l'identification des relations topologiques entre les composants de l'objet, la reconnaissance des caractéristiques et la création de modèles géométriques 3D complets. (1) Des solutions d'apprentissage automatique ont été proposées pour la segmentation sémantique automatique et la segmentation de type CAO afin de segmenter des objets aux structures complexes. (2) Nous avons proposé un algorithme pour identifier efficacement les relations topologiques entre les composants d'objet extraits des nuages de points afin d'assembler un modèle de Représentation Frontière. (3) L'intégration des connaissances sur les objets et la reconnaissance des caractéristiques a été développée pour inférer automatiquement les étiquettes sémantiques des objets et de leurs composants. Afin de traiter les informations incertitudes, une solution de raisonnement automatique incertain, basée sur des règles représentant la connaissance, a été développée pour reconnaître les composants du bâtiment à partir d'informations incertaines extraites des nuages de points. (4) Une méthode heuristique pour la création de modèles géométriques 3D complets a été conçue en utilisant les connaissances relatives aux bâtiments, les informations géométriques et topologiques des composants du bâtiment et les informations sémantiques obtenues à partir de la reconnaissance des caractéristiques. Enfin, le cadre proposé pour améliorer la modélisation 3D automatique à partir de nuages de points de zones urbaines a été validé par une étude de cas visant à créer un modèle de bâtiment 3D complet. L'expérimentation démontre que l'intégration des connaissances dans les étapes de la modélisation 3D est efficace pour créer un modèle de construction complet à partir de nuages de points incomplets.The processing of a very large set of LiDAR data is very costly and necessitates automatic 3D modeling approaches. In addition, incomplete point clouds caused by occlusion and uneven density and the uncertainties in the processing of LiDAR data make it difficult to automatic creation of semantically enriched 3D models. This research work aims at developing new solutions for the automatic creation of complete 3D geometric models with semantic labels from incomplete point clouds. A framework integrating knowledge about objects in urban scenes into 3D modeling is proposed for improving the completeness of 3D geometric models using qualitative reasoning based on semantic information of objects and their components, their geometric and spatial relations. Moreover, we aim at taking advantage of the qualitative knowledge of objects in automatic feature recognition and further in the creation of complete 3D geometric models from incomplete point clouds. To achieve this goal, several algorithms are proposed for automatic segmentation, the identification of the topological relations between object components, feature recognition and the creation of complete 3D geometric models. (1) Machine learning solutions have been proposed for automatic semantic segmentation and CAD-like segmentation to segment objects with complex structures. (2) We proposed an algorithm to efficiently identify topological relationships between object components extracted from point clouds to assemble a Boundary Representation model. (3) The integration of object knowledge and feature recognition has been developed to automatically obtain semantic labels of objects and their components. In order to deal with uncertain information, a rule-based automatic uncertain reasoning solution was developed to recognize building components from uncertain information extracted from point clouds. (4) A heuristic method for creating complete 3D geometric models was designed using building knowledge, geometric and topological relations of building components, and semantic information obtained from feature recognition. Finally, the proposed framework for improving automatic 3D modeling from point clouds of urban areas has been validated by a case study aimed at creating a complete 3D building model. Experiments demonstrate that the integration of knowledge into the steps of 3D modeling is effective in creating a complete building model from incomplete point clouds

    Tibetan Buddhist English: a corpus approach to the Tibetan Buddhist genre of shastra within the Kagyu Shedra curriculum

    Get PDF
    Against the backdrop of the argument of the incomprehensibility of Buddhist English language to non-specialist audiences due to the high frequency of Sanskrit loanwords and unexplained terminology and a general lack of data-driven, empirical research on the use of Buddhist English beyond Buddhology and translation studies, this thesis investigates the following research questions: (1) What are pervasive linguistic features of the genre shastra in Tibetan Buddhist English? (2) Based on question 1, what are the characteristics of such linguistic features? (3) What is the link between such linguistic features and their situational context of Tibetan Buddhist shastras? (4) How do the linguistic features of Tibetan Buddhist shastras compare to other written registers? Compilation and frequency-based analysis of a small specialised corpus of Tibetan Buddhist Shastras (commentaries) identified four typical linguistic features: lexical closure, low type-token ratio (TTR), frequent use of the indefinite pronoun one and the frequent use of Sanskrit loanwords. Analysis was carried out following Biber and Conrad’s (2013) framework for register analysis, comprising situational, linguistic and functional analyses. Lexical closure properties in the corpus provided a reliability measure for the findings of the study. Together with a low frequency of personal pronouns and a high frequency of the generic pronoun one, they aligned with characteristics of general and academic written registers. Existing characteristics of written registers have been challenged for their disassociation of high TTR and the use of the specific pronoun one, which in Buddhist English were found to be features of written register, indicative of the frequent repetition of titles and headings and frequent anaphoric referencing to aid the Buddhist practice of memorisation. The high frequency of loanwords proved to align with the claim of incomprehensibility of Buddhist language for a non-specialist audience, yet the relationship between situational and linguistic analysis indicated that such shortcomings of Buddhist English are mitigated through the common Buddhist practice of textual study as part of so-called “Shedras in the West”. Contributions include the provision of empirical data on the under-investigated register of Buddhist English Shastras, and to register classifications of written and academic registers. Methodological contributions were made through provision of a first-ever corpus-based study of Buddhist English, thereby testing the validity of established corpus approaches in a small specialised context. Theoretical contributions included an evaluation of Biber’s multidimensional analysis framework (1988, 2007), calling for an extension of the existing frameworks to account for the deviations in the findings based on the Buddhist English register shastra. Furthermore, the study provides a template for the calculation of lexical closure as a measure for representativeness in small corpora. Additional contributions are made by illustrating the pedagogic application of corpus data in the classroom by means of sample classroom tasks

    Development of a text mining approach to disease network discovery

    Get PDF
    Scientific literature is one of the major sources of knowledge for systems biology, in the form of papers, patents and other types of written reports. Text mining methods aim at automatically extracting relevant information from the literature. The hypothesis of this thesis was that biological systems could be elucidated by the development of text mining solutions that can automatically extract relevant information from documents. The first objective consisted in developing software components to recognize biomedical entities in text, which is the first step to generate a network about a biological system. To this end, a machine learning solution was developed, which can be trained for specific biological entities using an annotated dataset, obtaining high-quality results. Additionally, a rule-based solution was developed, which can be easily adapted to various types of entities. The second objective consisted in developing an automatic approach to link the recognized entities to a reference knowledge base. A solution based on the PageRank algorithm was developed in order to match the entities to the concepts that most contribute to the overall coherence. The third objective consisted in automatically extracting relations between entities, to generate knowledge graphs about biological systems. Due to the lack of annotated datasets available for this task, distant supervision was employed to train a relation classifier on a corpus of documents and a knowledge base. The applicability of this approach was demonstrated in two case studies: microRNAgene relations for cystic fibrosis, obtaining a network of 27 relations using the abstracts of 51 recently published papers; and cell-cytokine relations for tolerogenic cell therapies, obtaining a network of 647 relations from 3264 abstracts. Through a manual evaluation, the information contained in these networks was determined to be relevant. Additionally, a solution combining deep learning techniques with ontology information was developed, to take advantage of the domain knowledge provided by ontologies. This thesis contributed with several solutions that demonstrate the usefulness of text mining methods to systems biology by extracting domain-specific information from the literature. These solutions make it easier to integrate various areas of research, leading to a better understanding of biological systems
    corecore