348 research outputs found

    A Survey on Semantic Processing Techniques

    Full text link
    Semantic processing is a fundamental research domain in computational linguistics. In the era of powerful pre-trained language models and large language models, the advancement of research in this domain appears to be decelerating. However, the study of semantics is multi-dimensional in linguistics. The research depth and breadth of computational semantic processing can be largely improved with new technologies. In this survey, we analyzed five semantic processing tasks, e.g., word sense disambiguation, anaphora resolution, named entity recognition, concept extraction, and subjectivity detection. We study relevant theoretical research in these fields, advanced methods, and downstream applications. We connect the surveyed tasks with downstream applications because this may inspire future scholars to fuse these low-level semantic processing tasks with high-level natural language processing tasks. The review of theoretical research may also inspire new tasks and technologies in the semantic processing domain. Finally, we compare the different semantic processing techniques and summarize their technical trends, application trends, and future directions.Comment: Published at Information Fusion, Volume 101, 2024, 101988, ISSN 1566-2535. The equal contribution mark is missed in the published version due to the publication policies. Please contact Prof. Erik Cambria for detail

    Improving Relation Extraction From Unstructured Genealogical Texts Using Fine-Tuned Transformers

    Get PDF
    Though exploring one’s family lineage through genealogical family trees can be insightful to developing one’s identity, this knowledge is typically held behind closed doors by private companies or require expensive technologies, such as DNA testing, to uncover. With the ever-booming explosion of data on the world wide web, many unstructured text documents, both old and new, are being discovered, written, and processed which contain rich genealogical information. With access to this immense amount of data, however, entails a costly process whereby people, typically volunteers, have to read large amounts of text to find relationships between people. This delays having genealogical information be open and accessible to all. This thesis explores state-of-the-art methods for relation extraction across the genealogical and biomedical domains and bridges new and old research by proposing an updated three-tier system for parsing unstructured documents. This system makes use of recently developed and massively pretrained transformers and fine-tuning techniques to take advantage of these deep neural models’ inherent understanding of English syntax and semantics for classification. With only a fraction of labeled data typically needed to train large models, fine-tuning a LUKE relation classification model with minimal added features can identify genealogical relationships with macro precision, recall, and F1 scores of 0.880, 0.867, and 0.871, respectively, in data sets with scarce (∼10%) positive relations. Further- more, with the advent of a modern coreference resolution system utilizing SpanBERT embeddings and a modern named entity parser, our end-to-end pipeline can extract and correctly classify relationships within unstructured documents with macro precision, recall, and F1 scores of 0.794, 0.616, and 0.676, respectively. This thesis also evaluates individual components of the system and discusses future improvements to be made

    Entity-centric knowledge discovery for idiosyncratic domains

    Get PDF
    Technical and scientific knowledge is produced at an ever-accelerating pace, leading to increasing issues when trying to automatically organize or process it, e.g., when searching for relevant prior work. Knowledge can today be produced both in unstructured (plain text) and structured (metadata or linked data) forms. However, unstructured content is still themost dominant formused to represent scientific knowledge. In order to facilitate the extraction and discovery of relevant content, new automated and scalable methods for processing, structuring and organizing scientific knowledge are called for. In this context, a number of applications are emerging, ranging fromNamed Entity Recognition (NER) and Entity Linking tools for scientific papers to specific platforms leveraging information extraction techniques to organize scientific knowledge. In this thesis, we tackle the tasks of Entity Recognition, Disambiguation and Linking in idiosyncratic domains with an emphasis on scientific literature. Furthermore, we study the related task of co-reference resolution with a specific focus on named entities. We start by exploring Named Entity Recognition, a task that aims to identify the boundaries of named entities in textual contents. We propose a newmethod to generate candidate named entities based on n-gram collocation statistics and design several entity recognition features to further classify them. In addition, we show how the use of external knowledge bases (either domain-specific like DBLP or generic like DBPedia) can be leveraged to improve the effectiveness of NER for idiosyncratic domains. Subsequently, we move to Entity Disambiguation, which is typically performed after entity recognition in order to link an entity to a knowledge base. We propose novel semi-supervised methods for word disambiguation leveraging the structure of a community-based ontology of scientific concepts. Our approach exploits the graph structure that connects different terms and their definitions to automatically identify the correct sense that was originally picked by the authors of a scientific publication. We then turn to co-reference resolution, a task aiming at identifying entities that appear using various forms throughout the text. We propose an approach to type entities leveraging an inverted index built on top of a knowledge base, and to subsequently re-assign entities based on the semantic relatedness of the introduced types. Finally, we describe an application which goal is to help researchers discover and manage scientific publications. We focus on the problem of selecting relevant tags to organize collections of research papers in that context. We experimentally demonstrate that the use of a community-authored ontology together with information about the position of the concepts in the documents allows to significantly increase the precision of tag selection over standard methods

    The structural and content aspects of abstracts versus bodies of full text journal articles are different

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>An increase in work on the full text of journal articles and the growth of PubMedCentral have the opportunity to create a major paradigm shift in how biomedical text mining is done. However, until now there has been no comprehensive characterization of how the bodies of full text journal articles differ from the abstracts that until now have been the subject of most biomedical text mining research.</p> <p>Results</p> <p>We examined the structural and linguistic aspects of abstracts and bodies of full text articles, the performance of text mining tools on both, and the distribution of a variety of semantic classes of named entities between them. We found marked structural differences, with longer sentences in the article bodies and much heavier use of parenthesized material in the bodies than in the abstracts. We found content differences with respect to linguistic features. Three out of four of the linguistic features that we examined were statistically significantly differently distributed between the two genres. We also found content differences with respect to the distribution of semantic features. There were significantly different densities per thousand words for three out of four semantic classes, and clear differences in the extent to which they appeared in the two genres. With respect to the performance of text mining tools, we found that a mutation finder performed equally well in both genres, but that a wide variety of gene mention systems performed much worse on article bodies than they did on abstracts. POS tagging was also more accurate in abstracts than in article bodies.</p> <p>Conclusions</p> <p>Aspects of structure and content differ markedly between article abstracts and article bodies. A number of these differences may pose problems as the text mining field moves more into the area of processing full-text articles. However, these differences also present a number of opportunities for the extraction of data types, particularly that found in parenthesized text, that is present in article bodies but not in article abstracts.</p

    The RareDis corpus: A corpus annotated with rare diseases, their signs and symptoms

    Get PDF
    Rare diseases affect a small number of people compared to the general population. However, more than 6,000 different rare diseases exist and, in total, they affect more than 300 million people worldwide. Rare diseases share as part of their main problem, the delay in diagnosis and the sparse information available for researchers, clinicians, and patients. Finding a diagnostic can be a very long and frustrating experience for patients and their families. The average diagnostic delay is between 6–8 years. Many of these diseases result in different manifestations among patients, which hampers even more their detection and the correct treatment choice. Therefore, there is an urgent need to increase the scientific and medical knowledge about rare diseases. Natural Language Processing (NLP) can help to extract relevant information about rare diseases to facilitate their diagnosis and treatments, but most NLP techniques require manually annotated corpora. Therefore, our goal is to create a gold standard corpus annotated with rare diseases and their clinical manifestations. It could be used to train and test NLP approaches and the information extracted through NLP could enrich the knowledge of rare diseases, and thereby, help to reduce the diagnostic delay and improve the treatment of rare diseases. The paper describes the selection of 1,041 texts to be included in the corpus, the annotation process and the annotation guidelines. The entities (disease, rare disease, symptom, sign and anaphor) and the relationships (produces, is a, is acron, is synon, increases risk of, anaphora) were annotated. The RareDis corpus contains more than 5,000 rare diseases and almost 6,000 clinical manifestations are annotated. Moreover, the Inter Annotator Agreement evaluation shows a relatively high agreement (F1-measure equal to 83.5% under exact match criteria for the entities and equal to 81.3% for the relations). Based on these results, this corpus is of high quality, supposing a significant step for the field since there is a scarcity of available corpus annotated with rare diseases. This could open the door to further NLP applications, which would facilitate the diagnosis and treatment of these rare diseases and, therefore, would improve dramatically the quality of life of these patients.This work was supported by the Madrid Government (Comunidad de Madrid) under the Multiannual Agreement with UC3M in the line of "Fostering Young Doctors Research" (NLP4RARE-CM-UC3M) and in the context of the V PRICIT (Regional Programme of Research and Technological Innovation; the Multiannual Agreement with UC3M in the line of "Excellence of University Professors (EPUC3M17)"; and a grant from Spanish Ministry of Economy and Competitiveness (SAF2017-86810-R)

    Application of information extraction techniques to pharmacological domain : extracting drug-drug interactions

    Get PDF
    Una interacción farmacológica ocurre cuando los efectos de un fármaco se modifican por la presencia de otro. Las consecuencias pueden ser perjudiciales si la interacción causa un aumento de la toxicidad del fármaco o la disminución de su efecto, pudiendo provocar incluso la muerte del paciente en los peores casos. Las interacciones farmacológicas no sólo suponen un grave problema para la seguridad del paciente, sino que además también conllevan un importante incremento en el gasto médico. En la actualidad, el personal sanitario tiene a su disposición diversas bases de datos sobre interacciones que permiten evitar posibles interacciones a la hora de prescribir un determinado tratamiento, sin embargo, estas bases de datos no están completas. Por este motivo, médicos y farmacéuticos se ven obligados a revisar una gran cantidad de artículos científicos e informes sobre seguridad de medicamentos para estar al día de todo lo publicado en relación al tema. Desgraciadamente, el gran volumen de información al respecto hace que estos profesionales estén desbordados ante tal avalancha. El desarrollo de métodos automáticos que permitan recopilar, mantener e interpretar toda esta información es crucial a la hora de conseguir una mejora real en la detección temprana de las interacciones entre fármacos. Por tanto, la extracción de información podría reducir el tiempo empleado por el personal médico en la revisión de la literatura médica. Sin embargo, la extracción de interacciones farmacológicas a partir textos biomédicos no ha sido dirigida hasta el momento. Motivados por estos aspectos, en esta tesis hemos realizado un estudio detallado sobre diversas técnicas de extracción de información aplicadas al dominio farmacológico. Basándonos en este estudio, hemos propuesto dos aproximaciones distintas para la extracción de interacciones farmacológicas de los textos. Nuestra primera aproximación propone un enfoque híbrido, que combina análisis sintáctico superficial y la aplicación de patrones léxicos definidos por un farmacéutico. La segunda aproximación se aborda mediante aprendizaje supervisado, concretamente, el uso de métodos kernels. Además, se han desarrollado las siguientes tareas auxiliares: (1) el análisis de los textos utilizando la herramienta UMLS MetaMap Transfer (MMTx), que proporciona información sintáctica y semántica, (2) un proceso para identificar y clasificar los nombres de fármacos que ocurren en los textos, y (3) un proceso para reconoger las expresiones anafóricas que se refieren a fármacos. Un prototipo ha sido desarrollado para integrar y combinar las distintas técnicas propuestas en esta tesis. Para la evaluación de las dos propuestas, con la ayuda de un farmacéutico desarrollamos y anotamos un corpus con interacciones farmacológicas. El corpus DrugDDI es una de las principales aportaciones de la tesis, ya que es el primer corpus en el dominio biomédico anotado con este tipo de información y porque creemos que puede alentar la investigación sobre extracción de información en el dominio farmacológico. Los experimentos realizados demuestran que el enfoque basado en kernels consigue mejores resultados que los reportados por el enfoque que utiliza información sintáctica y patrones léxicos. Además, los kernels consiguen resultados comparables a los obtenidos en dominios similares como son las interacciones entre proteínas. Esta tesis se ha llevado a cabo en el marco del consorcio de investigación MAVIRCM (Mejorando el acceso y visibilidad de la información multilingüe en red para la Comunidad de Madrid, www.mavir.net) dentro del Programa de Actividades de I+D en Tecnologías 2005-2008 de la Comunidad de Madrid (S-0505/TIC-0267) así como en el proyecto de investigación BRAVO: ”Búsqueda de Respuestas Avanzada Multimodal y Multilingüe” (TIN2007-67407-C03-01).----------------------------------------------------------------------------------------A drug-drug interaction occurs when one drug influences the level or activity of another drug. The detection of drug interactions is an important research area in patient safety since these interactions can become very dangerous and increase health care costs. Although there are different databases supporting health care professionals in the detection of drug interactions, this kind of resource is rarely complete. Drug interactions are frequently reported in journals of clinical pharmacology, making medical literature the most effective source for the detection of drug interactions. However, the increasing volume of the literature overwhelms health care professionals trying to keep an up-to-date collection of all reported drug-drug interactions. The development of automatic methods for collecting, maintaining and interpreting this information is crucial for achieving a real improvement in their early detection. Information Extraction (IE) techniques can provide an interesting way of reducing the time spent by health care professionals on reviewing the literature. Nevertheless, no approach has been carried out to extract drug-drug interactions from biomedical texts. In this thesis, we have conducted a detailed study on various IE techniques applied to biomedical domain. Based on this study, we have proposed two different approximations for the extraction of drug-drug interactions from texts. The first approximation proposes a hybrid approach, which combines shallow parsing and pattern matching to extract relations between drugs from biomedical texts. The second approximation is based on a supervised machine learning approach, in particular, kernel methods. In addition, we have created and annotated the first corpus, DrugDDI, annotated with drug-drug interactions, which allow us to evaluate and compare both approximations. To the best of our knowledge, the DrugDDI corpus is the only available corpus annotated for drug-drug interactions and this thesis is the first work which addresses the problem of extracting drug-drug interactions from biomedical texts. We believe the DrugDDI corpus is an important contribution because it could encourage other research groups to research into this problem. We have also defined three auxiliary processes to provide crucial information, which will be used by the aforementioned approximations. These auxiliary tasks are as follows: (1) a process for text analysis based on the UMLS MetaMap Transfer tool (MMTx) to provide shallow syntactic and semantic information from texts, (2) a process for drug name recognition and classification, and (3) a process for drug anaphora resolution. Finally, we have developed a pipeline prototype which integrates the different auxiliary processes. The pipeline architecture allows us to easily integrate these modules with each of the approaches proposed in this thesis: pattern-matching or kernels. Several experiments were performed on the DrugDDI corpus. They show that while the first approximation based on pattern matching achieves low performance, the approach based on kernel-methods achieves a performance comparable to those obtained by approaches which carry out a similar task such as the extraction of protein-protein interactions. This work has been partially supported by the Spanish research projects: MAVIR consortium (S-0505/TIC-0267, www.mavir.net), a network of excellence funded by the Madrid Regional Government and TIN2007-67407-C03-01 (BRAVO: Advanced Multimodal and Multilingual Question Answering)

    Event extraction of bacteria biotopes: a knowledge-intensive NLP-based approach

    Get PDF
    International audienceBackground: Bacteria biotopes cover a wide range of diverse habitats including animal and plant hosts, natural, medical and industrial environments. The high volume of publications in the microbiology domain provides a rich source of up-to-date information on bacteria biotopes. This information, as found in scientific articles, is expressed in natural language and is rarely available in a structured format, such as a database. This information is of great importance for fundamental research and microbiology applications (e.g., medicine, agronomy, food, bioenergy). The automatic extraction of this information from texts will provide a great benefit to the field
    corecore