614 research outputs found

    Automatic keyword assignment system for medical research articles using nearest-neighbor searches

    Get PDF
    Assigning accurate keywords to research articles is increasingly important concern. Keywords should be selected meticulously to describe the article well since keywords play an important role in matching readers with research articles in order to reach a bigger audience. So, improper selection of keywords may result in less attraction to readers which results in degradation in its audience. Hence, we designed and developed an automatic keyword assignment system (AKAS) for research articles based on k-nearest neighbor (k-NN) and threshold-nearest neighbor (t-NN) accompanied with information retrieval systems (IRS), which is a corpus-based method by utilizing IRS using the Medline dataset in PubMed. First, AKAS accepts an abstract of the research article or a particular text as a query to the IRS. Next, the IRS returns a ranked list of articles to the given query. Then, we selected a set of documents from this list using two different methods, which are k-NN and t-NN representing the first k documents and documents whose similarity is greater than the threshold value of t, respectively. To evaluate our proposed system, we conducted a set of experiments on a selected subset of 458,594 PubMed articles. Then, we performed an experiment to observe the performance of AKAS results by comparing with the original keywords assigned by authors. The results we obtained showed that our system suggests keywords more than 55% match in terms of F-score. We presented both methods we used and results of experiments, in detail

    Using ontology in query answering systems: Scenarios, requirements and challenges

    Get PDF
    Equipped with the ultimate query answering system, computers would finally be in a position to address all our information needs in a natural way. In this paper, we describe how Language and Computing nv (L&C), a developer of ontology-based natural language understanding systems for the healthcare domain, is working towards the ultimate Question Answering (QA) System for healthcare workers. L&C’s company strategy in this area is to design in a step-by-step fashion the essential components of such a system, each component being designed to solve some one part of the total problem and at the same time reflect well-defined needs on the prat of our customers. We compare our strategy with the research roadmap proposed by the Question Answering Committee of the National Institute of Standards and Technology (NIST), paying special attention to the role of ontology

    An Ontology-based Two-Stage Approach to Medical Text Classification with Feature Selection by Particle Swarm Optimisation

    Full text link
    © 2019 IEEE. Document classification (DC) is the task of assigning pre-defined labels to unseen documents by utilizing a model trained on the available labeled documents. DC has attracted much attention in medical fields recently because many issues can be formulated as a classification problem. It can assist doctors in decision making and correct decisions can reduce the medical expenses. Medical documents have special attributes that distinguish them from other texts and make them difficult to analyze. For example, many acronyms and abbreviations, and short expressions make it more challenging to extract information. The classification accuracy of the current medical DC methods is not satisfactory. The goal of this work is to enhance the input feature sets of the DC method to improve the accuracy. To approach this goal, a novel two-stage approach is proposed. In the first stage, a domain-specific dictionary, namely the Unified Medical Language System (UMLS), is employed to extract the key features belonging to the most relevant concepts such as diseases or symptoms. In the second stage, PSO is applied to select more related features from the extracted features in the first stage. The performance of the proposed approach is evaluated on the 2010 Informatics for Integrating Biology and the Bedside (i2b2) data set which is a widely used medical text dataset. The experimental results show substantial improvement by the proposed method on the accuracy of classification

    Fusion architectures for automatic subject indexing under concept drift:Analysis and empirical results on short texts

    Get PDF
    Indexing documents with controlled vocabularies enables a wealth of semantic applications for digital libraries. Due to the rapid growth of scientific publications, machine learning-based methods are required that assign subject descriptors automatically. While stability of generative processes behind the underlying data is often assumed tacitly, it is being violated in practice. Addressing this problem, this article studies explicit and implicit concept drift, that is, settings with new descriptor terms and new types of documents, respectively. First, the existence of concept drift in automatic subject indexing is discussed in detail and demonstrated by example. Subsequently, architectures for automatic indexing are analyzed in this regard, highlighting individual strengths and weaknesses. The results of the theoretical analysis justify research on fusion of different indexing approaches with special consideration on information sharing among descriptors. Experimental results on titles and author keywords in the domain of economics underline the relevance of the fusion methodology, especially under concept drift. Fusion approaches outperformed non-fusion strategies on the tested data sets, which comprised shifts in priors of descriptors as well as covariates. These findings can help researchers and practitioners in digital libraries to choose appropriate methods for automatic subject indexing, as is finally shown by a recent case study

    Multi-Instance Multi-Label Learning

    Get PDF
    In this paper, we propose the MIML (Multi-Instance Multi-Label learning) framework where an example is described by multiple instances and associated with multiple class labels. Compared to traditional learning frameworks, the MIML framework is more convenient and natural for representing complicated objects which have multiple semantic meanings. To learn from MIML examples, we propose the MimlBoost and MimlSvm algorithms based on a simple degeneration strategy, and experiments show that solving problems involving complicated objects with multiple semantic meanings in the MIML framework can lead to good performance. Considering that the degeneration process may lose information, we propose the D-MimlSvm algorithm which tackles MIML problems directly in a regularization framework. Moreover, we show that even when we do not have access to the real objects and thus cannot capture more information from real objects by using the MIML representation, MIML is still useful. We propose the InsDif and SubCod algorithms. InsDif works by transforming single-instances into the MIML representation for learning, while SubCod works by transforming single-label examples into the MIML representation for learning. Experiments show that in some tasks they are able to achieve better performance than learning the single-instances or single-label examples directly.Comment: 64 pages, 10 figures; Artificial Intelligence, 201

    MeSH Up: effective MeSH text classification for improved document retrieval

    Get PDF
    Motivation: Controlled vocabularies such as the Medical Subject Headings (MeSH) thesaurus and the Gene Ontology (GO) provide an efficient way of accessing and organizing biomedical information by reducing the ambiguity inherent to free-text data. Different methods of automating the assignment of MeSH concepts have been proposed to replace manual annotation, but they are either limited to a small subset of MeSH or have only been compared with a limited number of other systems

    Learning structure and schemas from heterogeneous domains in networked systems: a survey

    Get PDF
    The rapidly growing amount of available digital documents of various formats and the possibility to access these through internet-based technologies in distributed environments, have led to the necessity to develop solid methods to properly organize and structure documents in large digital libraries and repositories. Specifically, the extremely large size of document collections make it impossible to manually organize such documents. Additionally, most of the document sexist in an unstructured form and do not follow any schemas. Therefore, research efforts in this direction are being dedicated to automatically infer structure and schemas. This is essential in order to better organize huge collections as well as to effectively and efficiently retrieve documents in heterogeneous domains in networked system. This paper presents a survey of the state-of-the-art methods for inferring structure from documents and schemas in networked environments. The survey is organized around the most important application domains, namely, bio-informatics, sensor networks, social networks, P2Psystems, automation and control, transportation and privacy preserving for which we analyze the recent developments on dealing with unstructured data in such domains.Peer ReviewedPostprint (published version

    BIOMEDICAL WORD SENSE DISAMBIGUATION WITH NEURAL WORD AND CONCEPT EMBEDDINGS

    Get PDF
    Addressing ambiguity issues is an important step in natural language processing (NLP) pipelines designed for information extraction and knowledge discovery. This problem is also common in biomedicine where NLP applications have become indispensable to exploit latent information from biomedical literature and clinical narratives from electronic medical records. In this thesis, we propose an ensemble model that employs recent advances in neural word embeddings along with knowledge based approaches to build a biomedical word sense disambiguation (WSD) system. Specifically, our system identities the correct sense from a given set of candidates for each ambiguous word when presented in its context (surrounding words). We use the MSH WSD dataset, a well known public dataset consisting of 203 ambiguous terms each with nearly 200 different instances and an average of two candidate senses represented by concepts in the unified medical language system (UMLS). We employ a popular biomedical concept, Our linear time (in terms of number of senses and context length) unsupervised and knowledge based approach improves over the state-of-the-art methods by over 3% in accuracy. A more expensive approach based on the k-nearest neighbor framework improves over prior best results by 5% in accuracy. Our results demonstrate that recent advances in neural dense word vector representations offer excellent potential for solving biomedical WSD
    corecore