313 research outputs found

    mARC: Memory by Association and Reinforcement of Contexts

    Full text link
    This paper introduces the memory by Association and Reinforcement of Contexts (mARC). mARC is a novel data modeling technology rooted in the second quantization formulation of quantum mechanics. It is an all-purpose incremental and unsupervised data storage and retrieval system which can be applied to all types of signal or data, structured or unstructured, textual or not. mARC can be applied to a wide range of information clas-sification and retrieval problems like e-Discovery or contextual navigation. It can also for-mulated in the artificial life framework a.k.a Conway "Game Of Life" Theory. In contrast to Conway approach, the objects evolve in a massively multidimensional space. In order to start evaluating the potential of mARC we have built a mARC-based Internet search en-gine demonstrator with contextual functionality. We compare the behavior of the mARC demonstrator with Google search both in terms of performance and relevance. In the study we find that the mARC search engine demonstrator outperforms Google search by an order of magnitude in response time while providing more relevant results for some classes of queries

    Unsupervised Biomedical Named Entity Recognition

    Get PDF
    Named entity recognition (NER) from text is an important task for several applications, including in the biomedical domain. Supervised machine learning based systems have been the most successful on NER task, however, they require correct annotations in large quantities for training. Annotating text manually is very labor intensive and also needs domain expertise. The purpose of this research is to reduce human annotation effort and to decrease cost of annotation for building NER systems in the biomedical domain. The method developed in this work is based on leveraging the availability of resources like UMLS (Unified Medical Language System), that contain a list of biomedical entities and a large unannotated corpus to build an unsupervised NER system that does not require any manual annotations. The method that we developed in this research has two phases. In the first phase, a biomedical corpus is automatically annotated with some named entities using UMLS through unambiguous exact matching which we call weakly-labeled data. In this data, positive examples are the entities in the text that exactly match in UMLS and have only one semantic type which belongs to the desired entity class to be extracted (for example, diseases and disorders). Negative examples are the entities in the text that exactly match in UMLS but are of semantic types other than those that belong to the desired entity class. These examples are then used to train a machine learning classifier using features that represent the contexts in which they appeared in the text. The trained classifier is applied back to the text to gather more examples iteratively through the process of self-training. The trained classifier is then capable of classifying mentions in an unseen text as of the desired entity class or not from the contexts in which they appear. Although the trained named entity detector is good at detecting the presence of entities of the desired class in text, it cannot determine their correct boundaries. In the second phase of our method, called “Boundary Expansion”, the correct boundaries of the entities are determined. This method is based on a novel idea that utilizes machine learning and UMLS. Training examples for boundary expansion are gathered directly from UMLS and do not require any manual annotations. We also developed a new WordNet based approach for boundary expansion. Our developed method was evaluated on three datasets - SemEval 2014 Task 7 dataset that has diseases and disorders as the desired entity class, GENIA dataset that has proteins, DNAs, RNAs, cell types, and cell lines as the desired entity classes, and i2b2 dataset that has problems, tests, and treatments as the desired entity classes. Our method performed well and obtained performance close to supervised methods on the SemEval dataset. On the other datasets, it outperformed an existing unsupervised method on most entity classes. Availability of a list of entity names with their semantic types and a large unannotated corpus are the only requirements of our method to work well. Given these, our method generalizes across different types of entities and different types of biomedical text. Being unsupervised, the method can be easily applied to new NER tasks without needing costly annotations

    Complex Network Analysis for Scientific Collaboration Prediction and Biological Hypothesis Generation

    Get PDF
    With the rapid development of digitalized literature, more and more knowledge has been discovered by computational approaches. This thesis addresses the problem of link prediction in co-authorship networks and protein--protein interaction networks derived from the literature. These networks (and most other types of networks) are growing over time and we assume that a machine can learn from past link creations by examining the network status at the time of their creation. Our goal is to create a computationally efficient approach to recommend new links for a node in a network (e.g., new collaborations in co-authorship networks and new interactions in protein--protein interaction networks). We consider edges in a network that satisfies certain criteria as training instances for the machine learning algorithms. We analyze the neighborhood structure of each node and derive the topological features. Furthermore, each node has rich semantic information when linked to the literature and can be used to derive semantic features. Using both types of features, we train machine learning models to predict the probability of connection for the new node pairs. We apply our idea of link prediction to two distinct networks: a co-authorship network and a protein--protein interaction network. We demonstrate that the novel features we derive from both the network topology and literature content help improve link prediction accuracy. We also analyze the factors involved in establishing a new link and recurrent connections

    Scalable and Declarative Information Extraction in a Parallel Data Analytics System

    Get PDF
    Informationsextraktions (IE) auf sehr großen Datenmengen erfordert hochkomplexe, skalierbare und anpassungsfähige Systeme. Obwohl zahlreiche IE-Algorithmen existieren, ist die nahtlose und erweiterbare Kombination dieser Werkzeuge in einem skalierbaren System immer noch eine große Herausforderung. In dieser Arbeit wird ein anfragebasiertes IE-System für eine parallelen Datenanalyseplattform vorgestellt, das für konkrete Anwendungsdomänen konfigurierbar ist und für Textsammlungen im Terabyte-Bereich skaliert. Zunächst werden konfigurierbare Operatoren für grundlegende IE- und Web-Analytics-Aufgaben definiert, mit denen komplexe IE-Aufgaben in Form von deklarativen Anfragen ausgedrückt werden können. Alle Operatoren werden hinsichtlich ihrer Eigenschaften charakterisiert um das Potenzial und die Bedeutung der Optimierung nicht-relationaler, benutzerdefinierter Operatoren (UDFs) für Data Flows hervorzuheben. Anschließend wird der Stand der Technik in der Optimierung nicht-relationaler Data Flows untersucht und herausgearbeitet, dass eine umfassende Optimierung von UDFs immer noch eine Herausforderung ist. Darauf aufbauend wird ein erweiterbarer, logischer Optimierer (SOFA) vorgestellt, der die Semantik von UDFs mit in die Optimierung mit einbezieht. SOFA analysiert eine kompakte Menge von Operator-Eigenschaften und kombiniert eine automatisierte Analyse mit manuellen UDF-Annotationen, um die umfassende Optimierung von Data Flows zu ermöglichen. SOFA ist in der Lage, beliebige Data Flows aus unterschiedlichen Anwendungsbereichen logisch zu optimieren, was zu erheblichen Laufzeitverbesserungen im Vergleich mit anderen Techniken führt. Als Viertes wird die Anwendbarkeit des vorgestellten Systems auf Korpora im Terabyte-Bereich untersucht und systematisch die Skalierbarkeit und Robustheit der eingesetzten Methoden und Werkzeuge beurteilt um schließlich die kritischsten Herausforderungen beim Aufbau eines IE-Systems für sehr große Datenmenge zu charakterisieren.Information extraction (IE) on very large data sets requires highly complex, scalable, and adaptive systems. Although numerous IE algorithms exist, their seamless and extensible combination in a scalable system still is a major challenge. This work presents a query-based IE system for a parallel data analysis platform, which is configurable for specific application domains and scales for terabyte-sized text collections. First, configurable operators are defined for basic IE and Web Analytics tasks, which can be used to express complex IE tasks in the form of declarative queries. All operators are characterized in terms of their properties to highlight the potential and importance of optimizing non-relational, user-defined operators (UDFs) for dataflows. Subsequently, we survey the state of the art in optimizing non-relational dataflows and highlight that a comprehensive optimization of UDFs is still a challenge. Based on this observation, an extensible, logical optimizer (SOFA) is introduced, which incorporates the semantics of UDFs into the optimization process. SOFA analyzes a compact set of operator properties and combines automated analysis with manual UDF annotations to enable a comprehensive optimization of data flows. SOFA is able to logically optimize arbitrary data flows from different application areas, resulting in significant runtime improvements compared to other techniques. Finally, the applicability of the presented system to terabyte-sized corpora is investigated. Hereby, we systematically evaluate scalability and robustness of the employed methods and tools in order to pinpoint the most critical challenges in building an IE system for very large data sets

    Health systems data interoperability and implementation

    Get PDF
    Objective The objective of this study was to use machine learning and health standards to address the problem of clinical data interoperability across healthcare institutions. Addressing this problem has the potential to make clinical data comparable, searchable and exchangeable between healthcare providers. Data sources Structured and unstructured data has been used to conduct the experiments in this study. The data was collected from two disparate data sources namely MIMIC-III and NHanes. The MIMIC-III database stored data from two electronic health record systems which are CareVue and MetaVision. The data stored in these systems was not recorded with the same standards; therefore, it was not comparable because some values were conflicting, while one system would store an abbreviation of a clinical concept, the other would store the full concept name and some of the attributes contained missing information. These few issues that have been identified make this form of data a good candidate for this study. From the identified data sources, laboratory, physical examination, vital signs, and behavioural data were used for this study. Methods This research employed a CRISP-DM framework as a guideline for all the stages of data mining. Two sets of classification experiments were conducted, one for the classification of structured data, and the other for unstructured data. For the first experiment, Edit distance, TFIDF and JaroWinkler were used to calculate the similarity weights between two datasets, one coded with the LOINC terminology standard and another not coded. Similar sets of data were classified as matches while dissimilar sets were classified as non-matching. Then soundex indexing method was used to reduce the number of potential comparisons. Thereafter, three classification algorithms were trained and tested, and the performance of each was evaluated through the ROC curve. Alternatively the second experiment was aimed at extracting patient’s smoking status information from a clinical corpus. A sequence-oriented classification algorithm called CRF was used for learning related concepts from the given clinical corpus. Hence, word embedding, random indexing, and word shape features were used for understanding the meaning in the corpus. Results Having optimized all the model’s parameters through the v-fold cross validation on a sampled training set of structured data ( ), out of 24 features, only ( 8) were selected for a classification task. RapidMiner was used to train and test all the classification algorithms. On the final run of classification process, the last contenders were SVM and the decision tree classifier. SVM yielded an accuracy of 92.5% when the and parameters were set to and . These results were obtained after more relevant features were identified, having observed that the classifiers were biased on the initial data. On the other side, unstructured data was annotated via the UIMA Ruta scripting language, then trained through the CRFSuite which comes with the CLAMP toolkit. The CRF classifier obtained an F-measure of 94.8% for “nonsmoker” class, 83.0% for “currentsmoker”, and 65.7% for “pastsmoker”. It was observed that as more relevant data was added, the performance of the classifier improved. The results show that there is a need for the use of FHIR resources for exchanging clinical data between healthcare institutions. FHIR is free, it uses: profiles to extend coding standards; RESTFul API to exchange messages; and JSON, XML and turtle for representing messages. Data could be stored as JSON format on a NoSQL database such as CouchDB, which makes it available for further post extraction exploration. Conclusion This study has provided a method for learning a clinical coding standard by a computer algorithm, then applying that learned standard to unstandardized data so that unstandardized data could be easily exchangeable, comparable and searchable and ultimately achieve data interoperability. Even though this study was applied on a limited scale, in future, the study would explore the standardization of patient’s long-lived data from multiple sources using the SHARPn open-sourced tools and data scaling platformsInformation ScienceM. Sc. (Computing

    Linking social media, medical literature, and clinical notes using deep learning.

    Get PDF
    Researchers analyze data, information, and knowledge through many sources, formats, and methods. The dominant data format includes text and images. In the healthcare industry, professionals generate a large quantity of unstructured data. The complexity of this data and the lack of computational power causes delays in analysis. However, with emerging deep learning algorithms and access to computational powers such as graphics processing unit (GPU) and tensor processing units (TPUs), processing text and images is becoming more accessible. Deep learning algorithms achieve remarkable results in natural language processing (NLP) and computer vision. In this study, we focus on NLP in the healthcare industry and collect data not only from electronic medical records (EMRs) but also medical literature and social media. We propose a framework for linking social media, medical literature, and EMRs clinical notes using deep learning algorithms. Connecting data sources requires defining a link between them, and our key is finding concepts in the medical text. The National Library of Medicine (NLM) introduces a Unified Medical Language System (UMLS) and we use this system as the foundation of our own system. We recognize social media’s dynamic nature and apply supervised and semi-supervised methodologies to generate concepts. Named entity recognition (NER) allows efficient extraction of information, or entities, from medical literature, and we extend the model to process the EMRs’ clinical notes via transfer learning. The results include an integrated, end-to-end, web-based system solution that unifies social media, literature, and clinical notes, and improves access to medical knowledge for the public and experts

    Getting More out of Biomedical Documents with GATE's Full Lifecycle Open Source Text Analytics.

    Get PDF
    This software article describes the GATE family of open source text analysis tools and processes. GATE is one of the most widely used systems of its type with yearly download rates of tens of thousands and many active users in both academic and industrial contexts. In this paper we report three examples of GATE-based systems operating in the life sciences and in medicine. First, in genome-wide association studies which have contributed to discovery of a head and neck cancer mutation association. Second, medical records analysis which has significantly increased the statistical power of treatment/ outcome models in the UK’s largest psychiatric patient cohort. Third, richer constructs in drug-related searching. We also explore the ways in which the GATE family supports the various stages of the lifecycle present in our examples. We conclude that the deployment of text mining for document abstraction or rich search and navigation is best thought of as a process, and that with the right computational tools and data collection strategies this process can be made defined and repeatable. The GATE research programme is now 20 years old and has grown from its roots as a specialist development tool for text processing to become a rather comprehensive ecosystem, bringing together software developers, language engineers and research staff from diverse fields. GATE now has a strong claim to cover a uniquely wide range of the lifecycle of text analysis systems. It forms a focal point for the integration and reuse of advances that have been made by many people (the majority outside of the authors’ own group) who work in text processing for biomedicine and other areas. GATE is available online ,1. under GNU open source licences and runs on all major operating systems. Support is available from an active user and developer community and also on a commercial basis

    Content Recognition and Context Modeling for Document Analysis and Retrieval

    Get PDF
    The nature and scope of available documents are changing significantly in many areas of document analysis and retrieval as complex, heterogeneous collections become accessible to virtually everyone via the web. The increasing level of diversity presents a great challenge for document image content categorization, indexing, and retrieval. Meanwhile, the processing of documents with unconstrained layouts and complex formatting often requires effective leveraging of broad contextual knowledge. In this dissertation, we first present a novel approach for document image content categorization, using a lexicon of shape features. Each lexical word corresponds to a scale and rotation invariant local shape feature that is generic enough to be detected repeatably and is segmentation free. A concise, structurally indexed shape lexicon is learned by clustering and partitioning feature types through graph cuts. Our idea finds successful application in several challenging tasks, including content recognition of diverse web images and language identification on documents composed of mixed machine printed text and handwriting. Second, we address two fundamental problems in signature-based document image retrieval. Facing continually increasing volumes of documents, detecting and recognizing unique, evidentiary visual entities (\eg, signatures and logos) provides a practical and reliable supplement to the OCR recognition of printed text. We propose a novel multi-scale framework to detect and segment signatures jointly from document images, based on the structural saliency under a signature production model. We formulate the problem of signature retrieval in the unconstrained setting of geometry-invariant deformable shape matching and demonstrate state-of-the-art performance in signature matching and verification. Third, we present a model-based approach for extracting relevant named entities from unstructured documents. In a wide range of applications that require structured information from diverse, unstructured document images, processing OCR text does not give satisfactory results due to the absence of linguistic context. Our approach enables learning of inference rules collectively based on contextual information from both page layout and text features. Finally, we demonstrate the importance of mining general web user behavior data for improving document ranking and other web search experience. The context of web user activities reveals their preferences and intents, and we emphasize the analysis of individual user sessions for creating aggregate models. We introduce a novel algorithm for estimating web page and web site importance, and discuss its theoretical foundation based on an intentional surfer model. We demonstrate that our approach significantly improves large-scale document retrieval performance
    • …
    corecore