1,263 research outputs found
Clustering cliques for graph-based summarization of the biomedical research literature
BACKGROUND: Graph-based notions are increasingly used in biomedical data mining and knowledge discovery tasks. In this paper, we present a clique-clustering method to automatically summarize graphs of semantic predications produced from PubMed citations (titles and abstracts). RESULTS: SemRep is used to extract semantic predications from the citations returned by a PubMed search. Cliques were identified from frequently occurring predications with highly connected arguments filtered by degree centrality. Themes contained in the summary were identified with a hierarchical clustering algorithm based on common arguments shared among cliques. The validity of the clusters in the summaries produced was compared to the Silhouette-generated baseline for cohesion, separation and overall validity. The theme labels were also compared to a reference standard produced with major MeSH headings. CONCLUSIONS: For 11 topics in the testing data set, the overall validity of clusters from the system summary was 10% better than the baseline (43% versus 33%). While compared to the reference standard from MeSH headings, the results for recall, precision and F-score were 0.64, 0.65, and 0.65 respectively
Recommended from our members
Reasoning About User Feedback Under Identity Uncertainty in Knowledge Base Construction
Intelligent, automated systems that are intertwined with everyday life---such as Google Search and virtual assistants like Amazon’s Alexa or Apple’s Siri---are often powered in part by knowledge bases (KBs), i.e., structured data repositories of entities, their attributes, and the relationships among them. Despite a wealth of research focused on automated KB construction methods, KBs are inevitably imperfect, with errors stemming from various points in the construction pipeline. Making matters more challenging, new data is created daily and must be integrated with existing KBs so that they remain up-to-date. As the primary consumers of KBs, human users have tremendous potential to aid in KB construction by contributing feedback that identifies spurious and missing entity attributes and relations. However, correctly integrating user feedback with an existing KB is complicated by the necessity to resolve identity uncertainty, i.e., uncertainty regarding to which real-world entity a piece of data refers. Identity uncertainty abounds in the collection of raw evidence from which a KB is built. Moreover, it also gives rise to identity uncertainty in user feedback, when KB entities, which were affected by user feedback, are split or merged.
In this dissertation, we present a continuous reasoning framework capable of integrating user feedback with a KB, under identity certainty. To begin, we introduce Grinch, an online entity resolution (ER) algorithm---with provable correctness guarantees---capable of merging and splitting KB entities as new data arrives. We show that Grinch is efficient and achieves state-of-the-art performance in ER as well as in clustering. Next, we propose a method for using Grinch to resolve identity uncertainty in a KB\u27s underlying data as well as in user feedback. Our approach is based on representing user feedback as mentions, i.e., first class KB objects that participate in all parts of KB construction. Furthermore, we introduce a structured representation for feedback comprised of packaging and payload, which facilitates recovery from KB errors that stem from both identity uncertainty and noisy data. Finally, we evaluate our framework\u27s efficacy using data from the KB that supports OpenReview.net---a deployed, conference management system that solicits feedback from users. The demands of OpenReview.net lead us to develop XGrinch-Shallow (XGS), a variant of Grinch that builds trees with arbitrary branching factors, and subsequently instantiates 60% fewer internal nodes than Grinch. Empirically, we show that XGS is efficient, and is able to effectively utilize user feedback to improve the correctness and completeness of the OpenReview.net KB. We conclude with 7 concrete suggestions for future research on this topic
PROCESS-ORIENTED KNOWLEDGE DISCOVERY TO SUPPORT PRODUCT DESIGN USING TEXT MINING
Ph.DDOCTOR OF PHILOSOPH
Machine learning of structured and unstructured healthcare data
The widespread adoption of Electronic Health Records (EHR) systems in healthcare institutions in the United States makes machine learning based on large-scale and real-world clinical data feasible and affordable. Machine learning of healthcare data, or healthcare data analytics, has achieved numerous successes in various applications. However, there are still many challenges for machine learning of healthcare data both structured and unstructured. Longitudinal structured clinical data (e.g., lab test results, diagnoses, and medications) have an enormous variety of categories, are collected at irregularly spaced visits, and are sparsely distributed. Studies on analyzing longitudinal structured EHR data for tasks such as disease prediction and visualization are still limited. For unstructured clinical notes, existing studies mostly focus on disease prediction or cohort selection. Studies on mining clinical notes with the direct purpose to reduce costs for healthcare providers or institutions are limited. To fill in these gaps, this dissertation has three research topics.The first topic is about developing state-of-the-art predictive models to detect diabetic retinopathy using longitudinal structured EHR data. Major deep-learning-based temporal models for disease prediction are studied, implemented, and evaluated. Experimental results on a large-scale dataset show that temporal deep learning models outperform non-temporal random forests models in terms of AUPRC and recall.The second topic is about clustering temporal disease networks to visualize comorbidity progression. We propose a clustering technique to outline comorbidity progression phases as well as a new disease clustering method to simplify the visualization. Two case studies on Clostridioides difficile and stroke show the methods are effective.The third topic is clinical information extraction for medical billing. We propose a framework that consists of two methods, a rule-based and a deep-learning-based, to extract patient history information directly from clinical notes to facilitate the Evaluation and Management Services (E/M) billing. Initial results of the two prototype systems on an annotated dataset are promising and direct us for potential improvements
SIIMCO: A forensic investigation tool for identifying the influential members of a criminal organization
Members of a criminal organization, who hold central positions in the organization, are usually targeted by criminal investigators for removal or surveillance. This is because they play key and influential roles by acting as commanders, who issue instructions or serve as gatekeepers. Removing these central members (i.e., influential members) is most likely to disrupt the organization and put it out of business. Most often, criminal investigators are even more interested in knowing the portion of these influential members, who are the immediate leaders of lower level criminals. These lower level criminals are the ones who usually carry out the criminal works; therefore, they are easier to identify. The ultimate goal of investigators is to identify the immediate leaders of these lower level criminals in order to disrupt future crimes. We propose, in this paper, a forensic analysis system called SIIMCO that can identify the influential members of a criminal organization. Given a list of lower level criminals in a criminal organization, SIIMCO can also identify the immediate leaders of these criminals. SIIMCO first constructs a network representing a criminal organization from either mobile communication data that belongs to the organization or crime incident reports. It adopts the concept space approach to automatically construct a network from crime incident reports. In such a network, a vertex represents an individual criminal, and a link represents the relationship between two criminals. SIIMCO employs formulas that quantify the degree of influence/importance of each vertex in the network relative to all other vertices. We present these formulas through a series of refinements. All the formulas incorporate novelweighting schemes for the edges of networks. We evaluated the quality of SIIMCO by comparing it experimentally with two other systems. Results showed marked improvement
- …