104 research outputs found
Doctor of Philosophy
dissertationMedical knowledge learned in medical school can become quickly outdated given the tremendous growth of the biomedical literature. It is the responsibility of medical practitioners to continuously update their knowledge with recent, best available clinical evidence to make informed decisions about patient care. However, clinicians often have little time to spend on reading the primary literature even within their narrow specialty. As a result, they often rely on systematic evidence reviews developed by medical experts to fulfill their information needs. At the present, systematic reviews of clinical research are manually created and updated, which is expensive, slow, and unable to keep up with the rapidly growing pace of medical literature. This dissertation research aims to enhance the traditional systematic review development process using computer-aided solutions. The first study investigates query expansion and scientific quality ranking approaches to enhance literature search on clinical guideline topics. The study showed that unsupervised methods can improve retrieval performance of a popular biomedical search engine (PubMed). The proposed methods improve the comprehensiveness of literature search and increase the ratio of finding relevant studies with reduced screening effort. The second and third studies aim to enhance the traditional manual data extraction process. The second study developed a framework to extract and classify texts from PDF reports. This study demonstrated that a rule-based multipass sieve approach is more effective than a machine-learning approach in categorizing document-level structures and iv that classifying and filtering publication metadata and semistructured texts enhances the performance of an information extraction system. The proposed method could serve as a document processing step in any text mining research on PDF documents. The third study proposed a solution for the computer-aided data extraction by recommending relevant sentences and key phrases extracted from publication reports. This study demonstrated that using a machine-learning classifier to prioritize sentences for specific data elements performs equally or better than an abstract screening approach, and might save time and reduce errors in the full-text screening process. In summary, this dissertation showed that there are promising opportunities for technology enhancement to assist in the development of systematic reviews. In this modern age when computing resources are getting cheaper and more powerful, the failure to apply computer technologies to assist and optimize the manual processes is a lost opportunity to improve the timeliness of systematic reviews. This research provides methodologies and tests hypotheses, which can serve as the basis for further large-scale software engineering projects aimed at fully realizing the prospect of computer-aided systematic reviews
A review of associative classification mining
Associative classification mining is a promising approach in data mining that utilizes the
association rule discovery techniques to construct classification systems, also known as
associative classifiers. In the last few years, a number of associative classification algorithms
have been proposed, i.e. CPAR, CMAR, MCAR, MMAC and others. These algorithms
employ several different rule discovery, rule ranking, rule pruning, rule prediction and rule
evaluation methods. This paper focuses on surveying and comparing the state-of-the-art associative
classification techniques with regards to the above criteria. Finally, future directions in associative
classification, such as incremental learning and mining low-quality data sets, are also
highlighted in this paper
Automatic pattern-taxonomy extraction for web mining
In this paper, we propose a model for discovering frequent sequential patterns, phrases, which can be used as profile descriptors of documents. It is indubitable that we can obtain numerous phrases using data mining algorithms. However, it is difficult to use these phrases effectively for answering what users want. Therefore, we present a pattern taxonomy extraction model which performs the task of extracting descriptive frequent sequential patterns by pruning the meaningless ones. The model then is extended and tested by applying it to the information filtering system. The results of the experiment show that pattern-based methods outperform the keyword-based methods. The results also indicate that removal of meaningless patterns not only reduces the cost of computation but also improves the effectiveness of the system. <br /
A Simplicial Complex, a Hypergraph, Structure in the Latent Semantic Space of Document Clustering
Abstract This paper presents a novel approach to document clustering based on some geometric structure in Combinatorial Topology. Given a set of documents, the set of associations among frequently co-occurring terms in documents forms naturally a simplicial complex. Our general thesis is each connected component of this simplicial complex represents a concept in the collection. Based on these concepts, documents can be clustered into meaningful classes. However, in this paper, we attack a softer notion, instead of connected components, we use maximal simplexes of highest dimension as representative of connected components, the concept so defined is called maximal primitive concepts. Experiments with three different data sets from web pages and medical literature have shown that the proposed unsupervised clustering approach performs significantly better than traditional clustering algorithms, such as k-means, AutoClass and Hierarchical Clustering (HAC). This abstract geometric model seems have captured the latent semantic structure of documents
Methods for Mining Association Rules from Data
CieÄľom práce je implementácia metĂłdy Multipass-Apriori pre zĂskavanie asociaÄŤnĂ˝ch pravidiel z textovĂ˝ch dát. Po Ăşvode do problematiky dolovania z dát je spomenutá špecifickosĹĄ dolovania znalostĂ z textovĂ˝ch dát. VeÄľmi dĂ´leĹľitĂş Ăşlohu v tomto procese zohráva predspracovanie, v tomto prĂpade najmä pouĹľitie stemmingu, a vytvorenie slovnĂka nepotrebnĂ˝ch slov (stopwords). VĂ˝znamu, vyuĹľitiu a procesu zĂskavania asociaÄŤnĂ˝ch pravidiel je venovaná ÄŹalšia ÄŤasĹĄ práce. Najväčšia pozornosĹĄ je venovaná metĂłde Multipass-Apriori, ktorá bola naimplementovaná a bol popĂsanĂ˝ princĂp jej fungovania. Na základe vykonanĂ˝ch testov bol stanovenĂ˝ optimálny spĂ´sob rozdelenia partĂciĂ a spĂ´sob usporiadania mnoĹľĂn. Pri praktickĂ˝ch testoch bola metĂłda Multipass-Apriori porovnávaná s metĂłdou Apriori.The aim of this thesis is to implement Multipass-Apriori method for mining association rules from text data. After the introduction to the field of knowledge discovery, the specific aspects of text mining are mentioned. In the mining process, preprocessing is a very important problem, use of stemming and stop words dictionary is necessary in this case. Next part of thesis deals with meaning, usage and generating of association rules. The main part is focused on the description of Multipass-Apriori method, which was implemented. On the ground of executed tests the most optimal way of dividing partitions was set and also the best way of sorting the itemsets. As a part of testing, Multipass-Apriori method was compared with Apriori method.
Método de reglas de asociación para el análisis de afinidad entre objetos de tipo texto
MaestrĂa en IngenierĂaData mining is considered a tool to extract knowledge in large volumes of information. One of the analyzes performed in data mining is the association rules, whose purpose is to look for co-occurrences among the records of a set of data.
Its main application is in the analysis of market basket, where criteria for decision making are established based on the buying behavior of customers. Some of the algorithms are A priori, Frequent Parent Growth, QFP Algorithm, CBA, CMAR, CPAR. These algorithms have been designed to analyze structured databases; At present, various applications require the processing of unstructured data known as text type Objects. The purpose of this research is to generate a method to establish the relationship between the elements that make up an object of text type, for the acquisition of relevant information from the analysis of massive data sources of the same type.La minerĂa de datos es considerada una herramienta para extraer conocimiento en grandes volĂşmenes de informaciĂłn. Uno de los análisis realizados en minerĂa de datos son las reglas de asociaciĂłn, cuyo propĂłsito es buscar co-ocurrencias entre los registros de un conjunto de datos.
Su principal aplicación se encuentra en el análisis de canasta de mercado, donde se establecen criterios para la toma de decisiones a partir del comportamiento de compra de los clientes. Algunos de los algoritmos son Apriori, Frequent Parent Growth, QFP Algorithm, CBA, CMAR, CPAR. Estos algoritmos han sido diseñados para analizar bases de datos estructuradas; en la actualidad, diversas aplicaciones requieren el procesamiento de datos no estructurados, como es el caso de los objetos de tipo texto. La investigación planteada tiene como propósito generar un método que permita establecer la relación existente entre los elementos que componen un objeto de tipo texto, para la adquisición de información relevante a partir del análisis de fuentes masivas de datos del mismo tipo
Design and implementation of a workflow for quality improvement of the metadata of scientific publications
In this paper, a detailed workflow for analyzing and improving the quality of metadata of scientific publications is presented and tested.
The workflow was developed based on approaches from the literature. Frequently occurring types of errors from the literature were compiled and mapped to the data-quality dimensions most relevant for publication data – completeness, correctness, and consistency – and made measurable. Based on the identified data errors, a process for improving data quality was developed. This process includes parsing hidden data, correcting incorrectly formatted attribute values, enriching with external data, carrying out deduplication, and filtering erroneous records.
The effectiveness of the workflow was confirmed in an exemplary application to publication data from Open Researcher and Contributor ID (ORCID), with 56\% of the identified data errors corrected. The workflow will be applied to publication data from other source systems in the future to further increase its performance
- …