16 research outputs found

    Framework for data quality in knowledge discovery tasks

    Get PDF
    Actualmente la explosión de datos es tendencia en el universo digital debido a los avances en las tecnologías de la información. En este sentido, el descubrimiento de conocimiento y la minería de datos han ganado mayor importancia debido a la gran cantidad de datos disponibles. Para un exitoso proceso de descubrimiento de conocimiento, es necesario preparar los datos. Expertos afirman que la fase de preprocesamiento de datos toma entre un 50% a 70% del tiempo de un proceso de descubrimiento de conocimiento. Herramientas software basadas en populares metodologías para el descubrimiento de conocimiento ofrecen algoritmos para el preprocesamiento de los datos. Según el cuadrante mágico de Gartner de 2018 para ciencia de datos y plataformas de aprendizaje automático, KNIME, RapidMiner, SAS, Alteryx, y H20.ai son las mejores herramientas para el desucrimiento del conocimiento. Estas herramientas proporcionan diversas técnicas que facilitan la evaluación del conjunto de datos, sin embargo carecen de un proceso orientado al usuario que permita abordar los problemas en la calidad de datos. Adem´as, la selección de las técnicas adecuadas para la limpieza de datos es un problema para usuarios inexpertos, ya que estos no tienen claro cuales son los métodos más confiables. De esta forma, la presente tesis doctoral se enfoca en abordar los problemas antes mencionados mediante: (i) Un marco conceptual que ofrezca un proceso guiado para abordar los problemas de calidad en los datos en tareas de descubrimiento de conocimiento, (ii) un sistema de razonamiento basado en casos que recomiende los algoritmos adecuados para la limpieza de datos y (iii) una ontología que representa el conocimiento de los problemas de calidad en los datos y los algoritmos de limpieza de datos. Adicionalmente, esta ontología contribuye en la representacion formal de los casos y en la fase de adaptación, del sistema de razonamiento basado en casos.The creation and consumption of data continue to grow by leaps and bounds. Due to advances in Information and Communication Technologies (ICT), today the data explosion in the digital universe is a new trend. The Knowledge Discovery in Databases (KDD) gain importance due the abundance of data. For a successful process of knowledge discovery is necessary to make a data treatment. The experts affirm that preprocessing phase take the 50% to 70% of the total time of knowledge discovery process. Software tools based on Knowledge Discovery Methodologies offers algorithms for data preprocessing. According to Gartner 2018 Magic Quadrant for Data Science and Machine Learning Platforms, KNIME, RapidMiner, SAS, Alteryx and H20.ai are the leader tools for knowledge discovery. These software tools provide different techniques and they facilitate the evaluation of data analysis, however, these software tools lack any kind of guidance as to which techniques can or should be used in which contexts. Consequently, the use of suitable data cleaning techniques is a headache for inexpert users. They have no idea which methods can be confidently used and often resort to trial and error. This thesis presents three contributions to address the mentioned problems: (i) A conceptual framework to provide the user a guidance to address data quality issues in knowledge discovery tasks, (ii) a Case-based reasoning system to recommend the suitable algorithms for data cleaning, and (iii) an Ontology that represent the knowledge in data quality issues and data cleaning methods. Also, this ontology supports the case-based reasoning system for case representation and reuse phase.Programa Oficial de Doctorado en Ciencia y Tecnología InformáticaPresidente: Fernando Fernández Rebollo.- Secretario: Gustavo Adolfo Ramírez.- Vocal: Juan Pedro Caraça-Valente Hernánde

    WikiSensing: A collaborative sensor management system with trust assessment for big data

    Get PDF
    Big Data for sensor networks and collaborative systems have become ever more important in the digital economy and is a focal point of technological interest while posing many noteworthy challenges. This research addresses some of the challenges in the areas of online collaboration and Big Data for sensor networks. This research demonstrates WikiSensing (www.wikisensing.org), a high performance, heterogeneous, collaborative data cloud for managing and analysis of real-time sensor data. The system is based on the Big Data architecture with comprehensive functionalities for smart city sensor data integration and analysis. The system is fully functional and served as the main data management platform for the 2013 UPLondon Hackathon. This system is unique as it introduced a novel methodology that incorporates online collaboration with sensor data. While there are other platforms available for sensor data management WikiSensing is one of the first platforms that enable online collaboration by providing services to store and query dynamic sensor information without any restriction of the type and format of sensor data. An emerging challenge of collaborative sensor systems is modelling and assessing the trustworthiness of sensors and their measurements. This is with direct relevance to WikiSensing as an open collaborative sensor data management system. Thus if the trustworthiness of the sensor data can be accurately assessed, WikiSensing will be more than just a collaborative data management system for sensor but also a platform that provides information to the users on the validity of its data. Hence this research presents a new generic framework for capturing and analysing sensor trustworthiness considering the different forms of evidence available to the user. It uses an extensible set of metrics that can represent such evidence and use Bayesian analysis to develop a trust classification model. Based on this work there are several publications and others are at the final stage of submission. Further improvement is also planned to make the platform serve as a cloud service accessible to any online user to build up a community of collaborators for smart city research.Open Acces

    Developing bioinformatics approaches for the analysis of influenza virus whole genome sequence data

    Get PDF
    Influenza viruses represent a major public health burden worldwide, resulting in an estimated 500,000 deaths per year, with potential for devastating pandemics. Considerable effort is expended in the surveillance of influenza, including major World Health Organization (WHO) initiatives such as the Global Influenza Surveillance and Response System (GISRS). To this end, whole-genome sequencning (WGS), and corresponding bioinformatics pipelines, have emerged as powerful tools. However, due to the inherent diversity of influenza genomes, circulation in several different host species, and noise in short-read data, several pitfalls can appear during bioinformatics processing and analysis. 2.1.2 Results Conventional mapping approaches can be insufficient when a sub-optimal reference strain is chosen. For short-read datasets simulated from human-origin influenza H1N1 HA sequences, read recovery after single-reference mapping was routinely as low as 90% for human-origin influenza sequences, and often lower than 10% for those from avian hosts. To this end, I developed software using de Bruijn 47Graphs (DBGs) for classification of influenza WGS datasets: VAPOR. In real data benchmarking using 257 WGS read sets with corresponding de novo assemblies, VAPOR provided classifications for all samples with a mean of >99.8% identity to assembled contigs. This resulted in an increase of the number of mapped reads by 6.8% on average, up to a maximum of 13.3%. Additionally, using simulations, I demonstrate that classification from reads may be applied to detection of reassorted strains. 2.1.3 Conclusions The approach used in this study has the potential to simplify bioinformatics pipelines for surveillance, providing a novel method for detection of influenza strains of human and non-human origin directly from reads, minimization of potential data loss and bias associated with conventional mapping, and facilitating alignments that would otherwise require slow de novo assembly. Whilst with expertise and time these pitfalls can largely be avoided, with pre-classification they are remedied in a single step. Furthermore, this algorithm could be adapted in future to surveillance of other RNA viruses. VAPOR is available at https://github.com/connor-lab/vapor. Lastly, VAPOR could be improved by future implementation in C++, and should employ more efficient methods for DBG representation

    Acta Cybernetica : Volume 25. Number 2.

    Get PDF

    Environmental genetics of root system architecture

    Get PDF
    The root system is the plant’s principal organ for water and mineral nutrient supply. Root growth follows an endogenous, developmental programme. Yet, this programme can be modulated by external cues which makes root system architecture (RSA), the spatial configuration of all root parts, a highly plastic trait. Presence or absence of nutrients such as nitrate (N), phosphate (P), potassium (K) and sulphate (S) serve as environmental signals to which a plant responds with targeted proliferation or restriction of main or lateral root growth. In turn, RSA serves as a quantitative reporter system of nutrient starvation responses and can therefore be used to study nutrient sensing and signalling mechanisms. In this study, I have analysed root architectural responses of various Arabidopsis thaliana genotypes (wildtype, mutants and natural accessions) to single and multiple nutrient deficiency treatments. A comprehensive analysis of combinatorial N, P, K an S supply allowed me to dissect the effect of individual nutrients on individual root parameters. It also highlighted the existence of interactive effects arising from simultaneous environmental stimuli. Quantification of appropriate RSA parameters allowed for targeted testing of known regulatory genes in specific nutritional settings. This revealed, for example, a novel role for CIPK23, AKT1 and NRT1.1 in integrating K and N effects on higher order lateral root branching and main root angle. A significant contribution to phenotypic variation also arose from P*K interactions. I could show that the iron (Fe) concentration in the external medium is an important driving force of RSA responses to low-P and low-K. In fact, P and K deprivation caused Fe accumulation in distinct parts of the root system, as demonstrated by Fe staining and synchrotron X-Ray fluorescence. Again, selected K, P and Fe transport and signalling mutants were tested for aberrant low-K and/or low-P phenotypes. Most notably, the two paralogous ER-localised multicopper oxidases LPR1 and LPR2 emerged as important signalling components of P and K deprivation, potentially integrating Fe homeostasis with meristematic activity under these conditions. In addition to the targeted characterisation of specific genotype-environment interactions, I investigated novel RSA responses to low-K via a non-targeted approach based on natural variation. A morphological gradient spanned the entire genotype set, linking two extreme strategies of low-K responses. Strategy I accessions responded to low-K with a moderate reduction of main root growth but a severe restriction of lateral root elongation. In contrast, strategy II genotypes ceded main root growth in favour of lateral root proliferation. The genetic basis of these low-K responses was then subsequently mapped onto the A. thaliana genome via quantitative trait loci (QTL) analysis using recombinant inbred lines derived from parental accessions that either adopt strategy I (Col-0) or II (Ct-1). In sum, this study addresses the question how plants incorporate environmental signals to modulate developmental programmes that underly RSA formation. I present evidence for novel phenotypic responses to nutrient deprivation and for novel genetic regulators involved in nutrient signalling and crosstalk

    Preface

    Get PDF

    Front-Line Physicians' Satisfaction with Information Systems in Hospitals

    Get PDF
    Day-to-day operations management in hospital units is difficult due to continuously varying situations, several actors involved and a vast number of information systems in use. The aim of this study was to describe front-line physicians' satisfaction with existing information systems needed to support the day-to-day operations management in hospitals. A cross-sectional survey was used and data chosen with stratified random sampling were collected in nine hospitals. Data were analyzed with descriptive and inferential statistical methods. The response rate was 65 % (n = 111). The physicians reported that information systems support their decision making to some extent, but they do not improve access to information nor are they tailored for physicians. The respondents also reported that they need to use several information systems to support decision making and that they would prefer one information system to access important information. Improved information access would better support physicians' decision making and has the potential to improve the quality of decisions and speed up the decision making process.Peer reviewe

    A Corpus-driven Approach toward Teaching Vocabulary and Reading to English Language Learners in U.S.-based K-12 Context through a Mobile App

    Get PDF
    In order to decrease teachers’ decisions of which vocabulary the focus of the instruction should be upon, a recent line of research argues that pedagogically-prepared word lists may offer the most efficient order of learning vocabulary with an optimized context for instruction in each of four K-12 content areas (math, science, social studies, and language arts) through providing English Language Learners (ELLs) with the most frequent words in each area. Educators and school experts have acknowledged the need for developing new materials, including computerized enhanced texts and effective strategies aimed at improving ELLs’ mastery of academic and STEM-related lexicon. Not all words in a language are equal in their role in comprehending the language and expressing ideas or thoughts. For this study, I used a corpus-driven approach which is operationalized by applying a text analysis method. For the purpose of this research study, I made two corpora, Teacher’s U.S. Corpus (TUSC) and Science and Math Academic Corpus for Kids (SMACK) with a focus on word lemma rather than inflectional and derivational variants of word families. To create the corpora, I collected and analyzed a total of 122 textbooks used commonly in the states of Florida and California. Recruiting, scanning and converting of textbooks had been carried out over a period of more than two years from October 2014 to March 2017. In total, this school corpus contains 10,519,639 running words and 16,344 lemmas saved in 16,315 word document pages. From the corpora, I developed six word lists, namely three frequency-based word lists (high-, mid-, and low-frequency), academic and STEM-related word lists, and essential word list (EWL). I then applied the word lists as the database and developed a mobile app, Vocabulary in Reading Study – VIRS, (available on App Store, Android and Google Play) alongside a website (www.myvirs.com). Also, I developed a new K-12 dictionary which targets the vocabulary needs of ELLs in K-12 context. This is a frequency-based dictionary which categorizes words into three groups of high, medium and low frequency words as well as two separate sections for academic and STEM words. The dictionary has 16,500 lemmas with derivational and inflectional forms

    27th Annual European Symposium on Algorithms: ESA 2019, September 9-11, 2019, Munich/Garching, Germany

    Get PDF
    corecore