Search CORE

211,883 research outputs found

A systematic review of data quality issues in knowledge discovery tasks

Author: Corrales David Camilo
Corrales Juan Carlos
Ledezma Agapito Ismael
Publication venue: 'Universidad de Medellin'
Publication date: 07/11/2015
Field of study

Hay un gran crecimiento en el volumen de datos porque las organizaciones capturan permanentemente la cantidad colectiva de datos para lograr un mejor proceso de toma de decisiones. El desafío mas fundamental es la exploración de los grandes volúmenes de datos y la extracción de conocimiento útil para futuras acciones por medio de tareas para el descubrimiento del conocimiento; sin embargo, muchos datos presentan mala calidad. Presentamos una revisión sistemática de los asuntos de calidad de datos en las áreas del descubrimiento de conocimiento y un estudio de caso aplicado a la enfermedad agrícola conocida como la roya del café.Large volume of data is growing because the organizations are continuously capturing the collective amount of data for better decision-making process. The most fundamental challenge is to explore the large volumes of data and extract useful knowledge for future actions through knowledge discovery tasks, nevertheless many data has poor quality. We presented a systematic review of the data quality issues in knowledge discovery tasks and a case study applied to agricultural disease named coffee rust

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

Crossref

Universidad de Medellín: Revistas Científicas

Repositorio Institucional Universidad de Medellín

DIALNET

Predictive modeling of housing instability and homelessness in the Veterans Health Administration

Author: Baggett
Bejan
Burt
DeVoe
Dichter
Elixhauser
Folsom
Fung
Gamache
Garg
Garg
Garg
Gold
Gottlieb
Gottlieb
Green
Greer
Gundlapalli
Hosmer
Hwang
Hwang
James
Japkowicz
Kessler
Kuhn
Kuhn
LaForge
Latimer
McCarthy
Montgomery
Montgomery
Montgomery
Morone
O'Toole
Oreskovic
Peterson
Salit
Shaw
Shelton
Shinn
Tsai
Vickery
Zech
Publication venue: 'Wiley'
Publication date: 01/02/2019
Field of study

OBJECTIVE: To develop and test predictive models of housing instability and homelessness based on responses to a brief screening instrument administered throughout the Veterans Health Administration (VHA). DATA SOURCES/STUDY SETTING: Electronic medical record data from 5.8 million Veterans who responded to the VHA's Homelessness Screening Clinical Reminder (HSCR) between October 2012 and September 2015. STUDY DESIGN: We randomly selected 80% of Veterans in our sample to develop predictive models. We evaluated the performance of both logistic regression and random forests—a machine learning algorithm—using the remaining 20% of cases. DATA COLLECTION/EXTRACTION METHODS: Data were extracted from two sources: VHA's Corporate Data Warehouse and National Homeless Registry. PRINCIPAL FINDINGS: Performance for all models was acceptable or better. Random forests models were more sensitive in predicting housing instability and homelessness than logistic regression, but less specific in predicting housing instability. Rates of positive screens for both outcomes were highest among Veterans in the top strata of model‐predicted risk. CONCLUSIONS: Predictive models based on medical record data can identify Veterans likely to report housing instability and homelessness, making the HSCR screening process more efficient and informing new engagement strategies. Our findings have implications for similar instruments in other health care systems.U.S. Department of Veterans Affairs (VA) Health Services Research and Development (HSR&D), Grant/Award Number: IIR 13-334 (IIR 13-334 - U.S. Department of Veterans Affairs (VA) Health Services Research and Development (HSRD))Accepted manuscrip

Crossref

Boston University Institutional Repository (OpenBU)

Analysis and Detection of Information Types of Open Source Software Issue Discussions

Author: Arya Deeksha
Cheng Jinghui
Guo Jin L. C.
Wang Wenting
Publication venue
Publication date: 01/01/2019
Field of study

Most modern Issue Tracking Systems (ITSs) for open source software (OSS) projects allow users to add comments to issues. Over time, these comments accumulate into discussion threads embedded with rich information about the software project, which can potentially satisfy the diverse needs of OSS stakeholders. However, discovering and retrieving relevant information from the discussion threads is a challenging task, especially when the discussions are lengthy and the number of issues in ITSs are vast. In this paper, we address this challenge by identifying the information types presented in OSS issue discussions. Through qualitative content analysis of 15 complex issue threads across three projects hosted on GitHub, we uncovered 16 information types and created a labeled corpus containing 4656 sentences. Our investigation of supervised, automated classification techniques indicated that, when prior knowledge about the issue is available, Random Forest can effectively detect most sentence types using conversational features such as the sentence length and its position. When classifying sentences from new issues, Logistic Regression can yield satisfactory performance using textual features for certain information types, while falling short on others. Our work represents a nontrivial first step towards tools and techniques for identifying and obtaining the rich information recorded in the ITSs to support various software engineering activities and to satisfy the diverse needs of OSS stakeholders.Comment: 41st ACM/IEEE International Conference on Software Engineering (ICSE2019

arXiv.org e-Print Archive

Crossref

PolyPublie