Search CORE

7 research outputs found

Towards transparent machine learning models using feature sensitivity algorithm

Author: Abaker Ali A.
Saeed Fakhreldeen A.
Publication venue: 'Universitas Ahmad Dahlan, Kampus 3'
Publication date: 01/01/2020
Field of study

Despite advances in health care, diabetic ketoacidosis (DKA) remains a potentially serious risk for diabetes. Directing diabetes patients to the appropriate unit of care is very critical for both lives and healthcare resources. Missing data occurs in almost all machine learning models, especially in production. Missing data can reduce the predictive power and produce biased estimates of models. Estimating a missing value around a 50 percent probability may lead to a completely different decision. The objective of this paper was to introduce a feature sensitivity score using the proposed feature sensitivity algorithm. The data were electronic health records contained 644 records and 28 attributes. We designed a model using a random forest classifier that predicts the likelihood of a developing patient DKA at the time of admission. The model achieved an accuracy of 80 percent using five attributes; this new model has fewer features than any model mentioned in the literature review. Also, Feature sensitivity score (FSS) was introduced, which identifies within feature sensitivity; the proposed algorithm enables physicians to make transparent, and accurate decisions at the time of admission. This method can be applied to different diseases and datasets

Journal of Education and Learning (EduLearn)

UAD Journal Management System

Arbeitsbericht Nr. 2017-01, April 2017

Author: Rockel Tobias
Publication venue: ilmedia
Publication date: 02/05/2017
Field of study

Das vorliegende Arbeitspapier aggregiert die Erkenntnisse aus 125 Simulationsstudien, die Imputationsverfahren vergleichen. Dazu werden zunächst der Aufbau der Studien untersucht und die Studien mit verlässlichen Ergebnissen ausgewählt. Diese Studien bilden die Basis für eine Analyse der Imputationsverfahren. Hierbei werden die Verfahren zunächst separat betrachtet und danach paarweise miteinander verglichen. Zusammenfassend ergeben beide Untersuchungen, dass die Imputation mittels adaptiver Regression, die multiple Imputation und die ML-Parameterschätzverfahren am besten zur Behandlung fehlender Werte geeignet sind. Über den Verfahrensvergleich hinaus erlauben die Studien auch Rückschlüsse über Faktoren, die die Qualität der Imputation beeinflussen. Die Studien zeigen, dass sowohl eine größere Anzahl an Objekten als auch ein geringere Anteil fehlender Werte zu besseren Ergebnissen führen. Die Aggregation der Studien zeigt auch weiteren Forschungsbedarf auf. Zum einen sind die Auswirkungen der Merkmale auf die Imputationsqualität nicht eindeutig und zum anderen sind viele Verfahren noch nie oder nicht häufig genug für belastbare Aussagen miteinander verglichen worden. Insbesondere wurden die drei besten Verfahren in keiner Studie direkt miteinander verglichen

Digitale Bibliothek Thüringen

Advances in clustering based on inter-cluster mapping

Author: Muhammad Arshad M.
Publication venue: 'American Psychological Association (APA)'
Publication date: 01/01/2016
Field of study

Data mining involves searching for certain patterns and facts about the structure of data within large complex datasets. Data mining can reveal valuable and interesting relationships which can improve the operations of business, health and many other disciplines. Extraction of hidden patterns and strategic knowledge from large datasets which are stored electronically, is therefore a challenge faced by many organizations. One commonly used technique in data mining for producing useful results is cluster analysis. A basic issue in cluster analysis is deciding the optimal number of clusters for a dataset. A solution to this issue is not straightforward as this form of clustering is unsupervised learning and no clear definition of cluster quality exists. In addition, this issue will be more challenging and complicated for multi-dimensional datasets. Finding the estimated number of clusters and their quality is generally based on so-called validation indexes. A limitation with typical existing validation indexes is that they only work well with specific types of datasets compatible with their design assumptions. Also their results may be inconsistent and an algorithm may need to be run multiple times to find a best estimate of the number of clusters. Furthermore, these existing approaches may not be effective for complex problems in large datasets with varied structure. To help overcome these deficiencies, an efficient and effective approach for stable estimation of the number of clusters is essential. Many clustering techniques including partitioning, hierarchal, grid-base and model-based clustering are available. Here we consider only the partitioning method e.g. the k-means clustering algorithm for analysing data. This thesis will describe a new approach for stable estimation of the number of clusters, based on use of the k-means clustering algorithm. First results obtained from the k-means clustering algorithm will be used to gain a forward and backward mapping of common elements for adjacent and non-adjacent clusters. These will be represented in the form of proportion matrices which will be used to compute combined mapped information using a matrix inner product similarity measure. This will provide indicators for the similarity of mapped elements and overlap (dissimilarity), average similarity and average overlap (average dissimilarity) between clusters. Finally, the estimated number of clusters will be decided using the maximum average similarity, minimum average overlap and coefficient of variation measure. The new approach provides more information than an application of typical existing validation indexes. For example, the new approach offers not only the estimated number of clusters but also gives an indication of fully or partially separated clusters and defines a set of stable clusters for the estimated number of clusters. The advantage of the new approach over several existing validation indexes for evaluating clustering results is demonstrated empirically by applying it on a variety of simulated and real datasets

Western Sydney ResearchDirect

An analysis of missing data treatment methods and their application to health care dataset

Author: Chountas P.
Chountas P.
El-Darzi E.
El-Darzi E.
Huang W.
Huang W.
Lei L.
Lei L.
Liu P.
Liu P.
Vasilakis C.
Vasilakis C.
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2005
Field of study

It is well accepted that many real-life datasets are full of missing data. In this paper we introduce, analyze and compare several well known treatment methods for missing data handling and propose new methods based on Naive Bayesian classifier to estimate and replace missing data. We conduct extensive experiments on datasets from UCI to compare these methods. Finally we apply these models to a geriatric hospital dataset in order to assess their effectiveness on a real-life dataset

Crossref

UCL Discovery

WestminsterResearch

Quality of service assessment over multiple attributes

Author: Al-Dossari Hmood Zafer
Publication venue
Publication date
Field of study

The development of the Internet and World Wide Web have led to many services being offered electronically. When there is sufficient demand from consumers for a certain service, multiple providers may exist, each offering identical service functionality but with varying qualities. It is desirable therefore that we are able to assess the quality of a service (QoS), so that service consumers can be given additional guidance in se lecting their preferred services. Various methods have been proposed to assess QoS using the data collected by monitoring tools, but they do not deal with multiple QoS attributes adequately. Typically these methods assume that the quality of a service may be assessed by first assessing the quality level delivered by each of its attributes individ ually, and then aggregating these in some way to give an overall verdict for the service. These methods, however, do not consider interaction among the multiple attributes of a service when some packaging of qualities exist (i.e. multiple levels of quality over multiple attributes for the same service). In this thesis, we propose a method that can give a better prediction in assessing QoS over multiple attributes, especially when the qualities of these attributes are monitored asynchronously. We do so by assessing QoS attributes collectively rather than indi vidually and employ a k nearest neighbour based technique to deal with asynchronous data. To quantify the confidence of a QoS assessment, we present a probabilistic model that integrates two reliability measures: the number of QoS data items used in the as sessment and the variation of data in this dataset. Our empirical evaluation shows that the new method is able to give a better prediction over multiple attributes, and thus provides better guidance for consumers in selecting their preferred services than the existing methods do

Online Research @ Cardiff

Estructuras de metadatos para un mejor uso de la información en alimentación animal

Author: Maroto Molina Francisco
Publication venue: Universidad de Córdoba, Servicio de Publicaciones
Publication date: 01/01/2013
Field of study

Para un uso eficiente de los alimentos es necesario conocer en profundidad tanto las necesidades de los animales como las características de los alimentos. Respecto a estas últimas, los datos sobre la composición química y el valor nutritivo de los alimentos se han obtenido de forma sistemática en los laboratorios de nutrición animal durante los últimos 200 años (Gizzi y Givens, 2004). Sin embargo, la mayoría de estos datos se utilizan con un propósito único, ya sea el control de calidad o la producción de resultados científicos, obviando su valor residual cuando se analizan conjuntamente. Desde principios del siglo XX, parte de esta información se ha recogido en tablas, pero éstas presentan algunas limitaciones, como el tamaño reducido y la estaticidad. Para superar dichas limitaciones surgen las bases de datos de alimentos. El Servicio de Información sobre Alimentos (SIA) de la Universidad de Córdoba lleva años trabajando en la construcción de este tipo de bases de datos (Gómez Cabrera et al., 2003), pero se ha encontrado con algunas dificultades relacionadas con la gestión y análisis de la información acumulada. La búsqueda de soluciones a dichos problemas es el punto de partida de la presente Tesis Doctoral. 2. Contenido de la investigación Los datos acumulados en el SIA carecían de la información accesoria necesaria para una adecuada interpretación y uso. Para solucionarlo se ha diseñado una estructura de metadatación adaptada a las necesidades del registro diario de información en los laboratorios. Por otro lado, se han diseñado sistemáticas de denominación y lenguajes controlados para los metadatos de cara a evitar la heterogeneidad de los descriptores. Respecto a la fase de análisis de la información, se había detectado la importancia del pre-procesamiento. En la presente Tesis Doctoral se ha estudiado el comportamiento, respecto a las externalidades o outputs más habituales de las bases de datos de alimentos, de diferentes técnicas para la integración de datos diversos, la búsqueda de repeticiones, la detección de anómalos o outliers y la gestión de los vacíos de información o missing data. Se han estudiado algoritmos uni- y multivariantes, así como aproximaciones globales y locales a los aspectos citados. 3. Conclusión Se concluye que las bases de datos de alimentos construidas en base a estructuras de metadatos son una gran opción para compartir resultados de investigación (data sharing) y para controlar la heterogeneidad típica de los datos sobre alimentos para animales. El pre-procesamiento de la información, en especial la detección de outliers y el manejo de missing data, se muestra como un paso esencial, siendo los algoritmos más adecuados en cada caso función de las características de la base de datos y del tipo de análisis que se quiere llevar a cabo. Además, pese a que ambos aspectos suelen ser vistos como un problema, su estudio permite obtener información cuali- y cuantitativa muy valiosa

Repositorio Institucional de la Universidad de Córdoba