7 research outputs found

    Towards transparent machine learning models using feature sensitivity algorithm

    Get PDF
    Despite advances in health care, diabetic ketoacidosis (DKA) remains a potentially serious risk for diabetes. Directing diabetes patients to the appropriate unit of care is very critical for both lives and healthcare resources. Missing data occurs in almost all machine learning models, especially in production. Missing data can reduce the predictive power and produce biased estimates of models. Estimating a missing value around a 50 percent probability may lead to a completely different decision. The objective of this paper was to introduce a feature sensitivity score using the proposed feature sensitivity algorithm. The data were electronic health records contained 644 records and 28 attributes. We designed a model using a random forest classifier that predicts the likelihood of a developing patient DKA at the time of admission. The model achieved an accuracy of 80 percent using five attributes; this new model has fewer features than any model mentioned in the literature review. Also, Feature sensitivity score (FSS) was introduced, which identifies within feature sensitivity; the proposed algorithm enables physicians to make transparent, and accurate decisions at the time of admission. This method can be applied to different diseases and datasets

    Arbeitsbericht Nr. 2017-01, April 2017

    Get PDF
    Das vorliegende Arbeitspapier aggregiert die Erkenntnisse aus 125 Simulationsstudien, die Imputationsverfahren vergleichen. Dazu werden zun盲chst der Aufbau der Studien untersucht und die Studien mit verl盲sslichen Ergebnissen ausgew盲hlt. Diese Studien bilden die Basis f眉r eine Analyse der Imputationsverfahren. Hierbei werden die Verfahren zun盲chst separat betrachtet und danach paarweise miteinander verglichen. Zusammenfassend ergeben beide Untersuchungen, dass die Imputation mittels adaptiver Regression, die multiple Imputation und die ML-Parametersch盲tzverfahren am besten zur Behandlung fehlender Werte geeignet sind. 脺ber den Verfahrensvergleich hinaus erlauben die Studien auch R眉ckschl眉sse 眉ber Faktoren, die die Qualit盲t der Imputation beeinflussen. Die Studien zeigen, dass sowohl eine gr枚脽ere Anzahl an Objekten als auch ein geringere Anteil fehlender Werte zu besseren Ergebnissen f眉hren. Die Aggregation der Studien zeigt auch weiteren Forschungsbedarf auf. Zum einen sind die Auswirkungen der Merkmale auf die Imputationsqualit盲t nicht eindeutig und zum anderen sind viele Verfahren noch nie oder nicht h盲ufig genug f眉r belastbare Aussagen miteinander verglichen worden. Insbesondere wurden die drei besten Verfahren in keiner Studie direkt miteinander verglichen

    Advances in clustering based on inter-cluster mapping

    Get PDF
    Data mining involves searching for certain patterns and facts about the structure of data within large complex datasets. Data mining can reveal valuable and interesting relationships which can improve the operations of business, health and many other disciplines. Extraction of hidden patterns and strategic knowledge from large datasets which are stored electronically, is therefore a challenge faced by many organizations. One commonly used technique in data mining for producing useful results is cluster analysis. A basic issue in cluster analysis is deciding the optimal number of clusters for a dataset. A solution to this issue is not straightforward as this form of clustering is unsupervised learning and no clear definition of cluster quality exists. In addition, this issue will be more challenging and complicated for multi-dimensional datasets. Finding the estimated number of clusters and their quality is generally based on so-called validation indexes. A limitation with typical existing validation indexes is that they only work well with specific types of datasets compatible with their design assumptions. Also their results may be inconsistent and an algorithm may need to be run multiple times to find a best estimate of the number of clusters. Furthermore, these existing approaches may not be effective for complex problems in large datasets with varied structure. To help overcome these deficiencies, an efficient and effective approach for stable estimation of the number of clusters is essential. Many clustering techniques including partitioning, hierarchal, grid-base and model-based clustering are available. Here we consider only the partitioning method e.g. the k-means clustering algorithm for analysing data. This thesis will describe a new approach for stable estimation of the number of clusters, based on use of the k-means clustering algorithm. First results obtained from the k-means clustering algorithm will be used to gain a forward and backward mapping of common elements for adjacent and non-adjacent clusters. These will be represented in the form of proportion matrices which will be used to compute combined mapped information using a matrix inner product similarity measure. This will provide indicators for the similarity of mapped elements and overlap (dissimilarity), average similarity and average overlap (average dissimilarity) between clusters. Finally, the estimated number of clusters will be decided using the maximum average similarity, minimum average overlap and coefficient of variation measure. The new approach provides more information than an application of typical existing validation indexes. For example, the new approach offers not only the estimated number of clusters but also gives an indication of fully or partially separated clusters and defines a set of stable clusters for the estimated number of clusters. The advantage of the new approach over several existing validation indexes for evaluating clustering results is demonstrated empirically by applying it on a variety of simulated and real datasets

    An analysis of missing data treatment methods and their application to health care dataset

    No full text
    It is well accepted that many real-life datasets are full of missing data. In this paper we introduce, analyze and compare several well known treatment methods for missing data handling and propose new methods based on Naive Bayesian classifier to estimate and replace missing data. We conduct extensive experiments on datasets from UCI to compare these methods. Finally we apply these models to a geriatric hospital dataset in order to assess their effectiveness on a real-life dataset

    Quality of service assessment over multiple attributes

    Get PDF
    The development of the Internet and World Wide Web have led to many services being offered electronically. When there is sufficient demand from consumers for a certain service, multiple providers may exist, each offering identical service functionality but with varying qualities. It is desirable therefore that we are able to assess the quality of a service (QoS), so that service consumers can be given additional guidance in se lecting their preferred services. Various methods have been proposed to assess QoS using the data collected by monitoring tools, but they do not deal with multiple QoS attributes adequately. Typically these methods assume that the quality of a service may be assessed by first assessing the quality level delivered by each of its attributes individ ually, and then aggregating these in some way to give an overall verdict for the service. These methods, however, do not consider interaction among the multiple attributes of a service when some packaging of qualities exist (i.e. multiple levels of quality over multiple attributes for the same service). In this thesis, we propose a method that can give a better prediction in assessing QoS over multiple attributes, especially when the qualities of these attributes are monitored asynchronously. We do so by assessing QoS attributes collectively rather than indi vidually and employ a k nearest neighbour based technique to deal with asynchronous data. To quantify the confidence of a QoS assessment, we present a probabilistic model that integrates two reliability measures: the number of QoS data items used in the as sessment and the variation of data in this dataset. Our empirical evaluation shows that the new method is able to give a better prediction over multiple attributes, and thus provides better guidance for consumers in selecting their preferred services than the existing methods do

    Estructuras de metadatos para un mejor uso de la informaci贸n en alimentaci贸n animal

    Get PDF
    Para un uso eficiente de los alimentos es necesario conocer en profundidad tanto las necesidades de los animales como las caracter铆sticas de los alimentos. Respecto a estas 煤ltimas, los datos sobre la composici贸n qu铆mica y el valor nutritivo de los alimentos se han obtenido de forma sistem谩tica en los laboratorios de nutrici贸n animal durante los 煤ltimos 200 a帽os (Gizzi y Givens, 2004). Sin embargo, la mayor铆a de estos datos se utilizan con un prop贸sito 煤nico, ya sea el control de calidad o la producci贸n de resultados cient铆ficos, obviando su valor residual cuando se analizan conjuntamente. Desde principios del siglo XX, parte de esta informaci贸n se ha recogido en tablas, pero 茅stas presentan algunas limitaciones, como el tama帽o reducido y la estaticidad. Para superar dichas limitaciones surgen las bases de datos de alimentos. El Servicio de Informaci贸n sobre Alimentos (SIA) de la Universidad de C贸rdoba lleva a帽os trabajando en la construcci贸n de este tipo de bases de datos (G贸mez Cabrera et al., 2003), pero se ha encontrado con algunas dificultades relacionadas con la gesti贸n y an谩lisis de la informaci贸n acumulada. La b煤squeda de soluciones a dichos problemas es el punto de partida de la presente Tesis Doctoral. 2. Contenido de la investigaci贸n Los datos acumulados en el SIA carec铆an de la informaci贸n accesoria necesaria para una adecuada interpretaci贸n y uso. Para solucionarlo se ha dise帽ado una estructura de metadataci贸n adaptada a las necesidades del registro diario de informaci贸n en los laboratorios. Por otro lado, se han dise帽ado sistem谩ticas de denominaci贸n y lenguajes controlados para los metadatos de cara a evitar la heterogeneidad de los descriptores. Respecto a la fase de an谩lisis de la informaci贸n, se hab铆a detectado la importancia del pre-procesamiento. En la presente Tesis Doctoral se ha estudiado el comportamiento, respecto a las externalidades o outputs m谩s habituales de las bases de datos de alimentos, de diferentes t茅cnicas para la integraci贸n de datos diversos, la b煤squeda de repeticiones, la detecci贸n de an贸malos o outliers y la gesti贸n de los vac铆os de informaci贸n o missing data. Se han estudiado algoritmos uni- y multivariantes, as铆 como aproximaciones globales y locales a los aspectos citados. 3. Conclusi贸n Se concluye que las bases de datos de alimentos construidas en base a estructuras de metadatos son una gran opci贸n para compartir resultados de investigaci贸n (data sharing) y para controlar la heterogeneidad t铆pica de los datos sobre alimentos para animales. El pre-procesamiento de la informaci贸n, en especial la detecci贸n de outliers y el manejo de missing data, se muestra como un paso esencial, siendo los algoritmos m谩s adecuados en cada caso funci贸n de las caracter铆sticas de la base de datos y del tipo de an谩lisis que se quiere llevar a cabo. Adem谩s, pese a que ambos aspectos suelen ser vistos como un problema, su estudio permite obtener informaci贸n cuali- y cuantitativa muy valiosa
    corecore