30 research outputs found

    Data Science With Excel

    Get PDF
    The stages in data science consist of several stages, one of which is data preparation. At this stage, many things are done so that the dirty data becomes clean data that is ready for modeling. Many applications offer data science convenience in terms of processing data. One of them is excel, this application from Microsoft can perform data processing so that the data is ready for modeling. However, there are limitations in using excel. The maximum number of rows that excel has is only 1,048,576 and the number of columns is 16,384. However, if you process data of no more than 1 million rows, excel can still handle it by using features such as error detection, removing duplicate data, correcting error values, detecting outlier values, handling missing data and validating data. This study shows some of these features along with examples of their use

    Handling incomplete heterogeneous data using VAEs.

    Get PDF
    Variational autoencoders (VAEs), as well as other generative models, have been shown to be efficient and accurate for capturing the latent structure of vast amounts of complex high-dimensional data. However, existing VAEs can still not directly handle data that are heterogenous (mixed continuous and discrete) or incomplete (with missing data at random), which is indeed common in real-world applications. In this paper, we propose a general framework to design VAEs suitable for fitting incomplete heterogenous data. The proposed HI-VAE includes likelihood models for real-valued, positive real valued, interval, categorical, ordinal and count data, and allows accurate estimation (and potentially imputation) of missing data. Furthermore, HI-VAE presents competitive predictive performance in supervised tasks, outperforming supervised models when trained on incomplete data

    The Two Types of Society: Computationally Revealing Recurrent Social Formations and Their Evolutionary Trajectories

    Get PDF
    Comparative social science has a long history of attempts to classify societies and cultures in terms of shared characteristics. However, only recently has it become feasible to conduct quantitative analysis of large historical datasets to mathematically approach the study of social complexity and classify shared societal characteristics. Such methods have the potential to identify recurrent social formations in human societies and contribute to social evolutionary theory. However, in order to achieve this potential, repeated studies are needed to assess the robustness of results to changing methods and data sets. Using an improved derivative of the Seshat: Global History Databank, we perform a clustering analysis of 271 past societies from sampling points across the globe to study plausible categorizations inherent in the data. Analysis indicates that the best fit to Seshat data is five subclusters existing as part of two clearly delineated superclusters (that is, two broad “types” of society in terms of social-ecological configuration). Our results add weight to the idea that human societies form recurrent social formations by replicating previous studies with different methods and data. Our results also contribute nuance to previously established measures of social complexity, illustrate diverse trajectories of change, and shed further light on the finite bounds of human social diversity

    Ensemble classification of incomplete data – a non-imputation approach with an application in ovarian tumour diagnosis support

    Get PDF
    Wydział Matematyki i InformatykiW niniejszej pracy doktorskiej zająłem się problemem klasyfikacji danych niekompletnych. Motywacja do podjęcia badań ma swoje źródło w medycynie, gdzie bardzo często występuje zjawisko braku danych. Najpopularniejszą metodą radzenia sobie z tym problemem jest imputacja danych, będąca uzupełnieniem brakujących wartości na podstawie statystycznych zależności między cechami. W moich badaniach przyjąłem inną strategię rozwiązania tego problemu. Wykorzystując opracowane wcześniej klasyfikatory można przekształcić je do formy, która zwraca przedział możliwych predykcji. Następnie, poprzez zastosowanie operatorów agregacji oraz metod progowania, można dokonać finalnej klasyfikacji. W niniejszej pracy pokazuję jak dokonać ww. przekształcenia klasyfikatorów oraz jak wykorzystać strategie agregacji danych przedziałowych do klasyfikacji. Opracowane przeze mnie metody podnoszą jakość klasyfikacji danych niekompletnych w problemie wspomagania diagnostyki guzów jajnika. Dodatkowa analiza wyników na zewnętrznych zbiorach danych z repozytorium uczenia maszynowego Uniwersytetu Kalifornijskiego w Irvine (UCI) wskazuje, że przedstawione metody są komplementarne z imputacją.In this doctoral dissertation I focus on the problem of classification of incomplete data. The motivation for the research comes from medicine, where missing data phenomena are commonly encountered. The most popular method of dealing with data missingness is imputation; that is, inserting missing data on the basis of statistical relationships among features. In my research I choose a different strategy for dealing with this issue. Classifiers of a type previously developed can be transformed to a form which returns an interval of possible predictions. In the next step, with the use of aggregation operators and thresholding methods, one can make a final classification. I show how to make such transformations of classifiers and how to use aggregation strategies for interval data classification. These methods improve the quality of the process of classification of incomplete data in the problem of ovarian tumour diagnosis. Additional analysis carried out on external datasets from the University of California, Irvine (UCI) Machine Learning Repository shows that the aforementioned methods are complementary to imputation

    Classifying Emergency Department Data to Improve Syndromic Surveillance: From Mixed Data Types to ICD Codes and Syndromes

    Get PDF
    Syndromic surveillance systems are used to monitor public health and enable a timely outbreak detection. Emergency department (ED) data can serve as an important data source for syndromic surveillance, but a high amount of missing diagnosis codes can make analyses relying on this information impossible. This study aims at enhancing an ED dataset from a piloted syndromic surveillance system in Germany to enable the monitoring of an influenza-like illness (ILI) syndrome. Routinely collected data from one ED containing mixed-type variables are analysed and two different approaches are implemented to deal with the missing data. Within the first approach, the missing diagnosis codes are imputed by predicting them from the remaining variables, using a multi-class naive Bayes classifier and a deep learning imputation package. In the second approach, a logistic regression model and a binary naive Bayes classifier are used to predict the ILI syndrome from all variables except the diagnosis code. The resulting ILI cases are evaluated on time series level with regard to seasonal patterns. The diagnosis codes were predicted from mixed-type input variables with sufficient precision (34.37% F1-measure in the best model). By taking into account the hierarchical structure of the ICD-10 codes, the performance was improved. Predicting the ILI syndrome independent of the diagnosis code from the remaining variables worked well (39.63% F1-measure in the best model) and the predictions showed medical similarity with the ILI syndrome. The models differed in their sensitivity of including cases, which can be adjusted by changing the threshold of the classifiers. The resulting ILI cases from all models were positively correlated with the reference cases on a time series basis (r = 0.865 for best model) and were comparable with an external data source, a surveillance of severe acute respiratory infections (SARI) (r = 0.867 for best model). The present study showed that the ED dataset can be enhanced to enable the syndromic surveillance of an ILI syndrome based on the diagnosis codes, even if this variable is missing. Additionally, a flexible case definition for an ILI syndrome was developed that is independent of the diagnosis code and the underlying generic method can be applied to other syndromes as well

    ANN-MIND : dropout for neural network training with missing data

    Get PDF
    M.Sc. (Computer Science)Abstract: It is a well-known fact that the quality of the dataset plays a central role in the results and conclusions drawn from the analysis of such a dataset. As the saying goes, ”garbage in, garbage out”. In recent years, neural networks have displayed good performance in solving a diverse number of problems. Unfortunately, neural networks are not immune to this misfortune presented by missing values. Furthermore, in most real-world settings, it is often the case that, the only data available for training neural networks consists of missing values. In such cases, we are left with little choice but to use this data for the purposes of training neural networks, although doing so may result in a poorly trained neural network. Most systems currently in use- merely discard the missing observation from the training datasets, while others just proceed to use this data and ignore the problems presented by the missing values. Still other approaches choose to impute these missing values with fixed constants such as means and mode. Most neural network models work under the assumption that the supplied data contains no missing values. This dissertation explores a method for training neural networks in the event where the training dataset consists of missing values..

    Big Data Analysis application in the renewable energy market: wind power

    Get PDF
    Entre as enerxías renovables, a enerxía eólica e unha das tecnoloxías mundiais de rápido crecemento. Non obstante, esta incerteza debería minimizarse para programar e xestionar mellor os activos de xeración tradicionais para compensar a falta de electricidade nas redes electricas. A aparición de técnicas baseadas en datos ou aprendizaxe automática deu a capacidade de proporcionar predicións espaciais e temporais de alta resolución da velocidade e potencia do vento. Neste traballo desenvólvense tres modelos diferentes de ANN, abordando tres grandes problemas na predición de series de datos con esta técnica: garantía de calidade de datos e imputación de datos non válidos, asignación de hiperparámetros e selección de funcións. Os modelos desenvolvidos baséanse en técnicas de agrupación, optimización e procesamento de sinais para proporcionar predicións de velocidade e potencia do vento a curto e medio prazo (de minutos a horas)

    Interpretable Models Capable of Handling Systematic Missingness in Imbalanced Classes and Heterogeneous Datasets

    Get PDF
    Application of interpretable machine learning techniques on medical datasets facilitate early and fast diagnoses, along with getting deeper insight into the data. Furthermore, the transparency of these models increase trust among application domain experts. Medical datasets face common issues such as heterogeneous measurements, imbalanced classes with limited sample size, and missing data, which hinder the straightforward application of machine learning techniques. In this paper we present a family of prototype-based (PB) interpretable models which are capable of handling these issues. The models introduced in this contribution show comparable or superior performance to alternative techniques applicable in such situations. However, unlike ensemble based models, which have to compromise on easy interpretation, the PB models here do not. Moreover we propose a strategy of harnessing the power of ensembles while maintaining the intrinsic interpretability of the PB models, by averaging the model parameter manifolds. All the models were evaluated on a synthetic (publicly available dataset) in addition to detailed analyses of two real-world medical datasets (one publicly available). Results indicated that the models and strategies we introduced addressed the challenges of real-world medical data, while remaining computationally inexpensive and transparent, as well as similar or superior in performance compared to their alternatives
    corecore