45 research outputs found

    High-throughput machine learning algorithms

    Get PDF
    The field of machine learning has become strongly compute driven, such that emerging research and applications require larger amounts of specialised hardware or smarter algorithms to advance beyond the state-of-the-art. This thesis develops specialised techniques and algorithms for a subset of computationally difficult machine learning problems. The applications under investigation are quantile approximation in the limited-memory data streaming setting, interpretability of decision tree ensembles, efficient sampling methods in the space of permutations, and the generation of large numbers of pseudorandom permutations. These specific applications are investigated as they represent significant bottlenecks in real-world machine learning pipelines, where improvements to throughput have significant impact on the outcomes of machine learning projects in both industry and research. To address these bottlenecks, we discuss both theoretical improvements, such as improved convergence rates, and hardware/software related improvements, such as optimised algorithm design for high throughput hardware accelerators. Some contributions include: the evaluation of bin-packing methods for efficiently scheduling small batches of dependent computations to GPU hardware execution units, numerically stable reduction operators for higher-order statistical moments, and memory bandwidth optimisation for GPU shuffling. Additionally, we apply theory of the symmetric group of permutations in reproducing kernel Hilbert spaces, resulting in improved analysis of Monte Carlo methods for Shapley value estimation and new, computationally more efficient algorithms based on kernel herding and Bayesian quadrature. We also utilise reproducing kernels over permutations to develop a novel statistical test for the hypothesis that a sample of permutations is drawn from a uniform distribution. The techniques discussed lie at the intersection of machine learning, high-performance computing, and applied mathematics. Much of the above work resulted in open source software used in real applications, including the GPUTreeShap library [38], shuffling primitives for the Thrust parallel computing library [2], extensions to the Shap package [31], and extensions to the XGBoost library [6]

    Intelligent Data Analysis for Energy Management

    Get PDF
    Predictive data analysis has been identified as essential to support intelligent energy management for better energy sustainability and efficiency. Previous studies have showcased that predicted energy information can benefit consumers economically by optimising energy usage while assisting energy suppliers in efficiently planning power distribution and implementing DR energy management. Recent advances in the Internet of Things (IoT) and Information and Communication Technologies (ICT) simplify the collection of desired energy data streams for further informatics analysis. With such energy data, machine learning (ML) prevails to effectively infer future knowledge associated with online energy resource scheduling, e.g., renewable energy generation, load demands and electricity prices. Although some early efforts have been dedicated to incorporating ML into energy management, computation resource limitations and data scarcity are two pressing challenges for on-site predictive energy analysis. Due to privacy concerns, users prefer on-premise model establishment instead of placing the training task in the cloud and sharing sensitive energy data. But most ML algorithms rely heavily on solid computational resources and vast amounts of labelled data to succeed. Users are often unable to fulfil the requirements in real-world scenarios. To this end, this thesis uses different perspectives to propose several affordable solutions for performing on-demand intelligent data analysis on local resource-constrained devices. Also, three algorithm-specific training frameworks have been developed to solve data shortage by leveraging easily obtainable but extensive data sources based on transfer learning and federated learning. We implement our design under practical settings for photovoltaic (PV) power prediction and non-intrusive load monitoring (NILM) as case studies to fully evaluate their performances

    Diagnosis and Prognosis of Occupational disorders based on Machine Learn- ing Techniques applied to Occupational Profiles

    Get PDF
    Work-related disorders have a global influence on people’s well-being and quality of life and are a financial burden for organizations because they reduce productivity, increase absenteeism, and promote early retirement. Work-related musculoskeletal disorders, in particular, represent a significant fraction of the total in all occupational contexts. In automotive and industrial settings where workers are exposed to work-related muscu- loskeletal disorders risk factors, occupational physicians are responsible for monitoring workers’ health protection profiles. Occupational technicians report in the Occupational Health Protection Profiles database to understand which exposure to occupational work- related musculoskeletal disorder risk factors should be ensured for a given worker. Occu- pational Health Protection Profiles databases describe the occupational physician states, and which exposure the physicians considers necessary to ensure the worker’s health protection in terms of their functional work ability. The application of Human-Centered explainable artificial intelligence can support the decision making to go from worker’s Functional Work Ability to explanations by integrating explainability into medical (re- striction) and supporting in two decision contexts: prognosis and diagnosis of individual, work related and organizational risk condition. Although previous machine learning ap- proaches provided good predictions, their application in an actual occupational setting is limited because their predictions are difficult to interpret and hence, not actionable. In this thesis, injured body parts in which the ability changed in a worker’s functional work ability status are targeted. On the one hand, artificial intelligence algorithms can help technical teams, occupational physicians, and ergonomists determine a worker’s workplace risk via the diagnosis and prognosis of body part(s) injuries; on the other hand, these approaches can help prevent work-related musculoskeletal disorders by identifying which processes are lacking in working condition improvement and which workplaces have a better match between the remaining functional work abilities. A sample of 2025 for the prognosis part (from the years of 2019 to 2020) and 7857 for the prognosis part of Occupational Health Protection Profiles based on Functional Work Ability textual re- ports in the Portuguese language in automotive industry factory. Machine learning-based Natural Language Processing methods were implemented to extract standardized infor- mation. The prognosis and diagnosis of Occupational Health Protection Profiles factors were developed in reliable Human-Centered explainable artificial intelligence system to promote a trustworthy Human-Centered explainable artificial intelligence system (enti- tled Industrial microErgo application). The most suitable regression models to predict the next medical appointment for the injured body regions were the models based on CatBoost regression, with R square and an RMSLE of 0.84 and 1.23 weeks, respectively. In parallel, CatBoost’s best regression model for most body parts is the prediction of the next injured body parts based on these two errors. This information can help tech- nical industrial teams understand potential risk factors for Occupational Health Protec- tion Profiles and identify warning signs of the early stages of musculoskeletal disorders.Os transtornos relacionados ao trabalho têm influência global no bem-estar e na quali- dade de vida das pessoas e são um ônus financeiro para as organizações, pois reduzem a produtividade, aumentam o absenteísmo e promovem a aposentadoria precoce. Os distúr- bios osteomusculares relacionados ao trabalho, em particular, representam uma fração significativa do total em todos os contextos ocupacionais. Em ambientes automotivos e industriais onde os trabalhadores estão expostos a fatores de risco de distúrbios osteomus- culares relacionados ao trabalho, os médicos do trabalho são responsáveis por monitorar os perfis de proteção à saúde dos trabalhadores. Os técnicos do trabalho reportam-se à base de dados dos Perfis de Proteção da Saúde Ocupacional para compreender quais os fatores de risco de exposição a perturbações músculo-esqueléticas relacionadas com o tra- balho que devem ser assegurados para um determinado trabalhador. As bases de dados de Perfis de Proteção à Saúde Ocupacional descrevem os estados do médico do trabalho e quais exposições os médicos consideram necessária para garantir a proteção da saúde do trabalhador em termos de sua capacidade funcional para o trabalho. A aplicação da inteligência artificial explicável centrada no ser humano pode apoiar a tomada de decisão para ir da capacidade funcional de trabalho do trabalhador às explicações, integrando a explicabilidade à médica (restrição) e apoiando em dois contextos de decisão: prognóstico e diagnóstico da condição de risco individual, relacionado ao trabalho e organizacional . Embora as abordagens anteriores de aprendizado de máquina tenham fornecido boas pre- visões, sua aplicação em um ambiente ocupacional real é limitada porque suas previsões são difíceis de interpretar e portanto, não acionável. Nesta tese, as partes do corpo lesiona- das nas quais a habilidade mudou no estado de capacidade funcional para o trabalho do trabalhador são visadas. Por um lado, os algoritmos de inteligência artificial podem aju- dar as equipes técnicas, médicos do trabalho e ergonomistas a determinar o risco no local de trabalho de um trabalhador por meio do diagnóstico e prognóstico de lesões em partes do corpo; por outro lado, essas abordagens podem ajudar a prevenir distúrbios muscu- loesqueléticos relacionados ao trabalho, identificando quais processos estão faltando na melhoria das condições de trabalho e quais locais de trabalho têm uma melhor correspon- dência entre as habilidades funcionais restantes do trabalho. Para esta tese, foi utilizada uma base de dados com Perfis de Proteção à Saúde Ocupacional, que se baseiam em relató- rios textuais de Aptidão para o Trabalho em língua portuguesa, de uma fábrica da indús- tria automóvel (Auto Europa). Uma amostra de 2025 ficheiros foi utilizada para a parte de prognóstico (de 2019 a 2020) e uma amostra de 7857 ficheiros foi utilizada para a parte de diagnóstico. . Aprendizado de máquina- métodos baseados em Processamento de Lingua- gem Natural foram implementados para extrair informações padronizadas. O prognóstico e diagnóstico dos fatores de Perfis de Proteção à Saúde Ocupacional foram desenvolvidos em um sistema confiável de inteligência artificial explicável centrado no ser humano (inti- tulado Industrial microErgo application). Os modelos de regressão mais adequados para prever a próxima consulta médica para as regiões do corpo lesionadas foram os modelos baseados na regressão CatBoost, com R quadrado e RMSLE de 0,84 e 1,23 semanas, res- pectivamente. Em paralelo, a previsão das próximas partes do corpo lesionadas com base nesses dois erros relatados pelo CatBoost como o melhor modelo de regressão para a mai- oria das partes do corpo. Essas informações podem ajudar as equipes técnicas industriais a entender os possíveis fatores de risco para os Perfis de Proteção à Saúde Ocupacio- nal e identificar sinais de alerta dos estágios iniciais de distúrbios musculoesqueléticos

    Applications of Machine Learning: From Single Cell Biology to Algorithmic Fairness

    Full text link
    It is common practice to obtain answers to complex questions by analyzing large amounts of data. Formal modeling and careful mathematical definitions are essential to extracting relevant answers from data, and establishing a mathematical framework requires deliberate interdisciplinary collaboration between the specialists who provide the questions and the mathematicians who translate them. This dissertation details the results of two of these interdisciplinary collaborations: one in single cell RNA sequencing, and the other in fairness. High throughput microfluidic protocols in single cell RNA sequencing (scRNA-seq) collect integer valued mRNA counts from many individual cells in a single experiment; this enables high resolution studies of rare cell types and cell development pathways. ScRNA-seq data are sparse: often 90% of the collected reads are zeros. Specialized methods are required to obtain solutions to biological questions from these sparse, integer-valued data. Determining genetic markers that can identify specific cell populations is one of the major objectives of the analysis of mRNA count data. We introduce RANKCORR, a fast method with robust mathematical underpinnings that performs multi-class marker selection. RANKCORR proceeds by ranking the mRNA count data before linearly separating the ranked data using a small number of genes. Ranking scRNA-seq count data provides a reasonable non-parametric method for analyzing these data; we further include an analysis of the statistical properties of this rank transformation. We compare the performance of RANKCORR to a variety of other marker selection methods. These experiments show that RANKCORR is consistently one of the top-performing marker selection methods on scRNA-seq data, though other methods show similar overall performance. This suggests that the speed of the algorithm is the most important consideration for large data sets. RANKCORR is efficient and able to handle the largest data sets; as such, it is a useful tool for dealing with high throughput scRNA-seq data. The second collaboration combines state of the art machine learning methods with formal definitions of fairness. Machine learning methods have a tendency to preserve or exacerbate biases that exist in data; consequently, the algorithms that influence our daily lives often display biases against certain protected groups. It is both objectionable and often illegal to allow daily decisions (e.g. mortgage approvals, job advertisements) to disadvantage protected groups; a growing body of literature in the field of algorithmic fairness aims to mitigate these issues. We contribute two methods towards this goal. We first introduce a preprocessing method designed to debias the training data. Specifically, the method attempts to remove any variation in the original data that comes from protected group status. This is accomplished by leveraging knowledge of groups that we expect to receive similar outcomes from a fair algorithm. We further present a method for training a classifier (from potentially biased data) that is both accurate and fair using the gradient boosting framework. Gradient boosting is a powerful method for constructing predictive models that can be superior to neural networks on tabular data; the development of a fair gradient boosting method is thus desirable for the adoption of fair methods. Moreover, the method that we present is designed to construct predictors that are fair at an individual level - that is, two comparable individuals will be assigned similar results. This is different from most of the existing fair algorithms that ensure fairness at a statistical level.PHDMathematicsUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/163215/1/ahsvargo_1.pd

    Season-Based Occupancy Prediction in Residential Buildings Using Data Mining Techniques

    Get PDF
    Considering the continuous increase of global energy consumption and the fact that buildings account for a large part of electricity use, it is essential to reduce energy consumption in buildings to mitigate greenhouse gas emissions and costs for both building owners and tenants. A reliable occupancy prediction model plays a critical role in improving the performance of energy simulation and occupant-centric building operations. In general, occupancy and occupant activities differ by season, and it is important to account for the dynamic nature of occupancy in simulations and to propose energy-efficient strategies. The present work aims to develop a data mining-based framework, including feature selection and the establishment of seasonal-customized occupancy prediction (SCOP) models to predict the occupancy in buildings considering different seasons. In the proposed framework, the recursive feature elimination with cross-validation (RFECV) feature selection was first implemented to select the optimal variables concerning the highest prediction accuracy. Later, six machine learning (ML) algorithms were considered to establish four SCOP models to predict occupancy presence, and their prediction performances were compared in terms of prediction accuracy and computational cost. To evaluate the effectiveness of the developed data mining framework, it was applied to an apartment in Lyon, France. The results show that the RFECV process reduced the computational time while improving the ML models’ prediction performances. Additionally, the SCOP models could achieve higher prediction accuracy than the conventional prediction model measured by performance evaluation metrics of F-1 score and area under the curve. Among the considered ML models, the gradient-boosting decision tree, random forest, and artificial neural network showed better performances, achieving more than 85% accuracy in Summer, Fall, and Winter, and over 80% in Spring. The essence of the framework is valuable for developing strategies for building energy consumption estimation and higher-resolution occupancy level prediction, which are easily influenced by seasons

    Interpretable AI-based large-scale 3D pathloss prediction model for enabling emerging self-driving networks

    Get PDF
    In modern wireless communication systems, radio propagation modeling to estimate pathloss has always been a fundamental task in system design and optimization. The state-of-the-art empirical propagation models are based on measurements in specific environments and limited in their ability to capture idiosyncrasies of various propagation environments. To cope with this problem, ray-tracing based solutions are used in commercial planning tools, but they tend to be extremely time-consuming and expensive. We propose a Machine Learning (ML)-based model that leverages novel key predictors for estimating pathloss. By quantitatively evaluating the ability of various ML algorithms in terms of predictive, generalization and computational performance, our results show that Light Gradient Boosting Machine (LightGBM) algorithm overall outperforms others, even with sparse training data, by providing a 65% increase in prediction accuracy as compared to empirical models and 13x decrease in prediction time as compared to ray-tracing. To address the interpretability challenge that thwarts the adoption of most Machine Learning (ML)-based models, we perform extensive secondary analysis using SHapley Additive exPlanations (SHAP) method, yielding many practically useful insights that can be leveraged for intelligently tuning the network configuration, selective enrichment of training data in real networks and for building lighter ML-based propagation model to enable low-latency use-cases

    Framework for collaborative intelligence in forecasting day-ahead electricity price

    Get PDF
    Electricity price forecasting in wholesale markets is an essential asset for deciding bidding strategies and operational schedules. The decision making process is limited if no understanding is given on how and why such electricity price points have been forecast. The present article proposes a novel framework that promotes human–machine collaboration in forecasting day-ahead electricity price in wholesale markets. The framework is based on a new model architecture that uses a plethora of statistical and machine learning models, a wide range of exogenous features, a combination of several time series decomposition methods and a collection of time series characteristics based on signal processing and time series analysis methods. The model architecture is supported by open-source automated machine learning platforms that provide a baseline reference used for comparison purposes. The objective of the framework is not only to provide forecasts, but to promote a human-in-the-loop approach by providing a data story based on a collection of model-agnostic methods aimed at interpreting the mechanisms and behavior of the new model architecture and its predictions. The framework has been applied to the Spanish wholesale market. The forecasting results show good accuracy on mean absolute error (1.859, 95% HDI [0.575, 3.924] EUR (MWh)−1) and mean absolute scaled error (0.378, 95% HDI [0.091, 0.934]). Moreover, the framework demonstrates its human-centric capabilities by providing graphical and numeric explanations that augments understanding on the model and its electricity price point forecasts

    Micro-estimates of Multidimensional Child Poverty in sub-Saharan Africa

    Get PDF
    Child poverty maps allow governments and other organizations to design policies to track and evaluate their impact in the fight against child poverty. However, reliable data on the geographic distribution of child poverty is scarce, sparse in coverage and expensive to collect. For some countries, the only available measurements are at the country level. In this thesis, we propose to train Machine Learning models to obtain finely grained predictions of child poverty using heterogeneous and publicly available data sources as geographical, demographic and economic georeferenced inputs. Benchmarks of child poverty, computed from nationally representative household survey data, are used as targets to train and calibrate our proposed prediction models. The multidimensional child poverty index has six dimensions: sanitation, water, education, housing, health and nutrition, and is defined such that the predictions can be compared across countries. Using the techniques that are introduced in this thesis, we compute and release a complete and publicly available set of micro-estimates of prevalence, depth and specific poverty dimensions at a 5.2 km2 resolution for sub-Saharan African countries. Prediction intervals are included to facilitate responsible downstream use. The resulting micro-estimates have the potential of being used to deepen the understanding of the causes of child poverty in sub-Saharan Africa and to gain insights on the impact of future actions.Child poverty maps allow governments and other organizations to design policies to track and evaluate their impact in the fight against child poverty. However, reliable data on the geographic distribution of child poverty is scarce, sparse in coverage and expensive to collect. For some countries, the only available measurements are at the country level. In this thesis, we propose to train Machine Learning models to obtain finely grained predictions of child poverty using heterogeneous and publicly available data sources as geographical, demographic and economic georeferenced inputs. Benchmarks of child poverty, computed from nationally representative household survey data, are used as targets to train and calibrate our proposed prediction models. The multidimensional child poverty index has six dimensions: sanitation, water, education, housing, health and nutrition, and is defined such that the predictions can be compared across countries. Using the techniques that are introduced in this thesis, we compute and release a complete and publicly available set of micro-estimates of prevalence, depth and specific poverty dimensions at a 5.2 km2 resolution for sub-Saharan African countries. Prediction intervals are included to facilitate responsible downstream use. The resulting micro-estimates have the potential of being used to deepen the understanding of the causes of child poverty in sub-Saharan Africa and to gain insights on the impact of future actions

    Applications of Hyper-parameter Optimisations for Static Malware Detection

    Get PDF
    Malware detection is a major security concern and a great deal of academic and commercial research and development is directed at it. Machine Learning is a natural technology to harness for malware detection and many researchers have investigated its use. However, drawing comparisons between different techniques is a fraught affair. For example, the performance of ML algorithms often depends significantly on parametric choices, so the question arises as to what parameter choices are optimal. In this thesis, we investigate the use of a variety of ML algorithms for building malware classifiers and also how best to tune the parameters of those algorithms – a process generally known as hyper-parameter optimisation (HPO). Firstly, we examine the effects of some simple (model-free) ways of parameter tuning together with a state-of-the-art Bayesian model-building approach. We demonstrate that optimal parameter choices may differ significantly from default choices and argue that hyper-parameter optimisation should be adopted as a ‘formal outer loop’ in the research and development of malware detection systems. Secondly, we investigate the use of covering arrays (combinatorial testing) as a way to combat the curse of dimensionality in Gird Search. Four ML techniques were used: Random Forests, xgboost, Light GBM and Decision Trees. cAgen (a tool that is used for combinatorial testing) is shown to be capable of generating high-performing subsets of the full parameter grid of Grid Search and so provides a rigorous but highly efficient means of performing HPO. This may be regarded as a ‘design of experiments’ approach. Thirdly, Evolutionary algorithms (EAs) were used to enhance machine learning classifier accuracy. Six traditional machine learning techniques baseline accuracy is recorded. Two evolutionary algorithm frameworks Tree-Based Pipeline Optimization Tool (TPOT) and Distributed Evolutionary Algorithms in Python (Deap) are compared. Deap shows very promising results for our malware detection problem. Fourthly, we compare the use of Grid Search and covering arrays for tuning the hyper-parameters of Neural Networks. Several major hyper-parameters were studied with various values and results. We achieve significant improvements over the benchmark model. Our work is carried out using EMBER, a major published malware benchmark dataset of Windows Portable Execution (PE) metadata samples, and a smaller dataset from kaggle.com (also comprising of Windows Portable Execution metadata). Overall, we conclude that HPO is an essential part of credible evaluations of ML-based malware detection models. We also demonstrate that high-performing hyper-parameter values can be found by HPO and that these can be found efficiently
    corecore