8 research outputs found

    DC-Check: A Data-Centric AI checklist to guide the development of reliable machine learning systems

    Full text link
    While there have been a number of remarkable breakthroughs in machine learning (ML), much of the focus has been placed on model development. However, to truly realize the potential of machine learning in real-world settings, additional aspects must be considered across the ML pipeline. Data-centric AI is emerging as a unifying paradigm that could enable such reliable end-to-end pipelines. However, this remains a nascent area with no standardized framework to guide practitioners to the necessary data-centric considerations or to communicate the design of data-centric driven ML systems. To address this gap, we propose DC-Check, an actionable checklist-style framework to elicit data-centric considerations at different stages of the ML pipeline: Data, Training, Testing, and Deployment. This data-centric lens on development aims to promote thoughtfulness and transparency prior to system development. Additionally, we highlight specific data-centric AI challenges and research opportunities. DC-Check is aimed at both practitioners and researchers to guide day-to-day development. As such, to easily engage with and use DC-Check and associated resources, we provide a DC-Check companion website (https://www.vanderschaar-lab.com/dc-check/). The website will also serve as an updated resource as methods and tooling evolve over time.Comment: Main paper: 11 pages, supplementary & case studies follo

    A Data Mining Approach to Predict urban Fires in Lisbon using H2o.ai python Library

    Get PDF
    Project Work presented as the partial requirement for obtaining a Master's degree in Information Management, specialization in Knowledge Management and Business IntelligenceTechnologies have enabled societies to socially and economically prosper and to be more interconnected. With the decreasing cost of data storage and processing, cities are now trying to extract actionable information from the available data to improve and optimize their resource allocation and planning. This thesis aims to develop a data-mining approach to predicting urban fires in Lisbon, leveraging both climate, building, and population data available to predict where a fire will happen in the future within a particular period. To aid RSB in reducing their overall response time to fires by predicting probable positive emergency event areas and understand the driving factors that lead to these events in Lisbon. This supervised learning task developed using the CRISP-DM methodology makes use of standard machine learning estimators using the h2o.ai python module to incorporate parallel distributed computing combined with an AutoML package, evaluated using cross-validation, PR-AUC and F-0.5 score. The main conclusion from this paper is that applying predictive methods of data mining in the prediction of emergency events has a large potential to aid in resource allocation and understanding of drivers to combat emergency events, however requires large amounts of data to for algorithms to learn and extract actionable insights from their predictions

    Urban Anomaly Analytics: Description, Detection, and Prediction

    Get PDF
    Urban anomalies may result in loss of life or property if not handled properly. Automatically alerting anomalies in their early stage or even predicting anomalies before happening is of great value for populations. Recently, data-driven urban anomaly analysis frameworks have been forming, which utilize urban big data and machine learning algorithms to detect and predict urban anomalies automatically. In this survey, we make a comprehensive review of the state-of-the-art research on urban anomaly analytics. We first give an overview of four main types of urban anomalies, traffic anomaly, unexpected crowds, environment anomaly, and individual anomaly. Next, we summarize various types of urban datasets obtained from diverse devices, i.e., trajectory, trip records, CDRs, urban sensors, event records, environment data, social media and surveillance cameras. Subsequently, a comprehensive survey of issues on detecting and predicting techniques for urban anomalies is presented. Finally, research challenges and open problems as discussed.Peer reviewe

    Onnettomuusennusteiden hyödyntäminen pelastustoimessa

    Get PDF
    Pelastustoimen onnettomuusennustemalleja on kehitetty ja niiden hyödyntämistä on käsitelty osana Pelastustoimen ja siviilivalmiuden suorituskyky ja suunnitteluperusteet -hankekokonaisuutta. Työ perustuu Pelastustoimen uudistushankkeessa (2015–2019) riskianalyysityöryhmän tekemään riskinalyysiprosessin kehittämisehdotukseen (Sisäministeriö 2018). Raportissa esitetään vaihtoehtoisia menetelmiä päivittäisten onnettomuusennusteiden mallintamiseen ja niiden hyödyntämiseen pelastustoimen palveluiden suunnittelussa sekä uudistusesitys pelastustoiminnan palvelutason suunnitteluun käytetyn riskitason laskentaan. Onnettomuusennusteiden laadinnassa tulee hyödyntää pelastustoimen omien tilasto- ja paikkatietoaineistojen lisäksi toimintaympäristöä kuvaavia tilasto- ja paikkatietoaineistoja. Onnettomuusennusteisiin vaikuttavista, toimintaympäristöä kuvaavista tekijöistä on löydettävissä eroja onnettomuustyypeittäin. Entistä laajemman, niin sisällöllisesti kuin ajallisesti, aineiston käyttö tuottaa paremman onnettomuuksien esiintymisen ennusteen ja mahdollistaa yhdenmukaisen riskitason laskennan. Tulevaisuuden haasteeksi onnettomuuksien ennustamisen optimoinnissa nousee, kuinka kaikki tarpeelliseksi tunnistetut tilasto- ja paikkatietoaineistot saadaan pelastustoimen riskinanalyysityön käyttöön onnettomuusennusteiden laskentaan ja pelastustoimen palvelujen suunnittelun ja kohdentamisen tueksi. Pelastustoimen ja siviilivalmiuden suorituskyky ja suunnitteluperusteet -hankkeen julkaisut: Pelastustoimen ja siviilivalmiuden suorituskyky ja suunnitteluperusteet -hanke : Yhteenvetoraportti Pelastustoimen ja siviilivalmiuden toimintaympäristöanalyysi Onnettomuusennusteiden hyödyntäminen pelastustoimessa Pelastustoimen tietoperustan päivittäminen ja uudistaminen Asiakkuustyö ja asiakasymmärrys pelastustoimessa Kuka auttaa harvaan asutuilla alueilla ja maaseudulla? : Lisää ymmärrystä ja ratkaisuja pelastustoimelle Pelastustoiminnan suorituskykyvaatimukset Onnettomuuksien ehkäisyn suorituskyky suhteessa suorituskykyvaatimuksiin Pelastustoimen rooli alueellisen varautumisen yhteistyössä ja väestönsuojelun suorituskyvyt Ennakoiva talous- ja henkilöstösuunnittelu – pelastustoimen suorituskyvyn perusta Pelastustoimen väestökysely ja -segmentointi : Pelastustoimen ja siviilivalmiuden suorituskyky ja suunnitteluperusteet -hankkeen loppuraportin osajulkaisu</a

    A GIS-Based Risk Assessment for Fire Departments: Case Study of Richland County, SC

    Get PDF
    Risk assessments enable fire departments to be better prepared for future incidents and to engage in more effective prevention activities. A combination of physical, demographic, and behavioral risk factors combined form a community’s level of risk. This research shows how spatial and nonspatial statistical methods can be used within a GIS framework to create such a risk assessment, with the Columbia-Richland Fire Department in Richland County, SC being used as a case study. Hot spot analysis and thematic mapping of incident rates were used to assess the first research question – what is the spatial variability of structure fires, carbon monoxide incidents, and emergency medical calls? Correlation analysis, principal component analysis (PCA), and factor analysis were applied to a few dozen social and physical risk factors at the block group level to assess the second research question - how are the risk factors correlated with each other, and how are these risk factors varied across the county? The results of all types of methods were compared against each other to assess how risk factors correlated with incident types. These methods were able to map hot and cold spots of incidents, identify the most relevant risk factors, and show which risk factors were most prevalent in hot spot areas. The primary hot spot for EMS and fire incidents was found in northern Columbia, with a secondary hot spot located in far Lower Richland. PCA identified nine primary factors, the top three of which were related to systematic hard times, older homeowners, and rural location. Factor analysis was able to cluster block groups into fourteen groupings of similar risk traits. There were very clear differences in incident rates between the fourteen groupings, although hot spots contained block groups from multiple groupings. Given the snapshot in time nature of risk assessments, this research builds a baseline for future risk assessments, both in terms of methods and results

    Continuous Estimation of Smoking Lapse Risk from Noisy Wrist Sensor Data Using Sparse and Positive-Only Labels

    Get PDF
    Estimating the imminent risk of adverse health behaviors provides opportunities for developing effective behavioral intervention mechanisms to prevent the occurrence of the target behavior. One of the key goals is to find opportune moments for intervention by passively detecting the rising risk of an imminent adverse behavior. Significant progress in mobile health research and the ability to continuously sense internal and external states of individual health and behavior has paved the way for detecting diverse risk factors from mobile sensor data. The next frontier in this research is to account for the combined effects of these risk factors to produce a composite risk score of adverse behaviors using wearable sensors convenient for daily use. Developing a machine learning-based model for assessing the risk of smoking lapse in the natural environment faces significant outstanding challenges requiring the development of novel and unique methodologies for each of them. The first challenge is coming up with an accurate representation of noisy and incomplete sensor data to encode the present and historical influence of behavioral cues, mental states, and the interactions of individuals with their ever-changing environment. The next noteworthy challenge is the absence of confirmed negative labels of low-risk states and adequate precise annotations of high-risk states. Finally, the model should work on convenient wearable devices to facilitate widespread adoption in research and practice. In this dissertation, we develop methods that account for the multi-faceted nature of smoking lapse behavior to train and evaluate a machine learning model capable of estimating composite risk scores in the natural environment. We first develop mRisk, which combines the effects of various mHealth biomarkers such as stress, physical activity, and location history in producing the risk of smoking lapse using sequential deep neural networks. We propose an event-based encoding of sensor data to reduce the effect of noises and then present an approach to efficiently model the historical influence of recent and past sensor-derived contexts on the likelihood of smoking lapse. To circumvent the lack of confirmed negative labels (i.e., annotated low-risk moments) and only a few positive labels (i.e., sensor-based detection of smoking lapse corroborated by self-reports), we propose a new loss function to accurately optimize the models. We build the mRisk models using biomarker (stress, physical activity) streams derived from chest-worn sensors. Adapting the models to work with less invasive and more convenient wrist-based sensors requires adapting the biomarker detection models to work with wrist-worn sensor data. To that end, we develop robust stress and activity inference methodologies from noisy wrist-sensor data. We first propose CQP, which quantifies wrist-sensor collected PPG data quality. Next, we show that integrating CQP within the inference pipeline improves accuracy-yield trade-offs associated with stress detection from wrist-worn PPG sensors in the natural environment. mRisk also requires sensor-based precise detection of smoking events and confirmation through self-reports to extract positive labels. Hence, we develop rSmoke, an orientation-invariant smoking detection model that is robust to the variations in sensor data resulting from orientation switches in the field. We train the proposed mRisk risk estimation models using the wrist-based inferences of lapse risk factors. To evaluate the utility of the risk models, we simulate the delivery of intelligent smoking interventions to at-risk participants as informed by the composite risk scores. Our results demonstrate the envisaged impact of machine learning-based models operating on wrist-worn wearable sensor data to output continuous smoking lapse risk scores. The novel methodologies we propose throughout this dissertation help instigate a new frontier in smoking research that can potentially improve the smoking abstinence rate in participants willing to quit
    corecore