20 research outputs found
On The Reliability Of Machine Learning Applications In Manufacturing Environments
The increasing deployment of advanced digital technologies such as Internet of Things (IoT) devices and Cyber-Physical Systems (CPS) in industrial environments is enabling the productive use of machine learning (ML) algorithms in the manufacturing domain. As ML applications transcend from research to productive use in real-world industrial environments, the question of reliability arises. Since the majority of ML models are trained and evaluated on static datasets, continuous online monitoring of their performance is required to build reliable systems. Furthermore, concept and sensor drift can lead to degrading accuracy of the algorithm over time, thus compromising safety, acceptance and economics if undetected and not properly addressed. In this work, we exemplarily highlight the severity of the issue on a publicly available industrial dataset which was recorded over the course of 36 months and explain possible sources of drift. We assess the robustness of ML algorithms commonly used in manufacturing and show, that the accuracy strongly declines with increasing drift for all tested algorithms. We further investigate how uncertainty estimation may be leveraged for online performance estimation as well as drift detection as a first step towards continually learning applications. The results indicate, that ensemble algorithms like random forests show the least decay of confidence calibration under drift.publishedVersio
Tagvisor: A Privacy Advisor for Sharing Hashtags
Hashtag has emerged as a widely used concept of popular culture and campaigns, but its implications on people's privacy have not been investigated so far. In this paper, we present the first systematic analysis of privacy issues induced by hashtags. We concentrate in particular on location, which is recognized as one of the key privacy concerns in the Internet era. By relying on a random forest model, we show that we can infer a user's precise location from hashtags with accuracy of 70% to 76%, depending on the city. To remedy this situation, we introduce a system called Tagvisor that systematically suggests alternative hashtags if the user-selected ones constitute a threat to location privacy. Tagvisor realizes this by means of three conceptually different obfuscation techniques and a semantics-based metric for measuring the consequent utility loss. Our findings show that obfuscating as little as two hashtags already provides a near-optimal trade-off between privacy and utility in our dataset. This in particular renders Tagvisor highly time-efficient, and thus, practical in real-world settings
Design of Probabilistic Random Forests with Applications to Anticancer Drug Sensitivity Prediction
Random forests consisting of an ensemble of regression trees with equal weights are frequently used for design of predictive models. In this article, we consider an extension of the methodology by representing the regression trees in the form of probabilistic trees and analyzing the nature of heteroscedasticity. The probabilistic tree representation allows for analytical computation of confidence intervals (CIs), and the tree weight optimization is expected to provide stricter CIs with comparable performance in mean error. We approached the ensemble of probabilistic trees’ prediction from the perspectives of a mixture distribution and as a weighted sum of correlated random variables. We applied our methodology to the drug sensitivity predic- tion problem on synthetic and cancer cell line encyclopedia dataset and illustrated that tree weights can be selected to reduce the average length of the CI without increase in mean error
Design of Probabilistic Random Forests with Applications to Anticancer Drug Sensitivity Prediction- 2016
Random forests consisting of an ensemble of regression trees with equal weights are frequently used for design of predictive models. In this article, we consider an extension of the methodology by representing the regression trees in the form of probabilistic trees and analyzing the nature of heteroscedasticity. The probabilistic tree representation allows for analytical computation of confidence intervals (CIs), and the tree weight optimization is expected to provide stricter CIs with comparable performance in mean error. We approached the ensemble of probabilistic trees’ prediction from the perspectives of a mixture distribution and as a weighted sum of correlated random variables. We applied our methodology to the drug sensitivity predic- tion problem on synthetic and cancer cell line encyclopedia dataset and illustrated that tree weights can be selected to reduce the average length of the CI without increase in mean error
Design of Probabilistic Random Forests with Applications to Anticancer Drug Sensitivity Prediction- 2016
Random forests consisting of an ensemble of regression trees with equal weights are frequently used for design of predictive models. In this article, we consider an extension of the methodology by representing the regression trees in the form of probabilistic trees and analyzing the nature of heteroscedasticity. The probabilistic tree representation allows for analytical computation of confidence intervals (CIs), and the tree weight optimization is expected to provide stricter CIs with comparable performance in mean error. We approached the ensemble of probabilistic trees’ prediction from the perspectives of a mixture distribution and as a weighted sum of correlated random variables. We applied our methodology to the drug sensitivity predic- tion problem on synthetic and cancer cell line encyclopedia dataset and illustrated that tree weights can be selected to reduce the average length of the CI without increase in mean error
Estimating habitat extent and carbon loss from an eroded northern blanket bog using UAV derived imagery and topography
Peatlands are important reserves of terrestrial carbon and biodiversity, and given that many peatlands across the UK and Europe exist in a degraded state, their conservation is a major area of concern and a focus of considerable research. Aerial surveys are valuable tools for habitat mapping and conservation and provide useful insights into their condition. We investigate how SfM photogrammetry-derived topography and habitat classes may be used to construct an estimate of carbon loss from erosion features in a remote blanket bog habitat. An autonomous, unmanned, aerial, fixed-wing remote sensing platform (Quest UAV 300™) collected imagery over Moor House, in the Upper Teesdale National Nature Reserve, a site with a high degree of peatland erosion. The images were used to generate point clouds into orthomosaics and digital surface models using SfM photogrammetry techniques, georeferenced and subsequently used to classify vegetation and peatland features. A classification of peatbog feature types was developed using a random forest classification model trained on field survey data and applied to UAV-captured products including the orthomosaic, digital surface model and derived surfaces such as topographic index, slope and aspect maps. Using the area classified as eroded peat and the derived digital surface model, we estimated a loss of 438 tonnes of carbon from a single gully. The UAV system was relatively straightforward to deploy in such a remote and unimproved area. SfM photogrammetry, imagery and random forest modelling obtained classification accuracies of between 42% and 100%, and was able to discern between bare peat, saturated bog and sphagnum habitats. This paper shows what can be achieved with low-cost UAVs equipped with consumer grade camera equipment and relatively straightforward ground control, and demonstrates their potential for the carbon and peatland conservation research community
Finding Respondents in the Forest: A Comparison of Logistic Regression and Random Forest Models for Response Propensity Weighting and Stratification
Survey response rates for modern surveys using many different modes are trending downward leaving the potential for nonresponse biases
in estimates derived from using only the respondents. The reasons for nonresponse may be complex functions of known auxiliary variables or
unknown latent variables not measured by practitioners. The degree to which the propensity to respond is associated with survey outcomes
casts light on the overall potential for nonresponse biases for estimates of means and totals. The most common method for nonresponse
adjustments to compensate for the potential bias in estimates has been logistic and probit regression models. However, for more complex
nonresponse mechanisms that may be nonlinear or involve many interaction effects, these methods may fail to converge and thus fail to
generate nonresponse adjustments for the sampling weights. In this paper we compare these traditional techniques to a relatively new data
mining technique- random forests – under a simple and complex nonresponse propensity population model using both direct and propensity
stratification nonresponse adjustments. Random forests appear to offer marginal improvements for the complex response model over logistic
regression in direct propensity adjustment, but have some surprising results for propensity stratification across both response models