13,194 research outputs found

    Beyond Volume: The Impact of Complex Healthcare Data on the Machine Learning Pipeline

    Full text link
    From medical charts to national census, healthcare has traditionally operated under a paper-based paradigm. However, the past decade has marked a long and arduous transformation bringing healthcare into the digital age. Ranging from electronic health records, to digitized imaging and laboratory reports, to public health datasets, today, healthcare now generates an incredible amount of digital information. Such a wealth of data presents an exciting opportunity for integrated machine learning solutions to address problems across multiple facets of healthcare practice and administration. Unfortunately, the ability to derive accurate and informative insights requires more than the ability to execute machine learning models. Rather, a deeper understanding of the data on which the models are run is imperative for their success. While a significant effort has been undertaken to develop models able to process the volume of data obtained during the analysis of millions of digitalized patient records, it is important to remember that volume represents only one aspect of the data. In fact, drawing on data from an increasingly diverse set of sources, healthcare data presents an incredibly complex set of attributes that must be accounted for throughout the machine learning pipeline. This chapter focuses on highlighting such challenges, and is broken down into three distinct components, each representing a phase of the pipeline. We begin with attributes of the data accounted for during preprocessing, then move to considerations during model building, and end with challenges to the interpretation of model output. For each component, we present a discussion around data as it relates to the healthcare domain and offer insight into the challenges each may impose on the efficiency of machine learning techniques.Comment: Healthcare Informatics, Machine Learning, Knowledge Discovery: 20 Pages, 1 Figur

    Automated Website Fingerprinting through Deep Learning

    Full text link
    Several studies have shown that the network traffic that is generated by a visit to a website over Tor reveals information specific to the website through the timing and sizes of network packets. By capturing traffic traces between users and their Tor entry guard, a network eavesdropper can leverage this meta-data to reveal which website Tor users are visiting. The success of such attacks heavily depends on the particular set of traffic features that are used to construct the fingerprint. Typically, these features are manually engineered and, as such, any change introduced to the Tor network can render these carefully constructed features ineffective. In this paper, we show that an adversary can automate the feature engineering process, and thus automatically deanonymize Tor traffic by applying our novel method based on deep learning. We collect a dataset comprised of more than three million network traces, which is the largest dataset of web traffic ever used for website fingerprinting, and find that the performance achieved by our deep learning approaches is comparable to known methods which include various research efforts spanning over multiple years. The obtained success rate exceeds 96% for a closed world of 100 websites and 94% for our biggest closed world of 900 classes. In our open world evaluation, the most performant deep learning model is 2% more accurate than the state-of-the-art attack. Furthermore, we show that the implicit features automatically learned by our approach are far more resilient to dynamic changes of web content over time. We conclude that the ability to automatically construct the most relevant traffic features and perform accurate traffic recognition makes our deep learning based approach an efficient, flexible and robust technique for website fingerprinting.Comment: To appear in the 25th Symposium on Network and Distributed System Security (NDSS 2018

    DetectA: abrupt concept drift detection in non-stationary environments

    Get PDF
    Almost all drift detection mechanisms designed for classification problems work reactively: after receiving the complete data set (input patterns and class labels) they apply a sequence of procedures to identify some change in the class-conditional distribution – a concept drift. However, detecting changes after its occurrence can be in some situations harmful to the process under analysis. This paper proposes a proactive approach for abrupt drift detection, called DetectA (Detect Abrupt Drift). Briefly, this method is composed of three steps: (i) label the patterns from the test set (an unlabelled data block), using an unsupervised method; (ii) compute some statistics from the train and test sets, conditioned to the given class labels for train set; and (iii) compare the training and testing statistics using a multivariate hypothesis test. Based on the results of the hypothesis tests, we attempt to detect the drift on the test set, before the real labels are obtained. A procedure for creating datasets with abrupt drift has been proposed to perform a sensitivity analysis of the DetectA model. The result of the sensitivity analysis suggests that the detector is efficient and suitable for datasets of high-dimensionality, blocks with any proportion of drifts, and datasets with class imbalance. The performance of the DetectA method, with different configurations, was also evaluated on real and artificial datasets, using an MLP as a classifier. The best results were obtained using one of the detection methods, being the proactive manner a top contender regarding improving the underlying base classifier accuracy

    The Influence of Dams on Downstream Larval and Juvenile Fish and Benthic Macroinvertebrate Community Structure and Associated Physicochemical Variables

    Get PDF
    The Influence of Dams on Downstream Larval and Juvenile Fish and Benthic Macroinvertebrate Community Structure and Associated Physicochemical Variables R. Daniel Hanks The influence of dams on downstream biotic and abiotic components of aquatic ecosystems has been largely studied within the context of the River Continuum (RCC) and Serial Discontinuity Concepts (SDC). Few of these studies have sufficiently studied how these variables change along the longitudinal gradient below the impoundments in a systematic manner, comparing equal distances below both epilimnetic and hypolimnetic dams to a reference condition. This is especially true of early life stages of fish (i.e., larval and juvenile stages) and macroinvertebrate functional groups. Here, we systematically evaluated the effects of dams at 16 sites downstream of dams for their impact on physicochemical (instream habitat [e.g., substrate, flow, etc.] and water quality [i.e., DO, pH, conductivity, and temperature], and landcover [i.e., % forested land, % developed land, and % grassland]) and various metrics for larval and juvenile fish and benthic macroinvertebrates.;Effective capture of larval and juvenile fish was paramount for the evaluation of dam influences on larval and juvenile. Sampling larval fish at various life stages can be difficult in shallow, structurally and spatially diverse streams. We evaluated three commonly employed methods (light traps, drift nets, and spot-and-sweep) for sampling larval fish in these systems. We found the spot-and-sweep method captured a higher abundance of larvae than either drift nets or light traps during both daytime and nighttime hours. Additionally the spot-and-sweep method captured as many different taxa as drift nets and more than light traps. The coefficient of variation was lower for spot-and-sweep than for either drift nets or light traps for both taxa richness and larval abundance. Richness for daytime and nighttime spot-and-sweep sampling was equal. Mean richness was also equal between the two periods, and mean CPUE was not significantly different between periods. The coefficient of variation was lowest for daytime spot-and-sweep sampling, suggesting it was less variable than nighttime sampling. The spot-and-sweep method showed promise for determining taxa presence and relative abundance. Discrepancies in the ability of personnel while performing spot-and-sweep sampling was investigated and found to be insignificant. Of the three methods evaluated for sampling structurally complex and spatially heterogeneous streams the spot-and-sweep method was found to be the most effective. We investigated the effects of dams on downstream larval and juvenile fish. Generalized additive models indicated that there was a general increase in abundance, genus richness, and Shannon diversity associated with increasing distance from dams. Principal component analysis (PCA) indicated three influential PC\u27s that were structured by landcover, habitat and water quality, and disturbance. Nonmetric multidimensional scaling (NMDS) indicated larval and juvenile fish communities were structured differently between epilimnetic and hypolimnetic releases and that habitat variables structuring those communities were more variable in epilimnetic releases than hypolimnetic releases. We systematically evaluated both the abiotic and biotic (i.e., benthic macroinvertebrates at the family level) along the stream continuum below impoundments with both epilimnetic and hypolimnetic releases and compared those findings to a reference stream. Generalized additive models (GAMs) identified six habitat variables (i.e., substrate coarseness, substrate diversity, pH, temperature, stream width, and stream depth) as significantly related to distance from dam. GAMs also indicated that abundance was not significantly related to distance from dam but both family level richness and Shannon diversity exhibited significant increases with increasing distance from dams. We evaluated patterns of changes in physicochemical and macroinvertebrate functional group components of aquatic systems along the longitudinal gradient below dams and compared changes in these variables to an undammed reference stream. Generalized additive models indicated that genus richness, functional richness, tolerance, dispersal, percent five dominant genera, EPT, and GLIMPSS were lower in dammed streams than in our reference stream. Genus and functional richness, percent 5 dominant genera, EPT, and GLIMPSS all increased as distance from dams increased while they remained relatively consistent within our reference stream. Tolerance and dispersal changed with distance from dams in dammed streams but showed little change in our reference stream. Percent composition of functional groups was different between dammed and reference streams; in dammed streams the percent composition changed with increasing distance from dams, but remained relatively stable in our reference stream. Genus and functional richness also exhibited two distinct gradients within the 5,100-m that we sampled below dams where a short, rapidly changing gradient existed immediately below dams to approximately 2,000-m, followed by a more gradual steadily increasing gradient that appeared to continue beyond our most distant sampling location below dams (i.e., 5,100-m). Important explanatory variables that varied in statistical significance between response variables but were commonly significant with distance from dams was substrate coarseness and percent forested land. Eighty five percent of our measured abiotic variables below dams had higher r values where curvilinear relationships were modeled as compared to linear relationships; whereas only 46% of the biotic variables had higher r values with curvilinear models. Nonmetric multidimensional scaling (NMDS) confirmed our GAM results indicating benthic macroinvertebrates below dams show structural changes along the stream continuum.;In all cases (larval and juvenile fish, family level aquatic macroinvertebrates, and genus level aquatic macroinvertebrate metrics) our findings generally agreed with the SDC but future studies should aim to sample in a spatially systematic manner, as this will improve our understanding of how dams influence abiotic and biotic components of aquatic systems. Additionally, our studies consistently indicated two gradients existed for most biotic measures. We believe further studies are required to understand the two recovery gradients that exist below dams and the extent of dam influences along the stream continuum

    Combining univariate approaches for ensemble change detection in multivariate data

    Get PDF
    Detecting change in multivariate data is a challenging problem, especially when class labels are not available. There is a large body of research on univariate change detection, notably in control charts developed originally for engineering applications. We evaluate univariate change detection approaches —including those in the MOA framework — built into ensembles where each member observes a feature in the input space of an unsupervised change detection problem. We present a comparison between the ensemble combinations and three established ‘pure’ multivariate approaches over 96 data sets, and a case study on the KDD Cup 1999 network intrusion detection dataset. We found that ensemble combination of univariate methods consistently outperformed multivariate methods on the four experimental metrics.project RPG-2015-188 funded by The Leverhulme Trust, UK; Spanish Ministry of Economy and Competitiveness through project TIN 2015-67534-P and the Spanish Ministry of Education, Culture and Sport through Mobility Grant PRX16/00495. The 96 datasets were originally curated for use in the work of Fernández-Delgado et al. [53] and accessed from the personal web page of the author5. The KDD Cup 1999 dataset used in the case study was accessed from the UCI Machine Learning Repository [10
    • …
    corecore