25,353 research outputs found

    Pyramid: Enhancing Selectivity in Big Data Protection with Count Featurization

    Full text link
    Protecting vast quantities of data poses a daunting challenge for the growing number of organizations that collect, stockpile, and monetize it. The ability to distinguish data that is actually needed from data collected "just in case" would help these organizations to limit the latter's exposure to attack. A natural approach might be to monitor data use and retain only the working-set of in-use data in accessible storage; unused data can be evicted to a highly protected store. However, many of today's big data applications rely on machine learning (ML) workloads that are periodically retrained by accessing, and thus exposing to attack, the entire data store. Training set minimization methods, such as count featurization, are often used to limit the data needed to train ML workloads to improve performance or scalability. We present Pyramid, a limited-exposure data management system that builds upon count featurization to enhance data protection. As such, Pyramid uniquely introduces both the idea and proof-of-concept for leveraging training set minimization methods to instill rigor and selectivity into big data management. We integrated Pyramid into Spark Velox, a framework for ML-based targeting and personalization. We evaluate it on three applications and show that Pyramid approaches state-of-the-art models while training on less than 1% of the raw data

    An explorative study of interface support for image searching

    Get PDF
    In this paper we study interfaces for image retrieval systems. Current image retrieval interfaces are limited to providing query facilities and result presentation. The user can inspect the results and possibly provide feedback on their relevance for the current query. Our approach, in contrast, encourages the user to group and organise their search results and thus provide more fine-grained feedback for the system. It combines the search and management process, which - according to our hypothesis - helps the user to onceptualise their search tasks and to overcome the query formulation problem. An evaluation, involving young design-professionals and di®erent types of information seeking scenarios, shows that the proposed approach succeeds in encouraging the user to conceptualise their tasks and that it leads to increased user satisfaction. However, it could not be shown to increase performance. We identify the problems in the current setup, which when eliminated should lead to more effective searching overall

    Does GPS supervision of intimate partner violence defendants reduce pretrial misconduct? Evidence from a quasi-experimental study

    Get PDF
    Objectives This research examines the effect global positioning system (GPS) technology supervision has on pretrial misconduct for defendants facing intimate partner violence charges. Methods Drawing on data from one pretrial services division, a retrospective quasi-experimental design was constructed to examine failure to appear to court, failure to appear to meetings with pretrial services, and rearrest outcomes between defendants ordered to pretrial GPS supervision and a comparison group of defendants ordered to pretrial supervision without the use of monitoring technology. Cox regression models were used to assess differences between quasi-experimental conditions. To enhance internal validity and mitigate model dependence, we utilized and compared results across four counterfactual comparison groups (propensity score matching, Mahalanobis distance matching, inverse probability of treatment weighting, and marginal mean weighting through stratification). Results Pretrial GPS supervision was no more or less effective than traditional, non-technology based pretrial supervision in reducing the risk of failure to appear to court or the risk of rearrest. GPS supervision did reduce the risk of failing to appear to meetings with pretrial services staff. Conclusions The results suggest that GPS supervision may hold untapped case management benefits for pretrial probation officers, a pragmatic focus that may be overshadowed by efforts to mitigate the risk of pretrial misconduct. Further, the results contribute to ongoing discussions on bail reform, pretrial practice, and the movement to reduce local jail populations. Although the cost savings are not entirely clear, relatively higher risk defendants can be managed in the community and produce outcomes that are comparable to other defendants. The results also call into question the ability of matching procedures to construct appropriate counterfactuals in an era where risk assessment informs criminal justice decision-making. Weighting techniques outperformed matching strategies

    How Algorithmic Confounding in Recommendation Systems Increases Homogeneity and Decreases Utility

    Full text link
    Recommendation systems are ubiquitous and impact many domains; they have the potential to influence product consumption, individuals' perceptions of the world, and life-altering decisions. These systems are often evaluated or trained with data from users already exposed to algorithmic recommendations; this creates a pernicious feedback loop. Using simulations, we demonstrate how using data confounded in this way homogenizes user behavior without increasing utility

    The effect of mode and context on survey results: analysis of data from the Health Survey for England 2006 and the Boost Survey for London.

    Get PDF
    BACKGROUND: Health-related data at local level could be provided by supplementing national health surveys with local boosts. Self-completion surveys are less costly than interviews, enabling larger samples to be achieved for a given cost. However, even when the same questions are asked with the same wording, responses to survey questions may vary by mode of data collection. These measurement differences need to be investigated further. METHODS: The Health Survey for England in London ('Core') and a London Boost survey ('Boost') used identical sampling strategies but different modes of data collection. Some data were collected by face-to-face interview in the Core and by self-completion in the Boost; other data were collected by self-completion questionnaire in both, but the context differed. Results were compared by mode of data collection using two approaches. The first examined differences in results that remained after adjusting the samples for differences in response. The second compared results after using propensity score matching to reduce any differences in sample composition. RESULTS: There were no significant differences between the two samples for prevalence of some variables including long-term illness, limiting long-term illness, current rates of smoking, whether participants drank alcohol, and how often they usually drank. However, there were a number of differences, some quite large, between some key measures including: general health, GHQ12 score, portions of fruit and vegetables consumed, levels of physical activity, and, to a lesser extent, smoking consumption, the number of alcohol units reported consumed on the heaviest day of drinking in the last week and perceived social support (among women only). CONCLUSION: Survey mode and context can both affect the responses given. The effect is largest for complex question modules but was also seen for identical self-completion questions. Some data collected by interview and self-completion can be safely combined

    Geometric deep learning: going beyond Euclidean data

    Get PDF
    Many scientific fields study data with an underlying structure that is a non-Euclidean space. Some examples include social networks in computational social sciences, sensor networks in communications, functional networks in brain imaging, regulatory networks in genetics, and meshed surfaces in computer graphics. In many applications, such geometric data are large and complex (in the case of social networks, on the scale of billions), and are natural targets for machine learning techniques. In particular, we would like to use deep neural networks, which have recently proven to be powerful tools for a broad range of problems from computer vision, natural language processing, and audio analysis. However, these tools have been most successful on data with an underlying Euclidean or grid-like structure, and in cases where the invariances of these structures are built into networks used to model them. Geometric deep learning is an umbrella term for emerging techniques attempting to generalize (structured) deep neural models to non-Euclidean domains such as graphs and manifolds. The purpose of this paper is to overview different examples of geometric deep learning problems and present available solutions, key difficulties, applications, and future research directions in this nascent field

    Optimal client recommendation for market makers in illiquid financial products

    Full text link
    The process of liquidity provision in financial markets can result in prolonged exposure to illiquid instruments for market makers. In this case, where a proprietary position is not desired, pro-actively targeting the right client who is likely to be interested can be an effective means to offset this position, rather than relying on commensurate interest arising through natural demand. In this paper, we consider the inference of a client profile for the purpose of corporate bond recommendation, based on typical recorded information available to the market maker. Given a historical record of corporate bond transactions and bond meta-data, we use a topic-modelling analogy to develop a probabilistic technique for compiling a curated list of client recommendations for a particular bond that needs to be traded, ranked by probability of interest. We show that a model based on Latent Dirichlet Allocation offers promising performance to deliver relevant recommendations for sales traders.Comment: 12 pages, 3 figures, 1 tabl
    • …
    corecore