22 research outputs found

    Efficient Data Management and Statistics with Zero-Copy Integration

    Get PDF
    Statistical analysts have long been struggling with evergrowing data volumes. While specialized data management systems such as relational databases would be able to handle the data, statistical analysis tools are far more convenient to express complex data analyses. An integration of these two classes of systems has the potential to overcome the data management issue while at the same time keeping analysis convenient. However, one must keep a careful eye on implementation overheads such as serialization. In this paper, we propose the in-process integration of data management and analytical tools. Furthermore, we argue that a zero-copy integration is feasible due to the omnipresence of C-style arrays containing native types. We discuss the general concept and present a prototype of this integration based on the columnar relational database MonetDB and the R environment for statistical computing. We evaluate the performance of this prototype in a series of micro-benchmarks of common data management tasks

    Model-Based Time Series Management at Scale

    Get PDF

    Scalable diversification for data exploration platforms

    Get PDF

    Predictive Data Analytics for Energy Demand Flexibility

    Get PDF

    Differential expression of microRNAs and other small RNAs in muscle tissue of patients with ALS and healthy age-matched controls

    Get PDF
    Amyotrophic lateral sclerosis is a late-onset disorder primarily affecting motor neurons and leading to progressive and lethal skeletal muscle atrophy. Small RNAs, including microRNAs (miRNAs), can serve as important regulators of gene expression and can act both globally and in a tissue-/cell-type-specific manner. In muscle, miRNAs called myomiRs govern important processes and are deregulated in various disorders. Several myomiRs have shown promise for therapeutic use in cellular and animal models of ALS; however, the exact miRNA species differentially expressed in muscle tissue of ALS patients remain unknown. Following small RNA-Seq, we compared the expression of small RNAs in muscle tissue of ALS patients and healthy age-matched controls. The identified snoRNAs, mtRNAs and other small RNAs provide possible molecular links between insulin signaling and ALS. Furthermore, the identified miRNAs are predicted to target proteins that are involved in both normal processes and various muscle disorders and indicate muscle tissue is undergoing active reinnervation/compensatory attempts thus providing targets for further research and therapy development in ALS

    Similarity-aware query refinement for data exploration

    Get PDF

    PAUSANIAS: Final activity report

    Get PDF
    Search engines, such as Google and Yahoo!, provide efficient retrieval and ranking of web pages based on queries consisting of a set of given keywords. Recent studies show that 20% of all Web queries also have location constraints, i.e., also refer to the location of a geotagged web page. An increasing number of applications support location-based keyword search, including Google Maps, Bing Maps, Yahoo! Local, and Yelp. Such applications depict points of interest on the map and combine their location with the keywords provided by the associated document(s). The posed queries consist of two conditions: a set of keywords and a spatial location. The goal is to find points of interest with these keywords close to the location. We refer to such a query as spatial-keyword query. Moreover, mobile devices nowadays are enhanced with built-in GPS receivers, which permits applications (such as search engines or yellow page services) to acquire the location of the user implicitly, and provide location-based services. For instance, Google Mobile App provides a simple search service for smartphones where the location of the user is automatically captured and employed to retrieve results relevant to her current location. As an example, a search for pizza results in a list of pizza restaurants nearby the user. In this research project, we studied how preference queries can be extended for supporting also keywords. To this end we first studied preference queries in order to establish techniques that can be extended for supporting keywords (Chapter 1). Moreover, we proposed Top-k Spatio-Textual Preference Queries and proposed a novel indexing scheme and two algorithms for supporting efficient query processing (Chapter 2). We also studied the problem of maximizing the influence of spatio-textual objects based on reverse top-k queries and keyword selection (Chapter 3). Finally, we analyze the properties of geotagged photos of Flickr, and propose novel location-aware tag recommendation methods (Chapter 4)

    Provenance Management for Collaborative Data Science Workflows

    Get PDF
    Collaborative data science activities are becoming pervasive in a variety of communities, and are often conducted in teams, with people of different expertise performing back-and-forth modeling and analysis on time-evolving datasets. Current data science systems mainly focus on specific steps in the process such as training machine learning models, scaling to large data volumes, or serving the data or the models, while the issues of end-to-end data science lifecycle management are largely ignored. Such issues include, for example, tracking provenance and derivation history of models, identifying data processing pipelines and keeping track of their evolution, analyzing unexpected behaviors and monitoring the project health, and providing the ability to reason about specific analysis results. We address these challenges by ingesting, managing, and analyzing rich provenance information generated during data science projects, and using it to enable users to easily publish, share, and discover data analytics projects. We first describe the design of our unified provenance and metadata management system, called ProvDB. We adopt a schema-later approach and use a flexible graph-based provenance representation model that combines the core concepts in version control and provenance management. We describe several ingestion mechanisms for this provenance model and show how heterogeneous data analysis environments can be served with natural extensions to this framework. We also describe a set of novel features of the system including graph queries for retrospective provenance, fileviews for data transformations, introspective queries for debugging, and continuous monitoring queries for anomaly detection. We then illustrate how to support deep learning modeling lifecycle via the extensibility mechanism in ProvDB. We describe techniques to compactly store and efficiently query the rich set of data artifacts generated during deep learning modeling lifecycle. We also describe a high-level domain specific language that helps raise the abstraction level during model exploration and enumeration and accelerate the modeling process. Lastly, we propose graph query operators and develop efficient evaluation techniques to address the verbose and evolving nature of such provenance graphs. First, we introduce a graph segmentation operator, which queries the provenance of a collection of user-given vertices (e.g., versioned files, author names) via flexible boundary criteria. Second, we propose a graph summarization operator to aggregate the results of multiple segmentation operations, and allow multi-resolution interaction with the aggregation result to understand similar and abnormal behaviors in those segments
    corecore