26 research outputs found

    Collaboration in an Open Data eScience: A Case Study of Sloan Digital Sky Survey

    Get PDF
    Current science and technology has produced more and more publically accessible scientific data. However, little is known about how the open data trend impacts a scientific community, specifically in terms of its collaboration behaviors. This paper aims to enhance our understanding of the dynamics of scientific collaboration in the open data eScience environment via a case study of co-author networks of an active and highly cited open data project, called Sloan Digital Sky Survey. We visualized the co-authoring networks and measured their properties over time at three levels: author, institution, and country levels. We compared these measurements to a random network model and also compared results across the three levels. The study found that 1) the collaboration networks of the SDSS community transformed from random networks to small-world networks; 2) the number of author-level collaboration instances has not changed much over time, while the number of collaboration instances at the other two levels has increased over time; 3) pairwise institutional collaboration become common in recent years. The open data trend may have both positive and negative impacts on scientific collaboration.Comment: iConference 201

    Qualitative Analysis of the SQLShare Workload for Session Segmentation

    Get PDF
    International audienceThis paper presents an ongoing work aiming at better understanding the workload of SQLShare [9]. SQLShare is database-as-a-service platform targeting scientists and data scientists with minimal database experience, whose workload was made available to the research community. According to the authors of [9], this workload is the only one containing primarily ad-hoc handwritten queries over user-uploaded datasets. We analyzed this workload by extracting features that characterize SQL queries and we show how to use these features to separate sequences of SQL queries into meaningful sessions. We ran a few test over various query workloads to validate empirically our approach

    SQL query log analysis for identifying user interests and query recommendations

    Get PDF
    In the sciences and elsewhere, the use of relational databases has become ubiquitous. To get maximum profit from a database, one should have in-depth knowledge in both SQL and a domain (data structure and meaning that a database contains). To assist inexperienced users in formulating their needs, SQL query recommendation system (SQL QRS) has been proposed. It utilizes the experience of previous users captured by SQL query log as well as the user query history to suggest. When constructing such a system, one should solve related problems: (1) clean the query log and (2) define appropriate query similarity functions. These two tasks are not only necessary for building SQL QRS, but they apply to other problems. In what follows, we describe three scenarios of SQL query log analysis: (1) cleaning an SQL query log, (2) SQL query log clustering when testing SQL query similarity functions and (3) recommending SQL queries. We also explain how these three branches are related to each other. Scenario 1. Cleaning SQL query log as a general pre-processing step The raw query log is often not suitable for query log analysis tasks such as clustering, giving recommendations. That is because it contains antipatterns and robotic data downloads, also known as Sliding Window Search (SWS). An antipattern in software engineering is a special case of a pattern. While a pattern is a standard solution, an antipattern is a pattern with a negative effect. When it comes to SQL query recommendation, leaving such artifacts in the log during analysis results in a wrong suggestion. Firstly, the behaviour of "mortal" users who need a recommendation is different from robots, which perform SWS. Secondly, one does not want to recommend antipatterns, so they need to be excluded from the query pool. Thirdly, the bigger a log is, the slower a recommendation engine operates. Thus, excluding SWS and antipatterns from the input data makes the recommendation better and faster. The effect of SWS and antipatterns on query log clustering depends on the chosen similarity function. The result can either (1) do not change or (2) add clusters which cover a big part of data. In any case, having antipatterns and SWS in an input log increases only the time one need to cluster and do not increase the quality of results. Scenario 2. Identifying User Interests via Clustering To identify the hot spots of user interests, one clusters SQL queries. In a scientific domain, it exposes research trends. In business, it points to popular data slices which one might want to refactor for better accessibility. A good clustering result must be precise (match ground truth) and interpretable. Query similarity relies on SQL query representation. There are three strategies to represent an SQL query. FB (feature-based) query representation sees a query as structure, not considering the data, a query accesses. WB (witness-based) approach treat a query as a set of tuples in the result set. AAB (access area-based) representation considers a query as an expression in relational algebra. While WB and FB query similarity functions are straightforward (Jaccard or cosine similarities), AAB query similarity requires additional definition. We proposed two variants of AAB similarity measure – overlap (AABovl) and closeness (AABcl). In AABovl, the similarity of two queries is the overlap of their access areas. AABcl relies on the distance between two access areas in the data space – two queries may be similar even if their access areas do not overlap. The extensive experiments consist of two parts. The first one is clustering a rather small dataset with ground truth. This experiment serves to study the precision of various similarity functions by comparing clustering results to supervised insights. The second experiment aims to investigate on the interpretability of clustering results with different similarity functions. It clusters a big real-world query log. The domain expert then evaluates the results. Both experiments show that AAB similarity functions produce better results in both precision and interpretability. Scenario 3. SQL Query Recommendation A sound SQL query recommendation system (1) provides a query which can be run directly, (2) supports comparison operators and various logical operators, (3) is scalable and has low response times, (4) provides recommendations of high quality. The existing approaches fail to fulfill all the requirements. We proposed DASQR, scalable and data-aware query recommendation to meet all four needs. In a nutshell, DASQR is a hybrid (collaborative filtering + content-based) approach. Its variations utilize all similarity functions, which we define or find in the related work. Measuring the quality of SQL query recommendation system (QRS) is particularly challenging since there is no standard way approaching it. Previous studies have evaluated the results using quality metrics which only rely on the query representations used in these studies. It is somewhat subjective since a similarity function and a quality metric are dependent. We propose AAB quality metrics and then evaluate each approach based on all the metrics. The experiments test DASQR approaches and competitors. Both performance and runtime experiments indicate that DASQR approaches outperform the existing ones

    The Sloan Digital Sky Survey Science Archive: Migrating a Multi-Terabyte Astronomical Archive from Object to Relational DBMS

    Full text link
    The Sloan Digital Sky Survey Science Archive is the first in a series of multi-Terabyte digital archives in Astronomy and other data-intensive sciences. To facilitate data mining in the SDSS archive, we adapted a commercial database engine and built specialized tools on top of it. Originally we chose an object-oriented database management system due to its data organization capabilities, platform independence, query performance and conceptual fit to the data. However, after using the object database for the first couple of years of the project, it soon began to fall short in terms of its query support and data mining performance. This was as much due to the inability of the database vendor to respond our demands for features and bug fixes as it was due to their failure to keep up with the rapid improvements in hardware performance, particularly faster RAID disk systems. In the end, we were forced to abandon the object database and migrate our data to a relational database. We describe below the technical issues that we faced with the object database and how and why we migrated to relational technology

    Query Formulation and Recommendation for Relational Databases Using User Sessions and Collaborative Filtering

    Get PDF
    Structured Query Language (SQL) has a uniform structure over different programming languages. The queries fired on Database Management System (DBMS) contain textual information along with selected segments of data parsed by data base management system to fire it as a structured query. Currently DBA needs to execute complex queries on large databases. Many times user or DBA fires similar queries on database server to get useful information. The queries which are similar to each other can then be categorized into two types a) the tuples retrieved by SQL queries are similar b) the fragment of the queries are similar. System gives recommendation to those similar queries so that it saves the time of DBA to construct it again and again. Query suggestions given to DBA or users are known as Query Recommendation. To develop a Query Recommendation system many authors suggested the use of Query Log. Query suggestions are divided into two areas mainly Collaborative Recommendations and Single Log Recommendations. This system is designed by single or collaborative log using parameter known as mixing factor. In this paper we analyzed Sql query Recommendation concepts and their uses. There are basically two types of similarity measure for Query Recommendation considered in [1] such as 1) Fragment Based 2) Tuple Based. Here in this research paper we are motivated towards generating recommendations for nested SQL queries. We adopt hierarchical classification on query log to create classes of similar queries and further to generate recommendations for SQL Query we proceed with finding matching class from which the recommendations can be modeled. DOI: 10.17762/ijritcc2321-8169.15070

    A pragmatic approach: Achieving acceptable security mechanisms for high speed data transfer protocol-UDT

    Full text link
    The development of next generation protocols, such as UDT (UDP-based data transfer), promptly addresses various infrastructure requirements for transmitting data in high speed networks. However, this development creates new vulnerabilities when these protocols are designed to solely rely on existing security solutions of existing protocols such as TCP and UDP. It is clear that not all security protocols (such as TLS) can be used to protect UDT, just as security solutions devised for wired networks cannot be used to protect the unwired ones. The development of UDT, similarly in the development of TCP/UDP many years ago, lacked a well-thought security architecture to address the problems that networks are presently experiencing. This paper proposes and analyses practical security mechanisms for UDT

    Usage Bibliometrics

    Full text link
    Scholarly usage data provides unique opportunities to address the known shortcomings of citation analysis. However, the collection, processing and analysis of usage data remains an area of active research. This article provides a review of the state-of-the-art in usage-based informetric, i.e. the use of usage data to study the scholarly process.Comment: Publisher's PDF (by permission). Publisher web site: books.infotoday.com/asist/arist44.shtm
    corecore