26 research outputs found
Collaboration in an Open Data eScience: A Case Study of Sloan Digital Sky Survey
Current science and technology has produced more and more publically
accessible scientific data. However, little is known about how the open data
trend impacts a scientific community, specifically in terms of its
collaboration behaviors. This paper aims to enhance our understanding of the
dynamics of scientific collaboration in the open data eScience environment via
a case study of co-author networks of an active and highly cited open data
project, called Sloan Digital Sky Survey. We visualized the co-authoring
networks and measured their properties over time at three levels: author,
institution, and country levels. We compared these measurements to a random
network model and also compared results across the three levels. The study
found that 1) the collaboration networks of the SDSS community transformed from
random networks to small-world networks; 2) the number of author-level
collaboration instances has not changed much over time, while the number of
collaboration instances at the other two levels has increased over time; 3)
pairwise institutional collaboration become common in recent years. The open
data trend may have both positive and negative impacts on scientific
collaboration.Comment: iConference 201
Qualitative Analysis of the SQLShare Workload for Session Segmentation
International audienceThis paper presents an ongoing work aiming at better understanding the workload of SQLShare [9]. SQLShare is database-as-a-service platform targeting scientists and data scientists with minimal database experience, whose workload was made available to the research community. According to the authors of [9], this workload is the only one containing primarily ad-hoc handwritten queries over user-uploaded datasets. We analyzed this workload by extracting features that characterize SQL queries and we show how to use these features to separate sequences of SQL queries into meaningful sessions. We ran a few test over various query workloads to validate empirically our approach
SQL query log analysis for identifying user interests and query recommendations
In the sciences and elsewhere, the use of relational databases has become ubiquitous.
To get maximum profit from a database, one should have in-depth knowledge in both
SQL and a domain (data structure and meaning that a database contains). To assist
inexperienced users in formulating their needs, SQL query recommendation system
(SQL QRS) has been proposed. It utilizes the experience of previous users captured by
SQL query log as well as the user query history to suggest. When constructing such
a system, one should solve related problems: (1) clean the query log and (2) define
appropriate query similarity functions. These two tasks are not only necessary for
building SQL QRS, but they apply to other problems. In what follows, we describe
three scenarios of SQL query log analysis: (1) cleaning an SQL query log, (2) SQL
query log clustering when testing SQL query similarity functions and (3) recommending
SQL queries. We also explain how these three branches are related to each other.
Scenario 1. Cleaning SQL query log as a general pre-processing step
The raw query log is often not suitable for query log analysis tasks such as clustering,
giving recommendations. That is because it contains antipatterns and robotic data
downloads, also known as Sliding Window Search (SWS). An antipattern in software
engineering is a special case of a pattern. While a pattern is a standard solution, an
antipattern is a pattern with a negative effect.
When it comes to SQL query recommendation, leaving such artifacts in the log during
analysis results in a wrong suggestion. Firstly, the behaviour of "mortal" users who
need a recommendation is different from robots, which perform SWS. Secondly, one
does not want to recommend antipatterns, so they need to be excluded from the query
pool. Thirdly, the bigger a log is, the slower a recommendation engine operates. Thus,
excluding SWS and antipatterns from the input data makes the recommendation
better and faster.
The effect of SWS and antipatterns on query log clustering depends on the chosen
similarity function. The result can either (1) do not change or (2) add clusters which
cover a big part of data. In any case, having antipatterns and SWS in an input log
increases only the time one need to cluster and do not increase the quality of results.
Scenario 2. Identifying User Interests via Clustering
To identify the hot spots of user interests, one clusters SQL queries. In a scientific
domain, it exposes research trends. In business, it points to popular data slices which
one might want to refactor for better accessibility. A good clustering result must be
precise (match ground truth) and interpretable.
Query similarity relies on SQL query representation. There are three strategies to
represent an SQL query. FB (feature-based) query representation sees a query as
structure, not considering the data, a query accesses. WB (witness-based) approach
treat a query as a set of tuples in the result set. AAB (access area-based) representation
considers a query as an expression in relational algebra. While WB and FB query
similarity functions are straightforward (Jaccard or cosine similarities), AAB query
similarity requires additional definition. We proposed two variants of AAB similarity
measure – overlap (AABovl) and closeness (AABcl). In AABovl, the similarity of two
queries is the overlap of their access areas. AABcl relies on the distance between two
access areas in the data space – two queries may be similar even if their access areas
do not overlap.
The extensive experiments consist of two parts. The first one is clustering a rather
small dataset with ground truth. This experiment serves to study the precision of
various similarity functions by comparing clustering results to supervised insights. The
second experiment aims to investigate on the interpretability of clustering results with
different similarity functions. It clusters a big real-world query log. The domain expert
then evaluates the results. Both experiments show that AAB similarity functions
produce better results in both precision and interpretability.
Scenario 3. SQL Query Recommendation
A sound SQL query recommendation system (1) provides a query which can be run
directly, (2) supports comparison operators and various logical operators, (3) is scalable
and has low response times, (4) provides recommendations of high quality. The existing
approaches fail to fulfill all the requirements. We proposed DASQR, scalable and
data-aware query recommendation to meet all four needs. In a nutshell, DASQR is
a hybrid (collaborative filtering + content-based) approach. Its variations utilize all
similarity functions, which we define or find in the related work.
Measuring the quality of SQL query recommendation system (QRS) is particularly
challenging since there is no standard way approaching it. Previous studies have
evaluated the results using quality metrics which only rely on the query representations
used in these studies. It is somewhat subjective since a similarity function and a
quality metric are dependent. We propose AAB quality metrics and then evaluate
each approach based on all the metrics.
The experiments test DASQR approaches and competitors. Both performance and
runtime experiments indicate that DASQR approaches outperform the existing ones
The Sloan Digital Sky Survey Science Archive: Migrating a Multi-Terabyte Astronomical Archive from Object to Relational DBMS
The Sloan Digital Sky Survey Science Archive is the first in a series of
multi-Terabyte digital archives in Astronomy and other data-intensive sciences.
To facilitate data mining in the SDSS archive, we adapted a commercial database
engine and built specialized tools on top of it. Originally we chose an
object-oriented database management system due to its data organization
capabilities, platform independence, query performance and conceptual fit to
the data. However, after using the object database for the first couple of
years of the project, it soon began to fall short in terms of its query support
and data mining performance. This was as much due to the inability of the
database vendor to respond our demands for features and bug fixes as it was due
to their failure to keep up with the rapid improvements in hardware
performance, particularly faster RAID disk systems. In the end, we were forced
to abandon the object database and migrate our data to a relational database.
We describe below the technical issues that we faced with the object database
and how and why we migrated to relational technology
Query Formulation and Recommendation for Relational Databases Using User Sessions and Collaborative Filtering
Structured Query Language (SQL) has a uniform structure over different programming languages. The queries fired on Database Management System (DBMS) contain textual information along with selected segments of data parsed by data base management system to fire it as a structured query. Currently DBA needs to execute complex queries on large databases. Many times user or DBA fires similar queries on database server to get useful information. The queries which are similar to each other can then be categorized into two types a) the tuples retrieved by SQL queries are similar b) the fragment of the queries are similar. System gives recommendation to those similar queries so that it saves the time of DBA to construct it again and again. Query suggestions given to DBA or users are known as Query Recommendation. To develop a Query Recommendation system many authors suggested the use of Query Log. Query suggestions are divided into two areas mainly Collaborative Recommendations and Single Log Recommendations. This system is designed by single or collaborative log using parameter known as mixing factor. In this paper we analyzed Sql query Recommendation concepts and their uses. There are basically two types of similarity measure for Query Recommendation considered in [1] such as 1) Fragment Based 2) Tuple Based. Here in this research paper we are motivated towards generating recommendations for nested SQL queries. We adopt hierarchical classification on query log to create classes of similar queries and further to generate recommendations for SQL Query we proceed with finding matching class from which the recommendations can be modeled.
DOI: 10.17762/ijritcc2321-8169.15070
A pragmatic approach: Achieving acceptable security mechanisms for high speed data transfer protocol-UDT
The development of next generation protocols, such as UDT (UDP-based data transfer), promptly addresses various infrastructure requirements for transmitting data in high speed networks. However, this development creates new vulnerabilities when these protocols are designed to solely rely on existing security solutions of existing protocols such as TCP and UDP. It is clear that not all security protocols (such as TLS) can be used to protect UDT, just as security solutions devised for wired networks cannot be used to protect the unwired ones. The development of UDT, similarly in the development of TCP/UDP many years ago, lacked a well-thought security architecture to address the problems that networks are presently experiencing. This paper proposes and analyses practical security mechanisms for UDT
Usage Bibliometrics
Scholarly usage data provides unique opportunities to address the known
shortcomings of citation analysis. However, the collection, processing and
analysis of usage data remains an area of active research. This article
provides a review of the state-of-the-art in usage-based informetric, i.e. the
use of usage data to study the scholarly process.Comment: Publisher's PDF (by permission). Publisher web site:
books.infotoday.com/asist/arist44.shtm