Search CORE

26 research outputs found

Collaboration in an Open Data eScience: A Case Study of Sloan Digital Sky Survey

Author: Chen Chaomei
Zhang Jian
Publication venue
Publication date: 20/01/2010
Field of study

Current science and technology has produced more and more publically accessible scientific data. However, little is known about how the open data trend impacts a scientific community, specifically in terms of its collaboration behaviors. This paper aims to enhance our understanding of the dynamics of scientific collaboration in the open data eScience environment via a case study of co-author networks of an active and highly cited open data project, called Sloan Digital Sky Survey. We visualized the co-authoring networks and measured their properties over time at three levels: author, institution, and country levels. We compared these measurements to a random network model and also compared results across the three levels. The study found that 1) the collaboration networks of the SDSS community transformed from random networks to small-world networks; 2) the number of author-level collaboration instances has not changed much over time, while the number of collaboration instances at the other two levels has increased over time; 3) pairwise institutional collaboration become common in recent years. The open data trend may have both positive and negative impacts on scientific collaboration.Comment: iConference 201

arXiv.org e-Print Archive

CiteSeerX

Illinois Digital Environment for Access to Learning and Scholarship Repository

Qualitative Analysis of the SQLShare Workload for Session Segmentation

Author: Marcel Patrick
Peralta Veronika
Raimont Yann
Verdeaux Willeme
Publication venue: HAL CCSD
Publication date: 26/03/2019
Field of study

International audienceThis paper presents an ongoing work aiming at better understanding the workload of SQLShare [9]. SQLShare is database-as-a-service platform targeting scientists and data scientists with minimal database experience, whose workload was made available to the research community. According to the authors of [9], this workload is the only one containing primarily ad-hoc handwritten queries over user-uploaded datasets. We analyzed this workload by extracting features that characterize SQL queries and we show how to use these features to separate sequences of SQL queries into meaningful sessions. We ran a few test over various query workloads to validate empirically our approach

HAL Université de Tours

HAL UVSQ

SQL query log analysis for identifying user interests and query recommendations

Author: Arzamasova Natalia
Publication venue: KIT-Bibliothek, Karlsruhe
Publication date: 01/01/2020
Field of study

In the sciences and elsewhere, the use of relational databases has become ubiquitous. To get maximum profit from a database, one should have in-depth knowledge in both SQL and a domain (data structure and meaning that a database contains). To assist inexperienced users in formulating their needs, SQL query recommendation system (SQL QRS) has been proposed. It utilizes the experience of previous users captured by SQL query log as well as the user query history to suggest. When constructing such a system, one should solve related problems: (1) clean the query log and (2) define appropriate query similarity functions. These two tasks are not only necessary for building SQL QRS, but they apply to other problems. In what follows, we describe three scenarios of SQL query log analysis: (1) cleaning an SQL query log, (2) SQL query log clustering when testing SQL query similarity functions and (3) recommending SQL queries. We also explain how these three branches are related to each other. Scenario 1. Cleaning SQL query log as a general pre-processing step The raw query log is often not suitable for query log analysis tasks such as clustering, giving recommendations. That is because it contains antipatterns and robotic data downloads, also known as Sliding Window Search (SWS). An antipattern in software engineering is a special case of a pattern. While a pattern is a standard solution, an antipattern is a pattern with a negative effect. When it comes to SQL query recommendation, leaving such artifacts in the log during analysis results in a wrong suggestion. Firstly, the behaviour of "mortal" users who need a recommendation is different from robots, which perform SWS. Secondly, one does not want to recommend antipatterns, so they need to be excluded from the query pool. Thirdly, the bigger a log is, the slower a recommendation engine operates. Thus, excluding SWS and antipatterns from the input data makes the recommendation better and faster. The effect of SWS and antipatterns on query log clustering depends on the chosen similarity function. The result can either (1) do not change or (2) add clusters which cover a big part of data. In any case, having antipatterns and SWS in an input log increases only the time one need to cluster and do not increase the quality of results. Scenario 2. Identifying User Interests via Clustering To identify the hot spots of user interests, one clusters SQL queries. In a scientific domain, it exposes research trends. In business, it points to popular data slices which one might want to refactor for better accessibility. A good clustering result must be precise (match ground truth) and interpretable. Query similarity relies on SQL query representation. There are three strategies to represent an SQL query. FB (feature-based) query representation sees a query as structure, not considering the data, a query accesses. WB (witness-based) approach treat a query as a set of tuples in the result set. AAB (access area-based) representation considers a query as an expression in relational algebra. While WB and FB query similarity functions are straightforward (Jaccard or cosine similarities), AAB query similarity requires additional definition. We proposed two variants of AAB similarity measure – overlap (AABovl) and closeness (AABcl). In AABovl, the similarity of two queries is the overlap of their access areas. AABcl relies on the distance between two access areas in the data space – two queries may be similar even if their access areas do not overlap. The extensive experiments consist of two parts. The first one is clustering a rather small dataset with ground truth. This experiment serves to study the precision of various similarity functions by comparing clustering results to supervised insights. The second experiment aims to investigate on the interpretability of clustering results with different similarity functions. It clusters a big real-world query log. The domain expert then evaluates the results. Both experiments show that AAB similarity functions produce better results in both precision and interpretability. Scenario 3. SQL Query Recommendation A sound SQL query recommendation system (1) provides a query which can be run directly, (2) supports comparison operators and various logical operators, (3) is scalable and has low response times, (4) provides recommendations of high quality. The existing approaches fail to fulfill all the requirements. We proposed DASQR, scalable and data-aware query recommendation to meet all four needs. In a nutshell, DASQR is a hybrid (collaborative filtering + content-based) approach. Its variations utilize all similarity functions, which we define or find in the related work. Measuring the quality of SQL query recommendation system (QRS) is particularly challenging since there is no standard way approaching it. Previous studies have evaluated the results using quality metrics which only rely on the query representations used in these studies. It is somewhat subjective since a similarity function and a quality metric are dependent. We propose AAB quality metrics and then evaluate each approach based on all the metrics. The experiments test DASQR approaches and competitors. Both performance and runtime experiments indicate that DASQR approaches outperform the existing ones

KITopen

The Sloan Digital Sky Survey Science Archive: Migrating a Multi-Terabyte Astronomical Archive from Object to Relational DBMS

Author: Gray Jim
Kunszt Peter Z.
Szalay Alexander S.
Thakar Aniruddha R.
Publication venue
Publication date: 01/01/2004
Field of study

The Sloan Digital Sky Survey Science Archive is the first in a series of multi-Terabyte digital archives in Astronomy and other data-intensive sciences. To facilitate data mining in the SDSS archive, we adapted a commercial database engine and built specialized tools on top of it. Originally we chose an object-oriented database management system due to its data organization capabilities, platform independence, query performance and conceptual fit to the data. However, after using the object database for the first couple of years of the project, it soon began to fall short in terms of its query support and data mining performance. This was as much due to the inability of the database vendor to respond our demands for features and bug fixes as it was due to their failure to keep up with the rapid improvements in hardware performance, particularly faster RAID disk systems. In the end, we were forced to abandon the object database and migrate our data to a relational database. We describe below the technical issues that we faced with the object database and how and why we migrated to relational technology

arXiv.org e-Print Archive

CERN Document Server

Query Formulation and Recommendation for Relational Databases Using User Sessions and Collaborative Filtering

Author: Mr. S. D. Chopade, Prof. S. S. Bere
Publication venue: 'Auricle Technologies, Pvt., Ltd.'
Publication date: 31/07/2015
Field of study

Structured Query Language (SQL) has a uniform structure over different programming languages. The queries fired on Database Management System (DBMS) contain textual information along with selected segments of data parsed by data base management system to fire it as a structured query. Currently DBA needs to execute complex queries on large databases. Many times user or DBA fires similar queries on database server to get useful information. The queries which are similar to each other can then be categorized into two types a) the tuples retrieved by SQL queries are similar b) the fragment of the queries are similar. System gives recommendation to those similar queries so that it saves the time of DBA to construct it again and again. Query suggestions given to DBA or users are known as Query Recommendation. To develop a Query Recommendation system many authors suggested the use of Query Log. Query suggestions are divided into two areas mainly Collaborative Recommendations and Single Log Recommendations. This system is designed by single or collaborative log using parameter known as mixing factor. In this paper we analyzed Sql query Recommendation concepts and their uses. There are basically two types of similarity measure for Query Recommendation considered in [1] such as 1) Fragment Based 2) Tuple Based. Here in this research paper we are motivated towards generating recommendations for nested SQL queries. We adopt hierarchical classification on query log to create classes of similar queries and further to generate recommendations for SQL Query we proceed with finding matching class from which the recommendations can be modeled. DOI: 10.17762/ijritcc2321-8169.15070

International Journal on Recent and Innovation Trends in Computing and Communication

A pragmatic approach: Achieving acceptable security mechanisms for high speed data transfer protocol-UDT

Author: Bernardo DV
Hoang DB
Publication venue
Publication date: 01/12/2010
Field of study

The development of next generation protocols, such as UDT (UDP-based data transfer), promptly addresses various infrastructure requirements for transmitting data in high speed networks. However, this development creates new vulnerabilities when these protocols are designed to solely rely on existing security solutions of existing protocols such as TCP and UDP. It is clear that not all security protocols (such as TLS) can be used to protect UDT, just as security solutions devised for wired networks cannot be used to protect the unwired ones. The development of UDT, similarly in the development of TCP/UDP many years ago, lacked a well-thought security architecture to address the problems that networks are presently experiencing. This paper proposes and analyses practical security mechanisms for UDT

OPUS - University of Technology Sydney

Usage Bibliometrics

Author: Abt
Accomazzi
Aggarwal
Baldi
Bar-Ilan
Bensman
Bertot
Blecic
Bollen
Bollen
Bollen
Bollen
Bollen
Bollen
Bollen
Bonitz
Borgman
Boyack
Boyack
Brin
Broadus
Broadus
Brody
Brody
Brookes
Burton
Börner
Börner
Castellano
Chen
Cooper
Craig
Cronin
Cronin
Cronin
Darmoni
Davis
Davis
Davis
Davis
Davis
Drott
Duy
Eason
Egghe
Eichhorn
Eysenbach
Eysenbach
Fortunato
Freire
Galvin
Gardner
Garfield
Garfield
Garfield
Gargouri
Georgakopoulos
Ginsparg
Ginsparg
Ginsparg
Ginsparg
Goldberg
Gosnell
Grant
Gross
Hajjem
Harnad
Harnad
Harnad
He
Henneken
Henneken
Henneken
Hider
Hood
Huntington
Jamali
Jansen
Jansen
Kaplan
King
King
King
King
Kurtz
Kurtz
Kurtz
Kurtz
Kurtz
Kurtz
Kurtz
Kurtz
Kurtz
Kurtz
Kurtz
Ladwig
Lawrence
Leydesdorff
Leydesdorff
Leydesdorff
Line
Line
Liu
Ludascher
Luther
MacRoberts
May
Mayr
McDonald
Meadows
Merton
Moed
Moed
Moed
Moya-Anegón
Nicholas
Norris
Pan
Parker
Peters
Pinski
Pirolli
Price
Price
Price
Rice
Rosvall
Rowlands
Rowlands
Scales
Shepherd
Small
Stankus
Szalay
Szalay
Tenopir
Tenopir
Tonta
Trimble
Trimble
Tsay
Tsay
Van de Sompel
Van de Sompel
Van de Sompel
Walter
Wang
Wasserman
White
Wilson
York
Publication venue: 'Wiley'
Publication date: 14/02/2011
Field of study

Scholarly usage data provides unique opportunities to address the known shortcomings of citation analysis. However, the collection, processing and analysis of usage data remains an area of active research. This article provides a review of the state-of-the-art in usage-based informetric, i.e. the use of usage data to study the scholarly process.Comment: Publisher's PDF (by permission). Publisher web site: books.infotoday.com/asist/arist44.shtm

arXiv.org e-Print Archive

Crossref