Search CORE

16 research outputs found

Big Data Management Challenges, Approaches, Tools and their limitations

Author: Adiba Michel
Castrejon-Castillo Juan-Carlos
Espinosa Oviedo Javier Alfonso
Vargas-Solar Genoveva
Zechinelli-Martini José-Luis
Publication venue: Chapman and Hall/CRC
Publication date: 01/02/2016
Field of study

International audienceBig Data is the buzzword everyone talks about. Independently of the application domain, today there is a consensus about the V's characterizing Big Data: Volume, Variety, and Velocity. By focusing on Data Management issues and past experiences in the area of databases systems, this chapter examines the main challenges involved in the three V's of Big Data. Then it reviews the main characteristics of existing solutions for addressing each of the V's (e.g., NoSQL, parallel RDBMS, stream data management systems and complex event processing systems). Finally, it provides a classification of different functions offered by NewSQL systems and discusses their benefits and limitations for processing Big Data

Hal - Université Grenoble Alpes

Organizing hidden-web databases by clustering visible web documents

Author: Barbosa Luciano
Freire Juliana
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/04/2007
Field of study

Journal ArticleIn this paper we address the problem of organizing hidden-Web databases. Given a heterogeneous set of Web forms that serve as entry points to hidden-Web databases, our goal is to cluster the forms according to the database domains to which they belong. We propose a new clustering approach that models Web forms as a set of hyperlinked objects and considers visible information in the form context-both within and in the neighborhood of forms-as the basis for similarity comparison. Since the clustering is performed over features that can be automatically extracted, the process is scalable. In addition, because it uses a rich set of metadata, our approach is able to handle a wide range of forms, including content-rich forms that contain multiple attributes, as well as simple keyword-based search interfaces. An experimental evaluation over real Web data shows that our strategy generates high-quality clusters-measured both in terms of entropy and F-measure. This indicates that our approach provides an effective and general solution to the problem of organizing hidden-Web databases

The University of Utah: J. Willard Marriott Digital Library

Recommended from our members

Web Archiving Bibliography 2013

Author: Reyes Ayala Brenda
Publication venue
Publication date: 28/06/2013
Field of study

The following document is a bibliography of the field of web archiving. It includes a preface as well as a list of bibliographical resources

UNT Digital Library

CCIndex: A Complemental Clustering Index on Distributed Ordered Tables for Multi-dimensional Range Queries

Author: A. Guttman
A. Rowstron
D. Comer
D.J. DeWitt
F. Chang
H.C. Yang
I. Stoica
J. Albrecht
J. MacCormick
M. Cafarella
M. Cai
R. Bayer
R. Sylvia
S. Fushimi
S. Padmandabhan
T.K. Sellis
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2010
Field of study

Crossref

CrawNet: Multimedia Crawler Resources for Both Surface and Hidden Web

Author: Estrada-Esquivel Hugo
Martínez-Rebollar Alicia
Pech-May Fernando
Pedroza-Landa Eduardo
Publication venue: 'Universidad Catolica Luis Amigo'
Publication date: 01/01/2015
Field of study

The web is the most used information source in both academic, scientific and industry forums. Its explosive growth has generated billions of pages with information which may be categorized as surface web, composed of static pages that are indexed into a hidden web, accessible through search templates. This paper presents the development of a crawler that allows searching, queries, and analysis of information in the surface web and hidden in specific domains of the web

Fundación Universitaria Luis Amigó (FUNLAM): Revistas en Línea

Hidden-web induced by client-side scripting: An empirical study

Author: Ali Mesbah
Zahra Behfarshad
Publication venue: 'Springer Fachmedien Wiesbaden GmbH'
Publication date: 01/01/2013
Field of study

Abstract. Client-side JavaScript is increasingly used for enhancing web application functionality, interactivity, and responsiveness. Through the execution of JavaScript code in browsers, the DOM tree representing a webpage at runtime, can be incrementally updated without requiring a URL change. This dynamically updated content is hidden from general search engines. In this paper, we present the first empirical study on measuring and characterizing the hidden-web induced as a result of clientside JavaScript execution. Our study reveals that this type of hidden-web content is prevalent in online web applications today: from the 500 websites we analyzed, 95% contain client-side hidden-web content; On those websites that contain client-side hidden-web content, (1) on average, 62% of the web states are hidden, (2) per hidden state, there is an average of 19 kilobytes of data that is hidden from which 0.6 kilobytes contain textual content, (3) the DIV element is the most common clickable element used (61%) to initiate this type of hidden-web state transition, and (4) on average 25 minutes is required to dynamically crawl 50 DOM states. Further, our study indicates that there is a correlation between DOM tree size and hidden-web content, but no correlation exists between the amount of JavaScript code and client-side hidden-web

CiteSeerX

Combining classifiers to identify online databases

Author
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2007
Field of study

Crossref