210,408 research outputs found

    Web Data Extraction, Applications and Techniques: A Survey

    Full text link
    Web Data Extraction is an important problem that has been studied by means of different scientific tools and in a broad range of applications. Many approaches to extracting data from the Web have been designed to solve specific problems and operate in ad-hoc domains. Other approaches, instead, heavily reuse techniques and algorithms developed in the field of Information Extraction. This survey aims at providing a structured and comprehensive overview of the literature in the field of Web Data Extraction. We provided a simple classification framework in which existing Web Data Extraction applications are grouped into two main classes, namely applications at the Enterprise level and at the Social Web level. At the Enterprise level, Web Data Extraction techniques emerge as a key tool to perform data analysis in Business and Competitive Intelligence systems as well as for business process re-engineering. At the Social Web level, Web Data Extraction techniques allow to gather a large amount of structured data continuously generated and disseminated by Web 2.0, Social Media and Online Social Network users and this offers unprecedented opportunities to analyze human behavior at a very large scale. We discuss also the potential of cross-fertilization, i.e., on the possibility of re-using Web Data Extraction techniques originally designed to work in a given domain, in other domains.Comment: Knowledge-based System

    The Dark Energy Survey Data Management System

    Full text link
    The Dark Energy Survey collaboration will study cosmic acceleration with a 5000 deg2 griZY survey in the southern sky over 525 nights from 2011-2016. The DES data management (DESDM) system will be used to process and archive these data and the resulting science ready data products. The DESDM system consists of an integrated archive, a processing framework, an ensemble of astronomy codes and a data access framework. We are developing the DESDM system for operation in the high performance computing (HPC) environments at NCSA and Fermilab. Operating the DESDM system in an HPC environment offers both speed and flexibility. We will employ it for our regular nightly processing needs, and for more compute-intensive tasks such as large scale image coaddition campaigns, extraction of weak lensing shear from the full survey dataset, and massive seasonal reprocessing of the DES data. Data products will be available to the Collaboration and later to the public through a virtual-observatory compatible web portal. Our approach leverages investments in publicly available HPC systems, greatly reducing hardware and maintenance costs to the project, which must deploy and maintain only the storage, database platforms and orchestration and web portal nodes that are specific to DESDM. In Fall 2007, we tested the current DESDM system on both simulated and real survey data. We used Teragrid to process 10 simulated DES nights (3TB of raw data), ingesting and calibrating approximately 250 million objects into the DES Archive database. We also used DESDM to process and calibrate over 50 nights of survey data acquired with the Mosaic2 camera. Comparison to truth tables in the case of the simulated data and internal crosschecks in the case of the real data indicate that astrometric and photometric data quality is excellent.Comment: To be published in the proceedings of the SPIE conference on Astronomical Instrumentation (held in Marseille in June 2008). This preprint is made available with the permission of SPIE. Further information together with preprint containing full quality images is available at http://desweb.cosmology.uiuc.edu/wik

    Discovering the Impact of Knowledge in Recommender Systems: A Comparative Study

    Get PDF
    Recommender systems engage user profiles and appropriate filtering techniques to assist users in finding more relevant information over the large volume of information. User profiles play an important role in the success of recommendation process since they model and represent the actual user needs. However, a comprehensive literature review of recommender systems has demonstrated no concrete study on the role and impact of knowledge in user profiling and filtering approache. In this paper, we review the most prominent recommender systems in the literature and examine the impression of knowledge extracted from different sources. We then come up with this finding that semantic information from the user context has substantial impact on the performance of knowledge based recommender systems. Finally, some new clues for improvement the knowledge-based profiles have been proposed.Comment: 14 pages, 3 tables; International Journal of Computer Science & Engineering Survey (IJCSES) Vol.2, No.3, August 201

    Intelligent Self-Repairable Web Wrappers

    Get PDF
    The amount of information available on the Web grows at an incredible high rate. Systems and procedures devised to extract these data from Web sources already exist, and different approaches and techniques have been investigated during the last years. On the one hand, reliable solutions should provide robust algorithms of Web data mining which could automatically face possible malfunctioning or failures. On the other, in literature there is a lack of solutions about the maintenance of these systems. Procedures that extract Web data may be strictly interconnected with the structure of the data source itself; thus, malfunctioning or acquisition of corrupted data could be caused, for example, by structural modifications of data sources brought by their owners. Nowadays, verification of data integrity and maintenance are mostly manually managed, in order to ensure that these systems work correctly and reliably. In this paper we propose a novel approach to create procedures able to extract data from Web sources -- the so called Web wrappers -- which can face possible malfunctioning caused by modifications of the structure of the data source, and can automatically repair themselves.\u

    A unified view of data-intensive flows in business intelligence systems : a survey

    Get PDF
    Data-intensive flows are central processes in today’s business intelligence (BI) systems, deploying different technologies to deliver data, from a multitude of data sources, in user-preferred and analysis-ready formats. To meet complex requirements of next generation BI systems, we often need an effective combination of the traditionally batched extract-transform-load (ETL) processes that populate a data warehouse (DW) from integrated data sources, and more real-time and operational data flows that integrate source data at runtime. Both academia and industry thus must have a clear understanding of the foundations of data-intensive flows and the challenges of moving towards next generation BI environments. In this paper we present a survey of today’s research on data-intensive flows and the related fundamental fields of database theory. The study is based on a proposed set of dimensions describing the important challenges of data-intensive flows in the next generation BI setting. As a result of this survey, we envision an architecture of a system for managing the lifecycle of data-intensive flows. The results further provide a comprehensive understanding of data-intensive flows, recognizing challenges that still are to be addressed, and how the current solutions can be applied for addressing these challenges.Peer ReviewedPostprint (author's final draft
    corecore