6,827 research outputs found

    Organizing hidden-web databases by clustering visible web documents

    Get PDF
    Journal ArticleIn this paper we address the problem of organizing hidden-Web databases. Given a heterogeneous set of Web forms that serve as entry points to hidden-Web databases, our goal is to cluster the forms according to the database domains to which they belong. We propose a new clustering approach that models Web forms as a set of hyperlinked objects and considers visible information in the form context-both within and in the neighborhood of forms-as the basis for similarity comparison. Since the clustering is performed over features that can be automatically extracted, the process is scalable. In addition, because it uses a rich set of metadata, our approach is able to handle a wide range of forms, including content-rich forms that contain multiple attributes, as well as simple keyword-based search interfaces. An experimental evaluation over real Web data shows that our strategy generates high-quality clusters-measured both in terms of entropy and F-measure. This indicates that our approach provides an effective and general solution to the problem of organizing hidden-Web databases

    Information maps: tools for document exploration

    Get PDF

    DeepPeep: A Form Search Engine

    Get PDF
    posterWe present DeepPeep (http://www.deeppeep.org), a new search engine specialized in Web forms. DeepPeep uses a scalable infrastructure for discovering, organizing and analyzing Web forms which serve as entry points to hidden-Web sites. DeepPeep provides an intuitive interface that allows users to explore and visualize large form collections. We presented the overall architecture of DeepPeep which can support both general and specific deep Web search; benefits not only casual users but also application builders. The system provides a scalable and automatic solution to deep Web search and can adapt to the dynamic evolution of deep Web which is growing fast and will play an important role in the future of search

    Doctor of Philosophy

    Get PDF
    dissertationThe explosion of structured Web data (e.g., online databases, Wikipedia infoboxes) creates many opportunities for integrating and querying these data that go far beyond the simple search capabilities provided by search engines. Although much work has been devoted to data integration in the database community, the Web brings new challenges: the Web-scale (e.g., the large and growing volume of data) and the heterogeneity in Web data. Because there are so much data, scalable techniques that require little or no manual intervention and that are robust to noisy data are needed. In this dissertation, we propose a new and effective approach for matching Web-form interfaces and for matching multilingual Wikipedia infoboxes. As a further step toward these problems, we propose a general prudent schema-matching framework that matches a large number of schemas effectively. Our comprehensive experiments for Web-form interfaces and Wikipedia infoboxes show that it can enable on-the-fly, automatic integration of large collections of structured Web data. Another problem we address in this dissertation is schema discovery. While existing integration approaches assume that the relevant data sources and their schemas have been identified in advance, schemas are not always available for structured Web data. Approaches exist that exploit information in Wikipedia to discover the entity types and their associate schemas. However, due to inconsistencies, sparseness, and noise from the community contribution, these approaches are error prone and require substantial human intervention. Given the schema heterogeneity in Wikipedia infoboxes, we developed a new approach that uses the structured information available in infoboxes to cluster similar infoboxes and infer the schemata for entity types. Our approach is unsupervised and resilient to the unpredictable skew in the entity class distribution. Our experiments, using over one hundred thousand infoboxes extracted from Wikipedia, indicate that our approach is effective and produces accurate schemata for Wikipedia entities

    Learning structure and schemas from heterogeneous domains in networked systems: a survey

    Get PDF
    The rapidly growing amount of available digital documents of various formats and the possibility to access these through internet-based technologies in distributed environments, have led to the necessity to develop solid methods to properly organize and structure documents in large digital libraries and repositories. Specifically, the extremely large size of document collections make it impossible to manually organize such documents. Additionally, most of the document sexist in an unstructured form and do not follow any schemas. Therefore, research efforts in this direction are being dedicated to automatically infer structure and schemas. This is essential in order to better organize huge collections as well as to effectively and efficiently retrieve documents in heterogeneous domains in networked system. This paper presents a survey of the state-of-the-art methods for inferring structure from documents and schemas in networked environments. The survey is organized around the most important application domains, namely, bio-informatics, sensor networks, social networks, P2Psystems, automation and control, transportation and privacy preserving for which we analyze the recent developments on dealing with unstructured data in such domains.Peer ReviewedPostprint (published version

    Applying Fourier-Transform Infrared Spectroscopy and Self-Organizing Maps for Forensic Classification of White-Copy Papers

    Get PDF
    White-copy A4 paper is an important kind of substrate for preparation of most formal as well as informal documents. It often encountered as questioned document in cases such as falsification, embezzlement or forgery. By comparing the questioned piece, (e.g. of a contract) against the rest deemed authentic, forgery indicator could be derived from an inconsistent chemical composition.  However, classification and even differentiation of white copy paper have been difficult due to highly similar physical properties and chemical composition. Self-organizing map (SOM) has been proven useful in many published works as a good tool for clustering and classification of samples, especially when involving high-dimensional data. In this preliminary paper, we explore the feasibility of SOM in classifying white copy paper for forensic purposes. A total of 150 infrared spectra were collected from three varieties of white paper using Attenuated Total Reflectance Fourier-transform infrared (ATR-FTIR) spectroscopy. Each IR spectrum composed of over thousands of wavenumbers (i.e. input variables) and resembles chemical fingerprint for the sample. Comparative performance between raw wavenumbers and its reduced form (i.e. principal components, PCs) in SOM modeling also conducted. Results showed that SOM built with PCs is much efficient than built with raw wavenumbers, with the classification accuracy of over 90% is obtained with external validation test. This study shows that SOM coupled with ATR-FTIR spectroscopy could be a potential non-destructive approach for forensic paper analysis

    Visual exploration and retrieval of XML document collections with the generic system X2

    Get PDF
    This article reports on the XML retrieval system X2 which has been developed at the University of Munich over the last five years. In a typical session with X2, the user first browses a structural summary of the XML database in order to select interesting elements and keywords occurring in documents. Using this intermediate result, queries combining structure and textual references are composed semiautomatically. After query evaluation, the full set of answers is presented in a visual and structured way. X2 largely exploits the structure found in documents, queries and answers to enable new interactive visualization and exploration techniques that support mixed IR and database-oriented querying, thus bridging the gap between these three views on the data to be retrieved. Another salient characteristic of X2 which distinguishes it from other visual query systems for XML is that it supports various degrees of detailedness in the presentation of answers, as well as techniques for dynamically reordering and grouping retrieved elements once the complete answer set has been computed
    • …
    corecore