498,984 research outputs found

    Query-related data extraction of hidden web documents

    Get PDF
    The larger amount of information on the Web is stored in document databases and is not indexed by general-purpose search engines (i.e., Google and Yahoo). Such information is dynamically generated through querying databases — which are referred to as Hidden Web databases. Documents returned in response to a user query are typically presented using templategenerated Web pages. This paper proposes a novel approach that identifies Web page templates by analysing the textual contents and the adjacent tag structures of a document in order to extract query-related data. Preliminary results demonstrate that our approach effectively detects templates and retrieves data with high recall and precision

    Information extraction from template-generated hidden web documents

    Get PDF
    The larger amount of information on the Web is stored in document databases and is not indexed by general-purpose search engines (such as Google and Yahoo). Databases dynamically generate a list of documents in response to a user query – which are referred to as Hidden Web databases. Such documents are typically presented to users as templategenerated Web pages. This paper presents a new approach that identifies Web page templates in order to extract queryrelated information from documents. We propose two forms of representation to analyse the content of a document – Text with Immediate Adjacent Tag Segments (TIATS) and Text with Neighbouring Adjacent Tag Segments (TNATS). Our techniques exploit tag structures that surround the textual contents of documents in order to detect Web page templates thereby extracting query-related information. Experimental results demonstrate that TNATS detects Web page templates most effectively and extracts information with high recall and precision

    A Comparison between Two Main Academic Literature Collections: Web of Science and Scopus Databases

    Get PDF
    Nowadays, the world’s scientific community has been publishing an enormous number of papers in different scientific fields. In such environment, it is essential to know which databases are equally efficient and objective for literature searches. It seems that two most extensive databases are Web of Science and Scopus. Besides searching the literature, these two databases used to rank journals in terms of their productivity and the total citations received to indicate the journals impact, prestige or influence. This article attempts to provide a comprehensive comparison of these databases to answer frequent questions which researchers ask, such as: How Web of Science and Scopus are different? In which aspects these two databases are similar? Or, if the researchers are forced to choose one of them, which one should they prefer? For answering these questions, these two databases will be compared based on their qualitative and quantitative characteristics.Cite as: Aghaei Chadegani, A., Salehi, H., Yunus, M. M., Farhadi, H., Fooladi, M., Farhadi, M., & Ale Ebrahim, N. (2013). A Comparison between Two Main Academic Literature Collections: Web of Science and Scopus Databases. Asian Social Science, 9(5), 18-26. doi: 10.5539/ass.v9n5p1

    Poster Presentation: Xcerpt and XChange – Logic Programming Languages for Querying and Evolution on the Web

    Get PDF
    age Xcerpt and provides advanced, Web-specific capabilities, such as propagation of changes on the Web (change) and event-based communications between Web sites (exchange). Xcerpt: Querying Data on the Web Xcerpt is a declarative, rule-based query language for Web data (i.e. XML documents or semistructured databases) based on logic programming. An Xcerpt program contains at least one goal and some (maybe zero) rules. Rules and goals consist of query and construction patterns, called terms in analogy to other logic programming languages. Terms represent tree-like (or graph-like) structures. The children of a node may be either ordered (as in standard XML) or unordered (as is common in databases). Data terms are used to represent XML documents and the data items of a semistructured database. They are similar to ground functional programming expressions and logical atoms. A database is a (multi-)set of data terms (e.g. the Web). Query terms are patterns matched against Web resources

    A Comparison between Two Main Academic Literature Collections: Web of Science and Scopus Databases

    Get PDF
    Nowadays, the world’s scientific community has been publishing an enormous number of papers in different scientific fields. In such environment, it is essential to know which databases are equally efficient and objective for literature searches. It seems that two most extensive databases are Web of Science and Scopus. Besides searching the literature, these two databases used to rank journals in terms of their productivity and the total citations received to indicate the journals impact, prestige or influence. This article attempts to provide a comprehensive comparison of these databases to answer frequent questions which researchers ask, such as: How Web of Science and Scopus are different? In which aspects these two databases are similar? Or, if the researchers are forced to choose one of them, which one should they prefer? For answering these questions, these two databases will be compared based on their qualitative and quantitative characteristics

    A three-year study on the freshness of Web search engine databases

    Get PDF
    This paper deals with one aspect of the index quality of search engines: index freshness. The purpose is to analyse the update strategies of the major Web search engines Google, Yahoo, and MSN/Live.com. We conducted a test of the updates of 40 daily updated pages and 30 irregularly updated pages, respectively. We used data from a time span of six weeks in the years 2005, 2006, and 2007. We found that the best search engine in terms of up-to-dateness changes over the years and that none of the engines has an ideal solution for index freshness. Frequency distributions for the pages’ ages are skewed, which means that search engines do differentiate between often- and seldom-updated pages. This is confirmed by the difference between the average ages of daily updated pages and our control group of pages. Indexing patterns are often irregular, and there seems to be no clear policy regarding when to revisit Web pages. A major problem identified in our research is the delay in making crawled pages available for searching, which differs from one engine to another
    corecore