3,477 research outputs found

    Web Video in Numbers - An Analysis of Web-Video Metadata

    Full text link
    Web video is often used as a source of data in various fields of study. While specialized subsets of web video, mainly earmarked for dedicated purposes, are often analyzed in detail, there is little information available about the properties of web video as a whole. In this paper we present insights gained from the analysis of the metadata associated with more than 120 million videos harvested from two popular web video platforms, vimeo and YouTube, in 2016 and compare their properties with the ones found in commonly used video collections. This comparison has revealed that existing collections do not (or no longer) properly reflect the properties of web video "in the wild".Comment: Dataset available from http://download-dbis.dmi.unibas.ch/WWIN

    Approximate algorithms for efficient indexing, clustering, and classification in Peer-to-peer networks

    Get PDF
    [no abstract

    Dynamic set kNN self-join

    Full text link
    In many applications, data objects can be represented as sets. For example, in video on-demand and social network services, the user data consists of a set of movies that have been watched and a set of users (friends), respectively, and they can be used for recommendation and information extraction. The problem of set similarity self-join hence has been studied extensively. Existing studies assume that sets are static, but in the above applications, sets are dynamically updated, and this requires continuous updating the join result. In this paper, we study a novel problem, dynamic set kNN self-join, i.e., for each set, we continuously compute its k nearest neighbor sets. Our problem poses a challenge for the efficiency of computation, because just an element insertion (deletion) into (from) a set may affect the kNN results of many sets. To address this challenge, we first investigate the property of the dynamic set kNN self-join problem to observe the search space derived from a set update. Then, based on this observation, we propose an efficient algorithm. This algorithm employs an indexing technique that enables incremental similarity computation and prunes unnecessary similarity computation. Our empirical studies using real datasets show the efficiency and scalability of our algorithm.Amagata D., Hara T., Xiao C.. Dynamic set kNN self-join. Proceedings - International Conference on Data Engineering 2019-April, 818 (2019); https://doi.org/10.1109/ICDE.2019.00078

    Cleaning Web pages for effective Web content mining.

    Get PDF
    Web pages usually contain many noisy blocks, such as advertisements, navigation bar, copyright notice and so on. These noisy blocks can seriously affect web content mining because contents contained in noise blocks are irrelevant to the main content of the web page. Eliminating noisy blocks before performing web content mining is very important for improving mining accuracy and efficiency. A few existing approaches detect noisy blocks with exact same contents, but are weak in detecting near-duplicate blocks, such as navigation bars. In this thesis, given a collection of web pages in a web site, a new system, WebPageCleaner, which eliminates noisy blocks from these web pages so as to improve the accuracy and efficiency of web content mining, is proposed. WebPageCleaner detects both noisy blocks with exact same contents as well as those with near-duplicate contents. It is based on the observation that noisy blocks usually share common contents, and appear frequently on a given web site. WebPageCleaner consists of three modules: block extraction, block importance retrieval, and cleaned files generation. A vision-based technique is employed for extracting blocks from web pages. Blocks get their importance degree according to their block features such as block position, and level of similarity of block contents to each other. A collection of cleaned files with high importance degree are generated finally and used for web content mining. The proposed technique is evaluated using Naive Bayes text classification. Experiments show that WebPageCleaner is able to lead to a more efficient and accurate web page classification results than existing approaches.Dept. of Computer Science. Paper copy at Leddy Library: Theses & Major Papers - Basement, West Bldg. / Call Number: Thesis2005 .L5. Source: Masters Abstracts International, Volume: 45-01, page: 0359. Thesis (M.Sc.)--University of Windsor (Canada), 2006
    • …
    corecore