5 research outputs found

    Building a domain-specific document collection for evaluating metadata effects on information retrieval

    Get PDF
    This paper describes the development of a structured document collection containing user-generated text and numerical metadata for exploring the exploitation of metadata in information retrieval (IR). The collection consists of more than 61,000 documents extracted from YouTube video pages on basketball in general and NBA (National Basketball Association) in particular, together with a set of 40 topics and their relevance judgements. In addition, a collection of nearly 250,000 user profiles related to the NBA collection is available. Several baseline IR experiments report the effect of using video-associated metadata on retrieval effectiveness. The results surprisingly show that searching the videos titles only performs significantly better than searching additional metadata text fields of the videos such as the tags or the description

    Exploiting User Comments for Audio-Visual Content Indexing and Retrieval

    Get PDF
    State-of-the-art content sharing platforms often require users to assign tags to pieces of media in order to make them easily retrievable. Since this task is sometimes perceived as tedious or boring, annotations can be sparse. Commenting on the other hand is a frequently used means of expressing user opinion towards shared media items. This work makes use of time series analyses in order to infer potential tags and indexing terms for audio-visual content from user comments. In this way, we mitigate the vocabulary gap between queries and document descriptors. Additionally, we show how large-scale encyclopaedias such as Wikipedia can aid the task of tag prediction by serving as surrogates for high-coverage natural language vocabulary lists. Our evaluation is conducted on a corpus of several million real-world user comments from the popular video sharing platform YouTube, and demonstrates signicant improvements in retrieval performance

    Analyzing User Comments On YouTube Coding Tutorial Videos

    Get PDF
    Video coding tutorials enable expert and novice programmers to visually observe real developers write, debug, and execute code. Previous research in this domain has focused on helping programmers find relevant content in coding tutorial videos as well as understanding the motivation and needs of content creators. In this thesis, we focus on the link connecting programmers creating coding videos with their audience. More specifically, we analyze user comments on YouTube coding tutorial videos. Our main objective is to help content creators to effectively understand the needs and concerns of their viewers, thus respond faster to these concerns and deliver higher-quality content. A dataset of 6000 comments sampled from 12 YouTube coding videos is used to conduct our analysis. Important user questions and concerns are then automatically classified and summarized. The results show that Support Vector Machines can detect useful viewers\u27 comments on coding videos with an average accuracy of 77%. The results also show that SumBasic, an extractive frequency-based summarization technique with redundancy control, can sufficiently capture the main concerns present in viewers\u27 comments

    Ranking, Labeling, and Summarizing Short Text in Social Media

    Get PDF
    One of the key features driving the growth and success of the Social Web is large-scale participation through user-contributed content – often through short text in social media. Unlike traditional long-form documents – e.g., Web pages, blog posts – these short text resources are typically quite brief (on the order of 100s of characters), often of a personal nature (reflecting opinions and reactions of users), and being generated at an explosive rate. Coupled with this explosion of short text in social media is the need for new methods to organize, monitor, and distill relevant information from these large-scale social systems, even in the face of the inherent “messiness” of short text, considering the wide variability in quality, style, and substance of short text generated by a legion of Social Web participants. Hence, this dissertation seeks to develop new algorithms and methods to ensure the continued growth of the Social Web by enhancing how users engage with short text in social media. Concretely, this dissertation takes a three-fold approach: First, this dissertation develops a learning-based algorithm to automatically rank short text comments associated with a Social Web object (e.g., Web document, image, video) based on the expressed preferences of the community itself, so that low-quality short text may be filtered and user attention may be focused on highly-ranked short text. Second, this dissertation organizes short text through labeling, via a graph- based framework for automatically assigning relevant labels to short text. In this way meaningful semantic descriptors may be assigned to short text for improved classification, browsing, and visualization. Third, this dissertation presents a cluster-based summarization approach for extracting high-quality viewpoints expressed in a collection of short text, while maintaining diverse viewpoints. By summarizing short text, user attention may quickly assess the aggregate viewpoints expressed in a collection of short text, without the need to scan each of possibly thousands of short text items

    Technologies for Reusing Text from the Web

    Get PDF
    Texts from the web can be reused individually or in large quantities. The former is called text reuse and the latter language reuse. We first present a comprehensive overview of the different ways in which text and language is reused today, and how exactly information retrieval technologies can be applied in this respect. The remainder of the thesis then deals with specific retrieval tasks. In general, our contributions consist of models and algorithms, their evaluation, and for that purpose, large-scale corpus construction. The thesis divides into two parts. The first part introduces technologies for text reuse detection, and our contributions are as follows: (1) A unified view of projecting-based and embedding-based fingerprinting for near-duplicate detection and the first time evaluation of fingerprint algorithms on Wikipedia revision histories as a new, large-scale corpus of near-duplicates. (2) A new retrieval model for the quantification of cross-language text similarity, which gets by without parallel corpora. We have evaluated the model in comparison to other models on many different pairs of languages. (3) An evaluation framework for text reuse and particularly plagiarism detectors, which consists of tailored detection performance measures and a large-scale corpus of automatically generated and manually written plagiarism cases. The latter have been obtained via crowdsourcing. This framework has been successfully applied to evaluate many different state-of-the-art plagiarism detection approaches within three international evaluation competitions. The second part introduces technologies that solve three retrieval tasks based on language reuse, and our contributions are as follows: (4) A new model for the comparison of textual and non-textual web items across media, which exploits web comments as a source of information about the topic of an item. In this connection, we identify web comments as a largely neglected information source and introduce the rationale of comment retrieval. (5) Two new algorithms for query segmentation, which exploit web n-grams and Wikipedia as a means of discerning the user intent of a keyword query. Moreover, we crowdsource a new corpus for the evaluation of query segmentation which surpasses existing corpora by two orders of magnitude. (6) A new writing assistance tool called Netspeak, which is a search engine for commonly used language. Netspeak indexes the web in the form of web n-grams as a source of writing examples and implements a wildcard query processor on top of it.Texte aus dem Web können einzeln oder in großen Mengen wiederverwendet werden. Ersteres wird Textwiederverwendung und letzteres Sprachwiederverwendung genannt. ZunĂ€chst geben wir einen ausfĂŒhrlichen Überblick darĂŒber, auf welche Weise Text und Sprache heutzutage wiederverwendet und wie Technologien des Information Retrieval in diesem Zusammenhang angewendet werden können. In der ĂŒbrigen Arbeit werden dann spezifische Retrievalaufgaben behandelt. Unsere BeitrĂ€ge bestehen dabei aus Modellen und Algorithmen, ihrer empirischen Auswertung und der Konstruktion von großen Korpora hierfĂŒr. Die Dissertation ist in zwei Teile gegliedert. Im ersten Teil prĂ€sentieren wir Technologien zur Erkennung von Textwiederverwendungen und leisten folgende BeitrĂ€ge: (1) Ein Überblick ĂŒber projektionsbasierte- und einbettungsbasierte Fingerprinting-Verfahren fĂŒr die Erkennung nahezu identischer Texte, sowie die erstmalige Evaluierung einer Reihe solcher Verfahren auf den Revisionshistorien der Wikipedia. (2) Ein neues Modell zum sprachĂŒbergreifenden, inhaltlichen Vergleich von Texten. Das Modell basiert auf einem mehrsprachigen Korpus bestehend aus PĂ€rchen themenverwandter Texte, wie zum Beispiel der Wikipedia. Wir vergleichen das Modell in mehreren Sprachen mit herkömmlichen Modellen. (3) Eine Evaluierungsumgebung fĂŒr Algorithmen zur Plagiaterkennung. Die Umgebung besteht aus Maßen, die die GĂŒte der Erkennung eines Algorithmus' quantifizieren, und einem großen Korpus von Plagiaten. Die Plagiate wurden automatisch generiert sowie mit Hilfe von Crowdsourcing manuell erstellt. DarĂŒber hinaus haben wir zwei Workshops veranstaltet, in denen unsere Evaluierungsumgebung erfolgreich zur Evaluierung aktueller Plagiaterkennungsalgorithmen eingesetzt wurde. Im zweiten Teil prĂ€sentieren wir auf Sprachwiederverwendung basierende Technologien fĂŒr drei verschiedene Retrievalaufgaben und leisten folgende BeitrĂ€ge: (4) Ein neues Modell zum medienĂŒbergreifenden, inhaltlichen Vergleich von Objekten aus dem Web. Das Modell basiert auf der Auswertung der zu einem Objekt vorliegenden Kommentare. In diesem Zusammenhang identifizieren wir Webkommentare als eine in der Forschung bislang vernachlĂ€ssigte Informationsquelle und stellen die Grundlagen des Kommentarretrievals vor. (5) Zwei neue Algorithmen zur Segmentierung von Websuchanfragen. Die Algorithmen nutzen Web n-Gramme sowie Wikipedia, um die Intention des Suchenden in einer Suchanfrage festzustellen. DarĂŒber hinaus haben wir mittels Crowdsourcing ein neues Evaluierungskorpus erstellt, das zwei GrĂ¶ĂŸenordnungen grĂ¶ĂŸer ist als bisherige Korpora. (6) Eine neuartige Suchmaschine, genannt Netspeak, die die Suche nach gebrĂ€uchlicher Sprache ermöglicht. Netspeak indiziert das Web als Quelle fĂŒr gebrĂ€uchliche Sprache in der Form von n-Grammen und implementiert eine Wildcardsuche darauf
    corecore