2,407 research outputs found

    Link Graph Analysis for Adult Images Classification

    Full text link
    In order to protect an image search engine's users from undesirable results adult images' classifier should be built. The information about links from websites to images is employed to create such a classifier. These links are represented as a bipartite website-image graph. Each vertex is equipped with scores of adultness and decentness. The scores for image vertexes are initialized with zero, those for website vertexes are initialized according to a text-based website classifier. An iterative algorithm that propagates scores within a website-image graph is described. The scores obtained are used to classify images by choosing an appropriate threshold. The experiments on Internet-scale data have shown that the algorithm under consideration increases classification recall by 17% in comparison with a simple algorithm which classifies an image as adult if it is connected with at least one adult site (at the same precision level).Comment: 7 pages. Young Scientists Conference, 4th Russian Summer School in Information Retrieva

    Term-driven E-Commerce

    Get PDF
    Die Arbeit nimmt sich der textuellen Dimension des E-Commerce an. Grundlegende Hypothese ist die textuelle Gebundenheit von Information und Transaktion im Bereich des elektronischen Handels. Überall dort, wo Produkte und Dienstleistungen angeboten, nachgefragt, wahrgenommen und bewertet werden, kommen natürlichsprachige Ausdrücke zum Einsatz. Daraus resultiert ist zum einen, wie bedeutsam es ist, die Varianz textueller Beschreibungen im E-Commerce zu erfassen, zum anderen können die umfangreichen textuellen Ressourcen, die bei E-Commerce-Interaktionen anfallen, im Hinblick auf ein besseres Verständnis natürlicher Sprache herangezogen werden

    D4.1. Technologies and tools for corpus creation, normalization and annotation

    Get PDF
    The objectives of the Corpus Acquisition and Annotation (CAA) subsystem are the acquisition and processing of monolingual and bilingual language resources (LRs) required in the PANACEA context. Therefore, the CAA subsystem includes: i) a Corpus Acquisition Component (CAC) for extracting monolingual and bilingual data from the web, ii) a component for cleanup and normalization (CNC) of these data and iii) a text processing component (TPC) which consists of NLP tools including modules for sentence splitting, POS tagging, lemmatization, parsing and named entity recognition

    A history and theory of textual event detection and recognition

    Get PDF

    Semantic Sort: A Supervised Approach to Personalized Semantic Relatedness

    Full text link
    We propose and study a novel supervised approach to learning statistical semantic relatedness models from subjectively annotated training examples. The proposed semantic model consists of parameterized co-occurrence statistics associated with textual units of a large background knowledge corpus. We present an efficient algorithm for learning such semantic models from a training sample of relatedness preferences. Our method is corpus independent and can essentially rely on any sufficiently large (unstructured) collection of coherent texts. Moreover, the approach facilitates the fitting of semantic models for specific users or groups of users. We present the results of extensive range of experiments from small to large scale, indicating that the proposed method is effective and competitive with the state-of-the-art.Comment: 37 pages, 8 figures A short version of this paper was already published at ECML/PKDD 201

    Proceedings of the First Workshop on Computing News Storylines (CNewsStory 2015)

    Get PDF
    This volume contains the proceedings of the 1st Workshop on Computing News Storylines (CNewsStory 2015) held in conjunction with the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (ACL-IJCNLP 2015) at the China National Convention Center in Beijing, on July 31st 2015. Narratives are at the heart of information sharing. Ever since people began to share their experiences, they have connected them to form narratives. The study od storytelling and the field of literary theory called narratology have developed complex frameworks and models related to various aspects of narrative such as plots structures, narrative embeddings, characters’ perspectives, reader response, point of view, narrative voice, narrative goals, and many others. These notions from narratology have been applied mainly in Artificial Intelligence and to model formal semantic approaches to narratives (e.g. Plot Units developed by Lehnert (1981)). In recent years, computational narratology has qualified as an autonomous field of study and research. Narrative has been the focus of a number of workshops and conferences (AAAI Symposia, Interactive Storytelling Conference (ICIDS), Computational Models of Narrative). Furthermore, reference annotation schemes for narratives have been proposed (NarrativeML by Mani (2013)). The workshop aimed at bringing together researchers from different communities working on representing and extracting narrative structures in news, a text genre which is highly used in NLP but which has received little attention with respect to narrative structure, representation and analysis. Currently, advances in NLP technology have made it feasible to look beyond scenario-driven, atomic extraction of events from single documents and work towards extracting story structures from multiple documents, while these documents are published over time as news streams. Policy makers, NGOs, information specialists (such as journalists and librarians) and others are increasingly in need of tools that support them in finding salient stories in large amounts of information to more effectively implement policies, monitor actions of “big players” in the society and check facts. Their tasks often revolve around reconstructing cases either with respect to specific entities (e.g. person or organizations) or events (e.g. hurricane Katrina). Storylines represent explanatory schemas that enable us to make better selections of relevant information but also projections to the future. They form a valuable potential for exploiting news data in an innovative way.JRC.G.2-Global security and crisis managemen

    Similarity and Diversity in Information Retrieval

    Get PDF
    Inter-document similarity is used for clustering, classification, and other purposes within information retrieval. In this thesis, we investigate several aspects of document similarity. In particular, we investigate the quality of several measures of inter-document similarity, providing a framework suitable for measuring and comparing the effectiveness of inter-document similarity measures. We also explore areas of research related to novelty and diversity in information retrieval. The goal of diversity and novelty is to be able to satisfy as many users as possible while simultaneously minimizing or eliminating duplicate and redundant information from search results. In order to evaluate the effectiveness of diversity-aware retrieval functions, user query logs and other information captured from user interactions with commercial search engines are mined and analyzed in order to uncover various informational aspects underlying queries, which are known as subtopics. We investigate the suitability of implicit associations between document content as an alternative to subtopic mining. We also explore subtopic mining from document anchor text and anchor links. In addition, we investigate the suitability of inter-document similarity as a measure for diversity-aware retrieval models, with the aim of using measured inter-document similarity as a replacement for diversity-aware evaluation models that rely on subtopic mining. Finally, we investigate the suitability and application of document similarity for requirements traceability. We present a fast algorithm that uncovers associations between various versions of frequently edited documents, even in the face of substantial changes
    corecore