6 research outputs found

    Leveraging Semantic Annotations for Event-focused Search & Summarization

    Get PDF
    Today in this Big Data era, overwhelming amounts of textual information across different sources with a high degree of redundancy has made it hard for a consumer to retrospect on past events. A plausible solution is to link semantically similar information contained across the different sources to enforce a structure thereby providing multiple access paths to relevant information. Keeping this larger goal in view, this work uses Wikipedia and online news articles as two prominent yet disparate information sources to address the following three problems: • We address a linking problem to connect Wikipedia excerpts to news articles by casting it into an IR task. Our novel approach integrates time, geolocations, and entities with text to identify relevant documents that can be linked to a given excerpt. • We address an unsupervised extractive multi-document summarization task to generate a fixed-length event digest that facilitates efficient consumption of information contained within a large set of documents. Our novel approach proposes an ILP for global inference across text, time, geolocations, and entities associated with the event. • To estimate temporal focus of short event descriptions, we present a semi-supervised approach that leverages redundancy within a longitudinal news collection to estimate accurate probabilistic time models. Extensive experimental evaluations demonstrate the effectiveness and viability of our proposed approaches towards achieving the larger goal.Im heutigen Big Data Zeitalters existieren überwältigende Mengen an Textinformationen, die über mehrere Quellen verteilt sind und ein hohes Maß an Redundanz haben. Durch diese Gegebenheiten ist eine Retroperspektive auf vergangene Ereignisse für Konsumenten nur schwer möglich. Eine plausible Lösung ist die Verknüpfung semantisch ähnlicher, aber über mehrere Quellen verteilter Informationen, um dadurch eine Struktur zu erzwingen, die mehrere Zugriffspfade auf relevante Informationen, bietet. Vor diesem Hintergrund benutzt diese Dissertation Wikipedia und Onlinenachrichten als zwei prominente, aber dennoch grundverschiedene Informationsquellen, um die folgenden drei Probleme anzusprechen: • Wir adressieren ein Verknüpfungsproblem, um Wikipedia-Auszüge mit Nachrichtenartikeln zu verbinden und das Problem in eine Information-Retrieval-Aufgabe umzuwandeln. Unser neuartiger Ansatz integriert Zeit- und Geobezüge sowie Entitäten mit Text, um relevante Dokumente, die mit einem gegebenen Auszug verknüpft werden können, zu identifizieren. • Wir befassen uns mit einer unüberwachten Extraktionsmethode zur automatischen Zusammenfassung von Texten aus mehreren Dokumenten um Ereigniszusammenfassungen mit fester Länge zu generieren, was eine effiziente Aufnahme von Informationen aus großen Dokumentenmassen ermöglicht. Unser neuartiger Ansatz schlägt eine ganzzahlige lineare Optimierungslösung vor, die globale Inferenzen über Text, Zeit, Geolokationen und mit Ereignis-verbundenen Entitäten zieht. • Um den zeitlichen Fokus kurzer Ereignisbeschreibungen abzuschätzen, stellen wir einen semi-überwachten Ansatz vor, der die Redundanz innerhalb einer langzeitigen Dokumentensammlung ausnutzt, um genaue probabilistische Zeitmodelle abzuschätzen. Umfangreiche experimentelle Auswertungen zeigen die Wirksamkeit und Tragfähigkeit unserer vorgeschlagenen Ansätze zur Erreichung des größeren Ziels

    Spatiotemporal enabled Content-based Image Retrieval

    Full text link

    Knowledge-Driven Harmonization of Sensor Observations: Exploiting Linked Open Data for IoT Data Streams

    Get PDF
    The rise of the Internet of Things leads to an unprecedented number of continuous sensor observations that are available as IoT data streams. Harmonization of such observations is a labor-intensive task due to heterogeneity in format, syntax, and semantics. We aim to reduce the effort for such harmonization tasks by employing a knowledge-driven approach. To this end, we pursue the idea of exploiting the large body of formalized public knowledge represented as statements in Linked Open Data

    An Online Analytical System for Multi-Tagged Document Collections

    Get PDF
    The New York Times Annotated Corpus and the ACM Digital Library are two prototypical examples of document collections in which each document is tagged with keywords and significant phrases. Such collections can be viewed as high-dimensional document cubes against which browsers and search systems can be applied in a manner similar to online analytical processing against data cubes. The tagging patterns in these collections are examined and a generative tagging model is developed that can mimic the tag assignments observed in those collections. When a user browses the collection by means of a Boolean query over tags, the result is a subset of documents that can be summarized by a centroid derived from their document term vectors. A partial materialization strategy is developed to provide efficient storage and access to centroids for such document subsets. A customized local term vocabulary storage approach is incorporated into the partial materialization to ensure that rich and relevant term vocabulary is available for representing centroids while maintaining a low storage footprint. By adopting this strategy, summary measures dependent on centroids (including bursty terms, or larger sets of indicative documents) can be efficiently and accurately computed for important subsets of documents. The proposed design is evaluated on the two collections along with PubMed (a held-back document collection) and several synthetic collections to validate that it outperforms alternative storage strategies. Finally, an enhanced faceted browsing system is developed to support users' exploration of large multi-tagged document collections. It provides summary measures of document result sets at each step of navigation through a set of indicative terms and diverse set of documents, as well as information scent that helps to guide users' exploration. These summaries are derived from pre-materialized views that allow for quick calculation of centroids for various result sets. The utility and efficiency of the system is demonstrated on the New York Times Annotated Corpus

    Adaptive and Reactive Rich Internet Applications

    Get PDF
    In this thesis we present the client-side approach of Adaptive and Reactive Rich Internet Applications as the main result of our research into how to bring in time adaptivity to Rich Internet Applications. Our approach leverages previous work on adaptive hypermedia, event processing and other research disciplines. We present a holistic framework covering the design-time as well as the runtime aspects of Adaptive and Reactive Rich Internet Applications focusing especially on the run-time aspects

    Topical relevance models

    Get PDF
    An inherent characteristic of information retrieval (IR) is that the query expressing a user's information need is often multi-faceted, that is, it encapsulates more than one specific potential sub-information need. This multifacetedness of queries manifests itself as a topic distribution in the retrieved set of documents, where each document can be considered as a mixture of topics, one or more of which may correspond to the sub-information needs expressed in the query. In some specific domains of IR, such as patent prior art search, where the queries are full patent articles and the objective is to (in)validate the claims contained therein, the queries themselves are multi-topical in addition to the retrieved set of documents. The overall objective of the research described in this thesis involves investigating techniques to recognize and exploit these multi-topical characteristics of the retrieved documents and the queries in IR and relevance feedback in IR. First, we hypothesize that segments of documents in close proximity to the query terms are indicative of these segments being topically related to the query terms. An intuitive choice for the unit of such segments, in close proximity to query terms within documents, is the sentences, which characteristically represent a collection of semantically related terms. This way of utilizing term proximity through the use of sentences is empirically shown to select potentially relevant topics from among those present in a retrieved document set and thus improve relevance feedback in IR. Secondly, to handle the very long queries of patent prior art search which are essentially multi-topical in nature, we hypothesize that segmenting these queries into topically focused segments and then using these topically focused segments as separate queries for retrieval can retrieve potentially relevant documents for each of these segments. The results for each of these segments then need to be merged to obtain a final retrieval result set for the whole query. These two conceptual approaches for utilizing the topical relatedness of terms in both the retrieved documents and the queries are then integrated more formally within a single statistical generative model, called the topical relevance model (TRLM). This model utilizes the underlying multi-topical nature of both retrieved documents and the query. Moreover, the model is used as the basis for construction of a novel search interface, called TopicVis, which lets the user visualize the topic distributions in the retrieved set of documents and the query. This visualization of the topics is beneficial to the user in the following ways. Firstly, through visualization of the ranked retrieval list, TopicVis facilitates the user to choose one or more facets of interest from the query in a feedback step, after which it retrieves documents primarily composed of the selected facets at top ranks. Secondly, the system provides an access link to the first segment within a document focusing on the selected topic and also supports navigation links to subsequent segments on the same topic in other documents. The methods proposed in this thesis are evaluated on datasets from the TREC IR benchmarking workshop series, and the CLEF-IP 2010 data, a patent prior art search data set. Experimental results show that relevance feedback using sentences and segmented retrieval for patent prior art search queries significantly improve IR effectiveness for the standard ad-hoc IR and patent prior art search tasks. Moreover, the topical relevance model (TRLM), designed to encapsulate these two complementary approaches within a single framework, significantly improves IR effectiveness for both standard ad-hoc IR and patent prior art search. Furthermore, a task based user study experiment shows that novel features of topic visualization, topic-based feedback and topic-based navigation, implemented in the TopicVis interface, lead to effective and efficient task completion achieving good user satisfaction
    corecore