584 research outputs found
Identifying the \u27aboutness\u27 of highly structured expository documents
The increases in commercial documentation over the past 50 years and the permeation of computers into all areas of business has led to a major increase in the individual\u27s reading load. This thesis proposes a method of writing procedural documentation to enable rapid appreciation of the \u27aboutness\u27 of such material, thus making the reading task more efficient. The method is derived from a document structure which is used as a basis for the development of rules to construct a hierarchy of in-text headings which encapsulates the \u27aboutness\u27 of the text. Reading efficiency is achieved through needing to only interpret the headings to understand what the document is about. The method was tested by having control and experimental groups complete the same series of questions, answers to which were derived from a set of documents. The set used by participants in the experimental group contained headings structured according to the proposed method; the set used by participants in the control group contained headings which were not structured according to the proposed method. All variables other than headings were negated or neutralised. Answer accuracy and completion times of the groups were compared. On average the experimental group, who used documents containing headings structured according to the proposed method, had 7.5% better accuracy and completed the questions in 13.5% less time overall. These improvements are assumed to be due to the differences in heading effects
The Generation of Compound Nominals to Represent the Essence of Text The COMMIX System
This thesis concerns the COMMIX system, which automatically extracts
information on what a text is about, and generates that information in the highly
compacted form of compound nominal expressions. The expressions generated
are complex and may include novel terms which do not appear themselves in
the input text.
From the practical point of view, the work is driven by the need for better
representations of content: for representations which are shorter and more
concise than would appear in an abstract, yet more informative and
representative of the actual aboutness than commonly occurs in indexing
expressions and key terms. This additional layer of representation is referred to in
this work as pertaining to the essence of a particular text.
From a theoretical standpoint, the thesis shows how the compound
nominal as a construct can be successfully employed in these highly informative
representations. It involves an exploration of the claim that there is sufficient
semantic information contained within the standard dictionary glosses for
individual words to enable the construction of useful and highly representative
novel compound nominal expressions, without recourse to standard syntactic
and statistical methods. It shows how a shallow semantic approach to content
identification which is based on lexical overlap can produce some very
encouraging results.
The methodology employed, and described herein, is domain-independent,
and does not require the specification of templates with which the
input text must comply. In these two respects, the methodology developed in this
work avoids two of the most common problems associated with information
extraction.
As regards the evaluation of this type of work, the thesis introduces and
utilises the notion of percentage attainment value, which is used in conjunction
with subjects' opinions about the degree to which the aboutness terms succeed in
indicating the subject matter of the texts for which they were generated
Recommended from our members
A framework for evaluating automatic indexing or classification in the context of retrieval
Tools for automatic subject assignment help deal with scale and sustainability in creating and enriching metadata, establishing more connections across and between resources and enhancing consistency. While some software vendors and experimental researchers claim the tools can replace manual subject indexing, hard scientific evidence of their performance in operating information environments is scarce. A major reason for this is that research is usually conducted in laboratory conditions, excluding the complexities of real-life systems and situations. The paper reviews and discusses issues with existing evaluation approaches such as problems of aboutness and relevance assessments, implying the need to use more than a single “gold standard” method when evaluating indexing and retrieval and proposes a comprehensive evaluation framework. The framework is informed by a systematic review of the literature on indexing, classification and approaches: evaluating indexing quality directly through assessment by an evaluator or through comparison with a gold standard; evaluating the quality of computer-assisted indexing directly in the context of an indexing workflow, and evaluating indexing quality indirectly through analyzing retrieval performance
SEL: A unified algorithm for entity linking and saliency detection
The Entity Linking task consists in automatically identifying and linking the entities mentioned in a text to their URIs in a given Knowledge Base, e.g., Wikipedia. Entity Linking has a large impact in several text analysis and information retrieval related tasks. This task is very challenging due to natural language ambiguity. However, not all the entities mentioned in a document have the same relevance and utility in understanding the topics being discussed. Thus, the related problem of identifying the most relevant entities present in a document, also known as Salient Entities, is attracting increasing interest. In this paper we propose SEL, a novel supervised two-step algorithm comprehensively addressing both entity linking and saliency detection. The first step is based on a classifier aimed at identifying a set of candidate entities that are likely to be mentioned in the document, thus maximizing the precision of the method without hindering its recall. The second step is still based on machine learning, and aims at choosing from the previous set the entities that actually occur in the document. Indeed, we tested two different versions of the second step, one aimed at solving only the entity linking task, and the other that, besides detecting linked entities, also scores them according to their saliency. Experiments conducted on two different datasets show that the proposed algorithm outperforms state-of-the-art competitors, and is able to detect salient entities with high accuracy
Using Web Archives to Enrich the Live Web Experience Through Storytelling
Much of our cultural discourse occurs primarily on the Web. Thus, Web preservation is a fundamental precondition for multiple disciplines. Archiving Web pages into themed collections is a method for ensuring these resources are available for posterity. Services such as Archive-It exists to allow institutions to develop, curate, and preserve collections of Web resources. Understanding the contents and boundaries of these archived collections is a challenge for most people, resulting in the paradox of the larger the collection, the harder it is to understand. Meanwhile, as the sheer volume of data grows on the Web, storytelling is becoming a popular technique in social media for selecting Web resources to support a particular narrative or story .
In this dissertation, we address the problem of understanding the archived collections through proposing the Dark and Stormy Archive (DSA) framework, in which we integrate storytelling social media and Web archives. In the DSA framework, we identify, evaluate, and select candidate Web pages from archived collections that summarize the holdings of these collections, arrange them in chronological order, and then visualize these pages using tools that users already are familiar with, such as Storify.
To inform our work of generating stories from archived collections, we start by building a baseline for the structural characteristics of popular (i.e., receiving the most views) human-generated stories through investigating stories from Storify. Furthermore, we checked the entire population of Archive-It collections for better understanding the characteristics of the collections we intend to summarize. We then filter off-topic pages from the collections the using different methods to detect when an archived page in a collection has gone off-topic. We created a gold standard dataset from three Archive-It collections to evaluate the proposed methods at different thresholds. From the gold standard dataset, we identified five behaviors for the TimeMaps (a list of archived copies of a page) based on the page’s aboutness. Based on a dynamic slicing algorithm, we divide the collection and cluster the pages in each slice. We then select the best representative page from each cluster based on different quality metrics (e.g., the replay quality, and the quality of the generated snippet from the page). At the end, we put the selected pages in chronological order and visualize them using Storify.
For evaluating the DSA framework, we obtained a ground truth dataset of hand-crafted stories from Archive-It collections generated by expert archivists. We used Amazon’s Mechanical Turk to evaluate the automatically generated stories against the stories that were created by domain experts. The results show that the automatically generated stories by the DSA are indistinguishable from those created by human subject domain experts, while at the same time both kinds of stories (automatic and human) are easily distinguished from randomly generated storie
Detecting, Modeling, and Predicting User Temporal Intention
The content of social media has grown exponentially in the recent years and its role has evolved from narrating life events to actually shaping them. Unfortunately, content posted and shared in social networks is vulnerable and prone to loss or change, rendering the context associated with it (a tweet, post, status, or others) meaningless. There is an inherent value in maintaining the consistency of such social records as in some cases they take over the task of being the first draft of history as collections of these social posts narrate the pulse of the street during historic events, protest, riots, elections, war, disasters, and others as shown in this work.
The user sharing the resource has an implicit temporal intent: either the state of the resource at the time of sharing, or the current state of the resource at the time of the reader \clicking . In this research, we propose a model to detect and predict the user\u27s temporal intention of the author upon sharing content in the social network and of the reader upon resolving this content. To build this model, we first examine the three aspects of the problem: the resource, time, and the user.
For the resource we start by analyzing the content on the live web and its persistence. We noticed that a portion of the resources shared in social media disappear, and with further analysis we unraveled a relationship between this disappearance and time. We lose around 11% of the resources after one year of sharing and a steady 7% every following year. With this, we turn to the public archives and our analysis reveals that not all posted resources are archived and even they were an average 8% per year disappears from the archives and in some cases the archived content is heavily damaged. These observations prove that in regards to archives resources are not well-enough populated to consistently and reliably reconstruct the missing resource as it existed at the time of sharing. To analyze the concept of time we devised several experiments to estimate the creation date of the shared resources. We developed Carbon Date, a tool which successfully estimated the correct creation dates for 76% of the test sets. Since the resources\u27 creation we wanted to measure if and how they change with time. We conducted a longitudinal study on a data set of very recently-published tweet-resource pairs and recording observations hourly. We found that after just one hour, ~4% of the resources have changed by ≥30% while after a day the change rate slowed to be ~12% of the resources changed by ≥40%.
In regards to the third and final component of the problem we conducted user behavioral analysis experiments and built a data set of 1,124 instances manually assigned by test subjects. Temporal intention proved to be a difficult concept for average users to understand. We developed our Temporal Intention Relevancy Model (TIRM) to transform the highly subjective temporal intention problem into the more easily understood idea of relevancy between a tweet and the resource it links to, and change of the resource through time. On our collected data set TIRM produced a significant 90.27% success rate. Furthermore, we extended TIRM and used it to build a time-based model to predict temporal intention change or steadiness at the time of posting with 77% accuracy. We built a service API around this model to provide predictions and a few prototypes. Future tools could implement TIRM to assist users in pushing copies of shared resources into public web archives to ensure the integrity of the historical record. Additional tools could be used to assist the mining of the existing social media corpus by derefrencing the intended version of the shared resource based on the intention strength and the time between the tweeting and mining
- …