22 research outputs found

    Dating Texts without Explicit Temporal Cues

    Full text link
    This paper tackles temporal resolution of documents, such as determining when a document is about or when it was written, based only on its text. We apply techniques from information retrieval that predict dates via language models over a discretized timeline. Unlike most previous works, we rely {\it solely} on temporal cues implicit in the text. We consider both document-likelihood and divergence based techniques and several smoothing methods for both of them. Our best model predicts the mid-point of individuals' lives with a median of 22 and mean error of 36 years for Wikipedia biographies from 3800 B.C. to the present day. We also show that this approach works well when training on such biographies and predicting dates both for non-biographical Wikipedia pages about specific years (500 B.C. to 2010 A.D.) and for publication dates of short stories (1798 to 2008). Together, our work shows that, even in absence of temporal extraction resources, it is possible to achieve remarkable temporal locality across a diverse set of texts

    ラベル伝搬によるトレンドクエリのカテゴリ推定

    Get PDF
    Query classification is an important technique for web search engines, allowing them to improve users\u27 search experience. Specifically, query classification methods classify queries according to topical categories, such as celebrities and sports. Such category information is effective in improving web search results, online advertisements, and so on. Unlike previous studies, our research focuses on trend queries that have suddenly become popular and are extensively searched. Our aim is to classify such trend queries in a timely manner, i.e., classify the queries on the same day when they become popular, in order to provide a better search experience. To reduce the expensive manual annotation costs to train supervised learning methods, we focus on a label propagation method that belongs to the semi-supervised learning family. Specifically, the proposed method is based on our previous method that constructs a graph using a corpus, and propagates a small number of ground-truth categories of labeled queries in order to estimate the categories of unlabeled queries. We extend this method to cut ineffective edges to improve both classification accuracy and computational efficiency. Furthermore, we investigate in detail the effects of different corpora, i.e., web/blog/news search results, Tweets, and news pages, on the trend query classification task. Our experiments replicate the situation of an emerging trend query; the results show that web search results are the most effective for trend query classification, achieving a 50.1% F-score, which significantly outperforms the state-of-the-art method by 7.2 points. These results provide useful insights into selecting an appropriate dataset for query classification from the various types of data available

    Understanding in-video dropouts and interaction peaks in online lecture videos

    Get PDF
    With thousands of learners watching the same online lecture videos, analyzing video watching patterns provides a unique opportunity to understand how students learn with videos. This paper reports a large-scale analysis of in-video dropout and peaks in viewership and student activity, using second-by-second user interaction data from 862 videos in four Massive Open Online Courses (MOOCs) on edX. We find higher dropout rates in longer videos, re-watching sessions (vs first-time), and tutorials (vs lectures). Peaks in re-watching sessions and play events indicate points of interest and confusion. Results show that tutorials (vs lectures) and re-watching sessions (vs first-time) lead to more frequent and sharper peaks. In attempting to reason why peaks occur by sampling 80 videos, we observe that 61% of the peaks accompany visual transitions in the video, e.g., a slide view to a classroom view. Based on this observation, we identify five student activity patterns that can explain peaks: starting from the beginning of a new material, returning to missed content, following a tutorial step, replaying a brief segment, and repeating a non-visual explanation. Our analysis has design implications for video authoring, editing, and interface design, providing a richer understanding of video learning on MOOCs

    Leveraging Semantic Annotations to Link Wikipedia and News Archives

    No full text
    The incomprehensible amount of information available online has made it difficult to retrospect on past events. We propose a novel linking problem to connect excerpts from Wikipedia summarizing events to online news articles elaborating on them. To address the linking problem, we cast it into an information retrieval task by treating a given excerpt as a user query with the goal to retrieve a ranked list of relevant news articles. We find that Wikipedia excerpts often come with additional semantics, in their textual descriptions, representing the time, geolocations, and named entities involved in the event. Our retrieval model leverages text and semantic annotations as different dimensions of an event by estimating independent query models to rank documents. In our experiments on two datasets, we compare methods that consider different combinations of dimensions and find that the approach that leverages all dimensions suits our problem best

    Multiple Models for Recommending Temporal Aspects of Entities

    Full text link
    Entity aspect recommendation is an emerging task in semantic search that helps users discover serendipitous and prominent information with respect to an entity, of which salience (e.g., popularity) is the most important factor in previous work. However, entity aspects are temporally dynamic and often driven by events happening over time. For such cases, aspect suggestion based solely on salience features can give unsatisfactory results, for two reasons. First, salience is often accumulated over a long time period and does not account for recency. Second, many aspects related to an event entity are strongly time-dependent. In this paper, we study the task of temporal aspect recommendation for a given entity, which aims at recommending the most relevant aspects and takes into account time in order to improve search experience. We propose a novel event-centric ensemble ranking method that learns from multiple time and type-dependent models and dynamically trades off salience and recency characteristics. Through extensive experiments on real-world query logs, we demonstrate that our method is robust and achieves better effectiveness than competitive baselines.Comment: In proceedings of the 15th Extended Semantic Web Conference (ESWC 2018

    Ranking Models for the Temporal Dimension of Text

    Get PDF
    Temporal features of text have been shown to improve clustering and organization of documents, text classification, visualization, and ranking. Temporal ranking models consider the temporal expressions found in text (e.g., “in 2021” or “last year”) as time units, rather than as keywords, to define a temporal relevance and improve ranking. This paper introduces a new class of ranking models called Temporal Metric Space Models (TMSM), based on a new domain for representing temporal information found in documents and queries, where each temporal expression is represented as a time interval. Furthermore, we introduce a new frequency-based baseline called Temporal BM25 (TBM25). We evaluate the effectiveness of each proposed metric against a purely textual baseline, as well as several variations of the metrics themselves, where we change the aggregate function, the time granularity and the combination weight. Our extensive experiments on five test collections show statistically significant improvements of TMSM and TBM25 over state-of-the-art temporal ranking models. Combining the temporal similarity scores with the text similarity scores always improves the results, when the combination weight is between 2% and 6% for the temporal scores. This is true also for test collections where only 5% of queries contain explicit temporal expressions
    corecore