15 research outputs found
Learning Dynamic Classes of Events using Stacked Multilayer Perceptron Networks
People often use a web search engine to find information about events of
interest, for example, sport competitions, political elections, festivals and
entertainment news. In this paper, we study a problem of detecting
event-related queries, which is the first step before selecting a suitable
time-aware retrieval model. In general, event-related information needs can be
observed in query streams through various temporal patterns of user search
behavior, e.g., spiky peaks for popular events, and periodicities for
repetitive events. However, it is also common that users search for non-popular
events, which may not exhibit temporal variations in query streams, e.g., past
events recently occurred, historical events triggered by anniversaries or
similar events, and future events anticipated to happen. To address the
challenge of detecting dynamic classes of events, we propose a novel deep
learning model to classify a given query into a predetermined set of multiple
event types. Our proposed model, a Stacked Multilayer Perceptron (S-MLP)
network, consists of multilayer perceptron used as a basic learning unit. We
assemble stacked units to further learn complex relationships between neutrons
in successive layers. To evaluate our proposed model, we conduct experiments
using real-world queries and a set of manually created ground truth.
Preliminary results have shown that our proposed deep learning model
outperforms the state-of-the-art classification models significantly.Comment: Neu-IR '16 SIGIR Workshop on Neural Information Retrieval, 6 pages, 4
figure
Time and information retrieval: Introduction to the special issue
The Special Issue of Information Processing and Management includes research papers on the intersection between time and information retrieval. In 'Evaluating Document Filtering Systems over Time', Tom Kenter and Krisztian Balog propose a time-aware way of measuring a system's performance at filtering documents. Manika Kar, SeAa7acute;rgio Nunes and Cristina Ribeiro present interesting methods for summarizing changes in dynamic text collections over time in their paper 'Summarization of Changes in Dynamic Text Collection using Latent Dirichlet Allocation Model.' Hideo Joho, Adam Jatowt and Roi Blanco report on the temporal information searching behaviour of users and their strategies for dealing with searches that have a temporal nature in 'Temporal Information Searching Behaviour and Strategies', a user study. In controlled settings, thirty participants are asked to perform searches on an array of topics on the web to find information related to particular time scopes. Adam Jatowt, Ching-man Au Yeung and Katsumi Tanaka present a 'Generic Method for Detecting Content Time of Documents'. The authors propose several methods for estimating the focus time of documents, i.e. the time a document's content refers to. Xujian Zhao, Peiquan Jin and Lihua Yue present an approach to determining the time of the underlying topic or event in their article entitled 'Discovering Topic Time from Web News'
Diversification Based Static Index Pruning - Application to Temporal Collections
Nowadays, web archives preserve the history of large portions of the web. As
medias are shifting from printed to digital editions, accessing these huge
information sources is drawing increasingly more attention from national and
international institutions, as well as from the research community. These
collections are intrinsically big, leading to index files that do not fit into
the memory and an increase query response time. Decreasing the index size is a
direct way to decrease this query response time.
Static index pruning methods reduce the size of indexes by removing a part of
the postings. In the context of web archives, it is necessary to remove
postings while preserving the temporal diversity of the archive. None of the
existing pruning approaches take (temporal) diversification into account.
In this paper, we propose a diversification-based static index pruning
method. It differs from the existing pruning approaches by integrating
diversification within the pruning context. We aim at pruning the index while
preserving retrieval effectiveness and diversity by pruning while maximizing a
given IR evaluation metric like DCG. We show how to apply this approach in the
context of web archives. Finally, we show on two collections that search
effectiveness in temporal collections after pruning can be improved using our
approach rather than diversity oblivious approaches
LEARNING WORD RELATEDNESS OVER TIME FOR TEMPORAL RANKING
Queries and ranking with temporal aspects gain significant attention in field of Information Retrieval. While searching for articles published over time, the relevant documents usually occur in certain temporal patterns. Given a query that is implicitly time sensitive, we develop a temporal ranking using the important times of query by drawing from the distribution of query trend relatedness over time. We also combine the model with Dual Embedding Space Model (DESM) in the temporal model according to document timestamp. We apply our model using three temporal word embeddings algorithms to learn relatedness of words from news archive in Bahasa Indonesia: (1) QT-W2V-Rank using Word2Vec (2) QT-OW2V-Rank using OrthoTrans-Word2Vec (3) QT-DBE-Rank using Dynamic Bernoulli Embeddings. The highest score was achieved with static word embeddings learned separately over time, called QT-W2V-Rank, which is 66% in average precision and 68% in early precision. Furthermore, studies of different characteristics of temporal topics showed that QT-W2V-Rank is also more effective in capturing temporal patterns such as spikes, periodicity, and seasonality than the baselines
Normalisation of imprecise temporal expressions extracted from text
Information extraction systems and techniques have been largely used to deal with the increasing amount of unstructured data available nowadays. Time is among the different kinds of information that may be extracted from such unstructured data sources, including text documents. However, the inability to correctly identify and extract temporal information from text makes it difficult to understand how the extracted events are organised in a chronological order. Furthermore, in many situations, the meaning of temporal expressions (timexes) is imprecise, such as in “less than 2 years” and “several weeks”, and cannot be accurately normalised, leading to interpretation errors. Although there are some approaches that enable representing imprecise timexes, they are not designed to be applied to specific scenarios and difficult to generalise. This paper presents a novel methodology to analyse and normalise imprecise temporal expressions by representing temporal imprecision in the form of membership functions, based on human interpretation of time in two different languages (Portuguese and English). Each resulting model is a generalisation of probability distributions in the form of trapezoidal and hexagonal fuzzy membership functions. We use an adapted F1-score to guide the choice of the best models for each kind of imprecise timex and a weighted F1-score (F1 3 D ) as a complementary metric in order to identify relevant differences when comparing two normalisation models. We apply the proposed methodology for three distinct classes of imprecise timexes, and the resulting models give distinct insights in the way each kind of temporal expression is interpreted
Exploiting Query’s Temporal Patterns for Query Autocompletion
Query autocompletion (QAC) is a common interactive feature of web search engines. It aims at assisting users to formulate queries and avoiding spelling mistakes by presenting them with a list of query completions as soon as they start typing in the search box. Existing QAC models mostly rank the query completions by their past popularity collected in the query logs. For some queries, their popularity exhibits relatively stable or periodic behavior while others may experience a sudden rise in their query popularity. Current time-sensitive QAC models focus on either periodicity or recency and are unable to respond swiftly to such sudden rise, resulting in a less optimal QAC performance. In this paper, we propose a hybrid QAC model that considers two temporal patterns of query’s popularity, that is, periodicity and burst trend. In detail, we first employ the Discrete Fourier Transform (DFT) to identify the periodicity of a query’s popularity, by which we forecast its future popularity. Then the burst trend of query’s popularity is detected and incorporated into the hybrid model with its cyclic behavior. Extensive experiments on a large, real-world query log dataset infer that modeling the temporal patterns of query popularity in the form of its periodicity and its burst trend can significantly improve the effectiveness of ranking query completions
Ranking Models for the Temporal Dimension of Text
Temporal features of text have been shown to improve clustering and organization of documents, text classification, visualization, and ranking. Temporal ranking models consider the temporal expressions found in text (e.g., “in 2021” or “last year”) as time units, rather than as keywords, to define a temporal relevance and improve ranking. This paper introduces a new class of ranking models called Temporal Metric Space Models (TMSM), based on a new domain for representing temporal information found in documents and queries, where each temporal expression is represented as a time interval. Furthermore, we introduce a new frequency-based baseline called Temporal BM25 (TBM25). We evaluate the effectiveness of each proposed metric against a purely textual baseline, as well as several variations of the metrics themselves, where we change the aggregate function, the time granularity and the combination weight. Our extensive experiments on five test collections show statistically significant improvements of TMSM and TBM25 over state-of-the-art temporal ranking models. Combining the temporal similarity scores with the text similarity scores always improves the results, when the combination weight is between 2% and 6% for the temporal scores. This is true also for test collections where only 5% of queries contain explicit temporal expressions
Temporal Information Models for Real-Time Microblog Search
Real-time search in Twitter and other social media services is often biased
towards the most recent results due to the “in the moment” nature of topic
trends and their ephemeral relevance to users and media in general. However,
“in the moment”, it is often difficult to look at all emerging topics and single-out
the important ones from the rest of the social media chatter. This thesis proposes
to leverage on external sources to estimate the duration and burstiness of live
Twitter topics. It extends preliminary research where itwas shown that temporal
re-ranking using external sources could indeed improve the accuracy of results.
To further explore this topic we pursued three significant novel approaches: (1)
multi-source information analysis that explores behavioral dynamics of users,
such as Wikipedia live edits and page view streams, to detect topic trends
and estimate the topic interest over time; (2) efficient methods for federated
query expansion towards the improvement of query meaning; and (3) exploiting
multiple sources towards the detection of temporal query intent. It differs from
past approaches in the sense that it will work over real-time queries, leveraging
on live user-generated content. This approach contrasts with previous methods
that require an offline preprocessing step
Web Archive Services Framework for Tighter Integration Between the Past and Present Web
Web archives have contained the cultural history of the web for many years, but they still have a limited capability for access. Most of the web archiving research has focused on crawling and preservation activities, with little focus on the delivery methods. The current access methods are tightly coupled with web archive infrastructure, hard to replicate or integrate with other web archives, and do not cover all the users\u27 needs. In this dissertation, we focus on the access methods for archived web data to enable users, third-party developers, researchers, and others to gain knowledge from the web archives. We build ArcSys, a new service framework that extracts, preserves, and exposes APIs for the web archive corpus. The dissertation introduces a novel categorization technique to divide the archived corpus into four levels. For each level, we will propose suitable services and APIs that enable both users and third-party developers to build new interfaces. The first level is the content level that extracts the content from the archived web data. We develop ArcContent to expose the web archive content processed through various filters. The second level is the metadata level; we extract the metadata from the archived web data and make it available to users. We implement two services, ArcLink for temporal web graph and ArcThumb for optimizing the thumbnail creation in the web archives. The third level is the URI level that focuses on using the URI HTTP redirection status to enhance the user query. Finally, the highest level in the web archiving service framework pyramid is the archive level. In this level, we define the web archive by the characteristics of its corpus and building Web Archive Profiles. The profiles are used by the Memento Aggregator for query optimization