4 research outputs found
Modeling Temporal Evidence from External Collections
Newsworthy events are broadcast through multiple mediums and prompt the
crowds to produce comments on social media. In this paper, we propose to
leverage on this behavioral dynamics to estimate the most relevant time periods
for an event (i.e., query). Recent advances have shown how to improve the
estimation of the temporal relevance of such topics. In this approach, we build
on two major novelties. First, we mine temporal evidences from hundreds of
external sources into topic-based external collections to improve the
robustness of the detection of relevant time periods. Second, we propose a
formal retrieval model that generalizes the use of the temporal dimension
across different aspects of the retrieval process. In particular, we show that
temporal evidence of external collections can be used to (i) infer a topic's
temporal relevance, (ii) select the query expansion terms, and (iii) re-rank
the final results for improved precision. Experiments with TREC Microblog
collections show that the proposed time-aware retrieval model makes an
effective and extensive use of the temporal dimension to improve search results
over the most recent temporal models. Interestingly, we observe a strong
correlation between precision and the temporal distribution of retrieved and
relevant documents.Comment: To appear in WSDM 201
Biomedical information extraction for matching patients to clinical trials
Digital Medical information had an astonishing growth on the last decades, driven
by an unprecedented number of medical writers, which lead to a complete revolution in
what and how much information is available to the health professionals.
The problem with this wave of information is that performing a precise selection of
the information retrieved by medical information repositories is very exhaustive and time
consuming for physicians. This is one of the biggest challenges for physicians with the
new digital era: how to reduce the time spent finding the perfect matching document for a
patient (e.g. intervention articles, clinical trial, prescriptions).
Precision Medicine (PM) 2017 is the track by the Text REtrieval Conference (TREC),
that is focused on this type of challenges exclusively for oncology. Using a dataset with a
large amount of clinical trials, this track is a good real life example on how information
retrieval solutions can be used to solve this types of problems. This track can be a very
good starting point for applying information extraction and retrieval methods, in a very
complex domain.
The purpose of this thesis is to improve a system designed by the NovaSearch team
for TREC PM 2017 Clinical Trials task, which got ranked on the top-5 systems of 2017.
The NovaSearch team also participated on the 2018 track and got a 15% increase on
precision compared to the 2017 one. It was used multiple IR techniques for information
extraction and processing of data, including rank fusion, query expansion (e.g. Pseudo
relevance feedback, Mesh terms expansion) and experiments with Learning to Rank
(LETOR) algorithms. Our goal is to retrieve the best possible set of trials for a given
patient, using precise documents filters to exclude the unwanted clinical trials. This work
can open doors in what can be done for searching and perceiving the criteria to exclude or
include the trials, helping physicians even on the more complex and difficult information
retrieval tasks
D.W.: HLTCOE at TREC 2014: Microblog and clinical decision support
Abstract Our team submitted runs for both the Microblog and Clinical Decision Support tracks. For the Microblog track, we participated in both the temporally anchored adhoc search and the tweet timeline generation subtasks. On the Clinical Decision support task, our efforts were time limited, and our main contribution was to investigate controlling for morphological variation in this technical domain
Microblogging Temporal Summarization: Filtering Important Twitter Updates for Breaking News
While news stories are an important traditional medium to broadcast and consume news, microblogging has recently emerged as a place where people can dis- cuss, disseminate, collect or report information about news. However, the massive information in the microblogosphere makes it hard for readers to keep up with these real-time updates. This is especially a problem when it comes to breaking news, where people are more eager to know “what is happening”. Therefore, this dis- sertation is intended as an exploratory effort to investigate computational methods to augment human effort when monitoring the development of breaking news on a given topic from a microblog stream by extractively summarizing the updates in a timely manner.
More specifically, given an interest in a topic, either entered as a query or presented as an initial news report, a microblog temporal summarization system is proposed to filter microblog posts from a stream with three primary concerns: topical relevance, novelty, and salience. Considering the relatively high arrival rate of microblog streams, a cascade framework consisting of three stages is proposed to progressively reduce quantity of posts. For each step in the cascade, this dissertation studies methods that improve over current baselines.
In the relevance filtering stage, query and document expansion techniques are applied to mitigate sparsity and vocabulary mismatch issues. The use of word embedding as a basis for filtering is also explored, using unsupervised and supervised modeling to characterize lexical and semantic similarity. In the novelty filtering stage, several statistical ways of characterizing novelty are investigated and ensemble learning techniques are used to integrate results from these diverse techniques. These results are compared with a baseline clustering approach using both standard and delay-discounted measures. In the salience filtering stage, because of the real-time prediction requirement a method of learning verb phrase usage from past relevant news reports is used in conjunction with some standard measures for characterizing writing quality.
Following a Cranfield-like evaluation paradigm, this dissertation includes a se- ries of experiments to evaluate the proposed methods for each step, and for the end- to-end system. New microblog novelty and salience judgments are created, building on existing relevance judgments from the TREC Microblog track. The results point to future research directions at the intersection of social media, computational jour- nalism, information retrieval, automatic summarization, and machine learning