4 research outputs found
Recommended from our members
Efficient Inference, Search and Evaluation for Latent Variable Models of Text with Applications to Information Retrieval and Machine Translation
Latent variable models of text, such as topic models, have been explored in many areas of natural language processing, information retrieval and machine translation to aid tasks such as exploratory data analysis, automated topic clustering and finding similar documents in mono- and multilingual collections. Many additional applications of these models, however, could be enabled by more efficient techniques for processing large datasets.
In this thesis, we introduce novel methods that offer efficient inference, search and evaluation for latent variable models of text. We present efficient, online inference for representing documents in several languages in a common topic space and fast approximations for finding near neighbors in the probability simplex representation of mono- and multilingual document collections. Empirical evaluations show that these methods are as accurate as —- and significantly faster than —- Gibbs sampling and brute-force all pairs search respectively. In addition, we present a new extrinsic evaluation metric that achieves very high correlation with common performance metrics while being more efficient to compute. We showcase the efficacy and efficiency of our new approaches on the problems of modeling and finding similar documents in a retrieval system for scientific papers, detecting document translation pairs, and extracting parallel sentences from large comparable corpora. This last task, in turn, allows us to efficiently train a translation model from comparable corpora that outperforms a model trained on parallel data.
Lastly, we improve the latent variable model representation of large documents in mono- and multilingual collections by introducing online inference for topic models with hierarchical Dirichlet prior structure over textual regions such as document sections. Modeling variations across textual regions using online inference offers a more effective and efficient document representation, beyond a bag of words, which is usually a handicap for the performance of these models on large documents
Use Case Oriented Medical Visual Information Retrieval & System Evaluation
Large amounts of medical visual data are produced daily in hospitals, while new imaging techniques continue to emerge. In addition, many images are made available continuously via publications in the scientific literature and can also be valuable for clinical routine, research and education. Information retrieval systems are useful tools to provide access to the biomedical literature and fulfil the information needs of medical professionals. The tools developed in this thesis can potentially help clinicians make decisions about difficult diagnoses via a case-based retrieval system based on a use case associated with a specific evaluation task. This system retrieves articles from the biomedical literature when querying with a case description and attached images. This thesis proposes a multimodal approach for medical case-based retrieval with focus on the integration of visual information connected to text. Furthermore, the ImageCLEFmed evaluation campaign was organised during this thesis promoting medical retrieval system evaluation
Real-time event detection in massive streams
Grant award number EP/J020664/1New event detection, also known as first story detection (FSD), has become very
popular in recent years. The task consists of finding previously unseen events from
a stream of documents. Despite the apparent simplicity, FSD is very challenging and
has applications anywhere where timely access to fresh information is crucial: from
journalism to stock market trading, homeland security, or emergency response. With
the rise of user generated content and citizen journalism we have entered an era of big
and noisy data, yet traditional approaches for solving FSD are not designed to deal
with this new type of data.
The amount of information that is being generated today exceeds by many orders
of magnitude previously available datasets, making traditional approaches obsolete
for modern event detection. In this thesis, we propose a modern approach to event
detection that scales to unbounded streams of text, without sacrificing accuracy. This
is a crucial property that enables us to detect events from large streams like Twitter,
which none of the previous approaches were able to do.
One of the major problems in detecting new events is vocabulary mismatch, also
known as lexical variation. This problem is characterized by different authors using
different words to describe the same event, and it is inherent to human language. We
show how to mitigate this problem in FSD by using paraphrases. Our approach that
uses paraphrases achieves state-of-the-art results on the FSD task, while still maintaining
efficiency and being able to process unbounded streams.
Another important property of user generated content is the high level of noise,
and Twitter is no exception. This is another problem that traditional approaches were
not designed to deal with, and here we investigate different methods of reducing the
amount of noise. We show that by using information from Wikipedia, it is possible to
significantly reduce the amount of spurious events detected in Twitter, while maintaining
a very small latency in detection.
A question is often raised as to whether Twitter is at all useful, especially if one
has access to a high-quality stream such as the newswire, or if it should be considered
as sort of a poor man’s newswire. In our comparison of these two streams we find that
Twitter contains events not present in the newswire, and that it also breaks some events
sooner, showing that it is useful for event detection, even in the presence of newswire