98,924 research outputs found
Document re-ranking using cluster validation and label propagation
This paper proposes a novel document re-ranking approach in information retrieval, which is done by a label propagation-based semi-supervised learning algorithm to utilize the intrinsic structure underlying in the large document data. Since no labeled relevant or irrelevant documents are generally available in IR, our approach tries to extract some pseudo labeled documents from the ranking list of the initial retrieval. For pseudo relevant documents, we determine a cluster of documents from the top ones via cluster validation-based k-means clustering; for pseudo irrelevant ones, we pick a set of documents from the bottom ones. Then the ranking of the documents can be conducted via label propagation. Evaluation on benchmark corpora shows that the approach can achieve significant improvement over standard baselines and performs better than other related approaches
Document Retrieval on Repetitive Collections
Document retrieval aims at finding the most important documents where a
pattern appears in a collection of strings. Traditional pattern-matching
techniques yield brute-force document retrieval solutions, which has motivated
the research on tailored indexes that offer near-optimal performance. However,
an experimental study establishing which alternatives are actually better than
brute force, and which perform best depending on the collection
characteristics, has not been carried out. In this paper we address this
shortcoming by exploring the relationship between the nature of the underlying
collection and the performance of current methods. Via extensive experiments we
show that established solutions are often beaten in practice by brute-force
alternatives. We also design new methods that offer superior time/space
trade-offs, particularly on repetitive collections.Comment: Accepted to ESA 2014. Implementation and experiments at
http://www.cs.helsinki.fi/group/suds/rlcsa
Automated legal sensemaking: the centrality of relevance and intentionality
Introduction: In a perfect world, discovery would ideally be conducted by the senior litigator who is
responsible for developing and fully understanding all nuances of their client’s legal strategy. Of
course today we must deal with the explosion of electronically stored information (ESI) that
never is less than tens-of-thousands of documents in small cases and now increasingly involves
multi-million-document populations for internal corporate investigations and litigations.
Therefore scalable processes and technologies are required as a substitute for the authority’s
judgment. The approaches taken have typically either substituted large teams of surrogate
human reviewers using vastly simplified issue coding reference materials or employed
increasingly sophisticated computational resources with little focus on quality metrics to insure
retrieval consistent with the legal goal. What is required is a system (people, process, and
technology) that replicates and automates the senior litigator’s human judgment.
In this paper we utilize 15 years of sensemaking research to establish the minimum acceptable
basis for conducting a document review that meets the needs of a legal proceeding. There is
no substitute for a rigorous characterization of the explicit and tacit goals of the senior litigator.
Once a process has been established for capturing the authority’s relevance criteria, we argue
that literal translation of requirements into technical specifications does not properly account for
the activities or states-of-affairs of interest. Having only a data warehouse of written records, it
is also necessary to discover the intentions of actors involved in textual communications. We
present quantitative results for a process and technology approach that automates effective
legal sensemaking
Modeling Documents with Deep Boltzmann Machines
We introduce a Deep Boltzmann Machine model suitable for modeling and
extracting latent semantic representations from a large unstructured collection
of documents. We overcome the apparent difficulty of training a DBM with
judicious parameter tying. This parameter tying enables an efficient
pretraining algorithm and a state initialization scheme that aids inference.
The model can be trained just as efficiently as a standard Restricted Boltzmann
Machine. Our experiments show that the model assigns better log probability to
unseen data than the Replicated Softmax model. Features extracted from our
model outperform LDA, Replicated Softmax, and DocNADE models on document
retrieval and document classification tasks.Comment: Appears in Proceedings of the Twenty-Ninth Conference on Uncertainty
in Artificial Intelligence (UAI2013
The State-of-the-arts in Focused Search
The continuous influx of various text data on the Web requires search engines to improve their retrieval abilities for more specific information. The need for relevant results to a user’s topic of interest has gone beyond search for domain or type specific documents to more focused result (e.g. document fragments or answers to a query). The introduction of XML provides a format standard for data representation, storage, and exchange. It helps focused search to be carried out at different granularities of a structured document with XML markups. This report aims at reviewing the state-of-the-arts in focused search, particularly techniques for topic-specific document retrieval, passage retrieval, XML retrieval, and entity ranking. It is concluded with highlight of open problems
Bridging the Semantic Gap in Multimedia Information Retrieval: Top-down and Bottom-up approaches
Semantic representation of multimedia information is vital for enabling the kind of multimedia search capabilities that professional searchers require. Manual annotation is often not possible because of the shear scale of the multimedia information that needs indexing. This paper explores the ways in which we are using both top-down, ontologically driven approaches and bottom-up, automatic-annotation approaches to provide retrieval facilities to users. We also discuss many of the current techniques that we are investigating to combine these top-down and bottom-up approaches
- …