1,268 research outputs found
Using Learning to Rank Approach to Promoting Diversity for Biomedical Information Retrieval with Wikipedia
In most of the traditional information retrieval (IR) models, the independent
relevance assumption is taken, which assumes the relevance of a document is
independent of other documents. However, the pitfall of this is the high redundancy
and low diversity of retrieval result. This has been seen in many scenarios, especially
in biomedical IR, where the information need of one query may refer to different
aspects. Promoting diversity in IR takes the relationship between documents into
account. Unlike previous studies, we tackle this problem in the learning to rank
perspective. The main challenges are how to find salient features for biomedical data
and how to integrate dynamic features into the ranking model. To address these
challenges, Wikipedia is used to detect topics of documents for generating diversity
biased features. A combined model is proposed and studied to learn a diversified
ranking result. Experiment results show the proposed method outperforms baseline
models
Semantic concept extraction from electronic medical records for enhancing information retrieval performance
With the healthcare industry increasingly using EMRs, there emerges an opportunity for knowledge discovery within the healthcare domain that was not possible with paper-based medical records. One such opportunity is to discover UMLS concepts from EMRs. However, with opportunities come challenges that need to be addressed. Medical verbiage is very different from common English verbiage and it is reasonable to assume extracting any information from medical text requires different protocols than what is currently used in common English text. This thesis proposes two new semantic matching models: Term-Based Matching and CUI-Based Matching. These two models use specialized biomedical text mining tools that extract medical concepts from EMRs. Extensive experiments to rank the extracted concepts are conducted on the University of Pittsburgh BLULab NLP Repository for the TREC 2011 Medical Records track dataset that consists of 101,711 EMRs that contain concepts in 34 predefined topics. This thesis compares the proposed semantic matching models against the traditional weighting equations and information retrieval tools used in the academic world today
Search beyond traditional probabilistic information retrieval
"This thesis focuses on search beyond probabilistic information retrieval. Three ap- proached are proposed beyond the traditional probabilistic modelling. First, term associ- ation is deeply examined. Term association considers the term dependency using a factor analysis based model, instead of treating each term independently. Latent factors, con- sidered the same as the hidden variables of ""eliteness"" introduced by Robertson et al. to gain understanding of the relation among term occurrences and relevance, are measured by the dependencies and occurrences of term sequences and subsequences. Second, an entity-based ranking approach is proposed in an entity system named ""EntityCube"" which has been released by Microsoft for public use. A summarization page is given to summarize the entity information over multiple documents such that the truly relevant entities can be highly possibly searched from multiple documents through integrating the local relevance contributed by proximity and the global enhancer by topic model. Third, multi-source fusion sets up a meta-search engine to combine the ""knowledge"" from different sources. Meta-features, distilled as high-level categories, are deployed to diversify the baselines. Three modified fusion methods are employed, which are re- ciprocal, CombMNZ and CombSUM with three expanded versions. Through extensive experiments on the standard large-scale TREC Genomics data sets, the TREC HARD data sets and the Microsoft EntityCube Web collections, the proposed extended models beyond probabilistic information retrieval show their effectiveness and superiority.
Statistical Modeling to Information Retrieval for Searching from Big Text Data and Higher Order Inference for Reliability
This thesis examined two research projects: probabilistic information retrieval modeling and third-order inference on reliability.
In the first part of this dissertation, two research topics in the information retrieval are carried out and experimented on large-scale text data set. First, we conduct an in-depth study of relationship between information of document length and document relevance to user need. Two statistical methods are proposed which incorporates document length as a substantial weighting factor to achieve higher retrieval performance. Second, we utilize the property of survival function to propose a cost-based re-ranking method to promote ranking diversity for biomedical information retrieval, and to model the proximity between query terms to improve retrieval performance. Through extensive experiments on standard TREC collections, our proposed models perform significantly better than the classical probabilistic information retrieval models.
In the second part of this dissertation, a small sample asymptotic method is proposed for higher order inference in the stress-strength reliability model, R=P(Y<X), where X and Y are independently distributed. A penalized likelihood method is proposed to handle the numerical complications of maximizing the constrained likelihood model. Simulation studies are conducted on two distributions: Burr type X distribution and exponentiated exponential distribution. Results from simulation studies show that the proposed method is very accurate even when the sample sizes are small
- …