30 research outputs found
Incremental Algorithms for Effective and Efficient Query Recommendation
Abstract. Query recommender systems give users hints on possible in-teresting queries relative to their information needs. Most query rec-ommenders are based on static knowledge models built on the basis of past user behaviors recorded in query logs. These models should be pe-riodically updated, or rebuilt from scratch, to keep up with the possible variations in the interests of users. We study query recommender algo-rithms that generate suggestions on the basis of models that are updated continuously, each time a new query is submitted. We extend two state-of-the-art query recommendation algorithms and evaluate the effects of continuous model updates on their effectiveness and efficiency. Tests con-ducted on an actual query log show that contrasting model aging by con-tinuously updating the recommendation model is a viable and effective solution.
Using Noun Phrases for Navigating Biomedical Literature on Pubmed: How Many Updates Are We Losing Track of?
Author-supplied citations are a fraction of the related literature for a paper. The “related citations” on PubMed is typically dozens or hundreds of results long, and does not offer hints why these results are related. Using noun phrases derived from the sentences of the paper, we show it is possible to more transparently navigate to PubMed updates through search terms that can associate a paper with its citations. The algorithm to generate these search terms involved automatically extracting noun phrases from the paper using natural language processing tools, and ranking them by the number of occurrences in the paper compared to the number of occurrences on the web. We define search queries having at least one instance of overlap between the author-supplied citations of the paper and the top 20 search results as citation validated (CV). When the overlapping citations were written by same authors as the paper itself, we define it as CV-S and different authors is defined as CV-D. For a systematic sample of 883 papers on PubMed Central, at least one of the search terms for 86% of the papers is CV-D versus 65% for the top 20 PubMed “related citations.” We hypothesize these quantities computed for the 20 million papers on PubMed to differ within 5% of these percentages. Averaged across all 883 papers, 5 search terms are CV-D, and 10 search terms are CV-S, and 6 unique citations validate these searches. Potentially related literature uncovered by citation-validated searches (either CV-S or CV-D) are on the order of ten per paper – many more if the remaining searches that are not citation-validated are taken into account. The significance and relationship of each search result to the paper can only be vetted and explained by a researcher with knowledge of or interest in that paper
Improved Topic-Dependent Language Modeling Using Information Retrieval Techniques
N-gram language models are frequently used by the speech recognition systems to constrain and guide the search. N-gram models use only the last N-1 words to predict the next word. Typical values of N that are used range from 2-4. N-gram language models thus lack the long-term context information. We show that the predictive power of the N-gram language models can be improved by using long-term context information about the topic of discussion. We use information retrieval techniques to generalize the available context information for topic-dependent language modeling. We demonstrate the effectiveness of this technique by performing experiments on the Wall Street Journal text corpus, which is a relatively difficult task for topic-dependent language modeling since the text is relatively homogeneous. The proposed method can reduce the perplexity of the baseline language model by 37%, indicating the predictive power of the topic-dependent language model. 1
Parents’ online school reviews reflect several racial and socioeconomic disparities in K–12 education
Parents often select schools by relying on subjective assessments of quality made by other parents, which are increasingly becoming available through written reviews on school ratings websites. To identify relationships between review content and school quality, we apply recent advances in natural language processing to nearly half a million parent reviews posted for more than 50,000 publicly funded U.S. K–12 schools on a popular ratings website. We find: (1) schools in urban areas and those serving affluent families are more likely to receive reviews, (2) review language correlates with standardized test scores—which generally track race and family income—but not school effectiveness, measured by how much students improve in their test scores over time, and (3) the linguistics of reviews reveal several racial and income-based disparities in K–12 education. These findings suggest that parents who reference school reviews may be accessing, and making decisions based on, biased perspectives that reinforce achievement gaps
Optimal Mixture Models in IR
We explore the use of Optimal Mixture Models to represent topics. We analyze two broad classes of mixture models: set-based and weighted. We provide an original proof that estimation of set-based models is NP-hard, and therefore not feasible. We argue that weighted models are superior to set-based models, and the solution can be estimated by a simple gradient descent technique. We demonstrate that Optimal Mixture Models can be successfully applied to the task of document retrieval. Our experiments show that weighted mixtures outperform a simple language modeling baseline. We also observe that weighted mixtures are more robust than other approaches of estimating topical models
An Iterative Approach to Text Segmentation
Abstract. We present divSeg, a novel method for text segmentation that iteratively splits a portion of text at its weakest point in terms of the connectivity strength between two adjacent parts. To search for the weakest point, we apply two different measures: one is based on language modeling of text segmentation and the other, on the interconnectivity between two segments. Our solution produces a deep and narrow binary tree – a dynamic object that describes the structure of a text and that is fully adaptable to a user’s segmentation needs. We treat it as a sep-arate task to flatten the tree into a broad and shallow hierarchy either through supervised learning of a document set or explicit input of how a text should be segmented. The rich structure of our created tree further allows us to segment documents at varying levels such as topic, sub-topic, etc. We evaluated our new solution on a set of 265 articles from Discover magazine where the topic structures are unknown and need to be discov-ered. Our experimental results show that the iterative approach has the potential to generate better segmentation results than several leading baselines, and the separate flattening step allows us to adapt the results to different levels of details and user preferences