1,139 research outputs found
Info Navigator: A visualization tool for document searching and browsing
In this paper we investigate the retrieval performance of monophonic and polyphonic queries made on a polyphonic music database. We extend the n-gram approach for full-music indexing of monophonic music data to polyphonic music using both rhythm and pitch information. We define an experimental framework for a comparative and fault-tolerance study of various n-gramming strategies and encoding levels. For monophonic queries, we focus in particular on query-by-humming systems, and for polyphonic queries on query-by-example. Error models addressed in several studies are surveyed for the fault-tolerance study. Our experiments show that different n-gramming strategies and encoding precision differ widely in their effectiveness. We present the results of our study on a collection of 6366 polyphonic MIDI-encoded music pieces
Relevance thresholds in system evaluations
We introduce and explore the concept of an individual's relevance threshold as a way of reconciling differences in outcomes between batch and user experiments
Categorical Dimensions of Human Odor Descriptor Space Revealed by Non-Negative Matrix Factorization
In contrast to most other sensory modalities, the basic perceptual dimensions of olfaction remain unclear. Here, we use non-negative matrix factorization (NMF) – a dimensionality reduction technique – to uncover structure in a panel of odor profiles, with each odor defined as a point in multi-dimensional descriptor space. The properties of NMF are favorable for the analysis of such lexical and perceptual data, and lead to a high-dimensional account of odor space. We further provide evidence that odor dimensions apply categorically. That is, odor space is not occupied homogenously, but rather in a discrete and intrinsically clustered manner. We discuss the potential implications of these results for the neural coding of odors, as well as for developing classifiers on larger datasets that may be useful for predicting perceptual qualities from chemical structures
ArborZ: Photometric Redshifts Using Boosted Decision Trees
Precision photometric redshifts will be essential for extracting cosmological
parameters from the next generation of wide-area imaging surveys. In this paper
we introduce a photometric redshift algorithm, ArborZ, based on the
machine-learning technique of Boosted Decision Trees. We study the algorithm
using galaxies from the Sloan Digital Sky Survey and from mock catalogs
intended to simulate both the SDSS and the upcoming Dark Energy Survey. We show
that it improves upon the performance of existing algorithms. Moreover, the
method naturally leads to the reconstruction of a full probability density
function (PDF) for the photometric redshift of each galaxy, not merely a single
"best estimate" and error, and also provides a photo-z quality figure-of-merit
for each galaxy that can be used to reject outliers. We show that the stacked
PDFs yield a more accurate reconstruction of the redshift distribution N(z). We
discuss limitations of the current algorithm and ideas for future work.Comment: 10 pages, 13 figures, submitted to Ap
PageRank without hyperlinks: Reranking with PubMed related article networks for biomedical text retrieval
Graph analysis algorithms such as PageRank and HITS have been successful in Web environments because they are able to extract important inter-document relationships from manually-created hyperlinks. We consider the application of these algorithms to related document networks comprised of automatically-generated content-similarity links. Specifically, this work tackles the problem of document retrieval in the biomedical domain, in the context of the PubMed search engine. A series of reranking experiments demonstrate that incorporating evidence extracted from link structure yields significant improvements in terms of standard ranked retrieval metrics. These results extend the applicability of link analysis algorithms to different environments
A Novel ILP Framework for Summarizing Content with High Lexical Variety
Summarizing content contributed by individuals can be challenging, because
people make different lexical choices even when describing the same events.
However, there remains a significant need to summarize such content. Examples
include the student responses to post-class reflective questions, product
reviews, and news articles published by different news agencies related to the
same events. High lexical diversity of these documents hinders the system's
ability to effectively identify salient content and reduce summary redundancy.
In this paper, we overcome this issue by introducing an integer linear
programming-based summarization framework. It incorporates a low-rank
approximation to the sentence-word co-occurrence matrix to intrinsically group
semantically-similar lexical items. We conduct extensive experiments on
datasets of student responses, product reviews, and news documents. Our
approach compares favorably to a number of extractive baselines as well as a
neural abstractive summarization system. The paper finally sheds light on when
and why the proposed framework is effective at summarizing content with high
lexical variety.Comment: Accepted for publication in the journal of Natural Language
Engineering, 201
- …