13,777 research outputs found
Graph-Sparse LDA: A Topic Model with Structured Sparsity
Originally designed to model text, topic modeling has become a powerful tool
for uncovering latent structure in domains including medicine, finance, and
vision. The goals for the model vary depending on the application: in some
cases, the discovered topics may be used for prediction or some other
downstream task. In other cases, the content of the topic itself may be of
intrinsic scientific interest.
Unfortunately, even using modern sparse techniques, the discovered topics are
often difficult to interpret due to the high dimensionality of the underlying
space. To improve topic interpretability, we introduce Graph-Sparse LDA, a
hierarchical topic model that leverages knowledge of relationships between
words (e.g., as encoded by an ontology). In our model, topics are summarized by
a few latent concept-words from the underlying graph that explain the observed
words. Graph-Sparse LDA recovers sparse, interpretable summaries on two
real-world biomedical datasets while matching state-of-the-art prediction
performance
Improve and Implement an Open Source Question Answering System
A question answer system takes queries from the user in natural language and returns a short concise answer which best fits the response to the question. This report discusses the integration and implementation of question answer systems for English and Hindi as part of the open source search engine Yioop. We have implemented a question answer system for English and Hindi, keeping in mind users who use these languages as their primary language. The user should be able to query a set of documents and should get the answers in the same language. English and Hindi are very different when it comes to language structure, characters etc. We have implemented the Question Answer System so that it supports localization and improved Part of Speech tagging performance by storing the lexicon in the database instead of a file based lexicon. We have implemented a brill tagger variant for Part of Speech tagging of Hindi phrases and grammar rules for triplet extraction. We also improve Yioop’s lexical data handling support by allowing the user to add named entities. Our improvements to Yioop were then evaluated by comparing the retrieved answers against a dataset of answers known to be true. The test data for the question answering system included creating 2 indexes, 1 each for English and Hindi. These were created by configuring Yioop to crawl 200,000 wikipedia pages for each crawl. The crawls were configured to be domain specific so that English index consists of pages restricted to English text and Hindi index is restricted to pages with Hindi text. We then used a set of 50 questions on the English and Hindi systems. We recored, Hindi system to have an accuracy of about 55% for simple factoid questions and English question answer system to have an accuracy of 63%
Joint Learning of Answer Selection and Answer Summary Generation in Community Question Answering
Community question answering (CQA) gains increasing popularity in both
academy and industry recently. However, the redundancy and lengthiness issues
of crowdsourced answers limit the performance of answer selection and lead to
reading difficulties and misunderstandings for community users. To solve these
problems, we tackle the tasks of answer selection and answer summary generation
in CQA with a novel joint learning model. Specifically, we design a
question-driven pointer-generator network, which exploits the correlation
information between question-answer pairs to aid in attending the essential
information when generating answer summaries. Meanwhile, we leverage the answer
summaries to alleviate noise in original lengthy answers when ranking the
relevancy degrees of question-answer pairs. In addition, we construct a new
large-scale CQA corpus, WikiHowQA, which contains long answers for answer
selection as well as reference summaries for answer summarization. The
experimental results show that the joint learning method can effectively
address the answer redundancy issue in CQA and achieves state-of-the-art
results on both answer selection and text summarization tasks. Furthermore, the
proposed model is shown to be of great transferring ability and applicability
for resource-poor CQA tasks, which lack of reference answer summaries.Comment: Accepted by AAAI 2020 (oral
A Survey of Symbolic Execution Techniques
Many security and software testing applications require checking whether
certain properties of a program hold for any possible usage scenario. For
instance, a tool for identifying software vulnerabilities may need to rule out
the existence of any backdoor to bypass a program's authentication. One
approach would be to test the program using different, possibly random inputs.
As the backdoor may only be hit for very specific program workloads, automated
exploration of the space of possible inputs is of the essence. Symbolic
execution provides an elegant solution to the problem, by systematically
exploring many possible execution paths at the same time without necessarily
requiring concrete inputs. Rather than taking on fully specified input values,
the technique abstractly represents them as symbols, resorting to constraint
solvers to construct actual instances that would cause property violations.
Symbolic execution has been incubated in dozens of tools developed over the
last four decades, leading to major practical breakthroughs in a number of
prominent software reliability applications. The goal of this survey is to
provide an overview of the main ideas, challenges, and solutions developed in
the area, distilling them for a broad audience.
The present survey has been accepted for publication at ACM Computing
Surveys. If you are considering citing this survey, we would appreciate if you
could use the following BibTeX entry: http://goo.gl/Hf5FvcComment: This is the authors pre-print copy. If you are considering citing
this survey, we would appreciate if you could use the following BibTeX entry:
http://goo.gl/Hf5Fv
Automatic summarising: factors and directions
This position paper suggests that progress with automatic summarising demands
a better research methodology and a carefully focussed research strategy. In
order to develop effective procedures it is necessary to identify and respond
to the context factors, i.e. input, purpose, and output factors, that bear on
summarising and its evaluation. The paper analyses and illustrates these
factors and their implications for evaluation. It then argues that this
analysis, together with the state of the art and the intrinsic difficulty of
summarising, imply a nearer-term strategy concentrating on shallow, but not
surface, text analysis and on indicative summarising. This is illustrated with
current work, from which a potentially productive research programme can be
developed
Accounting for Calibration Uncertainties in X-ray Analysis: Effective Areas in Spectral Fitting
While considerable advance has been made to account for statistical
uncertainties in astronomical analyses, systematic instrumental uncertainties
have been generally ignored. This can be crucial to a proper interpretation of
analysis results because instrumental calibration uncertainty is a form of
systematic uncertainty. Ignoring it can underestimate error bars and introduce
bias into the fitted values of model parameters. Accounting for such
uncertainties currently requires extensive case-specific simulations if using
existing analysis packages. Here we present general statistical methods that
incorporate calibration uncertainties into spectral analysis of high-energy
data. We first present a method based on multiple imputation that can be
applied with any fitting method, but is necessarily approximate. We then
describe a more exact Bayesian approach that works in conjunction with a Markov
chain Monte Carlo based fitting. We explore methods for improving computational
efficiency, and in particular detail a method of summarizing calibration
uncertainties with a principal component analysis of samples of plausible
calibration files. This method is implemented using recently codified Chandra
effective area uncertainties for low-resolution spectral analysis and is
verified using both simulated and actual Chandra data. Our procedure for
incorporating effective area uncertainty is easily generalized to other types
of calibration uncertainties.Comment: 61 pages double spaced, 8 figures, accepted for publication in Ap
On Role Logic
We present role logic, a notation for describing properties of relational
structures in shape analysis, databases, and knowledge bases. We construct role
logic using the ideas of de Bruijn's notation for lambda calculus, an encoding
of first-order logic in lambda calculus, and a simple rule for implicit
arguments of unary and binary predicates. The unrestricted version of role
logic has the expressive power of first-order logic with transitive closure.
Using a syntactic restriction on role logic formulas, we identify a natural
fragment RL^2 of role logic. We show that the RL^2 fragment has the same
expressive power as two-variable logic with counting C^2 and is therefore
decidable. We present a translation of an imperative language into the
decidable fragment RL^2, which allows compositional verification of programs
that manipulate relational structures. In addition, we show how RL^2 encodes
boolean shape analysis constraints and an expressive description logic.Comment: 20 pages. Our later SAS 2004 result builds on this wor
- …