31,059 research outputs found
Entity Linking for Queries by Searching Wikipedia Sentences
We present a simple yet effective approach for linking entities in queries.
The key idea is to search sentences similar to a query from Wikipedia articles
and directly use the human-annotated entities in the similar sentences as
candidate entities for the query. Then, we employ a rich set of features, such
as link-probability, context-matching, word embeddings, and relatedness among
candidate entities as well as their related entities, to rank the candidates
under a regression based framework. The advantages of our approach lie in two
aspects, which contribute to the ranking process and final linking result.
First, it can greatly reduce the number of candidate entities by filtering out
irrelevant entities with the words in the query. Second, we can obtain the
query sensitive prior probability in addition to the static link-probability
derived from all Wikipedia articles. We conduct experiments on two benchmark
datasets on entity linking for queries, namely the ERD14 dataset and the GERDAQ
dataset. Experimental results show that our method outperforms state-of-the-art
systems and yields 75.0% in F1 on the ERD14 dataset and 56.9% on the GERDAQ
dataset
Building a wordnet for Turkish
This paper summarizes the development process of a wordnet for Turkish as part of the Balkanet project. After discussing the basic method-ological issues that had to be resolved during the course of the project, the paper presents the basic steps of the construction process in chronological order. Two applications using Turkish wordnet are summarized and links to resources for wordnet builders are provided at the end of the paper
Improving the Representation and Conversion of Mathematical Formulae by Considering their Textual Context
Mathematical formulae represent complex semantic information in a concise
form. Especially in Science, Technology, Engineering, and Mathematics,
mathematical formulae are crucial to communicate information, e.g., in
scientific papers, and to perform computations using computer algebra systems.
Enabling computers to access the information encoded in mathematical formulae
requires machine-readable formats that can represent both the presentation and
content, i.e., the semantics, of formulae. Exchanging such information between
systems additionally requires conversion methods for mathematical
representation formats. We analyze how the semantic enrichment of formulae
improves the format conversion process and show that considering the textual
context of formulae reduces the error rate of such conversions. Our main
contributions are: (1) providing an openly available benchmark dataset for the
mathematical format conversion task consisting of a newly created test
collection, an extensive, manually curated gold standard and task-specific
evaluation metrics; (2) performing a quantitative evaluation of
state-of-the-art tools for mathematical format conversions; (3) presenting a
new approach that considers the textual context of formulae to reduce the error
rate for mathematical format conversions. Our benchmark dataset facilitates
future research on mathematical format conversions as well as research on many
problems in mathematical information retrieval. Because we annotated and linked
all components of formulae, e.g., identifiers, operators and other entities, to
Wikidata entries, the gold standard can, for instance, be used to train methods
for formula concept discovery and recognition. Such methods can then be applied
to improve mathematical information retrieval systems, e.g., for semantic
formula search, recommendation of mathematical content, or detection of
mathematical plagiarism.Comment: 10 pages, 4 figure
Domain discovery method for topological profile searches in protein structures
We describe a method for automated domain discovery for topological profile searches in protein
structures. The method is used in a system TOPStructure for fast prediction of CATH classification
for protein structures (given as PDB files). It is important for profile searches in multi-domain
proteins, for which the profile method by itself tends to perform poorly. We also present an
O(C(n)k +nk2) time algorithm for this problem, compared to the O(C(n)k +(nk)2) time used by
a trivial algorithm (where n is the length of the structure, k is the number of profiles and C(n) is the
time needed to check for a presence of a given motif in a structure of length n). This method has
been developed and is currently used for TOPS representations of protein structures and prediction
of CATH classification, but may be applied to other graph-based representations of protein or RNA
structures and/or other prediction problems. A protein structure prediction system incorporating
the domain discovery method is available at http://bioinf.mii.lu.lv/tops/
The C++0x "Concepts" Effort
C++0x is the working title for the revision of the ISO standard of the C++
programming language that was originally planned for release in 2009 but that
was delayed to 2011. The largest language extension in C++0x was "concepts",
that is, a collection of features for constraining template parameters. In
September of 2008, the C++ standards committee voted the concepts extension
into C++0x, but then in July of 2009, the committee voted the concepts
extension back out of C++0x.
This article is my account of the technical challenges and debates within the
"concepts" effort in the years 2003 to 2009. To provide some background, the
article also describes the design space for constrained parametric
polymorphism, or what is colloquially know as constrained generics. While this
article is meant to be generally accessible, the writing is aimed toward
readers with background in functional programming and programming language
theory. This article grew out of a lecture at the Spring School on Generic and
Indexed Programming at the University of Oxford, March 2010
- …