5,780 research outputs found
A path following algorithm for the graph matching problem
We propose a convex-concave programming approach for the labeled weighted
graph matching problem. The convex-concave programming formulation is obtained
by rewriting the weighted graph matching problem as a least-square problem on
the set of permutation matrices and relaxing it to two different optimization
problems: a quadratic convex and a quadratic concave optimization problem on
the set of doubly stochastic matrices. The concave relaxation has the same
global minimum as the initial graph matching problem, but the search for its
global minimum is also a hard combinatorial problem. We therefore construct an
approximation of the concave problem solution by following a solution path of a
convex-concave problem obtained by linear interpolation of the convex and
concave formulations, starting from the convex relaxation. This method allows
to easily integrate the information on graph label similarities into the
optimization problem, and therefore to perform labeled weighted graph matching.
The algorithm is compared with some of the best performing graph matching
methods on four datasets: simulated graphs, QAPLib, retina vessel images and
handwritten chinese characters. In all cases, the results are competitive with
the state-of-the-art.Comment: 23 pages, 13 figures,typo correction, new results in sections 4,5,
Connecting Dream Networks Across Cultures
Many species dream, yet there remain many open research questions in the
study of dreams. The symbolism of dreams and their interpretation is present in
cultures throughout history. Analysis of online data sources for dream
interpretation using network science leads to understanding symbolism in dreams
and their associated meaning. In this study, we introduce dream interpretation
networks for English, Chinese and Arabic that represent different cultures from
various parts of the world. We analyze communities in these networks, finding
that symbols within a community are semantically related. The central nodes in
communities give insight about cultures and symbols in dreams. The community
structure of different networks highlights cultural similarities and
differences. Interconnections between different networks are also identified by
translating symbols from different languages into English. Structural
correlations across networks point out relationships between cultures.
Similarities between network communities are also investigated by analysis of
sentiment in symbol interpretations. We find that interpretations within a
community tend to have similar sentiment. Furthermore, we cluster communities
based on their sentiment, yielding three main categories of positive, negative,
and neutral dream symbols.Comment: 6 pages, 3 figure
Automatic Construction of Chinese-English Translation Lexicons
The process of constructing translation lexicons from parallel texts (bitexts) can be broken down into three stages: mapping bitext correspondence, counting co-occurrences, and estimating a translation model. State-of-the-art techniques for accomplishing each stage of the process had already been developed, but only for bitexts involving fairly similar languages. Correct and efficient implementation of each stage poses special challenges when the parallel texts involve two very different languages. This report describes our theoretical and empirical investigations into how existing techniques might be extended and applied to Chinese/English bitexts
Measuring the accuracy of page-reading systems
Given a bitmapped image of a page from any document, a page-reading system identifies the characters on the page and stores them in a text file. This OCR-generated text is represented by a string and compared with the correct string to determine the accuracy of this process. The string editing problem is applied to find an optimal correspondence of these strings using an appropriate cost function. The ISRI annual test of page-reading systems utilizes the following performance measures, which are defined in terms of this correspondence and the string edit distance: character accuracy, throughput, accuracy by character class, marked character efficiency, word accuracy, non-stopword accuracy, and phrase accuracy. It is shown that the universe of cost functions is divided into equivalence classes, and the cost functions related to the longest common subsequence (LCS) are identified. The computation of a LCS can be made faster by a linear-time preprocessing step
Multilingual Schema Matching for Wikipedia Infoboxes
Recent research has taken advantage of Wikipedia's multilingualism as a
resource for cross-language information retrieval and machine translation, as
well as proposed techniques for enriching its cross-language structure. The
availability of documents in multiple languages also opens up new opportunities
for querying structured Wikipedia content, and in particular, to enable answers
that straddle different languages. As a step towards supporting such queries,
in this paper, we propose a method for identifying mappings between attributes
from infoboxes that come from pages in different languages. Our approach finds
mappings in a completely automated fashion. Because it does not require
training data, it is scalable: not only can it be used to find mappings between
many language pairs, but it is also effective for languages that are
under-represented and lack sufficient training samples. Another important
benefit of our approach is that it does not depend on syntactic similarity
between attribute names, and thus, it can be applied to language pairs that
have distinct morphologies. We have performed an extensive experimental
evaluation using a corpus consisting of pages in Portuguese, Vietnamese, and
English. The results show that not only does our approach obtain high precision
and recall, but it also outperforms state-of-the-art techniques. We also
present a case study which demonstrates that the multilingual mappings we
derive lead to substantial improvements in answer quality and coverage for
structured queries over Wikipedia content.Comment: VLDB201
Automatic Palaeographic Exploration of Genizah Manuscripts
The Cairo Genizah is a collection of hand-written documents containing approximately
350,000 fragments of mainly Jewish texts discovered in the late 19th
century. The
fragments are today spread out in some 75 libraries and private collections worldwide,
but there is an ongoing effort to document and catalogue all extant fragments.
Palaeographic information plays a key role in the study of the Genizah collection.
Script style, and–more specifically–handwriting, can be used to identify fragments that
might originate from the same original work. Such matched fragments, commonly
referred to as “joins”, are currently identified manually by experts, and presumably only
a small fraction of existing joins have been discovered to date. In this work, we show
that automatic handwriting matching functions, obtained from non-specific features
using a corpus of writing samples, can perform this task quite reliably. In addition, we
explore the problem of grouping various Genizah documents by script style, without
being provided any prior information about the relevant styles. The automatically
obtained grouping agrees, for the most part, with the palaeographic taxonomy. In cases
where the method fails, it is due to apparent similarities between related scripts
A Sketch-Based Educational System for Learning Chinese Handwriting
Learning Chinese as a Second Language (CSL) is a difficult task for students in English-speaking countries due to the large symbol set and complicated writing techniques. Traditional classroom methods of teaching Chinese handwriting have major drawbacks due to human experts’ bias and the lack of assessment on writing techniques. In this work, we propose a sketch-based educational system to help CSL students learn Chinese handwriting faster and better in a novel way. Our system allows students to draw freehand symbols to answer questions, and uses sketch recognition and AI techniques to recognize, assess, and provide feedback in real time. Results have shown that the system reaches a recognition accuracy of 86% on novice learners’ inputs, higher than 95% detection rate for mistakes in writing techniques, and 80.3% F-measure on the classification between expert and novice handwriting inputs
- …