8,236 research outputs found
Automatic Construction of Clean Broad-Coverage Translation Lexicons
Word-level translational equivalences can be extracted from parallel texts by
surprisingly simple statistical techniques. However, these techniques are
easily fooled by {\em indirect associations} --- pairs of unrelated words whose
statistical properties resemble those of mutual translations. Indirect
associations pollute the resulting translation lexicons, drastically reducing
their precision. This paper presents an iterative lexicon cleaning method. On
each iteration, most of the remaining incorrect lexicon entries are filtered
out, without significant degradation in recall. This lexicon cleaning technique
can produce translation lexicons with recall and precision both exceeding 90\%,
as well as dictionary-sized translation lexicons that are over 99\% correct.Comment: PostScript file, 10 pages. To appear in Proceedings of AMTA-9
Achieving Secure and Efficient Cloud Search Services: Cross-Lingual Multi-Keyword Rank Search over Encrypted Cloud Data
Multi-user multi-keyword ranked search scheme in arbitrary language is a
novel multi-keyword rank searchable encryption (MRSE) framework based on
Paillier Cryptosystem with Threshold Decryption (PCTD). Compared to previous
MRSE schemes constructed based on the k-nearest neighbor searcha-ble encryption
(KNN-SE) algorithm, it can mitigate some draw-backs and achieve better
performance in terms of functionality and efficiency. Additionally, it does not
require a predefined keyword set and support keywords in arbitrary languages.
However, due to the pattern of exact matching of keywords in the new MRSE
scheme, multilingual search is limited to each language and cannot be searched
across languages. In this pa-per, we propose a cross-lingual multi-keyword rank
search (CLRSE) scheme which eliminates the barrier of languages and achieves
semantic extension with using the Open Multilingual Wordnet. Our CLRSE scheme
also realizes intelligent and per-sonalized search through flexible keyword and
language prefer-ence settings. We evaluate the performance of our scheme in
terms of security, functionality, precision and efficiency, via extensive
experiments
Concept Extraction and Clustering for Topic Digital Library Construction
This paper is to introduce a new approach to build
topic digital library using concept extraction and
document clustering. Firstly, documents in a special
domain are automatically produced by document
classification approach. Then, the keywords of each
document are extracted using the machine learning
approach. The keywords are used to cluster the
documents subset. The clustered result is the taxonomy
of the subset. Lastly, the taxonomy is modified to the
hierarchical structure for user navigation by manual
adjustments. The topic digital library is constructed
after combining the full-text retrieval and hierarchical
navigation function
Embedding Web-based Statistical Translation Models in Cross-Language Information Retrieval
Although more and more language pairs are covered by machine translation
services, there are still many pairs that lack translation resources.
Cross-language information retrieval (CLIR) is an application which needs
translation functionality of a relatively low level of sophistication since
current models for information retrieval (IR) are still based on a
bag-of-words. The Web provides a vast resource for the automatic construction
of parallel corpora which can be used to train statistical translation models
automatically. The resulting translation models can be embedded in several ways
in a retrieval model. In this paper, we will investigate the problem of
automatically mining parallel texts from the Web and different ways of
integrating the translation models within the retrieval process. Our
experiments on standard test collections for CLIR show that the Web-based
translation models can surpass commercial MT systems in CLIR tasks. These
results open the perspective of constructing a fully automatic query
translation device for CLIR at a very low cost.Comment: 37 page
Fighting with the Sparsity of Synonymy Dictionaries
Graph-based synset induction methods, such as MaxMax and Watset, induce
synsets by performing a global clustering of a synonymy graph. However, such
methods are sensitive to the structure of the input synonymy graph: sparseness
of the input dictionary can substantially reduce the quality of the extracted
synsets. In this paper, we propose two different approaches designed to
alleviate the incompleteness of the input dictionaries. The first one performs
a pre-processing of the graph by adding missing edges, while the second one
performs a post-processing by merging similar synset clusters. We evaluate
these approaches on two datasets for the Russian language and discuss their
impact on the performance of synset induction methods. Finally, we perform an
extensive error analysis of each approach and discuss prominent alternative
methods for coping with the problem of the sparsity of the synonymy
dictionaries.Comment: In Proceedings of the 6th Conference on Analysis of Images, Social
Networks, and Texts (AIST'2017): Springer Lecture Notes in Computer Science
(LNCS
- …