4,996 research outputs found
Document Clustering based on Topic Maps
Importance of document clustering is now widely acknowledged by researchers
for better management, smart navigation, efficient filtering, and concise
summarization of large collection of documents like World Wide Web (WWW). The
next challenge lies in semantically performing clustering based on the semantic
contents of the document. The problem of document clustering has two main
components: (1) to represent the document in such a form that inherently
captures semantics of the text. This may also help to reduce dimensionality of
the document, and (2) to define a similarity measure based on the semantic
representation such that it assigns higher numerical values to document pairs
which have higher semantic relationship. Feature space of the documents can be
very challenging for document clustering. A document may contain multiple
topics, it may contain a large set of class-independent general-words, and a
handful class-specific core-words. With these features in mind, traditional
agglomerative clustering algorithms, which are based on either Document Vector
model (DVM) or Suffix Tree model (STC), are less efficient in producing results
with high cluster quality. This paper introduces a new approach for document
clustering based on the Topic Map representation of the documents. The document
is being transformed into a compact form. A similarity measure is proposed
based upon the inferred information through topic maps data and structures. The
suggested method is implemented using agglomerative hierarchal clustering and
tested on standard Information retrieval (IR) datasets. The comparative
experiment reveals that the proposed approach is effective in improving the
cluster quality
Entropy-scaling search of massive biological data
Many datasets exhibit a well-defined structure that can be exploited to
design faster search tools, but it is not always clear when such acceleration
is possible. Here, we introduce a framework for similarity search based on
characterizing a dataset's entropy and fractal dimension. We prove that
searching scales in time with metric entropy (number of covering hyperspheres),
if the fractal dimension of the dataset is low, and scales in space with the
sum of metric entropy and information-theoretic entropy (randomness of the
data). Using these ideas, we present accelerated versions of standard tools,
with no loss in specificity and little loss in sensitivity, for use in three
domains---high-throughput drug screening (Ammolite, 150x speedup), metagenomics
(MICA, 3.5x speedup of DIAMOND [3,700x BLASTX]), and protein structure search
(esFragBag, 10x speedup of FragBag). Our framework can be used to achieve
"compressive omics," and the general theory can be readily applied to data
science problems outside of biology.Comment: Including supplement: 41 pages, 6 figures, 4 tables, 1 bo
Reconstructing Native Language Typology from Foreign Language Usage
Linguists and psychologists have long been studying cross-linguistic
transfer, the influence of native language properties on linguistic performance
in a foreign language. In this work we provide empirical evidence for this
process in the form of a strong correlation between language similarities
derived from structural features in English as Second Language (ESL) texts and
equivalent similarities obtained from the typological features of the native
languages. We leverage this finding to recover native language typological
similarity structure directly from ESL text, and perform prediction of
typological features in an unsupervised fashion with respect to the target
languages. Our method achieves 72.2% accuracy on the typology prediction task,
a result that is highly competitive with equivalent methods that rely on
typological resources.Comment: CoNLL 201
Building Morphological Chains for Agglutinative Languages
In this paper, we build morphological chains for agglutinative languages by
using a log-linear model for the morphological segmentation task. The model is
based on the unsupervised morphological segmentation system called
MorphoChains. We extend MorphoChains log linear model by expanding the
candidate space recursively to cover more split points for agglutinative
languages such as Turkish, whereas in the original model candidates are
generated by considering only binary segmentation of each word. The results
show that we improve the state-of-art Turkish scores by 12% having a F-measure
of 72% and we improve the English scores by 3% having a F-measure of 74%.
Eventually, the system outperforms both MorphoChains and other well-known
unsupervised morphological segmentation systems. The results indicate that
candidate generation plays an important role in such an unsupervised log-linear
model that is learned using contrastive estimation with negative samples.Comment: 10 pages, accepted and presented at the CICLing 2017 (18th
International Conference on Intelligent Text Processing and Computational
Linguistics
- …