10,434 research outputs found
A Graph-Based Semi-Supervised k Nearest-Neighbor Method for Nonlinear Manifold Distributed Data Classification
Nearest Neighbors (NN) is one of the most widely used supervised
learning algorithms to classify Gaussian distributed data, but it does not
achieve good results when it is applied to nonlinear manifold distributed data,
especially when a very limited amount of labeled samples are available. In this
paper, we propose a new graph-based NN algorithm which can effectively
handle both Gaussian distributed data and nonlinear manifold distributed data.
To achieve this goal, we first propose a constrained Tired Random Walk (TRW) by
constructing an -level nearest-neighbor strengthened tree over the graph,
and then compute a TRW matrix for similarity measurement purposes. After this,
the nearest neighbors are identified according to the TRW matrix and the class
label of a query point is determined by the sum of all the TRW weights of its
nearest neighbors. To deal with online situations, we also propose a new
algorithm to handle sequential samples based a local neighborhood
reconstruction. Comparison experiments are conducted on both synthetic data
sets and real-world data sets to demonstrate the validity of the proposed new
NN algorithm and its improvements to other version of NN algorithms.
Given the widespread appearance of manifold structures in real-world problems
and the popularity of the traditional NN algorithm, the proposed manifold
version NN shows promising potential for classifying manifold-distributed
data.Comment: 32 pages, 12 figures, 7 table
An Investigation into the Pedagogical Features of Documents
Characterizing the content of a technical document in terms of its learning
utility can be useful for applications related to education, such as generating
reading lists from large collections of documents. We refer to this learning
utility as the "pedagogical value" of the document to the learner. While
pedagogical value is an important concept that has been studied extensively
within the education domain, there has been little work exploring it from a
computational, i.e., natural language processing (NLP), perspective. To allow a
computational exploration of this concept, we introduce the notion of
"pedagogical roles" of documents (e.g., Tutorial and Survey) as an intermediary
component for the study of pedagogical value. Given the lack of available
corpora for our exploration, we create the first annotated corpus of
pedagogical roles and use it to test baseline techniques for automatic
prediction of such roles.Comment: 12th Workshop on Innovative Use of NLP for Building Educational
Applications (BEA) at EMNLP 2017; 12 page
Discriminating word senses with tourist walks in complex networks
Patterns of topological arrangement are widely used for both animal and human
brains in the learning process. Nevertheless, automatic learning techniques
frequently overlook these patterns. In this paper, we apply a learning
technique based on the structural organization of the data in the attribute
space to the problem of discriminating the senses of 10 polysemous words. Using
two types of characterization of meanings, namely semantical and topological
approaches, we have observed significative accuracy rates in identifying the
suitable meanings in both techniques. Most importantly, we have found that the
characterization based on the deterministic tourist walk improves the
disambiguation process when one compares with the discrimination achieved with
traditional complex networks measurements such as assortativity and clustering
coefficient. To our knowledge, this is the first time that such deterministic
walk has been applied to such a kind of problem. Therefore, our finding
suggests that the tourist walk characterization may be useful in other related
applications
A complex network approach to stylometry
Statistical methods have been widely employed to study the fundamental
properties of language. In recent years, methods from complex and dynamical
systems proved useful to create several language models. Despite the large
amount of studies devoted to represent texts with physical models, only a
limited number of studies have shown how the properties of the underlying
physical systems can be employed to improve the performance of natural language
processing tasks. In this paper, I address this problem by devising complex
networks methods that are able to improve the performance of current
statistical methods. Using a fuzzy classification strategy, I show that the
topological properties extracted from texts complement the traditional textual
description. In several cases, the performance obtained with hybrid approaches
outperformed the results obtained when only traditional or networked methods
were used. Because the proposed model is generic, the framework devised here
could be straightforwardly used to study similar textual applications where the
topology plays a pivotal role in the description of the interacting agents.Comment: PLoS ONE, 2015 (to appear
- …