26,510 research outputs found
Taming Wild High Dimensional Text Data with a Fuzzy Lash
The bag of words (BOW) represents a corpus in a matrix whose elements are the
frequency of words. However, each row in the matrix is a very high-dimensional
sparse vector. Dimension reduction (DR) is a popular method to address sparsity
and high-dimensionality issues. Among different strategies to develop DR
method, Unsupervised Feature Transformation (UFT) is a popular strategy to map
all words on a new basis to represent BOW. The recent increase of text data and
its challenges imply that DR area still needs new perspectives. Although a wide
range of methods based on the UFT strategy has been developed, the fuzzy
approach has not been considered for DR based on this strategy. This research
investigates the application of fuzzy clustering as a DR method based on the
UFT strategy to collapse BOW matrix to provide a lower-dimensional
representation of documents instead of the words in a corpus. The quantitative
evaluation shows that fuzzy clustering produces superior performance and
features to Principal Components Analysis (PCA) and Singular Value
Decomposition (SVD), two popular DR methods based on the UFT strategy
Sparse Coding of Neural Word Embeddings for Multilingual Sequence Labeling
In this paper we propose and carefully evaluate a sequence labeling framework
which solely utilizes sparse indicator features derived from dense distributed
word representations. The proposed model obtains (near) state-of-the art
performance for both part-of-speech tagging and named entity recognition for a
variety of languages. Our model relies only on a few thousand sparse
coding-derived features, without applying any modification of the word
representations employed for the different tasks. The proposed model has
favorable generalization properties as it retains over 89.8% of its average POS
tagging accuracy when trained at 1.2% of the total available training data,
i.e.~150 sentences per language
Economic Complexity Unfolded: Interpretable Model for the Productive Structure of Economies
Economic complexity reflects the amount of knowledge that is embedded in the
productive structure of an economy. It resides on the premise of hidden
capabilities - fundamental endowments underlying the productive structure. In
general, measuring the capabilities behind economic complexity directly is
difficult, and indirect measures have been suggested which exploit the fact
that the presence of the capabilities is expressed in a country's mix of
products. We complement these studies by introducing a probabilistic framework
which leverages Bayesian non-parametric techniques to extract the dominant
features behind the comparative advantage in exported products. Based on
economic evidence and trade data, we place a restricted Indian Buffet Process
on the distribution of countries' capability endowment, appealing to a culinary
metaphor to model the process of capability acquisition. The approach comes
with a unique level of interpretability, as it produces a concise and
economically plausible description of the instantiated capabilities
- …