295 research outputs found
Similarity Cluster of Indonesian Ethnic Languages
Lexicostatistic and language similarity clusters are useful for computational linguistic researches that depends on language similarity or cognate recognition. Nevertheless, there are no published lexicostatistic/language similarity cluster of Indonesian ethnic languages available. We formulate an approach of creating language similarity clusters by utilizing ASJP database to generate the language similarity matrix, then generate the hierarchical clusters with complete linkage and mean linkage clustering, and further extract two stable clusters with high language similarities. We introduced an extended k-means clustering semi-supervised learning to evaluate the stability level of the hierarchical stable clusters being grouped together despite of changing the number of cluster. The higher the number of the trial, the more likely we can distinctly find the two hierarchical stable clusters in the generated k-clusters. However, for all five experiments, the stability level of the two hierarchical stable clusters is the highest on 5 clusters. Therefore, we take the 5 clusters as the best clusters of Indonesian ethnic languages. Finally, we plot the generated 5 clusters to a geographical map
On Geometric Alignment in Low Doubling Dimension
In real-world, many problems can be formulated as the alignment between two
geometric patterns. Previously, a great amount of research focus on the
alignment of 2D or 3D patterns, especially in the field of computer vision.
Recently, the alignment of geometric patterns in high dimension finds several
novel applications, and has attracted more and more attentions. However, the
research is still rather limited in terms of algorithms. To the best of our
knowledge, most existing approaches for high dimensional alignment are just
simple extensions of their counterparts for 2D and 3D cases, and often suffer
from the issues such as high complexities. In this paper, we propose an
effective framework to compress the high dimensional geometric patterns and
approximately preserve the alignment quality. As a consequence, existing
alignment approach can be applied to the compressed geometric patterns and thus
the time complexity is significantly reduced. Our idea is inspired by the
observation that high dimensional data often has a low intrinsic dimension. We
adopt the widely used notion "doubling dimension" to measure the extents of our
compression and the resulting approximation. Finally, we test our method on
both random and real datasets, the experimental results reveal that running the
alignment algorithm on compressed patterns can achieve similar qualities,
comparing with the results on the original patterns, but the running times
(including the times cost for compression) are substantially lower
Unsupervised Multilingual Alignment using Wasserstein Barycenter
We investigate the language alignment problem when there are multiple languages, and we are interested in finding translation between all pairs of languages. The problem of language alignment has long been an exciting topic for Natural Language Processing researchers. Current methods for learning cross-domain correspondences at the word level rely on distributed representations of words. Therefore, the recent development in the word computational linguistics and neural language modeling has led to the development of the so-called zero-shot learning paradigm. Many algorithms were proposed to solve the bilingual alignment problem in supervised or unsupervised manners. One popular way to extend the bilingual alignment to the multilingual setting is by picking one of the input languages as the pivot language and transiting through that language. However, transiting through a pivot language degrades the quality of translations, since it assumes transitive relations among all pairs of languages. It is often the case that one does not enforce such transitive relations in the training process of bilingual tasks. Therefore, transiting through an uninformed pivot language degrades the quality of translation. Motivated by the observation that using information from other languages during the training process helps improve translating language pairs, we propose a new algorithm for unsupervised multilingual alignment, where we employ the barycenter of all language word embeddings as a new pivot to imply translations. Instead of going through a pivot language, we propose to align languages through their Wasserstein barycenter. Our motivation behind this is that we can encapsulate information from all languages in the barycenter and facilitate bilingual alignment. We evaluate our method on standard benchmarks and demonstrate that our method outperforms state-of-the-art approaches. The barycenter is closely related to the joint mapping for all input languages hence encapsulates all useful information for translation. Finally, we evaluate our method by jointly aligning word vectors in 6 languages and demonstrating noticeable improvement to the current state-of-the-art
Multiview Learning with Sparse and Unannotated data.
PhD ThesisObtaining annotated training data for supervised learning, is a bottleneck in many
contemporary machine learning applications. The increasing prevalence of multi-modal
and multi-view data creates both new opportunities for circumventing this issue, and
new application challenges. In this thesis we explore several approaches to alleviating
annotation issues in multi-view scenarios.
We start by studying the problem of zero-shot learning (ZSL) for image recognition,
where class-level annotations for image recognition are eliminated by transferring information
from text modality instead. We next look at cross-modal matching, where
paired instances across views provide the supervised label information for learning. We
develop methodology for unsupervised and semi-supervised learning of pairing, thus
eliminating the need for annotation requirements.
We rst apply these ideas to unsupervised multi-view matching in the context of
bilingual dictionary induction (BLI), where instances are words in two languages and
nding a correspondence between the words produces a cross-lingual word translation
model. We then return to vision and language and look at learning unsupervised pairing
between images and text. We will see that this can be seen as a limiting case of ZSL
where text-image pairing annotation requirements are completely eliminated.
Overall these contributions in multi-view learning provide a suite of methods for
reducing annotation requirements: both in conventional classi cation and cross-view
matching settings
Embedding Multilingual and Relational Data Using Linear Mappings
This thesis presents our research on the embedding method, a machine learning technique that encodes real-world signals into high-dimensional vectors. Specifically, we focus on a family of algorithms whose backbone is one simple yet elegant type of topological operation, the linear mapping, aka. linear transformation or vector space homomorphism. Past studies have shown the usefulness of these approaches for modelling complex data, such as lexicons from different languages and networks storing factual relations. However, they also exhibit crucial limitations, including a lack of theoretical justifications, precision drop in challenging setups, and considerable environmental impact during training, among others.
To bridge these gaps, we first identify the unnoticed link between the success of linear Cross-Lingual Word Embedding (CLWE) mappings and the preservation of the implicit analogy relation, using both theoretical and empirical evidence. Next, we propose a post-hoc L1-norm rotation step which substantially improves the performance of existing CLWE mappings. Then, beyond solving conventional questions where only modern languages are involved, we extend the application of CLWE mappings to summarising lengthy and opaque historical text. Finally, motivated by the learning procedure of CLWE models, we adopt linear mappings to optimise Knowledge Graph Embeddings (KGEs) iteratively, significantly reducing the carbon footprint required to train the algorithm
- …