13,417 research outputs found
Crosslingual Document Embedding as Reduced-Rank Ridge Regression
There has recently been much interest in extending vector-based word
representations to multiple languages, such that words can be compared across
languages. In this paper, we shift the focus from words to documents and
introduce a method for embedding documents written in any language into a
single, language-independent vector space. For training, our approach leverages
a multilingual corpus where the same concept is covered in multiple languages
(but not necessarily via exact translations), such as Wikipedia. Our method,
Cr5 (Crosslingual reduced-rank ridge regression), starts by training a
ridge-regression-based classifier that uses language-specific bag-of-word
features in order to predict the concept that a given document is about. We
show that, when constraining the learned weight matrix to be of low rank, it
can be factored to obtain the desired mappings from language-specific
bags-of-words to language-independent embeddings. As opposed to most prior
methods, which use pretrained monolingual word vectors, postprocess them to
make them crosslingual, and finally average word vectors to obtain document
vectors, Cr5 is trained end-to-end and is thus natively crosslingual as well as
document-level. Moreover, since our algorithm uses the singular value
decomposition as its core operation, it is highly scalable. Experiments show
that our method achieves state-of-the-art performance on a crosslingual
document retrieval task. Finally, although not trained for embedding sentences
and words, it also achieves competitive performance on crosslingual sentence
and word retrieval tasks.Comment: In The Twelfth ACM International Conference on Web Search and Data
Mining (WSDM '19
Metadata Augmentation for Semantic- and Context- Based Retrieval of Digital Cultural Objects
Cultural objects are increasingly stored and generated in digital form, yet effective methods for their indexing and retrieval still remain an open area of research. The main problem arises from the disconnection between the content-based indexing approach used by computer scientists and the description-based approach used by information scientists. There is also a lack of representational schemes that allow the alignment of the semantics and context with keywords and low-level features that can be automatically extracted from the content of these cultural objects. This paper presents an integrated approach to address these problems, taking advantage of both computer science and information science approaches. The focus is on the rationale and conceptual design of the system and its various components. In particular, we discuss techniques for augmenting commonly used metadata with visual features and domain knowledge to generate high-level abstract metadata which in turn can be used for semantic and context-based indexing and retrieval. We use a sample collection of Vietnamese traditional woodcuts to demonstrate the usefulness of this approach
Topic Identification for Speech without ASR
Modern topic identification (topic ID) systems for speech use automatic
speech recognition (ASR) to produce speech transcripts, and perform supervised
classification on such ASR outputs. However, under resource-limited conditions,
the manually transcribed speech required to develop standard ASR systems can be
severely limited or unavailable. In this paper, we investigate alternative
unsupervised solutions to obtaining tokenizations of speech in terms of a
vocabulary of automatically discovered word-like or phoneme-like units, without
depending on the supervised training of ASR systems. Moreover, using automatic
phoneme-like tokenizations, we demonstrate that a convolutional neural network
based framework for learning spoken document representations provides
competitive performance compared to a standard bag-of-words representation, as
evidenced by comprehensive topic ID evaluations on both single-label and
multi-label classification tasks.Comment: 5 pages, 2 figures; accepted for publication at Interspeech 201
Combination of Domain Knowledge and Deep Learning for Sentiment Analysis of Short and Informal Messages on Social Media
Sentiment analysis has been emerging recently as one of the major natural
language processing (NLP) tasks in many applications. Especially, as social
media channels (e.g. social networks or forums) have become significant sources
for brands to observe user opinions about their products, this task is thus
increasingly crucial. However, when applied with real data obtained from social
media, we notice that there is a high volume of short and informal messages
posted by users on those channels. This kind of data makes the existing works
suffer from many difficulties to handle, especially ones using deep learning
approaches. In this paper, we propose an approach to handle this problem. This
work is extended from our previous work, in which we proposed to combine the
typical deep learning technique of Convolutional Neural Networks with domain
knowledge. The combination is used for acquiring additional training data
augmentation and a more reasonable loss function. In this work, we further
improve our architecture by various substantial enhancements, including
negation-based data augmentation, transfer learning for word embeddings, the
combination of word-level embeddings and character-level embeddings, and using
multitask learning technique for attaching domain knowledge rules in the
learning process. Those enhancements, specifically aiming to handle short and
informal messages, help us to enjoy significant improvement in performance once
experimenting on real datasets.Comment: A Preprint of an article accepted for publication by Inderscience in
IJCVR on September 201
ViCGCN: Graph Convolutional Network with Contextualized Language Models for Social Media Mining in Vietnamese
Social media processing is a fundamental task in natural language processing
with numerous applications. As Vietnamese social media and information science
have grown rapidly, the necessity of information-based mining on Vietnamese
social media has become crucial. However, state-of-the-art research faces
several significant drawbacks, including imbalanced data and noisy data on
social media platforms. Imbalanced and noisy are two essential issues that need
to be addressed in Vietnamese social media texts. Graph Convolutional Networks
can address the problems of imbalanced and noisy data in text classification on
social media by taking advantage of the graph structure of the data. This study
presents a novel approach based on contextualized language model (PhoBERT) and
graph-based method (Graph Convolutional Networks). In particular, the proposed
approach, ViCGCN, jointly trained the power of Contextualized embeddings with
the ability of Graph Convolutional Networks, GCN, to capture more syntactic and
semantic dependencies to address those drawbacks. Extensive experiments on
various Vietnamese benchmark datasets were conducted to verify our approach.
The observation shows that applying GCN to BERTology models as the final layer
significantly improves performance. Moreover, the experiments demonstrate that
ViCGCN outperforms 13 powerful baseline models, including BERTology models,
fusion BERTology and GCN models, other baselines, and SOTA on three benchmark
social media datasets. Our proposed ViCGCN approach demonstrates a significant
improvement of up to 6.21%, 4.61%, and 2.63% over the best Contextualized
Language Models, including multilingual and monolingual, on three benchmark
datasets, UIT-VSMEC, UIT-ViCTSD, and UIT-VSFC, respectively. Additionally, our
integrated model ViCGCN achieves the best performance compared to other
BERTology integrated with GCN models
Vietnamese Women and Children Refugees in Hong Kong: An Argument against Arbitrary Detention
Anteckningar frÄn Sundre, Gotland sydligaste socken. Onsdag 11 april 2012 Rivet, en smal udde av grus. Stark vind, öppen horisont. Vatten pÄ nÀra nog alla hÄll. HÀr, lÀngst ut, drabbades jag av fasa nÀr havet ville döda mig, sluka min kropp. Letade efter gravarna men fann inget. Gick upp till Arendt i fyren. --- Mitt examensprojekt Àr min promenad tillsammans med historia, tempo, minne, tröghet, förflyttningar och förÀndringar, ruiner, myter samt högst personliga reflektioner. Det finns platser man verkligen tycker om, som man Àlskar. Jag vet flera platser som jag Àlskar att vara pÄ, det hÀr Àr en av dem. FÄ stannar till hÀr, det Àr ett stÀlle man passerar. Jag vill vara ensam hÀr. Vid mina besök, som pÄ senare tid under projektets gÄng har varit mycket riktade, blir jag extremt fokuserad pÄ platsen och bara platsen. OcksÄ som att kliva ur tjockan.Notes from the parish of Sundre, at the very south of Gotland. Wednesday April 11, 2012 Rivet, a small cpe of gravel. Strong wind, open horizon. Water in almost all directions. Here, at the nab, fright hit me when the sea wanted to liquidate me, swallow my body. Searched for the tombs but found nothing. Visited Arendt in the lighthouse. --- My degree project is my promenade together with history, tempo, memory, inertia, movements, ruins, myths and my very personal reflections. There are places you really like, that you love. There are several places that I love to be at, this is one of them. Few people halt here, this is a place you pass. I want to be alone here. At my visits, that during the project became more and more addressed, I become extremely focused at the place and the place only. Like stepping out of the mist
- âŠ