13,417 research outputs found

    Crosslingual Document Embedding as Reduced-Rank Ridge Regression

    Get PDF
    There has recently been much interest in extending vector-based word representations to multiple languages, such that words can be compared across languages. In this paper, we shift the focus from words to documents and introduce a method for embedding documents written in any language into a single, language-independent vector space. For training, our approach leverages a multilingual corpus where the same concept is covered in multiple languages (but not necessarily via exact translations), such as Wikipedia. Our method, Cr5 (Crosslingual reduced-rank ridge regression), starts by training a ridge-regression-based classifier that uses language-specific bag-of-word features in order to predict the concept that a given document is about. We show that, when constraining the learned weight matrix to be of low rank, it can be factored to obtain the desired mappings from language-specific bags-of-words to language-independent embeddings. As opposed to most prior methods, which use pretrained monolingual word vectors, postprocess them to make them crosslingual, and finally average word vectors to obtain document vectors, Cr5 is trained end-to-end and is thus natively crosslingual as well as document-level. Moreover, since our algorithm uses the singular value decomposition as its core operation, it is highly scalable. Experiments show that our method achieves state-of-the-art performance on a crosslingual document retrieval task. Finally, although not trained for embedding sentences and words, it also achieves competitive performance on crosslingual sentence and word retrieval tasks.Comment: In The Twelfth ACM International Conference on Web Search and Data Mining (WSDM '19

    Metadata Augmentation for Semantic- and Context- Based Retrieval of Digital Cultural Objects

    Get PDF
    Cultural objects are increasingly stored and generated in digital form, yet effective methods for their indexing and retrieval still remain an open area of research. The main problem arises from the disconnection between the content-based indexing approach used by computer scientists and the description-based approach used by information scientists. There is also a lack of representational schemes that allow the alignment of the semantics and context with keywords and low-level features that can be automatically extracted from the content of these cultural objects. This paper presents an integrated approach to address these problems, taking advantage of both computer science and information science approaches. The focus is on the rationale and conceptual design of the system and its various components. In particular, we discuss techniques for augmenting commonly used metadata with visual features and domain knowledge to generate high-level abstract metadata which in turn can be used for semantic and context-based indexing and retrieval. We use a sample collection of Vietnamese traditional woodcuts to demonstrate the usefulness of this approach

    Topic Identification for Speech without ASR

    Full text link
    Modern topic identification (topic ID) systems for speech use automatic speech recognition (ASR) to produce speech transcripts, and perform supervised classification on such ASR outputs. However, under resource-limited conditions, the manually transcribed speech required to develop standard ASR systems can be severely limited or unavailable. In this paper, we investigate alternative unsupervised solutions to obtaining tokenizations of speech in terms of a vocabulary of automatically discovered word-like or phoneme-like units, without depending on the supervised training of ASR systems. Moreover, using automatic phoneme-like tokenizations, we demonstrate that a convolutional neural network based framework for learning spoken document representations provides competitive performance compared to a standard bag-of-words representation, as evidenced by comprehensive topic ID evaluations on both single-label and multi-label classification tasks.Comment: 5 pages, 2 figures; accepted for publication at Interspeech 201

    Combination of Domain Knowledge and Deep Learning for Sentiment Analysis of Short and Informal Messages on Social Media

    Full text link
    Sentiment analysis has been emerging recently as one of the major natural language processing (NLP) tasks in many applications. Especially, as social media channels (e.g. social networks or forums) have become significant sources for brands to observe user opinions about their products, this task is thus increasingly crucial. However, when applied with real data obtained from social media, we notice that there is a high volume of short and informal messages posted by users on those channels. This kind of data makes the existing works suffer from many difficulties to handle, especially ones using deep learning approaches. In this paper, we propose an approach to handle this problem. This work is extended from our previous work, in which we proposed to combine the typical deep learning technique of Convolutional Neural Networks with domain knowledge. The combination is used for acquiring additional training data augmentation and a more reasonable loss function. In this work, we further improve our architecture by various substantial enhancements, including negation-based data augmentation, transfer learning for word embeddings, the combination of word-level embeddings and character-level embeddings, and using multitask learning technique for attaching domain knowledge rules in the learning process. Those enhancements, specifically aiming to handle short and informal messages, help us to enjoy significant improvement in performance once experimenting on real datasets.Comment: A Preprint of an article accepted for publication by Inderscience in IJCVR on September 201

    ViCGCN: Graph Convolutional Network with Contextualized Language Models for Social Media Mining in Vietnamese

    Full text link
    Social media processing is a fundamental task in natural language processing with numerous applications. As Vietnamese social media and information science have grown rapidly, the necessity of information-based mining on Vietnamese social media has become crucial. However, state-of-the-art research faces several significant drawbacks, including imbalanced data and noisy data on social media platforms. Imbalanced and noisy are two essential issues that need to be addressed in Vietnamese social media texts. Graph Convolutional Networks can address the problems of imbalanced and noisy data in text classification on social media by taking advantage of the graph structure of the data. This study presents a novel approach based on contextualized language model (PhoBERT) and graph-based method (Graph Convolutional Networks). In particular, the proposed approach, ViCGCN, jointly trained the power of Contextualized embeddings with the ability of Graph Convolutional Networks, GCN, to capture more syntactic and semantic dependencies to address those drawbacks. Extensive experiments on various Vietnamese benchmark datasets were conducted to verify our approach. The observation shows that applying GCN to BERTology models as the final layer significantly improves performance. Moreover, the experiments demonstrate that ViCGCN outperforms 13 powerful baseline models, including BERTology models, fusion BERTology and GCN models, other baselines, and SOTA on three benchmark social media datasets. Our proposed ViCGCN approach demonstrates a significant improvement of up to 6.21%, 4.61%, and 2.63% over the best Contextualized Language Models, including multilingual and monolingual, on three benchmark datasets, UIT-VSMEC, UIT-ViCTSD, and UIT-VSFC, respectively. Additionally, our integrated model ViCGCN achieves the best performance compared to other BERTology integrated with GCN models

    Vietnamese Women and Children Refugees in Hong Kong: An Argument against Arbitrary Detention

    Get PDF
    Anteckningar frÄn Sundre, Gotland sydligaste socken. Onsdag 11 april 2012 Rivet, en smal udde av grus. Stark vind, öppen horisont. Vatten pÄ nÀra nog alla hÄll. HÀr, lÀngst ut, drabbades jag av fasa nÀr havet ville döda mig, sluka min kropp. Letade efter gravarna men fann inget. Gick upp till Arendt i fyren. --- Mitt examensprojekt Àr min promenad tillsammans med historia, tempo, minne, tröghet, förflyttningar och förÀndringar, ruiner, myter samt högst personliga reflektioner. Det finns platser man verkligen tycker om, som man Àlskar. Jag vet flera platser som jag Àlskar att vara pÄ, det hÀr Àr en av dem. FÄ stannar till hÀr, det Àr ett stÀlle man passerar. Jag vill vara ensam hÀr. Vid mina besök, som pÄ senare tid under projektets gÄng har varit mycket riktade, blir jag extremt fokuserad pÄ platsen och bara platsen. OcksÄ som att kliva ur tjockan.Notes from the parish of Sundre, at the very south of Gotland. Wednesday April 11, 2012 Rivet, a small cpe of gravel. Strong wind, open horizon. Water in almost all directions. Here, at the nab, fright hit me when the sea wanted to liquidate me, swallow my body. Searched for the tombs but found nothing. Visited Arendt in the lighthouse. --- My degree project is my promenade together with history, tempo, memory, inertia, movements, ruins, myths and my very personal reflections. There are places you really like, that you love. There are several places that I love to be at, this is one of them. Few people halt here, this is a place you pass. I want to be alone here. At my visits, that during the project became more and more addressed, I become extremely focused at the place and the place only. Like stepping out of the mist
    • 

    corecore