3 research outputs found

    Comparison of automatic vs. manual language identification in multilingual social media texts

    No full text
    Multilingual speakers communicate in more than one language in daily life and on social media. In order to process or investigate multilingual communication, there is a need for language identification. This study compares the performance of human annotators with automatic ways of language identification on a multilingual (mainly German-Italian-English) social media corpus collected in South Tyrol, Italy. Our results indicate that humans and Natural Language Processing (NLP) systems follow their individual techniques to make a decision about multilingual text messages. This results in low agreement when different annotators or NLP systems execute the same task. In general, annotators agree with each other more than NLP systems. However, there is also variation in human agreement depending on the prior establishment of guidelines for the annotation task or not.Multilingual speakers communicate in more than one language in daily life and on social media. In order to process or investigate multilingual communication, there is a need for language identification. This study compares the performance of human annotators with automatic ways of language identification on a multilingual (mainly German-Italian-English) social media corpus collected in South Tyrol, Italy. Our results indicate that humans and Natural Language Processing (NLP) systems follow their individual techniques to make a decision about multilingual text messages. This results in low agreement when different annotators or NLP systems execute the same task. In general, annotators agree with each other more than NLP systems. However, there is also variation in human agreement depending on the prior establishment of guidelines for the annotation task or not.B

    Discovering similarities for content-based recommendation and browsing in multimedia collections

    No full text
    The purpose of the research described in this paper is to examine the existence of correlation between low level audio, visual and textual features and movie content similarity. In order to focus on a well defined and controlled case, we have built a small dataset of movie scenes from three sequel movies. In addition, manual annotations have led to a ground-truth similarity matrix between the adopted scenes. Then, three similarity matrices (one for each medium) have been computed based on Gaussian Mixture Models (audio and visual) and Latent Semantic Indexing (text). We have evaluated the automatically extracted similarities along with two simple fusion approaches and results indicate that the low-level features can lead to an accurate representation of the movie content. In addition, the fusion approach seems to outperform the individual modalities, which is a strong indication that individual modules lead to diverse similarities (in terms of content). Finally, we have evaluated the extracted similarities for different groups of human annotators, based on what a human interprets as similar and the results show that different groups of people correlate better with different modalities. This last result is very important and can be either used in (a) a personalized content-based retrieval and recommender system and (b) in a local weighted fusion approach, in future research.The purpose of the research described in this paper is to examine the existence of correlation between low level audio, visual and textual features and movie content similarity. In order to focus on a well defined and controlled case, we have built a small dataset of movie scenes from three sequel movies. In addition, manual annotations have led to a ground-truth similarity matrix between the adopted scenes. Then, three similarity matrices (one for each medium) have been computed based on Gaussian Mixture Models (audio and visual) and Latent Semantic Indexing (text). We have evaluated the automatically extracted similarities along with two simple fusion approaches and results indicate that the low-level features can lead to an accurate representation of the movie content. In addition, the fusion approach seems to outperform the individual modalities, which is a strong indication that individual modules lead to diverse similarities (in terms of content). Finally, we have evaluated the extracted similarities for different groups of human annotators, based on what a human interprets as similar and the results show that different groups of people correlate better with different modalities. This last result is very important and can be either used in (a) a personalized content-based retrieval and recommender system and (b) in a local weighted fusion approach, in future research.P

    Open machine translation for low resource South American languages (AmericasNLP 2021 shared task contribution)

    No full text
    This paper describes the team (“Tamalli”)’s submission to AmericasNLP2021 shared task on Open Machine Translation for low resource South American languages. Our goal was to evaluate different Machine Translation (MT) techniques, statistical and neural-based, under several configuration settings. We obtained the second-best results for the language pairs “Spanish-Bribri”, “Spanish-Asháninka”, and “Spanish-Rarámuri” in the category “Development set not used for training”. Our performed experiments will serve as a point of reference for researchers working on MT with low-resource languages.This paper describes the team (“Tamalli”)’s submission to AmericasNLP2021 shared task on Open Machine Translation for low resource South American languages. Our goal was to evaluate different Machine Translation (MT) techniques, statistical and neural-based, under several configuration settings. We obtained the second-best results for the language pairs “Spanish-Bribri”, “Spanish-Asháninka”, and “Spanish-Rarámuri” in the category “Development set not used for training”. Our performed experiments will serve as a point of reference for researchers working on MT with low-resource languages.C
    corecore