23 research outputs found
Machine translation of user-generated content
The world of social media has undergone huge evolution during the last few years. With the spread of social media and online forums, individual users actively participate in the generation of online content in different languages from all over the world. Sharing of online content has become much easier than before with the advent of popular websites such as Twitter, Facebook etc. Such content is referred to as ‘User-Generated Content’ (UGC). Some examples of UGC are user reviews, customer feedback, tweets etc. In general, UGC is informal and noisy in terms of linguistic norms. Such noise does not create significant problems for human to understand the content, but it can pose challenges for several natural language processing
applications such as parsing, sentiment analysis, machine translation (MT),
etc.
An additional challenge for MT is sparseness of bilingual (translated) parallel UGC corpora. In this research, we explore the general issues in MT of UGC and set some research goals from our findings. One of our main goals is to exploit comparable corpora in order to extract parallel or semantically similar sentences. To accomplish this task, we design a document alignment system to extract semantically similar bilingual document pairs using the bilingual comparable corpora. We then apply strategies to extract parallel or semantically similar sentences from comparable corpora by transforming the document alignment system into a sentence alignment system. We seek to improve the quality of parallel data extraction for UGC translation and assemble the extracted data with the existing human translated resources.
Another objective of this research is to demonstrate the usefulness of MT-based sentiment analysis. However, when using openly available systems such as Google Translate, the translation process may alter the sentiment in the target language. To cope with this phenomenon, we instead build fine-grained sentiment translation models that focus on sentiment preservation in the target language during translation
Essential Speech and Language Technology for Dutch: Results by the STEVIN-programme
Computational Linguistics; Germanic Languages; Artificial Intelligence (incl. Robotics); Computing Methodologie
Recommended from our members
NBS monograph
From Introduction: "This report is the first of a series intended to provide a selective overview of research and development efforts and requirements in the somewhat overlapping fields of the computer and information sciences and technologies. The projected series of reports will attempt to outline the probable range of R & D activities in the computer and information sciences and technologies through selective reviews of the literature and to develop a reasonable consensus with respect to the opinions of workers in these and potentially related fields as to areas of continuing R & D concern for research program planning or review in these areas.
CLARIN
The book provides a comprehensive overview of the Common Language Resources and Technology Infrastructure – CLARIN – for the humanities. It covers a broad range of CLARIN language resources and services, its underlying technological infrastructure, the achievements of national consortia, and challenges that CLARIN will tackle in the future. The book is published 10 years after establishing CLARIN as an Europ. Research Infrastructure Consortium
CLARIN. The infrastructure for language resources
CLARIN, the "Common Language Resources and Technology Infrastructure", has established itself as a major player in the field of research infrastructures for the humanities. This volume provides a comprehensive overview of the organization, its members, its goals and its functioning, as well as of the tools and resources hosted by the infrastructure. The many contributors representing various fields, from computer science to law to psychology, analyse a wide range of topics, such as the technology behind the CLARIN infrastructure, the use of CLARIN resources in diverse research projects, the achievements of selected national CLARIN consortia, and the challenges that CLARIN has faced and will face in the future.
The book will be published in 2022, 10 years after the establishment of CLARIN as a European Research Infrastructure Consortium by the European Commission (Decision 2012/136/EU)
Tune your brown clustering, please
Brown clustering, an unsupervised hierarchical clustering technique based on ngram mutual information, has proven useful in many NLP applications. However, most uses of Brown clustering employ the same default configuration; the appropriateness of this configuration has gone predominantly unexplored. Accordingly, we present information for practitioners on the behaviour of Brown clustering in order to assist hyper-parametre tuning, in the form of a theoretical model of Brown clustering utility. This model is then evaluated empirically in two sequence labelling tasks over two text types. We explore the dynamic between the input corpus size, chosen number of classes, and quality of the resulting clusters, which has an impact for any approach using Brown clustering. In every scenario that we examine, our results reveal that the values most commonly used for the clustering are sub-optimal
CLARIN
The book provides a comprehensive overview of the Common Language Resources and Technology Infrastructure – CLARIN – for the humanities. It covers a broad range of CLARIN language resources and services, its underlying technological infrastructure, the achievements of national consortia, and challenges that CLARIN will tackle in the future. The book is published 10 years after establishing CLARIN as an Europ. Research Infrastructure Consortium
Engineering data compendium. Human perception and performance, volume 3
The concept underlying the Engineering Data Compendium was the product of a research and development program (Integrated Perceptual Information for Designers project) aimed at facilitating the application of basic research findings in human performance to the design of military crew systems. The principal objective was to develop a workable strategy for: (1) identifying and distilling information of potential value to system design from existing research literature, and (2) presenting this technical information in a way that would aid its accessibility, interpretability, and applicability by system designers. The present four volumes of the Engineering Data Compendium represent the first implementation of this strategy. This is Volume 3, containing sections on Human Language Processing, Operator Motion Control, Effects of Environmental Stressors, Display Interfaces, and Control Interfaces (Real/Virtual)