258 research outputs found
Knowledge Expansion of a Statistical Machine Translation System using Morphological Resources
Translation capability of a Phrase-Based Statistical Machine Translation (PBSMT) system mostly depends on parallel data and phrases that are not present in the training data are not correctly translated. This paper describes a method that efficiently expands the existing knowledge of a PBSMT system without adding more parallel data but using external morphological resources. A set of new phrase associations is added to translation and reordering models; each of them corresponds to a morphological variation of the source/target/both phrases of an existing association. New associations are generated using a string similarity score based on morphosyntactic information. We tested our approach on En-Fr and Fr-En translations and results showed improvements of the performance in terms of automatic scores (BLEU and Meteor) and reduction of out-of-vocabulary (OOV) words. We believe that our knowledge expansion framework is generic and could be used to add different types of information to the model.JRC.G.2-Global security and crisis managemen
Recommended from our members
The Near-Synonymous Classifiers in Mandarin Chinese: Etymology, Modern Usage, And Possible Problems in L2 Classroom
Many Chinese classifiers are nearly synonymic – they can be used with the same head nouns without changing the meaning of the sentence, in other words, such classifiers can be used interchangeably or almost interchangeably. This poses a challenge for Chinese language learners, especially those who lack such a grammatical category in their own native language. Another complication arises from the ambiguous English translations of many classifiers.
In this paper we investigate the collocation behavior of near-synonymous Chinese classifiers, focusing on their semantic nuances and interchangeability. Analyzing 6 pairs of classifiers — 栋 and 幢, 匹 and 头, 批 and 派, 颗 and 粒, 辆 and 台, and 根 and 支— drawn from the HSK exam glossary, the dataset for this study encompasses 1200 samples (100 per each variable) and 416 distinct head nouns.
Through a corpus-based approach we analyze collocation behavior of each classifier on its own and as a part of the pair. The results showcase that not all pairs exhibit complete interchangeability. The collocation behavior of 批 and 派 differ significantly, where 批 primarily quantifies batches with a \u27first\u27 connotation, while 派 is used more in artistic expressions. The interchangeability of 栋 and 幢 varies with context. 幢 emerges as the least fre¬¬quent morpheme in the corpus, emphasizing its specific contextual usage. While both are used in address lines, 栋 predominantly quantifies standalone buildings, whereas 幢 is more aligned with larger architectural complexes. The analysis of 匹 and 头 highlights their distinctiveness, with 匹 counting horses and wolves and 头 being more versatile with various animals. 颗 and 粒 appear partially interchangeable, particularly with 珠-related head nouns and items associated with plants, fruits, and trees. The research also underscores that 辆 is primarily linked to car-related nouns, while 台 is used more versatile as a classifier for machines and electronic devices, including computers, printers, phones, cameras. 根 and 支 only overlap in the head noun 笔, and their roles diverge, with 根 being a versatile classifier and 支 also appearing as part of medical terms
Improving Product-related Patent Information Access with Automated Technology Ontology Extraction
Ph.DDOCTOR OF PHILOSOPH
Exploratory Search on Mobile Devices
The goal of this thesis is to provide a general framework (MobEx) for exploratory search especially on mobile devices. The central part is the design, implementation, and evaluation of several core modules for on-demand unsupervised information extraction well suited for exploratory search on mobile devices and creating the MobEx framework. These core processing elements, combined with a multitouch - able user interface specially designed for two families of mobile devices, i.e. smartphones and tablets, have been finally implemented in a research prototype. The initial information request, in form of a query topic description, is issued online by a user to the system. The system then retrieves web snippets by using standard search engines. These snippets are passed through a chain of NLP components which perform an ondemand or ad-hoc interactive Query Disambiguation, Named Entity Recognition, and Relation Extraction task. By on-demand or ad-hoc we mean the components are capable to perform their operations on an unrestricted open domain within special time constraints. The result of the whole process is a topic graph containing the detected associated topics as nodes and the extracted relation ships as labelled edges between the nodes. The Topic Graph is presented to the user in different ways depending on the size of the device she is using. Various evaluations have been conducted that help us to understand the potentials and limitations of the framework and the prototype
Getting Past the Language Gap: Innovations in Machine Translation
In this chapter, we will be reviewing state of the art machine translation systems, and will discuss innovative methods for machine translation, highlighting the most promising techniques and applications. Machine translation (MT) has benefited from a revitalization in the last 10 years or so, after a period of relatively slow activity. In 2005 the field received a jumpstart when a powerful complete experimental package for building MT systems from scratch became freely available as a result of the unified efforts of the MOSES international consortium. Around the same time, hierarchical methods had been introduced by Chinese researchers, which allowed the introduction and use of syntactic information in translation modeling. Furthermore, the advances in the related field of computational linguistics, making off-the-shelf taggers and parsers readily available, helped give MT an additional boost. Yet there is still more progress to be made. For example, MT will be enhanced greatly when both syntax and semantics are on board: this still presents a major challenge though many advanced research groups are currently pursuing ways to meet this challenge head-on. The next generation of MT will consist of a collection of hybrid systems. It also augurs well for the mobile environment, as we look forward to more advanced and improved technologies that enable the working of Speech-To-Speech machine translation on hand-held devices, i.e. speech recognition and speech synthesis. We review all of these developments and point out in the final section some of the most promising research avenues for the future of MT
Big Data Computing for Geospatial Applications
The convergence of big data and geospatial computing has brought forth challenges and opportunities to Geographic Information Science with regard to geospatial data management, processing, analysis, modeling, and visualization. This book highlights recent advancements in integrating new computing approaches, spatial methods, and data management strategies to tackle geospatial big data challenges and meanwhile demonstrates opportunities for using big data for geospatial applications. Crucial to the advancements highlighted in this book is the integration of computational thinking and spatial thinking and the transformation of abstract ideas and models to concrete data structures and algorithms
Extending Ethos in Digital Rhetorics
This dissertation researched the concept of ethos, or appeal to authority or trust, on the social media platform, Twitter. Looking at collections of tweets, I found that the characteristics of the Twitter platform, as well as the general qualities of writing online, pushed users to use short cuts to trust, such as focusing in on specific buzz words, or through referencing well known organizations and individuals. Users also used internet culture as its own source of authority. They demonstrated that they were up to date on the latest trends and memes, and so were trustworthy accounts to follow. Users appealed to ethos this way because Twitter conversations occurred faster and farther, and with people who most users were either unfamiliar with or who were completely anonymous. Essentially, Twitter user rely on the short cuts to trust and authority in conversations because they are less often engaging with a stable, known audience. Twitter users must continually reassert and define themselves again as their posts circulate widely across and beyond the platform
Applying Wikipedia to Interactive Information Retrieval
There are many opportunities to improve the interactivity of information retrieval systems beyond the ubiquitous search box. One idea is to use knowledge bases—e.g. controlled vocabularies, classification schemes, thesauri and ontologies—to organize, describe and navigate the information space. These resources are popular in libraries and specialist collections, but have proven too expensive and narrow to be applied to everyday webscale search. Wikipedia has the potential to bring structured knowledge into more widespread use. This online, collaboratively generated encyclopaedia is one of the largest and most consulted reference works in existence. It is broader, deeper and more agile than the knowledge bases put forward to assist retrieval in the past. Rendering this resource machine-readable is a challenging task that has captured the interest of many researchers. Many see it as a key step required to break the knowledge acquisition bottleneck that crippled previous efforts. This thesis claims that the roadblock can be sidestepped: Wikipedia can be applied effectively to open-domain information retrieval with minimal natural language processing or information extraction. The key is to focus on gathering and applying human-readable rather than machine-readable knowledge. To demonstrate this claim, the thesis tackles three separate problems: extracting knowledge from Wikipedia; connecting it to textual documents; and applying it to the retrieval process. First, we demonstrate that a large thesaurus-like structure can be obtained directly from Wikipedia, and that accurate measures of semantic relatedness can be efficiently mined from it. Second, we show that Wikipedia provides the necessary features and training data for existing data mining techniques to accurately detect and disambiguate topics when they are mentioned in plain text. Third, we provide two systems and user studies that demonstrate the utility of the Wikipedia-derived knowledge base for interactive information retrieval
Satellite Workshop On Language, Artificial Intelligence and Computer Science for Natural Language Processing Applications (LAICS-NLP): Discovery of Meaning from Text
This paper proposes a novel method to disambiguate important words from a collection of documents. The
hypothesis that underlies this approach is that there is a
minimal set of senses that are significant in characterizing a context. We extend Yarowsky’s one sense
per discourse [13] further to a collection of related
documents rather than a single document. We perform
distributed clustering on a set of features representing
each of the top ten categories of documents in the
Reuters-21578 dataset. Groups of terms that have a
similar term distributional pattern across documents were
identified. WordNet-based similarity measurement was
then computed for terms within each cluster. An
aggregation of the associations in WordNet that was
employed to ascertain term similarity within clusters has
provided a means of identifying clusters’ root senses
A Survey on Semantic Processing Techniques
Semantic processing is a fundamental research domain in computational
linguistics. In the era of powerful pre-trained language models and large
language models, the advancement of research in this domain appears to be
decelerating. However, the study of semantics is multi-dimensional in
linguistics. The research depth and breadth of computational semantic
processing can be largely improved with new technologies. In this survey, we
analyzed five semantic processing tasks, e.g., word sense disambiguation,
anaphora resolution, named entity recognition, concept extraction, and
subjectivity detection. We study relevant theoretical research in these fields,
advanced methods, and downstream applications. We connect the surveyed tasks
with downstream applications because this may inspire future scholars to fuse
these low-level semantic processing tasks with high-level natural language
processing tasks. The review of theoretical research may also inspire new tasks
and technologies in the semantic processing domain. Finally, we compare the
different semantic processing techniques and summarize their technical trends,
application trends, and future directions.Comment: Published at Information Fusion, Volume 101, 2024, 101988, ISSN
1566-2535. The equal contribution mark is missed in the published version due
to the publication policies. Please contact Prof. Erik Cambria for detail
- …