52 research outputs found
SMT and Hybrid systems of the QTLeap project in the WMT16 IT-task
This paper presents the description of 12
systems submitted to the WMT16 IT-task,
covering six different languages, namely
Basque, Bulgarian, Dutch, Czech, Portuguese
and Spanish. All these systems
were developed under the scope of the
QTLeap project, presenting a common
strategy. For each language two different
systems were submitted, namely a phrase-based
MT system built using Moses, and
a system exploiting deep language engineering
approaches, that in all the languages
but Bulgarian was implemented
using TectoMT. For 4 of the 6 languages,
the TectoMT-based system performs better
than the Moses-based one
Traducción automática basada en tectogramática para inglés-español e inglés-euskara
Presentamos los primeros sistemas de traducción automática para inglés-español e inglés-euskara basados en tectogramática. A partir del modelo ya existente inglés-checo, describimos las herramientas para el análisis y síntesis, y los recursos para la trasferencia. La evaluación muestra el potencial de estos sistemas para adaptarse a nuevas lenguas y dominios.We present the first attempt to build machine translation systems for the English-Spanish and English-Basque language pairs following the tectogrammar approach. Based on the English-Czech system, we describe the language-specific tools added in the analysis and synthesis steps, and the resources for bilingual transfer. Evaluation shows the potential of these systems for new languages and domains.The research leading to these results has received funding from FP7-ICT-2013-10-610516 (QTLeap project, qtleap.eu)
Using Parallel Texts and Lexicons for Verbal Word Sense Disambiguation
We present a system for verbal Word Sense Disambiguation (WSD) that is able to exploit additional information from parallel texts and lexicons. It is an extension of our previous WSD method, which gave promising results but used only monolingual features. In the follow-up work described here, we have explored two additional ideas: using English-Czech bilingual resources (as features only - the task itself remains a monolingual WSD task), and using a 'hybrid' approach, adding features extracted both from a parallel corpus and from manually aligned bilingual valency lexicon entries, which contain subcategorization information. Albeit not all types of features proved useful, both ideas and additions have led to significant improvements for both languages explored
Extrakce znalostních grafů z projektové dokumentace
Název práce: Extrakce znalostních grafů z projektové dokumentace Autor: Bc. Tomáš Helešic Katedra: Katedra softwarového inženýrství Vedoucí diplomové práce: Mgr. Martin Nečaský, Ph.D. Abstrakt: Cílem této práce je prozkoumat možnosti automatické extrakce infor- mací z firemní projektové dokumentace s využitím nástroje pro strojové zpra- cování přirozeného jazyka a analýza přesnosti lingvistického zpracování těchto dokumentů. Dále navrhnout metody, jak získat klíčové pojmy a vazby mezi nimi. Z těchto pojmů a vazeb se vytváří znalostní grafy, které se uchovávají ve vhodném úložisti s vyhledávací službou. Práce se snaží propojit již ex- istující technologie, implementovat je do jednoduché aplikace a ověřit jejich připravenost pro praktické využití. Cílem je inspirovat budoucí výzkum v této oblasti, identifikovat kritická místa a navhrnout zlepšení. Hlavní přínos tkví v propojení zpracování přirozeného jazyka, metod extrakce informací, sémantické vyhledávání s firemnímy dokumenty. Přínos praktické části spočívá ve způsobu identifikace důležitých informací, které popisují jednotlivé dokumenty a jejich využití ve vyhledávání. Klíčová slova: Znalostní grafy, Extrakce informace, Zpracování...Title: Knowledge Graph Extraction from Project Documentation Author: Bc. Tomáš Helešic Department: Department of Software Engineering Supervisor: Mgr. Martin Nečaský, Ph.D. Abstract: The goal of this thesis is to explore the possibilities of automatic in- formation extraction from company project documentation with the use of ma- chine natural language processing and the analysis of the precision of linguistic processing of these documents. Furthermore suggest methods how acquire key terms and dependencies between them. From this terms and dependencies cre- ate knowledge graphs, that are stored in an appropriate database with search engine. The work is trying to interconnect already existing technologies in a shape of a simple application and test their readiness for a practical use. The goal is to inspire future research in this field, identify critical parts and propose improvements. The main gain is in the interconnection between natural lan- guage processing, methods of information extraction and semantic searching in corporate documents. The gain of the practical part reside in the way how to identify key information that is uniquely describing each document and its use in search. Keywords: Knowledge graphs, Information extraction, Natural language pro- cessing, Resource Description Framework 1Katedra softwarového inženýrstvíDepartment of Software EngineeringMatematicko-fyzikální fakultaFaculty of Mathematics and Physic
New Language Pairs in TectoMT
The TectoMT tree-to-tree machine translation system has been updated this year to support easier retraining for more translation directions. We use multilingual standards for morphology and syntax annotation and language-independent base rules. We include a simple, non-parametric way of combining TectoMT’s transfer model outputs
The Weight Function in the Subtree Kernel is Decisive
Tree data are ubiquitous because they model a large variety of situations,
e.g., the architecture of plants, the secondary structure of RNA, or the
hierarchy of XML files. Nevertheless, the analysis of these non-Euclidean data
is difficult per se. In this paper, we focus on the subtree kernel that is a
convolution kernel for tree data introduced by Vishwanathan and Smola in the
early 2000's. More precisely, we investigate the influence of the weight
function from a theoretical perspective and in real data applications. We
establish on a 2-classes stochastic model that the performance of the subtree
kernel is improved when the weight of leaves vanishes, which motivates the
definition of a new weight function, learned from the data and not fixed by the
user as usually done. To this end, we define a unified framework for computing
the subtree kernel from ordered or unordered trees, that is particularly
suitable for tuning parameters. We show through eight real data classification
problems the great efficiency of our approach, in particular for small
datasets, which also states the high importance of the weight function.
Finally, a visualization tool of the significant features is derived.Comment: 36 page
Recommended from our members
Bayesian Logic Programs for plan recognition and machine reading
textSeveral real world tasks involve data that is uncertain and relational in nature. Traditional approaches like first-order logic and probabilistic models either deal with structured data or uncertainty, but not both. To address these limitations, statistical relational learning (SRL), a new area in machine learning integrating both first-order logic and probabilistic graphical models, has emerged in the recent past. The advantage of SRL models is that they can handle both uncertainty and structured/relational data. As a result, they are widely used in domains like social network analysis, biological data analysis, and natural language processing. Bayesian Logic Programs (BLPs), which integrate both first-order logic and Bayesian net- works are a powerful SRL formalism developed in the recent past. In this
dissertation, we develop approaches using BLPs to solve two real world tasks – plan recognition and machine reading.
Plan recognition is the task of predicting an agent’s top-level plans based on its observed actions. It is an abductive reasoning task that involves inferring cause from effect. In the first part of the dissertation, we develop an approach to abductive plan recognition using BLPs. Since BLPs employ logical deduction to construct the networks, they cannot be used effectively for abductive plan recognition as is. Therefore, we extend BLPs to use logical abduction to construct Bayesian networks and call the resulting model Bayesian Abductive Logic Programs (BALPs).
In the second part of the dissertation, we apply BLPs to the task of machine reading, which involves automatic extraction of knowledge from natural language text. Most information extraction (IE) systems identify facts that are explicitly stated in text. However, much of the information conveyed in text must be inferred from what is explicitly stated since easily inferable facts are rarely mentioned. Human readers naturally use common sense knowledge and “read between the lines” to infer such implicit information from the explicitly stated facts. Since IE systems do not have access to common sense knowledge, they cannot perform deeper reasoning to infer implicitly stated facts. Here, we first develop an approach using BLPs to infer implicitly stated facts from natural language text. It involves learning uncertain common sense knowledge in the form of probabilistic first-order rules by mining a large corpus of automatically extracted facts using an existing rule learner. These rules are then used to derive additional facts from extracted information using BLP inference. We then develop an online rule learner that handles the concise, incomplete nature of natural-language text and learns first-order rules from noisy IE extractions. Finally, we develop a novel approach to calculate the weights of the rules using a curated lexical ontology like WordNet.
Both tasks described above involve inference and learning from partially
observed or incomplete data. In plan recognition, the underlying cause or the top-level plan that resulted in the observed actions is not known or observed. Further, only a subset of the executed actions can be observed by the plan recognition system resulting in partially observed data. Similarly, in machine reading, since some information is implicitly stated, they are rarely observed in the data. In this dissertation, we demonstrate the efficacy of BLPs for inference and learning from incomplete data. Experimental comparison on various benchmark data sets on both tasks demonstrate the superior performance of BLPs over state-of-the-art methods.Computer Science
Target-Side Context for Discriminative Models in Statistical Machine Translation
Discriminative translation models utilizing source context have been shown to
help statistical machine translation performance. We propose a novel extension
of this work using target context information. Surprisingly, we show that this
model can be efficiently integrated directly in the decoding process. Our
approach scales to large training data sizes and results in consistent
improvements in translation quality on four language pairs. We also provide an
analysis comparing the strengths of the baseline source-context model with our
extended source-context and target-context model and we show that our extension
allows us to better capture morphological coherence. Our work is freely
available as part of Moses.Comment: Accepted as a long paper for ACL 201
- …