4,022 research outputs found
Getting Past the Language Gap: Innovations in Machine Translation
In this chapter, we will be reviewing state of the art machine translation systems, and will discuss innovative methods for machine translation, highlighting the most promising techniques and applications. Machine translation (MT) has benefited from a revitalization in the last 10 years or so, after a period of relatively slow activity. In 2005 the field received a jumpstart when a powerful complete experimental package for building MT systems from scratch became freely available as a result of the unified efforts of the MOSES international consortium. Around the same time, hierarchical methods had been introduced by Chinese researchers, which allowed the introduction and use of syntactic information in translation modeling. Furthermore, the advances in the related field of computational linguistics, making off-the-shelf taggers and parsers readily available, helped give MT an additional boost. Yet there is still more progress to be made. For example, MT will be enhanced greatly when both syntax and semantics are on board: this still presents a major challenge though many advanced research groups are currently pursuing ways to meet this challenge head-on. The next generation of MT will consist of a collection of hybrid systems. It also augurs well for the mobile environment, as we look forward to more advanced and improved technologies that enable the working of Speech-To-Speech machine translation on hand-held devices, i.e. speech recognition and speech synthesis. We review all of these developments and point out in the final section some of the most promising research avenues for the future of MT
SAGA: A project to automate the management of software production systems
The SAGA system is a software environment that is designed to support most of the software development activities that occur in a software lifecycle. The system can be configured to support specific software development applications using given programming languages, tools, and methodologies. Meta-tools are provided to ease configuration. The SAGA system consists of a small number of software components that are adapted by the meta-tools into specific tools for use in the software development application. The modules are design so that the meta-tools can construct an environment which is both integrated and flexible. The SAGA project is documented in several papers which are presented
Recommended from our members
Cross-Lingual Transfer of Natural Language Processing Systems
Accurate natural language processing systems rely heavily on annotated datasets. In the absence of such datasets, transfer methods can help to develop a model by transferring annotations from one or more rich-resource languages to the target language of interest. These methods are generally divided into two approaches: 1) annotation projection from translation data, aka parallel data, using supervised models in rich-resource languages, and 2) direct model transfer from annotated datasets in rich-resource languages.
In this thesis, we demonstrate different methods for transfer of dependency parsers and sentiment analysis systems. We propose an annotation projection method that performs well in the scenarios for which a large amount of in-domain parallel data is available. We also propose a method which is a combination of annotation projection and direct transfer that can leverage a minimal amount of information from a small out-of-domain parallel dataset to develop highly accurate transfer models. Furthermore, we propose an unsupervised syntactic reordering model to improve the accuracy of dependency parser transfer for non-European languages. Finally, we conduct a diverse set of experiments for the transfer of sentiment analysis systems in different data settings.
A summary of our contributions are as follows:
* We develop accurate dependency parsers using parallel text in an annotation projection framework. We make use of the fact that the density of word alignments is a valuable indicator of reliability in annotation projection.
* We develop accurate dependency parsers in the absence of a large amount of parallel data. We use the Bible data, which is in orders of magnitude smaller than a conventional parallel dataset, to provide minimal cues for creating cross-lingual word representations. Our model is also capable of boosting the performance of annotation projection with a large amount of parallel data. Our model develops cross-lingual word representations for going beyond the traditional delexicalized direct transfer methods. Moreover, we propose a simple but effective word translation approach that brings in explicit lexical features from the target language in our direct transfer method.
* We develop different syntactic reordering models that can change the source treebanks in rich-resource languages, thus preventing learning a wrong model for a non-related language. Our experimental results show substantial improvements over non-European languages.
* We develop transfer methods for sentiment analysis in different data availability scenarios. We show that we can leverage cross-lingual word embeddings to create accurate sentiment analysis systems in the absence of annotated data in the target language of interest.
We believe that the novelties that we introduce in this thesis indicate the usefulness of transfer methods. This is appealing in practice, especially since we suggest eliminating the requirement for annotating new datasets for low-resource languages which is expensive, if not impossible, to obtain
Posterior Regularization for Learning with Side Information and Weak Supervision
Supervised machine learning techniques have been very successful for a variety of tasks and domains including natural language processing, computer vision, and computational biology. Unfortunately, their use often requires creation of large problem-specific training corpora that can make these methods prohibitively expensive. At the same time, we often have access to external problem-specific information that we cannot alway easily incorporate. We might know how to solve the problem in another domain (e.g. for a different language); we might have access to cheap but noisy training data; or a domain expert might be available who would be able to guide a human learner much more efficiently than by simply creating an IID training corpus. A key challenge for weakly supervised learning is then how to incorporate such kinds of auxiliary information arising from indirect supervision.
In this thesis, we present Posterior Regularization, a probabilistic framework for structured, weakly supervised learning. Posterior Regularization is applicable to probabilistic models with latent variables and exports a language for specifying constraints or preferences about posterior distributions of latent variables. We show that this language is powerful enough to specify realistic prior knowledge for a variety applications in natural language processing. Additionally, because Posterior Regularization separates model complexity from the complexity of structural constraints, it can be used for structured problems with relatively little computational overhead. We apply Posterior Regularization to several problems in natural language processing including word alignment for machine translation, transfer of linguistic resources across languages and grammar induction. Additionally, we find that we can apply Posterior Regularization to the problem of multi-view learning, achieving particularly good results for transfer learning. We also explore the theoretical relationship between Posterior Regularization and other proposed frameworks for encoding this kind of prior knowledge, and show a close relationship to Constraint Driven Learning as well as to Generalized Expectation Constraints
at the 14th Conference of the Spanish Association for Artificial Intelligence (CAEPIA 2011)
Technical Report TR-2011/1, Department of Languages and Computation. University of Almeria November 2011. Joaquín Cañadas, Grzegorz J. Nalepa, Joachim Baumeister (Editors)The seventh workshop on Knowledge Engineering and Software Engineering (KESE7) was held at the Conference of the Spanish Association for Artificial Intelligence (CAEPIA-2011) in La Laguna (Tenerife), Spain, and brought together researchers and practitioners from both fields of software engineering and artificial intelligence. The intention was to give ample space for exchanging latest research results as well as knowledge about practical experience.University of Almería, Almería, Spain. AGH University of Science and Technology, Kraków, Poland. University of Würzburg, Würzburg, Germany
Recommended from our members
Learning with Joint Inference and Latent Linguistic Structure in Graphical Models
Constructing end-to-end NLP systems requires the processing of many types of linguistic information prior to solving the desired end task. A common approach to this problem is to construct a pipeline, one component for each task, with each system\u27s output becoming input for the next. This approach poses two problems. First, errors propagate, and, much like the childhood game of telephone , combining systems in this manner can lead to unintelligible outcomes. Second, each component task requires annotated training data to act as supervision for training the model. These annotations are often expensive and time-consuming to produce, may differ from each other in genre and style, and may not match the intended application.
In this dissertation we present a general framework for constructing and reasoning on joint graphical model formulations of NLP problems. Individual models are composed using weighted Boolean logic constraints, and inference is performed using belief propagation. The systems we develop are composed of two parts: one a representation of syntax, the other a desired end task (semantic role labeling, named entity recognition, or relation extraction). By modeling these problems jointly, both models are trained in a single, integrated process, with uncertainty propagated between them. This mitigates the accumulation of errors typical of pipelined approaches.
Additionally we propose a novel marginalization-based training method in which the error signal from end task annotations is used to guide the induction of a constrained latent syntactic representation. This allows training in the absence of syntactic training data, where the latent syntactic structure is instead optimized to best support the end task predictions. We find that across many NLP tasks this training method offers performance comparable to fully supervised training of each individual component, and in some instances improves upon it by learning latent structures which are more appropriate for the task
Statistical and Machine Learning Techniques Applied to Algorithm Selection for Solving Sparse Linear Systems
There are many applications and problems in science and engineering that require large-scale numerical simulations and computations. The issue of choosing an appropriate method to solve these problems is very common, however it is not a trivial one, principally because this decision is most of the times too hard for humans to make, or certain degree of expertise and knowledge in the particular discipline, or in mathematics, are required. Thus, the development of a methodology that can facilitate or automate this process and helps to understand the problem, would be of great interest and help. The proposal is to utilize various statistically based machine-learning and data mining techniques to analyze and automate the process of choosing an appropriate numerical algorithm for solving a specific set of problems (sparse linear systems) based on their individual properties
Plant-Wide Diagnosis: Cause-and-Effect Analysis Using Process Connectivity and Directionality Information
Production plants used in modern process industry must produce products that meet stringent
environmental, quality and profitability constraints. In such integrated plants, non-linearity and
strong process dynamic interactions among process units complicate root-cause diagnosis of
plant-wide disturbances because disturbances may propagate to units at some distance away
from the primary source of the upset. Similarly, implemented advanced process control
strategies, backup and recovery systems, use of recycle streams and heat integration may
hamper detection and diagnostic efforts.
It is important to track down the root-cause of a plant-wide disturbance because once
corrective action is taken at the source, secondary propagated effects can be quickly eliminated
with minimum effort and reduced down time with the resultant positive impact on process
efficiency, productivity and profitability.
In order to diagnose the root-cause of disturbances that manifest plant-wide, it is crucial to
incorporate and utilize knowledge about the overall process topology or interrelated physical
structure of the plant, such as is contained in Piping and Instrumentation Diagrams (P&IDs).
Traditionally, process control engineers have intuitively referred to the physical structure of
the plant by visual inspection and manual tracing of fault propagation paths within the process
structures, such as the process drawings on printed P&IDs, in order to make logical
conclusions based on the results from data-driven analysis. This manual approach, however, is
prone to various sources of errors and can quickly become complicated in real processes.
The aim of this thesis, therefore, is to establish innovative techniques for the electronic
capture and manipulation of process schematic information from large plants such as
refineries in order to provide an automated means of diagnosing plant-wide performance
problems. This report also describes the design and implementation of a computer application
program that integrates: (i) process connectivity and directionality information from intelligent
P&IDs (ii) results from data-driven cause-and-effect analysis of process measurements and (iii)
process know-how to aid process control engineers and plant operators gain process insight.
This work explored process intelligent P&IDs, created with AVEVA® P&ID, a Computer
Aided Design (CAD) tool, and exported as an ISO 15926 compliant platform and vendor
independent text-based XML description of the plant. The XML output was processed by a
software tool developed in Microsoft® .NET environment in this research project to
computationally generate connectivity matrix that shows plant items and their connections.
The connectivity matrix produced can be exported to Excel® spreadsheet application as a basis
for other application and has served as precursor to other research work. The final version of
the developed software tool links statistical results of cause-and-effect analysis of process data
with the connectivity matrix to simplify and gain insights into the cause and effect analysis
using the connectivity information. Process knowhow and understanding is incorporated to
generate logical conclusions.
The thesis presents a case study in an atmospheric crude heating unit as an illustrative example
to drive home key concepts and also describes an industrial case study involving refinery
operations. In the industrial case study, in addition to confirming the root-cause candidate, the
developed software tool was set the task to determine the physical sequence of fault
propagation path within the plant.
This was then compared with the hypothesis about disturbance propagation sequence
generated by pure data-driven method. The results show a high degree of overlap which helps
to validate statistical data-driven technique and easily identify any spurious results from the
data-driven multivariable analysis. This significantly increase control engineers confidence in
data-driven method being used for root-cause diagnosis.
The thesis concludes with a discussion of the approach and presents ideas for further
development of the methods
- …