535,012 research outputs found
Joint morphological-lexical language modeling for processing morphologically rich languages with application to dialectal Arabic
Language modeling for an inflected language
such as Arabic poses new challenges for speech recognition and
machine translation due to its rich morphology. Rich morphology
results in large increases in out-of-vocabulary (OOV) rate and
poor language model parameter estimation in the absence of large
quantities of data. In this study, we present a joint
morphological-lexical language model (JMLLM) that takes
advantage of Arabic morphology. JMLLM combines
morphological segments with the underlying lexical items and
additional available information sources with regards to
morphological segments and lexical items in a single joint model.
Joint representation and modeling of morphological and lexical
items reduces the OOV rate and provides smooth probability
estimates while keeping the predictive power of whole words.
Speech recognition and machine translation experiments in
dialectal-Arabic show improvements over word and morpheme
based trigram language models. We also show that as the
tightness of integration between different information sources
increases, both speech recognition and machine translation
performances improve
Recommended from our members
Geostatistical data integration in complex reservoirs
textOne of the most challenging issues in reservoir modeling is to integrate information coming from different sources at disparate scales and precision. The primary data are borehole measurements, but in most cases, these are too sparse to construct accurate reservoir models. Therefore, in most cases, the information from borehole measurements has to be supplemented with other secondary data. The secondary data for reservoir modeling could be static data such as seismic data or dynamic data such as production history, well test data or time-lapse seismic data. Several algorithms for integrating different types of data have been developed. A novel method for data integration based on the permanence of ratio hypothesis was proposed by Journel in 2002. The premise of the permanence of ratio hypothesis is to assess the information from each data source separately and then merge the information accounting for the redundancy between the information sources. The redundancy between the information from different sources is accounted for using parameters (tau or nu parameters, Krishnan, 2004). The primary goal of this thesis is to derive a practical expression for the tau parameters and demonstrate the procedure for calibrating these parameters using the available data. This thesis presents two new algorithms for data integration in reservoir modeling. The algorithms proposed in this thesis overcome some of the limitations of the current methods for data integration. We present an extension to the direct sampling based multiple-point statistics method. We present a methodology for integrating secondary soft data in that framwork. The algorithm is based on direct pattern search through an ensemble of realizations. We show that the proposed methodology is sutiable for modeling complex channelized reservoirs and reduces the uncertainty associated with production performance due to integration of secondary data. We subsequently present the permanence of ratio hypothesis for data integration in great detail. We present analytical equations for calculating the redundancy factor for discrete or continuous variable modeling. Then, we show how this factor can be infered using available data for different scenarios. We implement the method to model a carbonate reservoir in the Gulf of Mexico. We show that the method has a better performance than when primary hard and secondary soft data are used within the traditional geostatistical framework.Petroleum and Geosystems Engineerin
Challenges in Modeling Geospatial Provenance
The surge in availability of geospatial data sources, the increased use of crowdsourced maps and the advent of geospatial mashups have brought us to an era where geospatial information is delivered to users after integration from divers sources. Understanding the provenance of geospatial data is crucial for assessing the quality of the data and addressing whether to trust the information or not. In this paper we describe user requirements for modeling geospatial provenance
Memory-Based Learning: Using Similarity for Smoothing
This paper analyses the relation between the use of similarity in
Memory-Based Learning and the notion of backed-off smoothing in statistical
language modeling. We show that the two approaches are closely related, and we
argue that feature weighting methods in the Memory-Based paradigm can offer the
advantage of automatically specifying a suitable domain-specific hierarchy
between most specific and most general conditioning information without the
need for a large number of parameters. We report two applications of this
approach: PP-attachment and POS-tagging. Our method achieves state-of-the-art
performance in both domains, and allows the easy integration of diverse
information sources, such as rich lexical representations.Comment: 8 pages, uses aclap.sty, To appear in Proc. ACL/EACL 9
Introducing Dynamic Behavior in Amalgamated Knowledge Bases
The problem of integrating knowledge from multiple and heterogeneous sources
is a fundamental issue in current information systems. In order to cope with
this problem, the concept of mediator has been introduced as a software
component providing intermediate services, linking data resources and
application programs, and making transparent the heterogeneity of the
underlying systems. In designing a mediator architecture, we believe that an
important aspect is the definition of a formal framework by which one is able
to model integration according to a declarative style. To this purpose, the use
of a logical approach seems very promising. Another important aspect is the
ability to model both static integration aspects, concerning query execution,
and dynamic ones, concerning data updates and their propagation among the
various data sources. Unfortunately, as far as we know, no formal proposals for
logically modeling mediator architectures both from a static and dynamic point
of view have already been developed. In this paper, we extend the framework for
amalgamated knowledge bases, presented by Subrahmanian, to deal with dynamic
aspects. The language we propose is based on the Active U-Datalog language, and
extends it with annotated logic and amalgamation concepts. We model the sources
of information and the mediator (also called supervisor) as Active U-Datalog
deductive databases, thus modeling queries, transactions, and active rules,
interpreted according to the PARK semantics. By using active rules, the system
can efficiently perform update propagation among different databases. The
result is a logical environment, integrating active and deductive rules, to
perform queries and update propagation in an heterogeneous mediated framework.Comment: Other Keywords: Deductive databases; Heterogeneous databases; Active
rules; Update
Construction of a taxonomy for requirements engineering commercial-off-the-shelf components
This article presents a procedure for constructing a taxonomy of COTS products in the field of Requirements Engineering (RE). The taxonomy and the obtained information reach transcendental benefits to the selection of systems and tools that aid to RE-related actors to simplify and facilitate their work. This taxonomy is performed by means of a goal-oriented methodology inspired in GBRAM (Goal-Based Requirements Analysis Method), called GBTCM (Goal-Based Taxonomy Construction Method), that provides a guide to analyze sources of information and modeling requirements and domains, as well as gathering and organizing the knowledge in any segment of the COTS market. GBTCM claims to promote the use of standards and the reuse of requirements in order to support different processes of selection and integration of components.Peer ReviewedPostprint (published version
A Bayesian Approach to Discovering Truth from Conflicting Sources for Data Integration
In practical data integration systems, it is common for the data sources
being integrated to provide conflicting information about the same entity.
Consequently, a major challenge for data integration is to derive the most
complete and accurate integrated records from diverse and sometimes conflicting
sources. We term this challenge the truth finding problem. We observe that some
sources are generally more reliable than others, and therefore a good model of
source quality is the key to solving the truth finding problem. In this work,
we propose a probabilistic graphical model that can automatically infer true
records and source quality without any supervision. In contrast to previous
methods, our principled approach leverages a generative process of two types of
errors (false positive and false negative) by modeling two different aspects of
source quality. In so doing, ours is also the first approach designed to merge
multi-valued attribute types. Our method is scalable, due to an efficient
sampling-based inference algorithm that needs very few iterations in practice
and enjoys linear time complexity, with an even faster incremental variant.
Experiments on two real world datasets show that our new method outperforms
existing state-of-the-art approaches to the truth finding problem.Comment: VLDB201
Estimation of COVID-19 spread curves integrating global data and borrowing information
Currently, novel coronavirus disease 2019 (COVID-19) is a big threat to
global health. The rapid spread of the virus has created pandemic, and
countries all over the world are struggling with a surge in COVID-19 infected
cases. There are no drugs or other therapeutics approved by the US Food and
Drug Administration to prevent or treat COVID-19: information on the disease is
very limited and scattered even if it exists. This motivates the use of data
integration, combining data from diverse sources and eliciting useful
information with a unified view of them. In this paper, we propose a Bayesian
hierarchical model that integrates global data for real-time prediction of
infection trajectory for multiple countries. Because the proposed model takes
advantage of borrowing information across multiple countries, it outperforms an
existing individual country-based model. As fully Bayesian way has been
adopted, the model provides a powerful predictive tool endowed with uncertainty
quantification. Additionally, a joint variable selection technique has been
integrated into the proposed modeling scheme, which aimed to identify possible
country-level risk factors for severe disease due to COVID-19
- …