Search CORE

535,012 research outputs found

Joint morphological-lexical language modeling for processing morphologically rich languages with application to dialectal Arabic

Author: Afify Mohamed
Deng Yonggang
Erdogan Hakan
Erdoğan Hakan
Gao Yuqing
Sarıkaya Ruhi
Sarikaya Ruhi
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2007
Field of study

Language modeling for an inflected language such as Arabic poses new challenges for speech recognition and machine translation due to its rich morphology. Rich morphology results in large increases in out-of-vocabulary (OOV) rate and poor language model parameter estimation in the absence of large quantities of data. In this study, we present a joint morphological-lexical language model (JMLLM) that takes advantage of Arabic morphology. JMLLM combines morphological segments with the underlying lexical items and additional available information sources with regards to morphological segments and lexical items in a single joint model. Joint representation and modeling of morphological and lexical items reduces the OOV rate and provides smooth probability estimates while keeping the predictive power of whole words. Speech recognition and machine translation experiments in dialectal-Arabic show improvements over word and morpheme based trigram language models. We also show that as the tightness of integration between different information sources increases, both speech recognition and machine translation performances improve

CiteSeerX

Crossref

Sabanci University Research Database

Recommended from our members

Geostatistical data integration in complex reservoirs

Author: Elahi Naraghi Morteza
Publication venue
Publication date: 03/02/2015
Field of study

textOne of the most challenging issues in reservoir modeling is to integrate information coming from different sources at disparate scales and precision. The primary data are borehole measurements, but in most cases, these are too sparse to construct accurate reservoir models. Therefore, in most cases, the information from borehole measurements has to be supplemented with other secondary data. The secondary data for reservoir modeling could be static data such as seismic data or dynamic data such as production history, well test data or time-lapse seismic data. Several algorithms for integrating different types of data have been developed. A novel method for data integration based on the permanence of ratio hypothesis was proposed by Journel in 2002. The premise of the permanence of ratio hypothesis is to assess the information from each data source separately and then merge the information accounting for the redundancy between the information sources. The redundancy between the information from different sources is accounted for using parameters (tau or nu parameters, Krishnan, 2004). The primary goal of this thesis is to derive a practical expression for the tau parameters and demonstrate the procedure for calibrating these parameters using the available data. This thesis presents two new algorithms for data integration in reservoir modeling. The algorithms proposed in this thesis overcome some of the limitations of the current methods for data integration. We present an extension to the direct sampling based multiple-point statistics method. We present a methodology for integrating secondary soft data in that framwork. The algorithm is based on direct pattern search through an ensemble of realizations. We show that the proposed methodology is sutiable for modeling complex channelized reservoirs and reduces the uncertainty associated with production performance due to integration of secondary data. We subsequently present the permanence of ratio hypothesis for data integration in great detail. We present analytical equations for calculating the redundancy factor for discrete or continuous variable modeling. Then, we show how this factor can be infered using available data for different scenarios. We implement the method to model a carbonate reservoir in the Gulf of Mexico. We show that the method has a better performance than when primary hard and secondary soft data are used within the traditional geostatistical framework.Petroleum and Geosystems Engineerin

Texas ScholarWorks

Challenges in Modeling Geospatial Provenance

Author: Garijo Daniel
Gil Yolanda
Harth Andreas
Publication venue: IPAW
Publication date: 01/01/2014
Field of study

The surge in availability of geospatial data sources, the increased use of crowdsourced maps and the advent of geospatial mashups have brought us to an era where geospatial information is delivered to users after integration from divers sources. Understanding the provenance of geospatial data is crucial for assessing the quality of the data and addressing whether to trust the information or not. In this paper we describe user requirements for modeling geospatial provenance

KITopen

Memory-Based Learning: Using Similarity for Smoothing

Author: Daelemans Walter
Zavrel Jakub
Publication venue
Publication date: 01/01/1997
Field of study

This paper analyses the relation between the use of similarity in Memory-Based Learning and the notion of backed-off smoothing in statistical language modeling. We show that the two approaches are closely related, and we argue that feature weighting methods in the Memory-Based paradigm can offer the advantage of automatically specifying a suitable domain-specific hierarchy between most specific and most general conditioning information without the need for a large number of parameters. We report two applications of this approach: PP-attachment and POS-tagging. Our method achieves state-of-the-art performance in both domains, and allows the easy integration of diverse information sources, such as rich lexical representations.Comment: 8 pages, uses aclap.sty, To appear in Proc. ACL/EACL 9

arXiv.org e-Print Archive

CiteSeerX

Institutional Repository Universiteit Antwerpen

Tilburg University Repository

Introducing Dynamic Behavior in Amalgamated Knowledge Bases

Author: Bertino Elisa
Catania Barbara
Perlasca Paolo
Publication venue
Publication date: 01/01/2002
Field of study

The problem of integrating knowledge from multiple and heterogeneous sources is a fundamental issue in current information systems. In order to cope with this problem, the concept of mediator has been introduced as a software component providing intermediate services, linking data resources and application programs, and making transparent the heterogeneity of the underlying systems. In designing a mediator architecture, we believe that an important aspect is the definition of a formal framework by which one is able to model integration according to a declarative style. To this purpose, the use of a logical approach seems very promising. Another important aspect is the ability to model both static integration aspects, concerning query execution, and dynamic ones, concerning data updates and their propagation among the various data sources. Unfortunately, as far as we know, no formal proposals for logically modeling mediator architectures both from a static and dynamic point of view have already been developed. In this paper, we extend the framework for amalgamated knowledge bases, presented by Subrahmanian, to deal with dynamic aspects. The language we propose is based on the Active U-Datalog language, and extends it with annotated logic and amalgamation concepts. We model the sources of information and the mediator (also called supervisor) as Active U-Datalog deductive databases, thus modeling queries, transactions, and active rules, interpreted according to the PARK semantics. By using active rules, the system can efficiently perform update propagation among different databases. The result is a logical environment, integrating active and deductive rules, to perform queries and update propagation in an heterogeneous mediated framework.Comment: Other Keywords: Deductive databases; Heterogeneous databases; Active rules; Update

arXiv.org e-Print Archive

CiteSeerX

AIR Universita degli studi di Milano

Archivio istituzionale della ricerca - Università di Genova

Construction of a taxonomy for requirements engineering commercial-off-the-shelf components

Author: Ayala Martínez Claudia Patricia
Botella López Pere
Franch Gutiérrez Javier
Publication venue
Publication date: 01/01/2005
Field of study

This article presents a procedure for constructing a taxonomy of COTS products in the field of Requirements Engineering (RE). The taxonomy and the obtained information reach transcendental benefits to the selection of systems and tools that aid to RE-related actors to simplify and facilitate their work. This taxonomy is performed by means of a goal-oriented methodology inspired in GBRAM (Goal-Based Requirements Analysis Method), called GBTCM (Goal-Based Taxonomy Construction Method), that provides a guide to analyze sources of information and modeling requirements and domains, as well as gathering and organizing the knowledge in any segment of the COTS market. GBTCM claims to promote the use of standards and the reuse of requirements in order to support different processes of selection and integration of components.Peer ReviewedPostprint (published version

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

UPCommons. Portal del coneixement obert de la UPC

A Bayesian Approach to Discovering Truth from Conflicting Sources for Data Integration

Author: Gemmell Jim
Han Jiawei
Rubinstein Benjamin I. P.
Zhao Bo
Publication venue
Publication date: 01/01/2012
Field of study

In practical data integration systems, it is common for the data sources being integrated to provide conflicting information about the same entity. Consequently, a major challenge for data integration is to derive the most complete and accurate integrated records from diverse and sometimes conflicting sources. We term this challenge the truth finding problem. We observe that some sources are generally more reliable than others, and therefore a good model of source quality is the key to solving the truth finding problem. In this work, we propose a probabilistic graphical model that can automatically infer true records and source quality without any supervision. In contrast to previous methods, our principled approach leverages a generative process of two types of errors (false positive and false negative) by modeling two different aspects of source quality. In so doing, ours is also the first approach designed to merge multi-valued attribute types. Our method is scalable, due to an efficient sampling-based inference algorithm that needs very few iterations in practice and enjoys linear time complexity, with an even faster incremental variant. Experiments on two real world datasets show that our new method outperforms existing state-of-the-art approaches to the truth finding problem.Comment: VLDB201

arXiv.org e-Print Archive

CiteSeerX

Estimation of COVID-19 spread curves integrating global data and borrowing information

Author: Lee Se Yoon
Lei Bowen
Mallick Bani K.
Publication venue: 'Public Library of Science (PLoS)'
Publication date: 01/01/2020
Field of study

Currently, novel coronavirus disease 2019 (COVID-19) is a big threat to global health. The rapid spread of the virus has created pandemic, and countries all over the world are struggling with a surge in COVID-19 infected cases. There are no drugs or other therapeutics approved by the US Food and Drug Administration to prevent or treat COVID-19: information on the disease is very limited and scattered even if it exists. This motivates the use of data integration, combining data from diverse sources and eliciting useful information with a unified view of them. In this paper, we propose a Bayesian hierarchical model that integrates global data for real-time prediction of infection trajectory for multiple countries. Because the proposed model takes advantage of borrowing information across multiple countries, it outperforms an existing individual country-based model. As fully Bayesian way has been adopted, the model provides a powerful predictive tool endowed with uncertainty quantification. Additionally, a joint variable selection technique has been integrated into the proposed modeling scheme, which aimed to identify possible country-level risk factors for severe disease due to COVID-19

arXiv.org e-Print Archive

Directory of Open Access Journals

Hochschulbibliothekszentrum des Landes Nordrhein-Westfalen (hbz)