5 research outputs found

    Improved HMM Alignment Models for Languages with Scarce Resources

    Get PDF
    We introduce improvements to statistical word alignment based on the Hidden Markov Model. One improvement incorporates syntac-tic knowledge. Results on the workshop data show that alignment performance exceeds that of a state-of-the art system based on more com-plex models, resulting in over a 5.5 % absolute reduction in error on Romanian-English.

    A Bayesian model for joint word alignment and part-of-speech transfer

    Get PDF
    Current methods for word alignment require considerable amounts of parallel text to deliver accurate results, a requirement which is met only for a small minority of the world's approximately 7,000 languages. We show that by jointly performing word alignment and annotation transfer in a novel Bayesian model, alignment accuracy can be improved for language pairs where annotations are available for only one of the languages---a finding which could facilitate the study and processing of a vast number of low-resource languages. We also present an evaluation where our method is used to perform single-source and multi-source part-of-speech transfer with 22 translations of the same text in four different languages. This allows us to quantify the considerable variation in accuracy depending on the specific source text(s) used, even with different translations into the same language.Non peer reviewe

    The Kinome of Pacific Oyster Crassostrea gigas, Its Expression during Development and in Response to Environmental Factors

    Get PDF
    Oysters play an important role in estuarine and coastal marine habitats, where the majority of humans live. In these ecosystems, environmental degradation is substantial, and oysters must cope with highly dynamic and stressful environmental constraints during their lives in the intertidal zone. The availability of the genome sequence of the Pacific oyster Crassostrea gigas represents a unique opportunity for a comprehensive assessment of the signal transduction pathways that the species has developed to deal with this unique habitat. We performed an in silico analysis to identify, annotate and classify protein kinases in C. gigas, according to their kinase domain taxonomy classification, and compared with kinome already described in other animal species. The C. gigas kinome consists of 371 protein kinases, making it closely related to the sea urchin kinome, which has 353 protein kinases. The absence of gene redundancy in some groups of the C. gigas kinome may simplify functional studies of protein kinases. Through data mining of transcriptomes in C. gigas, we identified part of the kinome which may be central during development and may play a role in response to various environmental factors. Overall, this work contributes to a better understanding of key sensing pathways that may be central for adaptation to a highly dynamic marine environment

    Constrained word alignment models for statistical machine translation

    Get PDF
    Word alignment is a fundamental and crucial component in Statistical Machine Translation (SMT) systems. Despite the enormous progress made in the past two decades, this task remains an active research topic simply because the quality of word alignment is still far from optimal. Most state-of-the-art word alignment models are grounded on statistical learning theory treating word alignment as a general sequence alignment problem, where many linguistically motivated insights are not incorporated. In this thesis, we propose new word alignment models with linguistically motivated constraints in a bid to improve the quality of word alignment for Phrase-Based SMT systems (PB-SMT). We start the exploration with an investigation into segmentation constraints for word alignment by proposing a novel algorithm, namely word packing, which is motivated by the fact that one concept expressed by one word in one language can frequently surface as a compound or collocation in another language. Our algorithm takes advantage of the interaction between segmentation and alignment, starting with some segmentation for both the source and target language and updating the segmentation with respect to the word alignment results using state-of-the-art word alignment models; thereafter a refined word alignment can be obtained based on the updated segmentation. In this process, the updated segmentation acts as a hard constraint on the word alignment models and reduces the complexity of the alignment models by generating more 1-to-1 correspondences through word packing. Experimental results show that this algorithm can lead to statistically significant improvements over the state-of-the-art word alignment models. Given that word packing imposes "hard" segmentation constraints on the word aligner, which is prone to introducing noise, we propose two new word alignment models using syntactic dependencies as soft constraints. The first model is a syntactically enhanced discriminative word alignment model, where we use a set of feature functions to express the syntactic dependency information encoded in both source and target languages. One the one hand, this model enjoys great flexibility in its capacity to incorporate multiple features; on the other hand, this model is designed to facilitate model tuning for different objective functions. Experimental results show that using syntactic constraints can improve the performance of the discriminative word alignment model, which also leads to better PB-SMT performance compared to using state-of-the-art word alignment models. The second model is a syntactically constrained generative word alignment model, where we add in a syntactic coherence model over the target phrases in the context of HMM word-to-phrase alignment. The advantages of our model are that (i) the addition of the syntactic coherence model preserves the efficient parameter estimation procedures; and (ii) the flexibility of the model can be increased so that it can be tuned according to different objective functions. Experimental results show that tuning this model properly leads to a significant gain in MT performance over the state-of-the-art

    Unsupervised Graph-Based Similarity Learning Using Heterogeneous Features.

    Full text link
    Relational data refers to data that contains explicit relations among objects. Nowadays, relational data are universal and have a broad appeal in many different application domains. The problem of estimating similarity between objects is a core requirement for many standard Machine Learning (ML), Natural Language Processing (NLP) and Information Retrieval (IR) problems such as clustering, classiffication, word sense disambiguation, etc. Traditional machine learning approaches represent the data using simple, concise representations such as feature vectors. While this works very well for homogeneous data, i.e, data with a single feature type such as text, it does not exploit the availability of dfferent feature types fully. For example, scientic publications have text, citations, authorship information, venue information. Each of the features can be used for estimating similarity. Representing such objects has been a key issue in efficient mining (Getoor and Taskar, 2007). In this thesis, we propose natural representations for relational data using multiple, connected layers of graphs; one for each feature type. Also, we propose novel algorithms for estimating similarity using multiple heterogeneous features. Also, we present novel algorithms for tasks like topic detection and music recommendation using the estimated similarity measure. We demonstrate superior performance of the proposed algorithms (root mean squared error of 24.81 on the Yahoo! KDD Music recommendation data set and classiffication accuracy of 88% on the ACL Anthology Network data set) over many of the state of the art algorithms, such as Latent Semantic Analysis (LSA), Multiple Kernel Learning (MKL) and spectral clustering and baselines on large, standard data sets.Ph.D.Computer Science & EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/89824/1/mpradeep_1.pd
    corecore