33,646 research outputs found

    Facilitating translation using source language paraphrase lattices

    Get PDF
    For resource-limited language pairs, coverage of the test set by the parallel corpus is an important factor that affects translation quality in two respects: 1) out of vocabulary words; 2) the same information in an input sentence can be expressed in different ways, while current phrase-based SMT systems cannot automatically select an alternative way to transfer the same information. Therefore, given limited data, in order to facilitate translation from the input side, this paper proposes a novel method to reduce the translation difficulty using source-side lattice-based paraphrases. We utilise the original phrases from the input sentence and the corresponding paraphrases to build a lattice with estimated weights for each edge to improve translation quality. Compared to the baseline system, our method achieves relative improvements of 7.07%, 6.78% and 3.63% in terms of BLEU score on small, medium and largescale English-to-Chinese translation tasks respectively. The results show that the proposed method is effective not only for resourcelimited language pairs, but also for resource sufficient pairs to some extent

    Improved phrase-based SMT with syntactic reordering patterns learned from lattice scoring

    Get PDF
    In this paper, we present a novel approach to incorporate source-side syntactic reordering patterns into phrase-based SMT. The main contribution of this work is to use the lattice scoring approach to exploit and utilize reordering information that is favoured by the baseline PBSMT system. By referring to the parse trees of the training corpus, we represent the observed reorderings with source-side syntactic patterns. The extracted patterns are then used to convert the parsed inputs into word lattices, which contain both the original source sentences and their potential reorderings. Weights of the word lattices are estimated from the observations of the syntactic reordering patterns in the training corpus. Finally, the PBSMT system is tuned and tested on the generated word lattices to show the benefits of adding potential sourceside reorderings in the inputs. We confirmed the effectiveness of our proposed method on a medium-sized corpus for Chinese-English machine translation task. Our method outperformed the baseline system by 1.67% relative on a randomly selected testset and 8.56% relative on the NIST 2008 testset in terms of BLEU score

    Incorporating source-language paraphrases into phrase-based SMT with confusion networks

    Get PDF
    To increase the model coverage, sourcelanguage paraphrases have been utilized to boost SMT system performance. Previous work showed that word lattices constructed from paraphrases are able to reduce out-ofvocabulary words and to express inputs in different ways for better translation quality. However, such a word-lattice-based method suffers from two problems: 1) path duplications in word lattices decrease the capacities for potential paraphrases; 2) lattice decoding in SMT dramatically increases the search space and results in poor time efficiency. Therefore, in this paper, we adopt word confusion networks as the input structure to carry source-language paraphrase information. Similar to previous work, we use word lattices to build word confusion networks for merging of duplicated paths and faster decoding. Experiments are carried out on small-, medium- and large-scale English– Chinese translation tasks, and we show that compared with the word-lattice-based method, the decoding time on three tasks is reduced significantly (up to 79%) while comparable translation quality is obtained on the largescale task

    Source-side syntactic reordering patterns with functional words for improved phrase-based SMT

    Get PDF
    Inspired by previous source-side syntactic reordering methods for SMT, this paper focuses on using automatically learned syntactic reordering patterns with functional words which indicate structural reorderings between the source and target language. This approach takes advantage of phrase alignments and source-side parse trees for pattern extraction, and then filters out those patterns without functional words. Word lattices transformed by the generated patterns are fed into PBSMT systems to incorporate potential reorderings from the inputs. Experiments are carried out on a medium-sized corpus for a Chinese–English SMT task. The proposed method outperforms the baseline system by 1.38% relative on a randomly selected testset and 10.45% relative on the NIST 2008 testset in terms of BLEU score. Furthermore, a system with just 61.88% of the patterns filtered by functional words obtains a comparable performance with the unfiltered one on the randomly selected testset, and achieves 1.74% relative improvements on the NIST 2008 testset

    Bayesian hierarchical modeling of colorectal and breast cancer data in Missouri

    Get PDF
    Field of study: Statistics.Dr. Dongchu Sun, Thesis Supervisor.Includes vita."May 2018."Data on cancer in the United States is collected through cancer registries. The Missouri Cancer Registry and Research Center (MCR-ARC) maintains a statewide cancer surveillance system and participate in research in support of the prevention of cancer and the reduction of the cancer burden among Missouri residents. We applied Bayesian hierarchical models to colorectal cancer (CRC) and breast cancer related data collected by the MCR-ARC. In the first project, CRC incidence and mortality rates in Missouri were studied with emphasis on different groups of people categorized by age, gender and county at diagnosis. The incidence and mortality data were aggregated into different spatial regions due to data confidential requirements, which was identified as a misaligned-region problem in multivariate disease mapping literature. The Marginally and Conditionally CAR models were built to address the problem. Later on, colorectal cancer screening (CRCS) prevalences were analyzed due to its importance to the early detection of CRC. We applied small area estimation techniques to produce county-level CRCS prevalences from the state-level Behavioral Risk Factor Surveillance System (BRFSS) data. The last two projects focused on breast cancer related data. One is about breast cancer survival analysis in Missouri with emphasis on detecting the spatial variation of survival time among counties in Missouri, after accounting for the differences in demographic and cancer stages. The other one is studying the disparities of breast cancer treatment delay with respect to patient's race, age, stage of cancer, county at diagnosis and year of diagnosis.Includes bibliographical references (pages 172-179)
    corecore