30 research outputs found

    Sentence Alignment using MR and GA

    Get PDF
    In this paper, two new approaches to align English-Arabic sentences in bilingual parallel corpora based on mathematical regression (MR) and genetic algorithm (GA) classifiers are presented. A feature vector is extracted from the text pair under consideration. This vector contains text features such as length, punctuation score, and cognate score values. A set of manually prepared training data was assigned to train the mathematical regression and genetic algorithm models. Another set of data was used for testing. The results of (MR) and (GA) outperform the results of length based approach. Moreover these new approaches are valid for any languages pair and are quite flexible since the feature vector may contain more, less or different features, such as a lexical matching feature and Hanzi characters in Japanese-Chinese texts, than the ones used in the current research

    A Hybrid Accurate Alignment method for large Persian-English corpus construction based on statistical analysis and Lexicon/Persian Word net

    Get PDF
    A bilingual corpus is considered as a very important knowledge source and an inevitable requirement for many natural language processing (NLP) applications in which two languages are involved. For some languages such as Persian, lack of such resources is much more significant. Several applications, including statistical and example-based machine translation needs bilingual corpora, in which large amounts of texts from two different languages have been aligned at the sentence or phrase levels. In order to meet this requirement, this paper aims to propose an accurate and hybrid sentence alignment method for construction of an English-Persian parallel corpus. As the first step, the proposed method uses statistical length based analysis for filtering of candidates. Punctuation marks are used as a directing feature to reduce the complexity and increase the accuracy. Finally, the proposed method makes use of some lexical knowledge in order to produce the final output. . In the phase of lexical analysis, a bilingual dictionary as well as a Persian semantic net (denoted as FarsNet) is used to calculate the extended semantic similarity. Experiments showed the positive effect of expansion on synonym words by extended semantic similarity on the accuracy of the sentence alignment process. In the proposed matching scheme, a semantic load based approach (which considers the verb as the pivot and the main part of a sentence) was also used in order for increasing the accuracy. The results obtained from the experiments were promising and the generated parallel corpus can be used as an effective knowledge source by researchers who work on Persian language

    Masadennin (The Little Prince in Bamana)

    Get PDF
    The launch of the online concordance with the parallel texts of The Little Prince in Bamana, French, and English is reported. Two working modes are available: the parallel text output and the output of an interlinearized Bamana text. In the parallel text output, the following search options are accessible: entire wordform, words starting with/ ending in/ containing a given character sequence, and exact phrase search. The search options in the interlinearized output mode also include part of speech, gloss, and lemma, but not the exact phrase search. The parallel texts in the corpus are aligned by paragraphs, and a simple algorithm to define the position of a corresponding sentence in the French or English text based on the position of the Bamana sentence is suggested. Automated sentence alignment based on the so-called anchor items (most frequent words) is briefly discussed for future applications.L’article présente une concordance mise en ligne des textes parallèles bambara, français et anglais du Petit Prince. Deux régimes de travail sont disponibles : la sortie d’un texte parallèle et la sortie d’un texte bambara interlinéarisé. Dans la mode du texte parallèle, des possibilités de recherche accessibles sont les suivantes : le mot-forme entier ; des mots commençant / se terminant par/ contenant des séquences données des caractères ; des phrases entières. Les possibilités de recherche dans la mode interlinéarisée sont : partie de discours, glose, lemme, mais non pas une phrase entière. Les textes parallèles du corpus sont alignés par paragraphes ; un simple algorithme est fourni pour définir la position de la phrase dans le texte français ou anglais en partant de la position de la phrase bambara. L’alignement automatique des phrases se basant sur les point d’ancrage (les mots les plus fréquents) est brièvement discuté ; cela peut être utile pour des applications ultérieures.В статье говорится о публикации в Интернете онлайнового конкорданса параллельных текстов «Маленького принца» на бамана, английском и французском языках. Представлены два режима работы: с выводом в виде параллельного корпуса и с выводом в виде синхронизированного баманского текста. В формате параллельного корпуса доступны следующие поисковые опции: по целой словоформе, по последовательности букв в начале, конце или середине слова, по фразе. Поисковые опции в интерлинеаризированной модели также включают поиск по части речи, по глоссе, по лемме, но не по фразе. Параллельные тексты в корпусе выравнены по абзацам. Предлагается простой алгоритм определения позиции соответствующих предложений французского и английского текстов. Исходя из позиции в тексте баманского предложения. Кратко обсуждается возможность применения в последующих версиях автоматического выравнивания предложений на основе так называемых опорных элементов (наиболее частотных слов)

    Query-by-Example Keyword Spotting

    Get PDF
    Tato diplomová práce se zabývá moderními přístupy detekce klíčových slov a detekce frází v řečových datech. V úvodní části je seznámení s problematikou a teoretický popis metod pro detekci. Následuje popis reprezentace vstupních datových sad použitých při experimentech a evaluaci. Dále jsou uvedeny metody pro detekci klíčových slov definovaných vzorem. Následně jsou popsány evaluační metody a techniky použité pro skórování. Po provedení experimentů na datových sadách a po evaluaci jsou diskutovány výsledky. V dalším kroku jsou navrženy a poté implementovány moderní postupy vedoucí k vylepšení systému pro detekci a opět je provedena evaluace a diskuze dosažených výsledků. V závěrečné části je práce zhodnocena a jsou zde navrženy další směy vývoje našeho systému. Příloha obsahuje manuál pro používání implementovaných skriptů.The aim of the thesis is to get acquainted with modern approach of keyword spotting and spoken term detection in speech data. The bases of keyword spotting are described at first. The data representation used for experiments and evaluation are introduced. Keyword spotting methods where query is provided as an audio example (Query-by-Example) are presented. The scoring metrics are described and experiments follow. The results are discussed. Further, modern approaches of keyword spotting are suggested and implemented. The system with new techniques is evaluated and the discussion of results achieved follows. The conclusions are drawn and the discussion of future directions of development is held. The Appendix contains user manual for using implemented system.

    Evaluating the Serious and Violent Offender Reentry Initiative within and across the New England Region

    Get PDF
    The reentry of inmates back into the community is a hotly debated topic in society today. A descriptive study was conducted to analyze what mental health services were being provided to inmates who were participants in the Serious and Violent Offender Reentry Initiative (SVORI). Additionally, barriers to implementing comprehensive mental health services, along with trends within and across the New England Region regarding mental health services were examined. The data collected was from three program evaluation surveys completed by Project Directors for SVORI in 2003, 2005, and 2006. The results indicate that the mental health services provided to SVORI participants varied according to state, lacking comparable data and having no consistent definition for mental health services. Inadequate referrals by facility staff was most often reported as a factor limiting participant enrollment in SVORI. These results have implications for the counseling field, as well as society in general

    The Wooster Voice (Wooster, OH), 1982-05-14

    Get PDF
    The College updates its decade old investment frame. Students request to be paid on every two weeks, rather than once per month. Students respond to a previous letter to the editor in which a woman named Elizabeth Koreman disparaged modern feminists. Voice staff rate the various pizza restaurants in Wooster.https://openworks.wooster.edu/voice1981-1990/1286/thumbnail.jp
    corecore