    Learning from Noisy Data in Statistical Machine Translation

    In dieser Arbeit wurden Methoden entwickelt, die in der Lage sind die negativen Effekte von verrauschten Daten in SMT Systemen zu senken und dadurch die Leistung des Systems zu steigern. Hierbei wird das Problem in zwei verschiedenen Schritten des Lernprozesses behandelt: Bei der Vorverarbeitung und während der Modellierung. Bei der Vorverarbeitung werden zwei Methoden zur Verbesserung der statistischen Modelle durch die Erhöhung der Qualität von Trainingsdaten entwickelt. Bei der Modellierung werden verschiedene Möglichkeiten vorgestellt, um Daten nach ihrer Nützlichkeit zu gewichten. Zunächst wird der Effekt des Entfernens von False-Positives vom Parallel Corpus gezeigt. Ein Parallel Corpus besteht aus einem Text in zwei Sprachen, wobei jeder Satz einer Sprache mit dem entsprechenden Satz der anderen Sprache gepaart ist. Hierbei wird vorausgesetzt, dass die Anzahl der Sätzen in beiden Sprachversionen gleich ist. False-Positives in diesem Sinne sind Satzpaare, die im Parallel Corpus gepaart sind aber keine Übersetzung voneinander sind. Um diese zu erkennen wird ein kleiner und fehlerfreier paralleler Corpus (Clean Corpus) vorausgesetzt. Mit Hilfe verschiedenen lexikalischen Eigenschaften werden zuverlässig False-Positives vor der Modellierungsphase gefiltert. Eine wichtige lexikalische Eigenschaft hierbei ist das vom Clean Corpus erzeugte bilinguale Lexikon. In der Extraktion dieses bilingualen Lexikons werden verschiedene Heuristiken implementiert, die zu einer verbesserten Leistung führen. Danach betrachten wir das Problem vom Extrahieren der nützlichsten Teile der Trainingsdaten. Dabei ordnen wir die Daten basierend auf ihren Bezug zur Zieldomaine. Dies geschieht unter der Annahme der Existenz eines guten repräsentativen Tuning Datensatzes. Da solche Tuning Daten typischerweise beschränkte Größe haben, werden Wortähnlichkeiten benutzt um die Abdeckung der Tuning Daten zu erweitern. Die im vorherigen Schritt verwendeten Wortähnlichkeiten sind entscheidend für die Qualität des Verfahrens. Aus diesem Grund werden in der Arbeit verschiedene automatische Methoden zur Ermittlung von solche Wortähnlichkeiten ausgehend von monoligual und biligual Corpora vorgestellt. Interessanterweise ist dies auch bei beschränkten Daten möglich, indem auch monolinguale Daten, die in großen Mengen zur Verfügung stehen, zur Ermittlung der Wortähnlichkeit herangezogen werden. Bei bilingualen Daten, die häufig nur in beschränkter Größe zur Verfügung stehen, können auch weitere Sprachpaare herangezogen werden, die mindestens eine Sprache mit dem vorgegebenen Sprachpaar teilen. Im Modellierungsschritt behandeln wir das Problem mit verrauschten Daten, indem die Trainingsdaten anhand der Güte des Corpus gewichtet werden. Wir benutzen Statistik signifikante Messgrößen, um die weniger verlässlichen Sequenzen zu finden und ihre Gewichtung zu reduzieren. Ähnlich zu den vorherigen Ansätzen, werden Wortähnlichkeiten benutzt um das Problem bei begrenzten Daten zu behandeln. Ein weiteres Problem tritt allerdings auf sobald die absolute Häufigkeiten mit den gewichteten Häufigkeiten ersetzt werden. In dieser Arbeit werden hierfür Techniken zur Glättung der Wahrscheinlichkeiten in dieser Situation entwickelt. Die Größe der Trainingsdaten werden problematisch sobald man mit Corpora von erheblichem Volumen arbeitet. Hierbei treten zwei Hauptschwierigkeiten auf: Die Länge der Trainingszeit und der begrenzte Arbeitsspeicher. Für das Problem der Trainingszeit wird ein Algorithmus entwickelt, der die rechenaufwendigen Berechnungen auf mehrere Prozessoren mit gemeinsamem Speicher ausführt. Für das Speicherproblem werden speziale Datenstrukturen und Algorithmen für externe Speicher benutzt. Dies erlaubt ein effizientes Training von extrem großen Modellne in Hardware mit begrenztem Speicher

    Corruption: Where Does It Stand? - A Point of View

    The economic theory states vigorously that the corruption is a painful sick to the economic growth and the development of a country. The developing countries known by their weak institutions suffer much from the negative impacts of the corruption. This paper attempts to explain the mechanism by which the corrupt practices affect harmfully the budget state as well as the public and private investments. Keywords: corruption, developing countries, private and public investmen

    Improving In-Domain Data Selection for Small In-Domain Sets

    Finding sufficient in-domain text data for language modeling is a recurrent challenge. Some methods have already been proposed for selecting parts of out-of-domain text data most closely resembling the in-domain data using a small amount of the latter. Including this new “near-domain” data in training can potentially lead to better language model performance, while reducing training resources relative to incorporating all data. One popular, state-of-the-art selection process based on cross-entropy scores makes use of in-domain and out-ofdomain language models. In order to compensate for the limited availability of the in-domain data required for this method, we introduce enhancements to two of its steps. Firstly, we improve the procedure for drawing the outof-domain sample data used for selection. Secondly, we use word-associations in order to extend the underlying vocabulary of the sample language models used for scoring. These enhancements are applied to selecting text for language modeling of talks given in a technical subject area. Besides comparing perplexity, we judge the resulting language models by their performance in automatic speech recognition and machine translation tasks. We evaluate our method in different contexts. We show that it yields consistent improvements, up to 2% absolute reduction in word error rate and 0.3 Bleu points. We achieve these improvements even given a much smaller in-domain set

    German-Arabic Speech-to-Speech Translation for Psychiatric Diagnosis

    In this paper we present the Arabic related natural language processing components of our German–Arabic speech-to-speech translation system which is being deployed in the context of interpretation during psychiatric, diagnostic interviews. For this purpose we have built a pipelined speech-to-speech translation system consisting of automatic speech recognition, machine translation, text post-processing, and speech synthesis systems. We have implemented two pipelines, from German to Arabic and vice versa, to conduct interpreted two-way dialogues between psychiatrists and potential patients. All systems in our pipeline have been realized as all-neural end-to-end systems, using different architectures suitable for the different components. The speech recognition systems use an encoder/decoder + attention architecture, the machine translation system is based on the Transformer architecture, the post-processing for Arabic employs a sequence-tagger for diacritization, and for the speech synthesis systems we use Tacotron 2 for generating spectrograms and WaveGlow as a vocoder. The speech translation is deployed in a server-based speech translation application that implements a turn-based translation between a German-speaking psychiatrist administrating the Mini-International Neuropsychiatric Interview (M.I.N.I.) and an Arabic speaking person answering the interview. As this is a very specific domain, in addition to the linguistic challenges posed by translating between Arabic and German, we also focus in this paper on the methods we implemented for adapting our speech to speech translation system to the domain of this psychiatric interview

    The KIT Translation Systems for IWSLT 2013

    In this paper, we present the KIT systems participating in all three official directions, namely English→German, German→English, and English→French, in translation tasks of the IWSLT 2013 machine translation evaluation. Additionally, we present the results for our submissions to the optional directions English→Chinese and English→Arabic. We used phrase-based translation systems to generate the translations. This year, we focused on adapting the systems towards ASR input. Furthermore, we investigated different reordering models as well as an extended discriminative word lexicon. Finally, we added a data selection approach for domain adaptation

    Eccentric Exercise in Treatment of Patellar Tendinopathy in High Level Basketball Players. A Randomized Clinical Trial.

    Chronic patellar tendinopathy is a common pathology in sporting population. To date, there is no agreed upon protocol as election treatment. Eccentric exercises have been used with satisfactory outcomes (3). The purpose of this trial was to compare the effects of two eccentric exercise protocols

    Combined Spoken Language Translation

    EU-BRIDGE is a European research project which is aimed at developing innovative speech translation technology. One of the collaborative efforts within EU-BRIDGE is to produce joint submissions of up to four different partners to the evaluation campaign at the 2014 International Workshop on Spoken Language Translation (IWSLT). We submitted combined translations to the German→English spoken language translation (SLT) track as well as to the German→English, English→German and English→French machine translation (MT) tracks. In this paper, we present the techniques which were applied by the different individual translation systems of RWTH Aachen University, the University of Edinburgh, Karlsruhe Institute of Technology, and Fondazione Bruno Kessler. We then show the combination approach developed at RWTH Aachen University which combined the individual systems. The consensus translations yield empirical gains of up to 2.3 points in BLEU and 1.2 points in TER compared to the best individual system

    Metabolomics-based profiling with chemometric approach to identify bioactive compounds in Salacca zalacca fruits extracts and in silico molecular docking

    Salak (Salacca zalacca) is well-known as snake fruit and it is immensely studied for its antioxidative and antidiabetic active metabolites throughout the southeast Asian countries. However, there are many remaining unidentified metabolites due to very low abundance and natural variation, which need to be further explored. Nowadays mass spectrometry (MS/MS) facilitates the tentative identification of unknown compounds in the crude herbal extracts. This study described the metabolite profiling of hydroalcoholic extracts of S. zalacca analysed by LCQTOF- MS/MS. The 60% ethanolic extract exhibited the highest a-glucosidase inhibition and ferric reducing antioxidant power activities with IC50 of 15.94 mg/mL and 78.13 lg AAE/g, respectively. Multivariate data analysis (MVDA) by an orthogonal partial least-squares (OPLS) algorithm was conducted to correlate the a-glucosidase inhibition activity with the LC- QTOF-MS data. A total of 4 compounds were reported for first time in this fruit and identified based on the molecular mass and fragment ions. LC-QTOF-MS analysis indicated the presence of carexane I, 5-phenoxytetra zol-1-yl)-2,3,5,6-hexahydrofurofuran-3-ethylurea, 3-acetylphenoxy)-N-[(2)-1-amino-4-methyl-1-oxo pentan-2-yl]-4,5-dihydroxycyclohexene-1-carboxamide and Ethyl 4-[5-methyl-2-oxo-10,20,50,60,70,70 a-hexahydro-1H-spiro[indole-3,30-pyrrolizine]-20-ylamido] benzoate. Molecular docking of those compounds with the a–glucosidase enzyme was performed to confirm their antidiabetic potential. These bioactive compounds could be suggested as a-glucosidase inhibitors and functional food additive

    Modulation of metabolic alterations of obese diabetic rats upon treatment with Salacca zalacca fruits extract using H NMR-based metabolomics

    Fruit of salak (Salacca zalacca) is tradition ally used and commercialized as an an tidiabetic agent. How ever, the scientific evidence to prove this traditional use is lack ing. This re search was aimed to evaluate the metabolic changes of obese-diabetic (OBDC) rats treated with S. zalacca fruit using proton-nuclear magnetic resonance ( H NMR)-based metabolomics approach. This re search presents the first report on the in vitro antidiabetic effect of S. zalacca fruits ex tract us ing this approach. The obtained results in dicated that the administration of 400 mg/ kg bw of 60% ethanolic S. zalacca extract for 6 weeks significantly de creased the blood glucose level and normalized the blood lipid pro file of the OBDC rats. The potential biomarkers in urine were 2-oxoglutarate, alanine, leucine, succinate 3-hydroxy butyrate, taurine, betaine, allantoin, acetate, dimethylamine, creatine, creatinine, glucose, phenyl-acetyl glycine, and hippurate. Based on the data obtained, the metabolite pro files of the urine of treated rats by the 60% ethanolic extract could not be fully improved the metabolic complications of diabetic rats. The ex tract of S. zalacca fruit was able to de crease the ketones bodies as 3-hydroxy butyrate and acetoacetate. It has also improved energy metabolism, involving glucose, acetate, lactate, 2-hydroxy butyrate, 2-oxoglutarate, citrate, and succinate. More over, it decreased metabolites from gut microflora, including choline. This extract had significant effect on amino acid metabolism, metabolites from gut microflora, bile acid metabolism and creatine. The result can further support the traditional claims of S. zalacca fruits in management of diabetes. This finding might bevaluable in understanding the molecular mechanism and pharmacological properties of this medicinal plant for managing diabetes mellitus