10,613 research outputs found
Automated DNA Motif Discovery
Ensembl's human non-coding and protein coding genes are used to automatically
find DNA pattern motifs. The Backus-Naur form (BNF) grammar for regular
expressions (RE) is used by genetic programming to ensure the generated strings
are legal. The evolved motif suggests the presence of Thymine followed by one
or more Adenines etc. early in transcripts indicate a non-protein coding gene.
Keywords: pseudogene, short and microRNAs, non-coding transcripts, systems
biology, machine learning, Bioinformatics, motif, regular expression, strongly
typed genetic programming, context-free grammar.Comment: 12 pages, 2 figure
Identifying the machine translation error types with the greatest impact on post-editing effort
Translation Environment Tools make translators' work easier by providing them with term lists, translation memories and machine translation output. Ideally, such tools automatically predict whether it is more effortful to post-edit than to translate from scratch, and determine whether or not to provide translators with machine translation output. Current machine translation quality estimation systems heavily rely on automatic metrics, even though they do not accurately capture actual post-editing effort. In addition, these systems do not take translator experience into account, even though novices' translation processes are different from those of professional translators. In this paper, we report on the impact of machine translation errors on various types of post-editing effort indicators, for professional translators as well as student translators. We compare the impact of MT quality on a product effort indicator (HTER) with that on various process effort indicators. The translation and post-editing process of student translators and professional translators was logged with a combination of keystroke logging and eye-tracking, and the MT output was analyzed with a fine-grained translation quality assessment approach. We find that most post-editing effort indicators (product as well as process) are influenced by machine translation quality, but that different error types affect different post-editing effort indicators, confirming that a more fine-grained MT quality analysis is needed to correctly estimate actual post-editing effort. Coherence, meaning shifts, and structural issues are shown to be good indicators of post-editing effort. The additional impact of experience on these interactions between MT quality and post-editing effort is smaller than expected
Ontologies and Information Extraction
This report argues that, even in the simplest cases, IE is an ontology-driven
process. It is not a mere text filtering method based on simple pattern
matching and keywords, because the extracted pieces of texts are interpreted
with respect to a predefined partial domain model. This report shows that
depending on the nature and the depth of the interpretation to be done for
extracting the information, more or less knowledge must be involved. This
report is mainly illustrated in biology, a domain in which there are critical
needs for content-based exploration of the scientific literature and which
becomes a major application domain for IE
A syntactified direct translation model with linear-time decoding
Recent syntactic extensions of statistical translation models work with a synchronous context-free or tree-substitution grammar extracted from an automatically parsed parallel corpus. The decoders accompanying these extensions typically exceed quadratic time complexity. This paper extends the Direct Translation Model 2 (DTM2) with syntax while maintaining linear-time decoding. We employ a linear-time parsing algorithm based on an eager, incremental interpretation of Combinatory Categorial Grammar
(CCG). As every input word is processed, the local parsing decisions resolve ambiguity eagerly, by selecting a single
supertagâoperator pair for extending the dependency parse incrementally. Alongside translation features extracted from
the derived parse tree, we explore syntactic features extracted from the incremental derivation process. Our empirical experiments show that our model significantly
outperforms the state-of-the art DTM2 system
- âŠ