173,468 research outputs found

    Integrating optical character recognition and machine translation of historical documents

    Get PDF
    Machine Translation (MT) plays a critical role in expanding capacity in the translation industry. However, many valuable documents, including digital documents, are encoded in non-accessible formats for machine processing (e.g., Historical or Legal documents). Such documents must be passed through a process of Optical Character Recognition (OCR) to render the text suitable for MT. No matter how good the OCR is, this process introduces recognition errors, which often renders MT ineffective. In this paper, we propose a new OCR to MT framework based on adding a new OCR error correction module to enhance the overall quality of translation. Experimentation shows that our new system correction based on the combination of Language Modeling and Translation methods outperforms the baseline system by nearly 30% relative improvement

    Theorizing EFL Teachers’ Perspectives and Rationales on Providing Corrective Feedback

    Get PDF
    Researchers condemn teachers by saying that tradition, rather than research findings, derive their practice while teachers condemn researchers by saying that their research findings are universal generalizations that fail in practice. To turn mutual distrust to mutual trust, this data-driven study aims at theorizing practice, rather than enlighten practice through theory-driven research. The theoretical sampling of twenty EFL teachers’ perspectives concerning corrective feedback, together with the rigorous coding schemes of grounded theory yielded some context-sensitive corrective feedback techniques: direct feedback; indirect feedback such as recast, providing an alternative, asking other students, pausing before the error, providing the rule, using the correct structure and showing surprise; feedback through other language skills including writing and listening; and no correction on cognitive, affective and information processing grounds. Moreover analysis uncovered a set of specifications on when, where, and why to use these techniques. Not only do the findings help practitioners get in-sights and improve their providing feedback, but also they help researchers modify their hypotheses before testing them through the quantitative research that aims at generalization

    Holaaa!! Writin like u talk is kewl but kinda hard 4 NLP

    Get PDF
    We present work in progress aiming to build tools for the normalization of User-Generated Content (UGC). As we will see, the task requires the revisiting of the initial steps of NLP processing, since UGC (micro-blog, blog, and, generally, Web 2.0 user texts) presents a number of non-standard communicative and linguistic characteristics, and is in fact much closer to oral and colloquial language than to edited text. We present and characterize a corpus of UGC text in Spanish from three different sources: Twitter, consumer reviews and blogs. We motivate the need for UGC text normalization by analyzing the problems found when processing this type of text through a conventional language processing pipeline, particularly in the tasks of lemmatization and morphosyntactic tagging, and finally we propose a strategy for automatically normalizing UGC using a selector of correct forms on top of a pre-existing spell-checker.Postprint (published version
    • 

    corecore