69 research outputs found

    Hybrid model of post-processing techniques for Arabic optical character recognition

    Get PDF
    Optical character recognition (OCR) is used to extract text contained in an image. One of the stages in OCR is the post-processing and it corrects the errors of OCR output text. The OCR multiple outputs approach consists of three processes: differentiation, alignment, and voting. Existing differentiation techniques suffer from the loss of important features as it uses N-versions of input images. On the other hand, alignment techniques in the literatures are based on approximation while the voting process is not context-aware. These drawbacks lead to a high error rate in OCR. This research proposed three improved techniques of differentiation, alignment, and voting to overcome the identified drawbacks. These techniques were later combined into a hybrid model that can recognize the optical characters in the Arabic language. Each of the proposed technique was separately evaluated against three other relevant existing techniques. The performance measurements used in this study were Word Error Rate (WER), Character Error Rate (CER), and Non-word Error Rate (NWER). Experimental results showed a relative decrease in error rate on all measurements for the evaluated techniques. Similarly, the hybrid model also obtained lower WER, CER, and NWER by 30.35%, 52.42%, and 47.86% respectively when compared to the three relevant existing models. This study contributes to the OCR domain as the proposed hybrid model of post-processing techniques could facilitate the automatic recognition of Arabic text. Hence, it will lead to a better information retrieval

    Action Categorisation in Multimodal Instructions

    Get PDF
    We present an explorative study for the (semi-)automatic categorisation of actions in Dutch multimodal first aid instructions, where the actions needed to successfully execute the procedure in question are presented verbally and in pictures. We start with the categorisation of verbalised actions and expect that this will later facilitate the identification of those actions in the pictures, which is known to be hard. Comparisons of and user-based experimentation with the verbal and visual representations will allow us to determine the effectiveness of picture-text combinations and will eventually support the automatic generation of multimodal documents. We used Natural Language Processing tools to identify and categorise 2,388 verbs in a corpus of 78 multimodal instructions (MIs). We show that the main action structure of an instruction can be retrieved through verb identification using the Alpino parser followed by a manual election operation. The selected main action verbs were subsequently generalised and categorised with the use of Cornetto, a lexical resource that combines a Dutch Wordnet and a Dutch Reference Lexicon. Results show that these tools are useful but also have limitations which make human intervention essential to guide an accurate categorisation of actions in multimodal instructions

    The communicative theory of Terminology (CTT) applied to the development of a corpus-based specialised dictionary of the ceramics industry

    Get PDF
    Esta tesis es el resultado de un proyecto destinado a la creación de un diccionario activo, bilingüe (español-inglés; inglés-español) y especializado de la industria cerámica y azulejera con la Teoría Comunicativa de la Terminología como su pilar teórico principal. Debido al posicionamiento teórico adoptado, la investigación aquí presentada ha partido de un estudio de corpus (compilado ad hoc) en el que los términos han sido analizados in vivo y caracterizados de acuerdo al ¿habitat¿ en el que se hallan en el texto especializado. Así pues, la aproximación hecha al estudio de la terminología industrial cerámica hace pertinente el uso de la etiqueta ¿lexicografía especializada¿ a la hora de referirnos a un trabajo como éste en el que se ha tratado de ir más allá de la práctica terminográfica para dar lugar a un estudio en el que se prima el contexto, las asociaciones naturales de los términos (colocaciones) y la naturaleza comunicativa de la terminología. De este modo, en esta tesis se ha presentado de manera progresiva, además de un marco teórico detallado y coherente con el fin último de la investigación, la metodología utilizada para la elaboración del diccionario en curso, ampliamente basada en el uso de programas informáticos tanto para la explotación del corpus (WordSmith Tools 4.0), como para la creación de la base de datos terminológica (TermStar XV) y la generación de entradas finales (GENDIC).Así pues, esta tesis presenta de manera progresiva los resultados obtenidos en cada etapa del método de trabajo y 4,000 entradas finales (en este caso del inglés al español) correspondientes a las letras A, B, N, O, U y V del diccionario.This PhD dissertation is the result of an ongoing process aimed at the creation of a bilingual corpus-based specialised active dictionary of the ceramic industry, with the Communicative Theory of Terminology (CTT) as its mainstay. According to the grounding principles of the CTT, this research has departed form a corpus-based approach in which terms have been analysed in vivo and characterised from the natural habitat in which they are given in specialised communication/discourse. In this light, it has been put forward how the study of terms – made possible thanks to the activity of compiling and describing them, called terminography – may be complemented by the wider projection of specialised lexicography for the compilation and elaboration of LSP, user-oriented and user-friendly quality products in the form of dictionaries. This specialised lexicographical dimension of the work has necessarily implied the need to renew the concept of speciality language dictionaries applied to the ceramic industry and has given way to the creation of a (prospective) active dictionary in this field with a marked emphasis on context. Accordingly, the importance of pragmatic aspects in a work of this sort, has made it necessary to undertake an in-depth revision and analysis of the socio-economic context for the research in order be able to establish and solve the specific terminological needs that the ceramic industrial discourse community may find. On the basis of this theoretical framework, the method of study followed for the development of the prospective dictionary has comprised 8 broad stages: the stage of work preparation and corpus compilation, the elaboration of the field diagram, the stage of documentary corpus management, term extraction, data processing, revision and normalisation and finally, the edition stage. Two main types of results have been presented: those obtained through work in progress in the different stages of the method and final ones strictly speaking, that is, 4,000 English-Spanish entries in their final format (as they will appear in the prospective dictionary) belonging to the letters A, B, N, O, U and V of a complete dictionary which will include a total of 26,000 entries

    The communicative theory of Terminology (CTT) applied to the development of a corpus-based specialised dictionary of the ceramics industry

    Get PDF
    Esta tesis es el resultado de un proyecto destinado a la creación de un diccionario activo, bilingüe (español-inglés; inglés-español) y especializado de la industria cerámica y azulejera con la Teoría Comunicativa de la Terminología como su pilar teórico principal. Debido al posicionamiento teórico adoptado, la investigación aquí presentada ha partido de un estudio de corpus (compilado ad hoc) en el que los términos han sido analizados in vivo y caracterizados de acuerdo al ¿habitat¿ en el que se hallan en el texto especializado. Así pues, la aproximación hecha al estudio de la terminología industrial cerámica hace pertinente el uso de la etiqueta ¿lexicografía especializada¿ a la hora de referirnos a un trabajo como éste en el que se ha tratado de ir más allá de la práctica terminográfica para dar lugar a un estudio en el que se prima el contexto, las asociaciones naturales de los términos (colocaciones) y la naturaleza comunicativa de la terminología. De este modo, en esta tesis se ha presentado de manera progresiva, además de un marco teórico detallado y coherente con el fin último de la investigación, la metodología utilizada para la elaboración del diccionario en curso, ampliamente basada en el uso de programas informáticos tanto para la explotación del corpus (WordSmith Tools 4.0), como para la creación de la base de datos terminológica (TermStar XV) y la generación de entradas finales (GENDIC).Así pues, esta tesis presenta de manera progresiva los resultados obtenidos en cada etapa del método de trabajo y 4,000 entradas finales (en este caso del inglés al español) correspondientes a las letras A, B, N, O, U y V del diccionario.This PhD dissertation is the result of an ongoing process aimed at the creation of a bilingual corpus-based specialised active dictionary of the ceramic industry, with the Communicative Theory of Terminology (CTT) as its mainstay. According to the grounding principles of the CTT, this research has departed form a corpus-based approach in which terms have been analysed in vivo and characterised from the natural habitat in which they are given in specialised communication/discourse. In this light, it has been put forward how the study of terms – made possible thanks to the activity of compiling and describing them, called terminography – may be complemented by the wider projection of specialised lexicography for the compilation and elaboration of LSP, user-oriented and user-friendly quality products in the form of dictionaries. This specialised lexicographical dimension of the work has necessarily implied the need to renew the concept of speciality language dictionaries applied to the ceramic industry and has given way to the creation of a (prospective) active dictionary in this field with a marked emphasis on context. Accordingly, the importance of pragmatic aspects in a work of this sort, has made it necessary to undertake an in-depth revision and analysis of the socio-economic context for the research in order be able to establish and solve the specific terminological needs that the ceramic industrial discourse community may find. On the basis of this theoretical framework, the method of study followed for the development of the prospective dictionary has comprised 8 broad stages: the stage of work preparation and corpus compilation, the elaboration of the field diagram, the stage of documentary corpus management, term extraction, data processing, revision and normalisation and finally, the edition stage. Two main types of results have been presented: those obtained through work in progress in the different stages of the method and final ones strictly speaking, that is, 4,000 English-Spanish entries in their final format (as they will appear in the prospective dictionary) belonging to the letters A, B, N, O, U and V of a complete dictionary which will include a total of 26,000 entries

    Proceedings of the 9th Dutch-Belgian Information Retrieval Workshop

    Get PDF

    Interactive Teaching of Corpus Linguistics at MA Studies of English Language andLinguistics: Theoretical, Methodological andPractical Aspects

    No full text
    Ova disretacija bavi se korpusnom IZ lingvistikom, ali iz perspektive teorijskih metodoloških i praktičnih aspekata koncipiranja kursa korpusne lingvistike koji je utemeljen na principima interaktivne nastave. Glavni cilj ovog istraživanja jeste da se u njemu formulišu principi projektovanja interaktivnih kurseve koji su usklađeni sa sadržajima studijskih programa i sadržaja ostalih kurseva, ali i koji su pedagoški efikasni i orijentisani na studenta. Drugi cilj istraživanje jeste da se u skladu sa ovim principima koncipira kurs korpusne lingvistike koji je optimalno prilagođen uvođenju korpusne lingvistike i interaktivne natave na diplomske akademske studije anglistike u Srbiji. Treći cilj istraživanja je da proveri inicijalnu hipotezu da će kurs korpusne lingvistike projektovan u skladu sa principima koncipiranja interaktivne natave i sproveden u hibridnom okruženju biti superioran u odnosi na kurs istih sadržaja ali sproveden isključivo primenom tradicionalnih metoda nastave.This thesis deals with corpus linguistics, but AB from the point of view of theoretical, methodological and practical aspects of designing a course in corpus lingusitics which is based on the principles of interactive teaching and learning. The main aim of the research is to formulate principles of designing interactive courses which are alligned with curricula and syllaby, but which are also pegogically efficient and student-friendly. The secondary aim is to design a course in accordance with these principles which is optimally suitable for introducing corpus linguistic and interactive teaching methods to MA studies of English language and linguistics in Serbia. The third aim of the research is to test the initial hypothesis that a course designed in accordance with interactive principles of course design and realized as a hybrid course will be superior to the same course based entirely on traditional teaching methods

    A galaxy of wor(l)ds: the translation of fictive vernacular in the Star Wars transmedia narrative in Brazil

    Get PDF
    Tese (doutorado) - Universidade Federal de Santa Catarina, Centro de Comunicação e Expressão, Programa de Pós-Graduação em Inglês: Estudos Linguísticos e Literários, Florianópolis, 2020.Com as recentes mudanças de cenário nas publicações de materiais da saga Star Wars no Brasil (que começou com a mudança do titular da propriedade intelectual em 2012), a franquia tornou-se uma narrativa transmídia no país. Diante desse contexto, a presente pesquisa tem como objetivo descrever as práticas tradutórias adotadas para lidar com materiais de Star Wars. Considerando que uma narrativa transmídia é um todo composto formado pela expansão narrativa em múltiplos episódios em diferentes plataformas midiáticas, a presente pesquisa visa, em última instância, investigar as práticas de tradução adotadas e seus impactos para a integridade dessa narrativa transmídia no Brasil. A investigação das práticas de tradução centra-se no dispositivo narrativo baseado na linguagem verbal denominado Vernáculo Fictício, um conceito proposto nesta tese. Os Estudos Descritivos da Tradução ofereceram as bases teóricas para analisar os pares selecionados de textos fontes e suas traduções. Os Estudos de Tradução com base em Corpus fornecem os procedimentos e ferramentas teóricas e metodológicas para conduzir a análise dos dados, para cujo fim foi criado um corpus paralelo computadorizado. O corpus paralelo é composto por pares alinhados de textos fontes e traduções nas mídias livro, quadrinho e filme (apenas os componentes verbais das duas últimas mídias são incluídos no corpus paralelo). Ele é composto por dois pares por mídia, totalizando seis títulos e doze textos. A análise revela duas tendências principais nas práticas adotadas para traduzir o Vernáculo Fictício no corpus. A primeira tendência envolve imprimir a composição de itens fictícios fonte nos textos de chegada. A segunda diz respeito ao aproveitamento dos recursos da língua-alvo para traduzir itens fictícios, mesmo às custas de, ocasionalmente, anular sua função de criação de mundo.Abstract: With the recent change in the publication scenario of materials from the Star Wars saga in Brazil (upon the change of intellectual property holder in 2012), the franchise has become a transmedia narrative in the country. In view of this context, the present research aims to describe the translation practices adopted to deal with Star Wars materials. Considering that a transmedia narrative is a composite whole formed by narrative expansion across multiples instalments in different media platforms, the present research ultimately aims to investigate the adopted translation practices and their impact on the wholeness of the transmedia narrative in Brazil. The investigation of translation practices focuses on the language-based narrative device called Fictive Vernacular, a concept developed in this thesis. Descriptive Translation Studies offered the theoretical foundations to analyse the selected pairs of source and translated texts. Corpus-based Translation Studies provide the theoretical and methodological procedures and tools to conduct the data analysis, for which end a computerised parallel corpus was created. The parallel corpus is composed of aligned pairs of source and target books, comics and films (only the verbal components of the last two are included in the parallel corpus). It comprises of two pairs per media, adding up to six instalments and twelve texts in total. Analysis reveals two main tendencies in the practices of translating the Fictive Vernacular in the corpus. The first tendency involves imprinting the makeup of source fictive items into the target texts. The second concerns drawing on the resources of the target language to render fictive items, even at the expense of occasionally irrupting their world-building function
    corecore