131 research outputs found

    Classifier-Based Text Simplification for Improved Machine Translation

    Full text link
    Machine Translation is one of the research fields of Computational Linguistics. The objective of many MT Researchers is to develop an MT System that produce good quality and high accuracy output translations and which also covers maximum language pairs. As internet and Globalization is increasing day by day, we need a way that improves the quality of translation. For this reason, we have developed a Classifier based Text Simplification Model for English-Hindi Machine Translation Systems. We have used support vector machines and Na\"ive Bayes Classifier to develop this model. We have also evaluated the performance of these classifiers.Comment: In Proceedings of International Conference on Advances in Computer Engineering and Applications 201

    Cross-view Embeddings for Information Retrieval

    Full text link
    In this dissertation, we deal with the cross-view tasks related to information retrieval using embedding methods. We study existing methodologies and propose new methods to overcome their limitations. We formally introduce the concept of mixed-script IR, which deals with the challenges faced by an IR system when a language is written in different scripts because of various technological and sociological factors. Mixed-script terms are represented by a small and finite feature space comprised of character n-grams. We propose the cross-view autoencoder (CAE) to model such terms in an abstract space and CAE provides the state-of-the-art performance. We study a wide variety of models for cross-language information retrieval (CLIR) and propose a model based on compositional neural networks (XCNN) which overcomes the limitations of the existing methods and achieves the best results for many CLIR tasks such as ad-hoc retrieval, parallel sentence retrieval and cross-language plagiarism detection. We empirically test the proposed models for these tasks on publicly available datasets and present the results with analyses. In this dissertation, we also explore an effective method to incorporate contextual similarity for lexical selection in machine translation. Concretely, we investigate a feature based on context available in source sentence calculated using deep autoencoders. The proposed feature exhibits statistically significant improvements over the strong baselines for English-to-Spanish and English-to-Hindi translation tasks. Finally, we explore the the methods to evaluate the quality of autoencoder generated representations of text data and analyse its architectural properties. For this, we propose two metrics based on reconstruction capabilities of the autoencoders: structure preservation index (SPI) and similarity accumulation index (SAI). We also introduce a concept of critical bottleneck dimensionality (CBD) below which the structural information is lost and present analyses linking CBD and language perplexity.En esta disertación estudiamos problemas de vistas-múltiples relacionados con la recuperación de información utilizando técnicas de representación en espacios de baja dimensionalidad. Estudiamos las técnicas existentes y proponemos nuevas técnicas para solventar algunas de las limitaciones existentes. Presentamos formalmente el concepto de recuperación de información con escritura mixta, el cual trata las dificultades de los sistemas de recuperación de información cuando los textos contienen escrituras en distintos alfabetos debido a razones tecnológicas y socioculturales. Las palabras en escritura mixta son representadas en un espacio de características finito y reducido, compuesto por n-gramas de caracteres. Proponemos los auto-codificadores de vistas-múltiples (CAE, por sus siglas en inglés) para modelar dichas palabras en un espacio abstracto, y esta técnica produce resultados de vanguardia. En este sentido, estudiamos varios modelos para la recuperación de información entre lenguas diferentes (CLIR, por sus siglas en inglés) y proponemos un modelo basado en redes neuronales composicionales (XCNN, por sus siglas en inglés), el cual supera las limitaciones de los métodos existentes. El método de XCNN propuesto produce mejores resultados en diferentes tareas de CLIR tales como la recuperación de información ad-hoc, la identificación de oraciones equivalentes en lenguas distintas y la detección de plagio entre lenguas diferentes. Para tal efecto, realizamos pruebas experimentales para dichas tareas sobre conjuntos de datos disponibles públicamente, presentando los resultados y análisis correspondientes. En esta disertación, también exploramos un método eficiente para utilizar similitud semántica de contextos en el proceso de selección léxica en traducción automática. Específicamente, proponemos características extraídas de los contextos disponibles en las oraciones fuentes mediante el uso de auto-codificadores. El uso de las características propuestas demuestra mejoras estadísticamente significativas sobre sistemas de traducción robustos para las tareas de traducción entre inglés y español, e inglés e hindú. Finalmente, exploramos métodos para evaluar la calidad de las representaciones de datos de texto generadas por los auto-codificadores, a la vez que analizamos las propiedades de sus arquitecturas. Como resultado, proponemos dos nuevas métricas para cuantificar la calidad de las reconstrucciones generadas por los auto-codificadores: el índice de preservación de estructura (SPI, por sus siglas en inglés) y el índice de acumulación de similitud (SAI, por sus siglas en inglés). También presentamos el concepto de dimensión crítica de cuello de botella (CBD, por sus siglas en inglés), por debajo de la cual la información estructural se deteriora. Mostramos que, interesantemente, la CBD está relacionada con la perplejidad de la lengua.En aquesta dissertació estudiem els problemes de vistes-múltiples relacionats amb la recuperació d'informació utilitzant tècniques de representació en espais de baixa dimensionalitat. Estudiem les tècniques existents i en proposem unes de noves per solucionar algunes de les limitacions existents. Presentem formalment el concepte de recuperació d'informació amb escriptura mixta, el qual tracta les dificultats dels sistemes de recuperació d'informació quan els textos contenen escriptures en diferents alfabets per motius tecnològics i socioculturals. Les paraules en escriptura mixta són representades en un espai de característiques finit i reduït, composat per n-grames de caràcters. Proposem els auto-codificadors de vistes-múltiples (CAE, per les seves sigles en anglès) per modelar aquestes paraules en un espai abstracte, i aquesta tècnica produeix resultats d'avantguarda. En aquest sentit, estudiem diversos models per a la recuperació d'informació entre llengües diferents (CLIR , per les sevas sigles en anglès) i proposem un model basat en xarxes neuronals composicionals (XCNN, per les sevas sigles en anglès), el qual supera les limitacions dels mètodes existents. El mètode de XCNN proposat produeix millors resultats en diferents tasques de CLIR com ara la recuperació d'informació ad-hoc, la identificació d'oracions equivalents en llengües diferents, i la detecció de plagi entre llengües diferents. Per a tal efecte, realitzem proves experimentals per aquestes tasques sobre conjunts de dades disponibles públicament, presentant els resultats i anàlisis corresponents. En aquesta dissertació, també explorem un mètode eficient per utilitzar similitud semàntica de contextos en el procés de selecció lèxica en traducció automàtica. Específicament, proposem característiques extretes dels contextos disponibles a les oracions fonts mitjançant l'ús d'auto-codificadors. L'ús de les característiques proposades demostra millores estadísticament significatives sobre sistemes de traducció robustos per a les tasques de traducció entre anglès i espanyol, i anglès i hindú. Finalment, explorem mètodes per avaluar la qualitat de les representacions de dades de text generades pels auto-codificadors, alhora que analitzem les propietats de les seves arquitectures. Com a resultat, proposem dues noves mètriques per quantificar la qualitat de les reconstruccions generades pels auto-codificadors: l'índex de preservació d'estructura (SCI, per les seves sigles en anglès) i l'índex d'acumulació de similitud (SAI, per les seves sigles en anglès). També presentem el concepte de dimensió crítica de coll d'ampolla (CBD, per les seves sigles en anglès), per sota de la qual la informació estructural es deteriora. Mostrem que, de manera interessant, la CBD està relacionada amb la perplexitat de la llengua.Gupta, PA. (2017). Cross-view Embeddings for Information Retrieval [Tesis doctoral no publicada]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/78457TESI

    On the Mono- and Cross-Language Detection of Text Re-Use and Plagiarism

    Full text link
    Barrón Cedeño, LA. (2012). On the Mono- and Cross-Language Detection of Text Re-Use and Plagiarism [Tesis doctoral no publicada]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/16012Palanci

    Advanced document data extraction techniques to improve supply chain performance

    Get PDF
    In this thesis, a novel machine learning technique to extract text-based information from scanned images has been developed. This information extraction is performed in the context of scanned invoices and bills used in financial transactions. These financial transactions contain a considerable amount of data that must be extracted, refined, and stored digitally before it can be used for analysis. Converting this data into a digital format is often a time-consuming process. Automation and data optimisation show promise as methods for reducing the time required and the cost of Supply Chain Management (SCM) processes, especially Supplier Invoice Management (SIM), Financial Supply Chain Management (FSCM) and Supply Chain procurement processes. This thesis uses a cross-disciplinary approach involving Computer Science and Operational Management to explore the benefit of automated invoice data extraction in business and its impact on SCM. The study adopts a multimethod approach based on empirical research, surveys, and interviews performed on selected companies.The expert system developed in this thesis focuses on two distinct areas of research: Text/Object Detection and Text Extraction. For Text/Object Detection, the Faster R-CNN model was analysed. While this model yields outstanding results in terms of object detection, it is limited by poor performance when image quality is low. The Generative Adversarial Network (GAN) model is proposed in response to this limitation. The GAN model is a generator network that is implemented with the help of the Faster R-CNN model and a discriminator that relies on PatchGAN. The output of the GAN model is text data with bonding boxes. For text extraction from the bounding box, a novel data extraction framework consisting of various processes including XML processing in case of existing OCR engine, bounding box pre-processing, text clean up, OCR error correction, spell check, type check, pattern-based matching, and finally, a learning mechanism for automatizing future data extraction was designed. Whichever fields the system can extract successfully are provided in key-value format.The efficiency of the proposed system was validated using existing datasets such as SROIE and VATI. Real-time data was validated using invoices that were collected by two companies that provide invoice automation services in various countries. Currently, these scanned invoices are sent to an OCR system such as OmniPage, Tesseract, or ABBYY FRE to extract text blocks and later, a rule-based engine is used to extract relevant data. While the system’s methodology is robust, the companies surveyed were not satisfied with its accuracy. Thus, they sought out new, optimized solutions. To confirm the results, the engines were used to return XML-based files with text and metadata identified. The output XML data was then fed into this new system for information extraction. This system uses the existing OCR engine and a novel, self-adaptive, learning-based OCR engine. This new engine is based on the GAN model for better text identification. Experiments were conducted on various invoice formats to further test and refine its extraction capabilities. For cost optimisation and the analysis of spend classification, additional data were provided by another company in London that holds expertise in reducing their clients' procurement costs. This data was fed into our system to get a deeper level of spend classification and categorisation. This helped the company to reduce its reliance on human effort and allowed for greater efficiency in comparison with the process of performing similar tasks manually using excel sheets and Business Intelligence (BI) tools.The intention behind the development of this novel methodology was twofold. First, to test and develop a novel solution that does not depend on any specific OCR technology. Second, to increase the information extraction accuracy factor over that of existing methodologies. Finally, it evaluates the real-world need for the system and the impact it would have on SCM. This newly developed method is generic and can extract text from any given invoice, making it a valuable tool for optimizing SCM. In addition, the system uses a template-matching approach to ensure the quality of the extracted information

    Multilingual sentiment analysis in social media.

    Get PDF
    252 p.This thesis addresses the task of analysing sentiment in messages coming from social media. The ultimate goal was to develop a Sentiment Analysis system for Basque. However, because of the socio-linguistic reality of the Basque language a tool providing only analysis for Basque would not be enough for a real world application. Thus, we set out to develop a multilingual system, including Basque, English, French and Spanish.The thesis addresses the following challenges to build such a system:- Analysing methods for creating Sentiment lexicons, suitable for less resourced languages.- Analysis of social media (specifically Twitter): Tweets pose several challenges in order to understand and extract opinions from such messages. Language identification and microtext normalization are addressed.- Research the state of the art in polarity classification, and develop a supervised classifier that is tested against well known social media benchmarks.- Develop a social media monitor capable of analysing sentiment with respect to specific events, products or organizations
    • …
    corecore