    Reconnaissance de l'écriture manuscrite en-ligne par approche combinant systèmes à vastes marges et modèles de Markov cachés

    Handwriting recognition is one of the leading applications of pattern recognition and machine learning. Despite having some limitations, handwriting recognition systems have been used as an input method of many electronic devices and helps in the automation of many manual tasks requiring processing of handwriting images. In general, a handwriting recognition system comprises three functional components; preprocessing, recognition and post-processing. There have been improvements made within each component in the system. However, to further open the avenues of expanding its applications, specific improvements need to be made in the recognition capability of the system. Hidden Markov Model (HMM) has been the dominant methods of recognition in handwriting recognition in offline and online systems. However, the use of Gaussian observation densities in HMM and representational model for word modeling often does not lead to good classification. Hybrid of Neural Network (NN) and HMM later improves word recognition by taking advantage of NN discriminative property and HMM representational capability. However, the use of NN does not optimize recognition capability as the use of Empirical Risk minimization (ERM) principle in its training leads to poor generalization. In this thesis, we focus on improving the recognition capability of a cursive online handwritten word recognition system by using an emerging method in machine learning, the support vector machine (SVM). We first evaluated SVM in isolated character recognition environment using IRONOFF and UNIPEN character databases. SVM, by its use of principle of structural risk minimization (SRM) have allowed simultaneous optimization of representational and discriminative capability of the character recognizer. We finally demonstrate the various practical issues in using SVM within a hybrid setting with HMM. In addition, we tested the hybrid system on the IRONOFF word database and obtained favourable results.Nos travaux concernent la reconnaissance de l'écriture manuscrite qui est l'un des domaines de prédilection pour la reconnaissance des formes et les algorithmes d'apprentissage. Dans le domaine de l'écriture en-ligne, les applications concernent tous les dispositifs de saisie permettant à un usager de communiquer de façon transparente avec les systèmes d'information. Dans ce cadre, nos travaux apportent une contribution pour proposer une nouvelle architecture de reconnaissance de mots manuscrits sans contrainte de style. Celle-ci se situe dans la famille des approches hybrides locale/globale où le paradigme de la segmentation/reconnaissance va se trouver résolu par la complémentarité d'un système de reconnaissance de type discriminant agissant au niveau caractère et d'un système par approche modèle pour superviser le niveau global. Nos choix se sont portés sur des Séparateurs à Vastes Marges (SVM) pour le classifieur de caractères et sur des algorithmes de programmation dynamique, issus d'une modélisation par Modèles de Markov Cachés (HMM). Cette combinaison SVM/HMM est unique dans le domaine de la reconnaissance de l'écriture manuscrite. Des expérimentations ont été menées, d'abord dans un cadre de reconnaissance de caractères isolés puis sur la base IRONOFF de mots cursifs. Elles ont montré la supériorité des approches SVM par rapport aux solutions à bases de réseaux de neurones à convolutions (Time Delay Neural Network) que nous avions développées précédemment, et leur bon comportement en situation de reconnaissance de mots

    Text-based Sentiment Analysis and Music Emotion Recognition

    Nowadays, with the expansion of social media, large amounts of user-generated texts like tweets, blog posts or product reviews are shared online. Sentiment polarity analysis of such texts has become highly attractive and is utilized in recommender systems, market predictions, business intelligence and more. We also witness deep learning techniques becoming top performers on those types of tasks. There are however several problems that need to be solved for efficient use of deep neural networks on text mining and text polarity analysis. First of all, deep neural networks are data hungry. They need to be fed with datasets that are big in size, cleaned and preprocessed as well as properly labeled. Second, the modern natural language processing concept of word embeddings as a dense and distributed text feature representation solves sparsity and dimensionality problems of the traditional bag-of-words model. Still, there are various uncertainties regarding the use of word vectors: should they be generated from the same dataset that is used to train the model or it is better to source them from big and popular collections that work as generic text feature representations? Third, it is not easy for practitioners to find a simple and highly effective deep learning setup for various document lengths and types. Recurrent neural networks are weak with longer texts and optimal convolution-pooling combinations are not easily conceived. It is thus convenient to have generic neural network architectures that are effective and can adapt to various texts, encapsulating much of design complexity. This thesis addresses the above problems to provide methodological and practical insights for utilizing neural networks on sentiment analysis of texts and achieving state of the art results. Regarding the first problem, the effectiveness of various crowdsourcing alternatives is explored and two medium-sized and emotion-labeled song datasets are created utilizing social tags. One of the research interests of Telecom Italia was the exploration of relations between music emotional stimulation and driving style. Consequently, a context-aware music recommender system that aims to enhance driving comfort and safety was also designed. To address the second problem, a series of experiments with large text collections of various contents and domains were conducted. Word embeddings of different parameters were exercised and results revealed that their quality is influenced (mostly but not only) by the size of texts they were created from. When working with small text datasets, it is thus important to source word features from popular and generic word embedding collections. Regarding the third problem, a series of experiments involving convolutional and max-pooling neural layers were conducted. Various patterns relating text properties and network parameters with optimal classification accuracy were observed. Combining convolutions of words, bigrams, and trigrams with regional max-pooling layers in a couple of stacks produced the best results. The derived architecture achieves competitive performance on sentiment polarity analysis of movie, business and product reviews. Given that labeled data are becoming the bottleneck of the current deep learning systems, a future research direction could be the exploration of various data programming possibilities for constructing even bigger labeled datasets. Investigation of feature-level or decision-level ensemble techniques in the context of deep neural networks could also be fruitful. Different feature types do usually represent complementary characteristics of data. Combining word embedding and traditional text features or utilizing recurrent networks on document splits and then aggregating the predictions could further increase prediction accuracy of such models

    Identifying human phenotype terms in text using a machine learning approach

    Tese de mestrado, Bioinformática e Biologia Computacional (Bioinformática) Universidade de Lisboa, Faculdade de Ciências, 2017Todos os dias, uma grande quantidade de informação biomédica está a ser criada sob a forma de artigos científicos, livros e imagens. Como a linguagem humana tem uma natureza não-estruturada (texto com baixo nível de organização), torna-se necessário a criação de métodos de extração de informação automáticos para que seja possível converter esta informação de modo a ser legível por uma máquina e para que seja possível automatizar este processo. Os sistemas de extração de informação têm melhorado ao longo dos anos, tornando-se cada vez mais eficazes. Esta informação extraída pode depois ser inserida em bases de dados para que seja facilmente acessível, pesquisável e para que seja possível criar ligações entre diferentes tipos de informação. O Processamento de Linguagem Natural (PLN) é uma área da informática que lida com linguagem humana. O seu objetivo é extrair significado de texto não-estruturado, de forma automática, utilizando um computador. Utiliza um conjunto de técnicas como tokenization, stemming, lemmatization e part-of-speech tagging para desconstruir o texto e torna-lo legível para máquinas. O PLN tem várias aplicações, entre as quais podemos encontrar: coreference resolution, tradução automática, Reconhecimento de Entidades Mencionadas (REM) e part-of-speech tagging. Os métodos de aprendizagem automática têm um papel muito importante na extração de informação, tendo sido desenvolvidos e melhorados ao longo dos anos, tornando-se cada vez mais poderosos. Estes métodos podem ser divididos em dois tipos: aprendizagem não-supervisionada e aprendizagem supervisionada. Os métodos de aprendizagem não-supervisionada como o Clustering, não necessitam de um conjunto de treino anotado, sendo isso vantajoso pois pode ser difícil de encontrar. Estes métodos podem ser usados para encontrar padrões nos dados, o que pode ser útil quando as características dos dados são desconhecidas. Por sua vez, os métodos de aprendizagem supervisionada utilizam um conjunto de treino anotado, que contém exemplos para os dados de input e de output, com o qual é possível criar um modelo capaz de classificar um conjunto de dados não anotado. Alguns dos métodos de aprendizagem supervisionada mais comuns são os Conditional Random Fields (CRFs), Support Vectors Machines (SVMs) e Decision Trees. Os CRFs são utilizados nesta tese e são modelos probabilísticos geralmente usados em sistemas de REM. Estes modelos apresentam vantagens em relação a outros modelos, permitindo relaxar as hipóteses de independência que são postas aos Hidden Markov Models (HMM) e evitar os problemas de bias (preconceito) existentes nos SVMs. O REM é um método que consiste na identificação de entidades em texto não-estruturado. Os sistemas REM podem ser divididos em três vertentes: métodos de aprendizagem automática, métodos baseados em dicionários e métodos baseados em regras escritas. Hoje em dia, a maioria dos sistemas de REM utilizam métodos de aprendizagem automática. As vertentes que utilizam apenas métodos de aprendizagem automática são flexíveis, mas precisam de grandes quantidades de dado, tendo a possibilidade de não produzir resultados precisos. Os métodos baseados em dicionários eliminam a necessidade de grandes quantidades de dados e conseguem obter bons resultados. No entanto, estes métodos são limitativos pois não conseguem identificar entidades que não estão dentro do dicionário. Finalmente, métodos que usam regras escritas podem produzir resultados de alta qualidade. Não tendo tantas limitações como os métodos baseados em dicionários, têm a desvantagem de ser necessário uma grande quantidade de tempo e trabalho manual para obter bons resultados. O objetivo desta tese é o desenvolvimento de um sistema REM, o IHP (Identifying Human Phenotypes) para a identificação automática de entidades representadas na Human Phenotype Ontology (HPO). A HPO é uma ontologia com o objetivo de fornecer um vocabulário standardizado para defeitos fenotípicos que podem ser encontrados em doenças humanas. O IHP utiliza métodos de aprendizagem automática para o processo de identificação de entidades e uma combinação de métodos baseados em dicionários e métodos baseados em regras escritas para o processo de validação das entidades identificadas. O IHP utiliza duas ferramentas de benchmarking específicas para esta ontologia, apresentadas num trabalho anterior (Groza T, 2015): O Gold Standard Corpora (GSC), que consiste num conjunto de abstracts com as respetivas anotações de termos do HPO, e os Test Suites (TS), que consistem num conjunto de testes específicos divididos em categorias diferentes. Estas ferramentas têm o propósito de testar diferentes propriedades dos anotadores. Enquanto que o GSC testa os anotadores de uma forma geral, avaliando a capacidade de identificar entidades em texto livre, os TS são compostos por um conjunto de testes que avaliam as possíveis variações linguísticas que as entidades do HPO podem ter. Groza et al. também apresenta os resultados do anotador BioLark-CR, o qual é utilizado como baseline para os resultados do IHP. O IHP utiliza o IBEnt (Identification of Biological Entities) como o sistema de REM base, tendo sido modificado para aceitar entidades do HPO. Este sistema usa o Stanford CoreNLP em conjunto com CRFs, sob a forma de StanfordNER e CRFSuite, de modo a criar um modelo a partir de um conjunto de treino. Este modelo pode depois ser avaliado por um conjunto de teste. Para a criação de um modelo é necessário selecionar um conjunto de características (features) que se ajuste ao conjunto de dados utilizados. O StanfordNER e o CRFSuite apresentam conjuntos de features diferentes. Para o StanfordNER, uma lista de features existente foi utilizada, aplicando um algoritmo para selecionar as features que trazem maiores benefícios. Para o CRFSuite, foi criado um conjunto de features (linguísticas, morfológicas, ortográficas, léxicas, de contexto e outra) com base em trabalhos prévios na área do REM biomédico. Este conjunto de features foi testado e selecionado manualmente de acordo com o desempenho. Além da utilização das features, um conjunto de regras de pós-processamento foi desenvolvido para pesquisar padrões linguísticos, utilizando também listas de palavras e stop words, com o propósito de remover entidades que tenham sido mal identificadas, identificar entidades que não tenham sido identificadas e combinar entidades adjacentes. Os resultados para o IHP foram obtidos utilizando os classificadores StanfordNER e o CRFSuite. Para o StanfordNER, o IHP atinge um F-measure de 0.63498 no GSC e de 0.86916 nos TS. Para o CRFSuite, atinge um F-measure de 0.64009 no GSC e 0.89556 nos TS. Em relação ao anotador comparativo Bio-LarK CR, estes resultados mostram um aumento de desempenho no GSC, sugerindo que o IHP tem uma maior capacidade do que o BioLarK CR em lidar com situações reais. Apresenta, no entanto, um decréscimo nos TS, tendo uma menor capacidade em lidar com estruturas linguísticas complexas que possam ocorrer. No entanto, apesar de haver um decréscimo nos TS, as estruturas linguísticas avaliadas por estes testes ocorrem naturalmente em texto livre (como os abstracts do GSC), sugerindo que os resultados do GSC sejam mais significativos do que os resultados dos TS. Durante o desenvolvimento da tese, alguns problemas foram identificados no GSC: anotação de entidades superclasse/subclasse, número de vezes que uma entidade é anotada erros comuns. Devido a estas inconsistências encontradas, o IHP tem o potencial de ter um desempenho melhor no GSC. Para testar esta possibilidade, foi efetuado um teste que consiste em remover Falsos Positivos que se encontram tanto nas anotações do GSC como também na base de dados do HPO. Estes Falsos Positivos, estando presentes no GSC e no HPO, provavelmente deveriam ser considerados como bem anotados, mas, no entanto, o GSC não identifica como uma entidade. Estes testes mostram que o IHP tem o potencial de atingir um desempenho de 0.816, que corresponde a um aumento considerável de cerca de 0.18 em relação aos resultados obtidos. Com a análise destas inconsistências encontradas no GSC, uma nova versão, o GSC+, foi criada. GSC+ permite uma anotação dos documentos mais consistente, tentando anotar o máximo número de entidades nos documentos. Em relação ao GSC, ao GSC+ foram adicionadas 881 entidades e foram modificadas 4 entidades. O desempenho do IHP no GSC+ é consideravelmente mais alta do que no GSC, tendo atingindo um valor de F-measure de 0.863. Esta diferença no desempenho é devido ao facto do GSC+ tentar identificar o máximo número de entidades possível. Muitas entidades que eram consideradas como erradas, agora são consideradas corretas.Named-Entity Recognition (NER) is an important Natural Language Processing task that can be used in Information Extraction systems to automatically identify and extract entities in unstructured text. NER is commonly used to identify biological entities such as proteins, genes and chemical compounds found in scientific articles. The Human Phenotype Ontology (HPO) is an ontology that provides a standardized vocabulary for phenotypic abnormalities found in human diseases. This article presents the Identifying Human Phenotypes (IHP) system, tuned to recognize HPO entities in unstructured text. IHP uses IBEnt (Identification of Biological Entities) as the base NER system. It uses Stanford CoreNLP for text processing and applies Conditional Random Fields (CRFs) for the identification of entities. IHP uses of a rich feature set containing linguistic, orthographic, morphologic, lexical and context features created for the machine learning-based classifier. However, the main novelty of IHP is its validation step based on a set of carefully crafted hand-written rules, such as the negative connotation analysis, that combined with a dictionary are able to filter incorrectly identified entities, find missing entities and combine adjacent entities. The performance of IHP was evaluated using the recently published HPO Gold Standardized Corpora (GSC) and Test Suites (TS), where the system Bio-LarK CR obtained the best F-measure of 0.56 and 0.95 in the GSC and TS, respectively. Using StanfordNER, IHP achieved an F-measure of 0.646 for the GSC and 0.869 for the TS. Using CRFSuite, it achieved an F-measure of 0.648 for the GSC and 0.895 for the TS. Due to inconsistencies found in the GSC, an extended version of the GSC, the GSC+, was created, adding 881 entities and modifying 4 entities. IHP achieved an F-measure of 0.863 on GSC+. Both the GSC+ and the IHP system are publicly available at: https://github.com/lasigeBioTM/IHP

    Advanced document data extraction techniques to improve supply chain performance

    In this thesis, a novel machine learning technique to extract text-based information from scanned images has been developed. This information extraction is performed in the context of scanned invoices and bills used in financial transactions. These financial transactions contain a considerable amount of data that must be extracted, refined, and stored digitally before it can be used for analysis. Converting this data into a digital format is often a time-consuming process. Automation and data optimisation show promise as methods for reducing the time required and the cost of Supply Chain Management (SCM) processes, especially Supplier Invoice Management (SIM), Financial Supply Chain Management (FSCM) and Supply Chain procurement processes. This thesis uses a cross-disciplinary approach involving Computer Science and Operational Management to explore the benefit of automated invoice data extraction in business and its impact on SCM. The study adopts a multimethod approach based on empirical research, surveys, and interviews performed on selected companies.The expert system developed in this thesis focuses on two distinct areas of research: Text/Object Detection and Text Extraction. For Text/Object Detection, the Faster R-CNN model was analysed. While this model yields outstanding results in terms of object detection, it is limited by poor performance when image quality is low. The Generative Adversarial Network (GAN) model is proposed in response to this limitation. The GAN model is a generator network that is implemented with the help of the Faster R-CNN model and a discriminator that relies on PatchGAN. The output of the GAN model is text data with bonding boxes. For text extraction from the bounding box, a novel data extraction framework consisting of various processes including XML processing in case of existing OCR engine, bounding box pre-processing, text clean up, OCR error correction, spell check, type check, pattern-based matching, and finally, a learning mechanism for automatizing future data extraction was designed. Whichever fields the system can extract successfully are provided in key-value format.The efficiency of the proposed system was validated using existing datasets such as SROIE and VATI. Real-time data was validated using invoices that were collected by two companies that provide invoice automation services in various countries. Currently, these scanned invoices are sent to an OCR system such as OmniPage, Tesseract, or ABBYY FRE to extract text blocks and later, a rule-based engine is used to extract relevant data. While the system’s methodology is robust, the companies surveyed were not satisfied with its accuracy. Thus, they sought out new, optimized solutions. To confirm the results, the engines were used to return XML-based files with text and metadata identified. The output XML data was then fed into this new system for information extraction. This system uses the existing OCR engine and a novel, self-adaptive, learning-based OCR engine. This new engine is based on the GAN model for better text identification. Experiments were conducted on various invoice formats to further test and refine its extraction capabilities. For cost optimisation and the analysis of spend classification, additional data were provided by another company in London that holds expertise in reducing their clients' procurement costs. This data was fed into our system to get a deeper level of spend classification and categorisation. This helped the company to reduce its reliance on human effort and allowed for greater efficiency in comparison with the process of performing similar tasks manually using excel sheets and Business Intelligence (BI) tools.The intention behind the development of this novel methodology was twofold. First, to test and develop a novel solution that does not depend on any specific OCR technology. Second, to increase the information extraction accuracy factor over that of existing methodologies. Finally, it evaluates the real-world need for the system and the impact it would have on SCM. This newly developed method is generic and can extract text from any given invoice, making it a valuable tool for optimizing SCM. In addition, the system uses a template-matching approach to ensure the quality of the extracted information

    Text Extraction From Natural Scene: Methodology And Application

    With the popularity of the Internet and the smart mobile device, there is an increasing demand for the techniques and applications of image/video-based analytics and information retrieval. Most of these applications can benefit from text information extraction in natural scene. However, scene text extraction is a challenging problem to be solved, due to cluttered background of natural scene and multiple patterns of scene text itself. To solve these problems, this dissertation proposes a framework of scene text extraction. Scene text extraction in our framework is divided into two components, detection and recognition. Scene text detection is to find out the regions containing text from camera captured images/videos. Text layout analysis based on gradient and color analysis is performed to extract candidates of text strings from cluttered background in natural scene. Then text structural analysis is performed to design effective text structural features for distinguishing text from non-text outliers among the candidates of text strings. Scene text recognition is to transform image-based text in detected regions into readable text codes. The most basic and significant step in text recognition is scene text character (STC) prediction, which is multi-class classification among a set of text character categories. We design robust and discriminative feature representations for STC structure, by integrating multiple feature descriptors, coding/pooling schemes, and learning models. Experimental results in benchmark datasets demonstrate the effectiveness and robustness of our proposed framework, which obtains better performance than previously published methods. Our proposed scene text extraction framework is applied to 4 scenarios, 1) reading print labels in grocery package for hand-held object recognition; 2) combining with car detection to localize license plate in camera captured natural scene image; 3) reading indicative signage for assistant navigation in indoor environments; and 4) combining with object tracking to perform scene text extraction in video-based natural scene. The proposed prototype systems and associated evaluation results show that our framework is able to solve the challenges in real applications