847 research outputs found

    Text segmentation for analysing different languages

    Get PDF
    Over the past several years, researchers have applied different methods of text segmentation. Text segmentation is defined as a method of splitting a document into smaller segments, assuming with its own relevant meaning. Those segments can be classified into the tag, word, sentence, topic, phrase and any information unit. Firstly, this study reviews the different types of text segmentation methods used in different types of documentation, and later discusses the various reasons for utilizing it in opinion mining. The main contribution of this study includes a summarisation of research papers from the past 10 years that applied text segmentation as their main approach in text analysing. Results show that word segmentation was successfully and widely used for processing different languages

    Computer analysis of composite documents with non-uniform background.

    Get PDF
    The motivation behind most of the applications of off-line text recognition is to convert data from conventional media into electronic media. Such applications are bank cheques, security documents and form processing. In this dissertation a document analysis system is presented to transfer gray level composite documents with complex backgrounds and poor illumination into electronic format that is suitable for efficient storage, retrieval and interpretation. The preprocessing stage for the document analysis system requires the conversion of a paper-based document to a digital bit-map representation after optical scanning followed by techniques of thresholding, skew detection, page segmentation and Optical Character Recognition (OCR). The system as a whole operates in a pipeline fashion where each stage or process passes its output to the next stage. The success of each stage guarantees that the operation of the system as a whole with no failures that may reduce the character recognition rate. By designing this document analysis system a new local bi-level threshold selection technique was developed for gray level composite document images with non-uniform background. The algorithm uses statistical and textural feature measures to obtain a feature vector for each pixel from a window of size (2 n + 1) x (2n + 1), where n ≄ 1. These features provide a local understanding of pixels from their neighbourhoods making it easier to classify each pixel into its proper class. A Multi-Layer Perceptron Neural Network is then used to classify each pixel value in the image. The results of thresholding are then passed to the block segmentation stage. The block segmentation technique developed is a feature-based method that uses a Neural Network classifier to automatically segment and classify the image contents into text and halftone images. Finally, the text blocks are passed into a Character Recognition (CR) system to transfer characters into an editable text format and the recognition results were compared to those obtained from a commercial OCR. The OCR system implemented uses pixel distribution as features extracted from different zones of the characters. A correlation classifier is used to recognize the characters. For the application of cheque processing, this system was used to read the special numerals of the optical barcode found in bank cheques. The OCR system uses a fuzzy descriptive feature extraction method with a correlation classifier to recognize these special numerals, which identify the bank institute and provides personal information about the account holder. The new local thresholding scheme was tested on a variety of composite document images with complex backgrounds. The results were very good compared to the results from commercial OCR software. This proposed thresholding technique is not limited to a specific application. It can be used on a variety of document images with complex backgrounds and can be implemented in any document analysis system provided that sufficient training is performed.Dept. of Electrical and Computer Engineering. Paper copy at Leddy Library: Theses & Major Papers - Basement, West Bldg. / Call Number: Thesis2004 .A445. Source: Dissertation Abstracts International, Volume: 66-02, Section: B, page: 1061. Advisers: Maher Sid-Ahmed; Majid Ahmadi. Thesis (Ph.D.)--University of Windsor (Canada), 2004

    USING AN EIGEN VALUES AND SPATIAL FEATURES FOR BUILDING AN IRAQI LICENSE PLATE DETECTOR AND RECOGNIZER

    Get PDF
    This paper produced a system for license plate recognition problem for Iraqi cars. The system depending on edges and position to locate license plate and character detection within it, for characters’ recognition an Eigen value for prepared templates was used to identify each character with template matching for recognizing government. Euclidian distance is invested for taking a decision; Experiments show that the recognition success was high and precise

    Text segmentation techniques: A critical review

    Get PDF
    Text segmentation is widely used for processing text. It is a method of splitting a document into smaller parts, which is usually called segments. Each segment has its relevant meaning. Those segments categorized as word, sentence, topic, phrase or any information unit depending on the task of the text analysis. This study presents various reasons of usage of text segmentation for different analyzing approaches. We categorized the types of documents and languages used. The main contribution of this study includes a summarization of 50 research papers and an illustration of past decade (January 2007- January 2017)’s of research that applied text segmentation as their main approach for analysing text. Results revealed the popularity of using text segmentation in different languages. Besides that, the “word” seems to be the most practical and usable segment, as it is the smaller unit than the phrase, sentence or line

    Classification of Camellia (Theaceae) Species Using Leaf Architecture Variations and Pattern Recognition Techniques

    Get PDF
    Leaf characters have been successfully utilized to classify Camellia (Theaceae) species; however, leaf characters combined with supervised pattern recognition techniques have not been previously explored. We present results of using leaf morphological and venation characters of 93 species from five sections of genus Camellia to assess the effectiveness of several supervised pattern recognition techniques for classifications and compare their accuracy. Clustering approach, Learning Vector Quantization neural network (LVQ-ANN), Dynamic Architecture for Artificial Neural Networks (DAN2), and C-support vector machines (SVM) are used to discriminate 93 species from five sections of genus Camellia (11 in sect. Furfuracea, 16 in sect. Paracamellia, 12 in sect. Tuberculata, 34 in sect. Camellia, and 20 in sect. Theopsis). DAN2 and SVM show excellent classification results for genus Camellia with DAN2's accuracy of 97.92% and 91.11% for training and testing data sets respectively. The RBF-SVM results of 97.92% and 97.78% for training and testing offer the best classification accuracy. A hierarchical dendrogram based on leaf architecture data has confirmed the morphological classification of the five sections as previously proposed. The overall results suggest that leaf architecture-based data analysis using supervised pattern recognition techniques, especially DAN2 and SVM discrimination methods, is excellent for identification of Camellia species

    Data-efficient methods for information extraction

    Get PDF
    Strukturierte WissensreprĂ€sentationssysteme wie Wissensdatenbanken oder Wissensgraphen bieten Einblicke in EntitĂ€ten und Beziehungen zwischen diesen EntitĂ€ten in der realen Welt. Solche WissensreprĂ€sentationssysteme können in verschiedenen Anwendungen der natĂŒrlichen Sprachverarbeitung eingesetzt werden, z. B. bei der semantischen Suche, der Beantwortung von Fragen und der Textzusammenfassung. Es ist nicht praktikabel und ineffizient, diese WissensreprĂ€sentationssysteme manuell zu befĂŒllen. In dieser Arbeit entwickeln wir Methoden, um automatisch benannte EntitĂ€ten und Beziehungen zwischen den EntitĂ€ten aus Klartext zu extrahieren. Unsere Methoden können daher verwendet werden, um entweder die bestehenden unvollstĂ€ndigen WissensreprĂ€sentationssysteme zu vervollstĂ€ndigen oder ein neues strukturiertes WissensreprĂ€sentationssystem von Grund auf zu erstellen. Im Gegensatz zu den gĂ€ngigen ĂŒberwachten Methoden zur Informationsextraktion konzentrieren sich unsere Methoden auf das Szenario mit wenigen Daten und erfordern keine große Menge an kommentierten Daten. Im ersten Teil der Arbeit haben wir uns auf das Problem der Erkennung von benannten EntitĂ€ten konzentriert. Wir haben an der gemeinsamen Aufgabe von Bacteria Biotope 2019 teilgenommen. Die gemeinsame Aufgabe besteht darin, biomedizinische EntitĂ€tserwĂ€hnungen zu erkennen und zu normalisieren. Unser linguistically informed Named-Entity-Recognition-System besteht aus einem Deep-Learning-basierten Modell, das sowohl verschachtelte als auch flache EntitĂ€ten extrahieren kann; unser Modell verwendet mehrere linguistische Merkmale und zusĂ€tzliche Trainingsziele, um effizientes Lernen in datenarmen Szenarien zu ermöglichen. Unser System zur EntitĂ€tsnormalisierung verwendet String-Match, Fuzzy-Suche und semantische Suche, um die extrahierten benannten EntitĂ€ten mit den biomedizinischen Datenbanken zu verknĂŒpfen. Unser System zur Erkennung von benannten EntitĂ€ten und zur EntitĂ€tsnormalisierung erreichte die niedrigste Slot-Fehlerrate von 0,715 und belegte den ersten Platz in der gemeinsamen Aufgabe. Wir haben auch an zwei gemeinsamen Aufgaben teilgenommen: Adverse Drug Effect Span Detection (Englisch) und Profession Span Detection (Spanisch); beide Aufgaben sammeln Daten von der Social Media Plattform Twitter. Wir haben ein Named-Entity-Recognition-Modell entwickelt, das die Eingabedarstellung des Modells durch das Stapeln heterogener Einbettungen aus verschiedenen DomĂ€nen verbessern kann; unsere empirischen Ergebnisse zeigen komplementĂ€res Lernen aus diesen heterogenen Einbettungen. Unser Beitrag belegte den 3. Platz in den beiden gemeinsamen Aufgaben. Im zweiten Teil der Arbeit untersuchten wir Strategien zur Erweiterung synthetischer Daten, um ressourcenarme Informationsextraktion in spezialisierten DomĂ€nen zu ermöglichen. Insbesondere haben wir backtranslation an die Aufgabe der Erkennung von benannten EntitĂ€ten auf Token-Ebene und der Extraktion von Beziehungen auf Satzebene angepasst. Wir zeigen, dass die RĂŒckĂŒbersetzung sprachlich vielfĂ€ltige und grammatikalisch kohĂ€rente synthetische SĂ€tze erzeugen kann und als wettbewerbsfĂ€hige Erweiterungsstrategie fĂŒr die Aufgaben der Erkennung von benannten EntitĂ€ten und der Extraktion von Beziehungen dient. Bei den meisten realen Aufgaben zur Extraktion von Beziehungen stehen keine kommentierten Daten zur VerfĂŒgung, jedoch ist hĂ€ufig ein großer unkommentierter Textkorpus vorhanden. Bootstrapping-Methoden zur Beziehungsextraktion können mit diesem großen Korpus arbeiten, da sie nur eine Handvoll Startinstanzen benötigen. Bootstrapping-Methoden neigen jedoch dazu, im Laufe der Zeit Rauschen zu akkumulieren (bekannt als semantische Drift), und dieses PhĂ€nomen hat einen drastischen negativen Einfluss auf die endgĂŒltige Genauigkeit der Extraktionen. Wir entwickeln zwei Methoden zur EinschrĂ€nkung des Bootstrapping-Prozesses, um die semantische Drift bei der Extraktion von Beziehungen zu minimieren. Unsere Methoden nutzen die Graphentheorie und vortrainierte Sprachmodelle, um verrauschte Extraktionsmuster explizit zu identifizieren und zu entfernen. Wir berichten ĂŒber die experimentellen Ergebnisse auf dem TACRED-Datensatz fĂŒr vier Relationen. Im letzten Teil der Arbeit demonstrieren wir die Anwendung der DomĂ€nenanpassung auf die anspruchsvolle Aufgabe der mehrsprachigen Akronymextraktion. Unsere Experimente zeigen, dass die DomĂ€nenanpassung die Akronymextraktion in wissenschaftlichen und juristischen Bereichen in sechs Sprachen verbessern kann, darunter auch Sprachen mit geringen Ressourcen wie Persisch und Vietnamesisch.The structured knowledge representation systems such as knowledge base or knowledge graph can provide insights regarding entities and relationship(s) among these entities in the real-world, such knowledge representation systems can be employed in various natural language processing applications such as semantic search, question answering and text summarization. It is infeasible and inefficient to manually populate these knowledge representation systems. In this work, we develop methods to automatically extract named entities and relationships among the entities from plain text and hence our methods can be used to either complete the existing incomplete knowledge representation systems to create a new structured knowledge representation system from scratch. Unlike mainstream supervised methods for information extraction, our methods focus on the low-data scenario and do not require a large amount of annotated data. In the first part of the thesis, we focused on the problem of named entity recognition. We participated in the shared task of Bacteria Biotope 2019, the shared task consists of recognizing and normalizing the biomedical entity mentions. Our linguistically informed named entity recognition system consists of a deep learning based model which can extract both nested and flat entities; our model employed several linguistic features and auxiliary training objectives to enable efficient learning in data-scarce scenarios. Our entity normalization system employed string match, fuzzy search and semantic search to link the extracted named entities to the biomedical databases. Our named entity recognition and entity normalization system achieved the lowest slot error rate of 0.715 and ranked first in the shared task. We also participated in two shared tasks of Adverse Drug Effect Span detection (English) and Profession Span Detection (Spanish); both of these tasks collect data from the social media platform Twitter. We developed a named entity recognition model which can improve the input representation of the model by stacking heterogeneous embeddings from a diverse domain(s); our empirical results demonstrate complementary learning from these heterogeneous embeddings. Our submission ranked 3rd in both of the shared tasks. In the second part of the thesis, we explored synthetic data augmentation strategies to address low-resource information extraction in specialized domains. Specifically, we adapted backtranslation to the token-level task of named entity recognition and sentence-level task of relation extraction. We demonstrate that backtranslation can generate linguistically diverse and grammatically coherent synthetic sentences and serve as a competitive augmentation strategy for the task of named entity recognition and relation extraction. In most of the real-world relation extraction tasks, the annotated data is not available, however, quite often a large unannotated text corpus is available. Bootstrapping methods for relation extraction can operate on this large corpus as they only require a handful of seed instances. However, bootstrapping methods tend to accumulate noise over time (known as semantic drift) and this phenomenon has a drastic negative impact on the final precision of the extractions. We develop two methods to constrain the bootstrapping process to minimise semantic drift for relation extraction; our methods leverage graph theory and pre-trained language models to explicitly identify and remove noisy extraction patterns. We report the experimental results on the TACRED dataset for four relations. In the last part of the thesis, we demonstrate the application of domain adaptation to the challenging task of multi-lingual acronym extraction. Our experiments demonstrate that domain adaptation can improve acronym extraction within scientific and legal domains in 6 languages including low-resource languages such as Persian and Vietnamese

    Theoretical results on a weightless neural classifier and application to computational linguistics

    Get PDF
    WiSARD Ă© um classificador n-upla, historicamente usado em tarefas de reconhecimento de padrĂ”es em imagens em preto e branco. Infelizmente, nĂŁo era comum que este fosse usado em outras tarefas, devido ĂĄ sua incapacidade de arcar com grandes volumes de dados por ser sensĂ­vel ao conteĂșdo aprendido. Recentemente, a tĂ©cnica de bleaching foi concebida como uma melhoria Ă  arquitetura do classificador n-upla, como um meio de coibir a sensibilidade da WiSARD. Desde entĂŁo, houve um aumento na gama de aplicaçÔes construĂ­das com este sistema de aprendizado. Pelo uso frequente de corpora bastante grandes, a etiquetação gramatical multilĂ­ngue encaixa-se neste grupo de aplicaçÔes. Esta tese aprimora o mWANN-Tagger, um etiquetador gramatical sem peso proposto em 2012. Este texto mostra que a pesquisa em etiquetação multilĂ­ngue com WiSARD foi intensificada atravĂ©s do uso de linguĂ­stica quantitativa e que uma configuração de parĂąmetros universal foi encontrada para o mWANN-Tagger. AnĂĄlises e experimentos com as bases da Universal Dependencies (UD) mostram que o mWANN-Tagger tem potencial para superar os etiquetadores do estado da arte dada uma melhor representação de palavra. Esta tese tambĂ©m almeja avaliar as vantagens do bleaching em relação ao modelo tradicional atravĂ©s do arcabouço teĂłrico da teoria VC. As dimensĂ”es VC destes foram calculadas, atestando-se que um classificador n-upla, seja WiSARD ou com bleaching, que possua N memĂłrias endereçadas por n-uplas binĂĄrias tem uma dimensĂŁo VC de exatamente N (2n − 1) + 1. Um paralelo foi entĂŁo estabelecido entre ambos os modelos, onde deduziu-se que a tĂ©cnica de bleaching Ă© uma melhoria ao mĂ©todo n-upla que nĂŁo causa prejuĂ­zos Ă  sua capacidade de aprendizado.WiSARD Ă© um classificador n-upla, historicamente usado em tarefas de reconhecimento de padrĂ”es em imagens em preto e branco. Infelizmente, nĂŁo era comum que este fosse usado em outras tarefas, devido ĂĄ sua incapacidade de arcar com grandes volumes de dados por ser sensĂ­vel ao conteĂșdo aprendido. Recentemente, a tĂ©cnica de bleaching foi concebida como uma melhoria Ă  arquitetura do classificador n-upla, como um meio de coibir a sensibilidade da WiSARD. Desde entĂŁo, houve um aumento na gama de aplicaçÔes construĂ­das com este sistema de aprendizado. Pelo uso frequente de corpora bastante grandes, a etiquetação gramatical multilĂ­ngue encaixa-se neste grupo de aplicaçÔes. Esta tese aprimora o mWANN-Tagger, um etiquetador gramatical sem peso proposto em 2012. Este texto mostra que a pesquisa em etiquetação multilĂ­ngue com WiSARD foi intensificada atravĂ©s do uso de linguĂ­stica quantitativa e que uma configuração de parĂąmetros universal foi encontrada para o mWANN-Tagger. AnĂĄlises e experimentos com as bases da Universal Dependencies (UD) mostram que o mWANN-Tagger tem potencial para superar os etiquetadores do estado da arte dada uma melhor representação de palavra. Esta tese tambĂ©m almeja avaliar as vantagens do bleaching em relação ao modelo tradicional atravĂ©s do arcabouço teĂłrico da teoria VC. As dimensĂ”es VC destes foram calculadas, atestando-se que um classificador n-upla, seja WiSARD ou com bleaching, que possua N memĂłrias endereçadas por n-uplas binĂĄrias tem uma dimensĂŁo VC de exatamente N (2n − 1) + 1. Um paralelo foi entĂŁo estabelecido entre ambos os modelos, onde deduziu-se que a tĂ©cnica de bleaching Ă© uma melhoria ao mĂ©todo n-upla que nĂŁo causa prejuĂ­zos Ă  sua capacidade de aprendizado

    Detection of Hate-Speech Tweets Based on Deep Learning: A Review

    Get PDF
    Cybercrime, cyberbullying, and hate speech have all increased in conjunction with the use of the internet and social media. The scope of hate speech knows no bounds or organizational or individual boundaries. This disorder affects many people in diverse ways. It can be harsh, offensive, or discriminating depending on the target's gender, race, political opinions, religious intolerance, nationality, human color, disability, ethnicity, sexual orientation, or status as an immigrant. Authorities and academics are investigating new methods for identifying hate speech on social media platforms like Facebook and Twitter. This study adds to the ongoing discussion about creating safer digital spaces while balancing limiting hate speech and protecting freedom of speech.   Partnerships between researchers, platform developers, and communities are crucial in creating efficient and ethical content moderation systems on Twitter and other social media sites. For this reason, multiple methodologies, models, and algorithms are employed. This study presents a thorough analysis of hate speech in numerous research publications. Each article has been thoroughly examined, including evaluating the algorithms or methodologies used, databases, classification techniques, and the findings achieved.   In addition, comprehensive discussions were held on all the examined papers, explicitly focusing on consuming deep learning techniques to detect hate speech

    The Analysis of Student Performance Using Data Mining

    Get PDF
    This paper presents the study of data mining in the education industry to model the performance for students enrolled in university. Two algorithms of data mining were used. Firstly, a descriptive task based on the K-means algorithm was utilized to select several student clusters. Secondly, a classification task supported two classification techniques, known as decision tree and Naïve Bayes, to predict the dropout because of poor performance in a student’s first four semesters. The student academic data collected during the admission process of those students were used to train and test the models, which were assessed using a cross-validation technique. Experimental results show that the prediction of drop out student is improved, and student performance is monitored when the data from the previous academic enrollment are added
    • 

    corecore