161,698 research outputs found

    Intelligent Search and Automatic Document Classification and Cataloging Based on Ontology Approach

    Get PDF
    This paper presents an approach to development of intelligent search system and automatic document classification and cataloging tools for CASE-system based on metadata. The described method uses advantages of ontology approach and traditional approach based on keywords. The method has powerful intelligent means and it can be integrated with existing document search systems

    An Intelligent System For Arabic Text Categorization

    Get PDF
    Text Categorization (classification) is the process of classifying documents into a predefined set of categories based on their content. In this paper, an intelligent Arabic text categorization system is presented. Machine learning algorithms are used in this system. Many algorithms for stemming and feature selection are tried. Moreover, the document is represented using several term weighting schemes and finally the k-nearest neighbor and Rocchio classifiers are used for classification process. Experiments are performed over self collected data corpus and the results show that the suggested hybrid method of statistical and light stemmers is the most suitable stemming algorithm for Arabic language. The results also show that a hybrid approach of document frequency and information gain is the preferable feature selection criterion and normalized-tfidf is the best weighting scheme. Finally, Rocchio classifier has the advantage over k-nearest neighbor classifier in the classification process. The experimental results illustrate that the proposed model is an efficient method and gives generalization accuracy of about 98%

    Multi One-Class Incremental SVM for Document Stream Digitization

    Get PDF
    International audienceInside the DIGIDOC project (ANR-10-CORD-0020)-CONTenus et INTeractions (CONTINT), our approach was applied to several scenarios of classification of image streams which can cores ond to real cases in digitization projects. Most of the time, the processing of documents is considered as a well-defined task: the classes (also called concepts) are defined and known before the processing starts. But in real industrial workflows of document processes, it may frequently happen that the concepts can change during the time. In a context of document stream processing, the information and content included in the digitized pages can evolve over the time as well as the judgment of the user on what he wants to do with the resulting classification. The goal of this application is to create a module of learning, for a steam-based document images classification (especially dedicated to a digitization process with a huge volume of data), that adapts different situations for intelligent scanning tasks: adding, extending, contracting, splitting, or merging the classes in on an online mode of streaming data processing

    Detecting and Monitoring Hate Speech in Twitter

    Get PDF
    Social Media are sensors in the real world that can be used to measure the pulse of societies. However, the massive and unfiltered feed of messages posted in social media is a phenomenon that nowadays raises social alarms, especially when these messages contain hate speech targeted to a specific individual or group. In this context, governments and non-governmental organizations (NGOs) are concerned about the possible negative impact that these messages can have on individuals or on the society. In this paper, we present HaterNet, an intelligent system currently being used by the Spanish National Office Against Hate Crimes of the Spanish State Secretariat for Security that identifies and monitors the evolution of hate speech in Twitter. The contributions of this research are many-fold: (1) It introduces the first intelligent system that monitors and visualizes, using social network analysis techniques, hate speech in Social Media. (2) It introduces a novel public dataset on hate speech in Spanish consisting of 6000 expert-labeled tweets. (3) It compares several classification approaches based on different document representation strategies and text classification models. (4) The best approach consists of a combination of a LTSM+MLP neural network that takes as input the tweet’s word, emoji, and expression tokens’ embeddings enriched by the tf-idf, and obtains an area under the curve (AUC) of 0.828 on our dataset, outperforming previous methods presented in the literatureThe work by Quijano-Sanchez was supported by the Spanish Ministry of Science and Innovation grant FJCI-2016-28855. The research of Liberatore was supported by the Government of Spain, grant MTM2015-65803-R, and by the European Union’s Horizon 2020 Research and Innovation Programme, under the Marie Sklodowska-Curie grant agreement No. 691161 (GEOSAFE). All the financial support is gratefully acknowledge

    A knowledge engineering framework for intelligent retrieval of legal case studies

    Get PDF
    International audienceJuris-Data is one of the largest case-study base in France. The case studies are indexed by legal classification elaborated by the Juris-Data Group. Knowledge engineering was used to design an intelligent interface for information retrieval based on this classification. The aim of the system is to help users find the case-study which is the most relevant to their own. The approach is potentially very useful, but for standardising it for other legal document bases, it is necessary to extract a legal classification of the primary documents. Thus, a methodology for the construction of these classifications was designed together with a framework for index construction. The project led to the implementation of a Legal Case Studie, based on the accumulated experimentation and the methodologies designed. It consists of a set of computerised tools which support the life-cycle of the legal document from their processing by legal experts to their consultation by clients

    Deep Learning Architectures for Novel Problems

    Get PDF
    With convolutional neural networks revolutionizing the computer vision field it is important to extend the capabilities of neural-based systems to dynamic and unrestricted data like graphs. Doing so not only expands the applications of such systems, but also provide more insight into improvements to neural-based systems. Currently most implementations of graph neural networks are based on vertex filtering on fixed adjacency matrices. Although important for a lot of applications, vertex filtering restricts the applications to vertex focused graphs and cannot be efficiently extended to edge focused graphs like social networks. Applications of current systems are mostly limited to images and document references. Beyond the graph applications, this work also explored the usage of convolutional neural networks for intelligent character recognition in a novel way. Most systems define Intelligent Character Recognition as either a recurrent classification problem or image classification. This achieves great performance in a limited environment but does not generalize well on real world applications. This work defines intelligent Character Recognition as a segmentation problem which we show to provide many benefits. The goal of this work was to explore alternatives to current graph neural networks implementations as well as exploring new applications of such system. This work also focused on improving Intelligent Character Recognition techniques on isolated words using deep learning techniques. Due to the contrast between these to contributions this documents was divided into Part I focusing on the graph work, and Part II focusing on the intelligent character recognition work

    Incident detection using data from social media

    Get PDF
    This is an accepted manuscript of an article published by IEEE in 2017 IEEE 20th International Conference on Intelligent Transportation Systems (ITSC) on 15/03/2018, available online: https://ieeexplore.ieee.org/document/8317967/citations#citations The accepted version of the publication may differ from the final published version.© 2017 IEEE. Due to the rapid growth of population in the last 20 years, an increased number of instances of heavy recurrent traffic congestion has been observed in cities around the world. This rise in traffic has led to greater numbers of traffic incidents and subsequent growth of non-recurrent congestion. Existing incident detection techniques are limited to the use of sensors in the transportation network. In this paper, we analyze the potential of Twitter for supporting real-time incident detection in the United Kingdom (UK). We present a methodology for retrieving, processing, and classifying public tweets by combining Natural Language Processing (NLP) techniques with a Support Vector Machine algorithm (SVM) for text classification. Our approach can detect traffic related tweets with an accuracy of 88.27%.Published versio

    Document Classification in Support of Automated Metadata Extraction Form Heterogeneous Collections

    Get PDF
    A number of federal agencies, universities, laboratories, and companies are placing their documents online and making them searchable via metadata fields such as author, title, and publishing organization. To enable this, every document in the collection must be catalogued using the metadata fields. Though time consuming, the task of identifying metadata fields by inspecting the document is easy for a human. The visual cues in the formatting of the document along with accumulated knowledge and intelligence make it easy for a human to identify various metadata fields. Even with the best possible automated procedures, numerous sources of error exist, including some that cannot be controlled, such as scanned documents with text obscured by smudges, signatures, or stamps. A commercially viable process for metadata extraction must remain robust in the presence of these external sources of error as well as in the face of the uncertainty that accompanies any attempts to automate intelligent behavior. While extraction accuracy and completeness must be the primary goal of an extraction system, the ability to detect and report questionable results is equally important for a production quality system, since it promotes confidence in the system. We have developed and demonstrated a novel system for extracting metadata. First, a document is examined in an attempt to recognize it as an instance of a known document layout. Then a template, a scripted description of how to associate blocks of text in the layout with metadata fields, is applied to the document to extract the metadata. The extraction is validated after post-processing to evaluate the quality of the extraction and, if necessary, to flag untrusted extractions for human recognition. The success or failure of the template approach is directly tied to document classification, which is the ability to match the document to the proper template correctly and consistently. Document classification in our system is implemented as a module which applies every template available in the system to a document to find candidate templates that extract any data at all. The candidate templates are evaluated by a validation module to select the best performing template. This method is called post hoc classification. Post hoc classification is not only effective at selecting the correct class but it also excels at minimizing false positives. It is, however, very sensitive to changes in the template collection and to poorly written templates. While this dissertation examines the evolution and all the major components of an automated metadata extraction system, the primary focus is on the problem of document classification. The main thrust of my research has been investigating alternative methods of document classification to replace or supplement post hoc classification. I experimented with machine learning techniques as an additional input factor for the post hoc classification script or the final validation script
    • …
    corecore