219,102 research outputs found

    Intelligent Fusion of Structural and Citation-Based Evidence for Text Classification

    Get PDF
    This paper investigates how citation-based information and structural content (e.g., title, abstract) can be combined to improve classification of text documents into predefined categories. We evaluate different measures of similarity, five derived from the citation structure of the collection, and three measures derived from the structural content, and determine how they can be fused to improve classification effectiveness. To discover the best fusion framework, we apply Genetic Programming (GP) techniques. Our empirical experiments using documents from the ACM digital library and the ACM classification scheme show that we can discover similarity functions that work better than any evidence in isolation and whose combined performance through a simple majority voting is comparable to that of Support Vector Machine classifiers

    Automated hierarchical classification of scanned documents using convolutional neural network and regular expression

    Get PDF
    This research proposed automated hierarchical classification of scanned documents with characteristics content that have unstructured text and special patterns (specific and short strings) using convolutional neural network (CNN) and regular expression method (REM). The research data using digital correspondence documents with format PDF images from pusat data teknologi dan informasi (technology and information data center). The document hierarchy covers type of letter, type of manuscript letter, origin of letter and subject of letter. The research method consists of preprocessing, classification, and storage to database. Preprocessing covers extraction using Tesseract optical character recognition (OCR) and formation of word document vector with Word2Vec. Hierarchical classification uses CNN to classify 5 types of letters and regular expression to classify 4 types of manuscript letter, 15 origins of letter and 25 subjects of letter. The classified documents are stored in the Hive database in Hadoop big data architecture. The amount of data used is 5200 documents, consisting of 4000 for training, 1000 for testing and 200 for classification prediction documents. The trial result of 200 new documents is 188 documents correctly classified and 12 documents incorrectly classified. The accuracy of automated hierarchical classification is 94%. Next, the search of classified scanned documents based on content can be developed

    Human Document Classification Using Bags of Words

    Get PDF
    Humans are remarkably adept at classifying text documents into cate-gories. For instance, while reading a news story, we are rapidly able to assess whether it belongs to the domain of finance, politics or sports. Automating this task would have applications for content-based search or filtering of digital documents. To this end, it is interesting to investigate the nature of information humans use to classify documents. Here we report experimental results suggesting that this information might, in fact, be quite simple. Using a paradigm of progressive revealing, we determined classification performance as a function of number of words. We found that subjects are able to achieve similar classification accuracy with or without syntactic information across a range of passage sizes. These results have implications for models of human text-understanding and also allow us to estimate what level of performance we can expect, in principle, from a system without requiring a prior step of complex natural language processing

    Audio Content-Based Music Retrieval

    Get PDF
    The rapidly growing corpus of digital audio material requires novel retrieval strategies for exploring large music collections. Traditional retrieval strategies rely on metadata that describe the actual audio content in words. In the case that such textual descriptions are not available, one requires content-based retrieval strategies which only utilize the raw audio material. In this contribution, we discuss content-based retrieval strategies that follow the query-by-example paradigm: given an audio query, the task is to retrieve all documents that are somehow similar or related to the query from a music collection. Such strategies can be loosely classified according to their "specificity", which refers to the degree of similarity between the query and the database documents. Here, high specificity refers to a strict notion of similarity, whereas low specificity to a rather vague one. Furthermore, we introduce a second classification principle based on "granularity", where one distinguishes between fragment-level and document-level retrieval. Using a classification scheme based on specificity and granularity, we identify various classes of retrieval scenarios, which comprise "audio identification", "audio matching", and "version identification". For these three important classes, we give an overview of representative state-of-the-art approaches, which also illustrate the sometimes subtle but crucial differences between the retrieval scenarios. Finally, we give an outlook on a user-oriented retrieval system, which combines the various retrieval strategies in a unified framework

    Intelligent Web Crawling using Semantic Signatures

    Get PDF
    The quantity of test that is added to the web in the digital form continues to grow and the quest for tools that can process this huge amount of data to retrieve the data of our interest is an ongoing process. Moreover, observing these large volumes of data over a period of time is a tedious task for any human being. Text mining is very helpful in performing these kinds of tasks. Text mining is a process of observing patterns in the text data using sophisticated statistical measures both quantitatively and qualitatively. Using these text mining techniques and the power of the internet and its technologies, we have developed a tool that retrieves documents concerning topics of interest, which utilizes novel and sensitive classification tools.;This thesis presents an intelligent web crawler, named Intel-Crawl. This tool identifies web pages of interest without the user\u27s guidance or monitoring. Documents of interest are logged (by URL or file name). This package uses automatically generated semantic signatures to identify documents with content of interest. The tool also produces a vector that is a quantification of a document\u27s content based on the semantic signatures. This provides a rich and sensitive characterization of the document\u27s content. Documents are classified according to content and presented to the user for further analysis and investigation.;Intel-Crawl may be applied to any area of interest. It is likely to be very useful in areas such as law enforcement, intelligence gathering, and monitoring changes in web site contents over time. It is well-suited for scrutinizing the web activity of large collection of web pages pertaining to similar content. The utility of Intel-Crawl is demonstrated in various situations using different parameters and classification techniques

    Multi Faceted Text Classification using Supervised Machine Learning Models

    Get PDF
    In recent year’s document management tasks (known as information retrieval) increased a lot due to availability of digital documents everywhere. The need of automatic methods for extracting document information became a prominent method for organizing information and knowledge discovery. Text Classification is one such solution, where in the natural language text is assigned to one or more predefined categories based on the content. In my research classification of text is mainly focused on sentiment label classification. The idea proposed for sentiment analysis is multi-class classification of online movie reviews. Many research papers discussed the classification of sentiment either positive or negative, but in this approach the user reviews are classified based on their sentiment to multi classes like positive, negative, neutral, very positive and very negative. This classification task would help the business to classify the user reviews same as star ratings, which are manually given by users. This paper also proposes a better classification approach with multi-tier prediction model. The goal of this research is to provide a better understanding classification for sentiment analysis by applying different preprocessing techniques and selecting suitable features like bag of words, stemming and removing stop words, POS Tagging etc. These features are adjusted to fit with some of the machine learning text classification algorithms such as Naïve Bayes, SVM, sand SGD on frameworks like WEKA, SVMLight & Scikit Learn

    Feature selection methods in Persian sentiment analysis

    Get PDF
    With the enormous growth of digital content in internet, various types of online reviews such as product and movie reviews present a wealth of subjective information that can be very helpful for potential users. Sentiment analysis aims to use automated tools to detect subjective information from reviews. Up to now as there are few researches conducted on feature selection in sentiment analysis, there are very rare works for Persian sentiment analysis. This paper considers the problem of sentiment classification using different feature selection methods for online customer reviews in Persian language. Three of the challenges of Persian text are using of a wide variety of declensional suffixes, different word spacing and many informal or colloquial words. In this paper we study these challenges by proposing a model for sentiment classification of Persian review documents. The proposed model is based on stemming and feature selection and is employed Naive Bayes algorithm for classification. We evaluate the performance of the model on a collection of cellphone reviews, where the results show the effectiveness of the proposed approache

    Content-Based Quality Estimation for Automatic Subject Indexing of Short Texts under Precision and Recall Constraints

    Get PDF
    Semantic annotations have to satisfy quality constraints to be useful for digital libraries, which is particularly challenging on large and diverse datasets. Confidence scores of multi-label classification methods typically refer only to the relevance of particular subjects, disregarding indicators of insufficient content representation at the document-level. Therefore, we propose a novel approach that detects documents rather than concepts where quality criteria are met. Our approach uses a deep, multi-layered regression architecture, which comprises a variety of content-based indicators. We evaluated multiple configurations using text collections from law and economics, where the available content is restricted to very short texts. Notably, we demonstrate that the proposed quality estimation technique can determine subsets of the previously unseen data where considerable gains in document-level recall can be achieved, while upholding precision at the same time. Hence, the approach effectively performs a filtering that ensures high data quality standards in operative information retrieval systems.Comment: authors' manuscript, paper submitted to TPDL-2018 conference, 12 page

    Optimizing the performance of a server-based classification for a large business document flow

    Get PDF
    The document categorization problem in the case of a large business document flow is considered. Textual and visual embeddings were employed for classification. Textual embeddings were extracted via OCR Tesseract. The Viola and Jones method was applied to generate visual embeddings. This paper describes the performance optimization technology for the implemented classification algorithm. Servers with Intel CPUs were used for the algorithm execution. For single-threaded implementation, high-level and low-level optimizations were performed. High-level optimization was based on the parametrization of the recognition algorithms and the employment of intermediate data. Low-level optimization was carried out via compiler tools allowing for an extended set of SIMD instructions. The implementation of parallelization with several multithreaded applications on multiple servers was also described. The proposed solution was tested using own test data sets of business documents. The proposed method can be applied in modern information systems to analyze the content of a large flow of digital document images.
    corecore