420,215 research outputs found

    Text Categorization based on Associative Classification

    Get PDF
    Text mining is an emerging technology that can be used to augment existing data in corporate databases by making unstructured text data available for analysis. The incredible increase in online documents, which has been mostly due to the expanding internet, has renewed the interest in automated document classification and data mining. The demand for text classification to aid the analysis and management of text is increasing. Text is cheap, but information, in the form of knowing what classes a text belongs to, is expensive. Text classification is the process of classifying documents into predefined categories based on their content. Automatic classification of text can provide this information at low cost, but the classifiers themselves must be built with expensive human effort, or trained from texts which have themselves been manually classified. Both classification and association rule mining are indispensable to practical applications. For association rule mining, the target of discovery is not pre-determined, while for classification rule mining there is one and only one predetermined target. Thus, great savings and conveniences to the user could result if the two mining techniques can somehow be integrated. In this paper, such an integrated framework, called associative classification is used for text categorization The algorithm presented here for text classification uses words as features , to derive feature set from preclassified text documents. The concept of Naïve Bayes classifier is then used on derived features for final classification

    Intelligent Web Crawling using Semantic Signatures

    Get PDF
    The quantity of test that is added to the web in the digital form continues to grow and the quest for tools that can process this huge amount of data to retrieve the data of our interest is an ongoing process. Moreover, observing these large volumes of data over a period of time is a tedious task for any human being. Text mining is very helpful in performing these kinds of tasks. Text mining is a process of observing patterns in the text data using sophisticated statistical measures both quantitatively and qualitatively. Using these text mining techniques and the power of the internet and its technologies, we have developed a tool that retrieves documents concerning topics of interest, which utilizes novel and sensitive classification tools.;This thesis presents an intelligent web crawler, named Intel-Crawl. This tool identifies web pages of interest without the user\u27s guidance or monitoring. Documents of interest are logged (by URL or file name). This package uses automatically generated semantic signatures to identify documents with content of interest. The tool also produces a vector that is a quantification of a document\u27s content based on the semantic signatures. This provides a rich and sensitive characterization of the document\u27s content. Documents are classified according to content and presented to the user for further analysis and investigation.;Intel-Crawl may be applied to any area of interest. It is likely to be very useful in areas such as law enforcement, intelligence gathering, and monitoring changes in web site contents over time. It is well-suited for scrutinizing the web activity of large collection of web pages pertaining to similar content. The utility of Intel-Crawl is demonstrated in various situations using different parameters and classification techniques

    Automatic office document classification and information extraction

    Get PDF
    TEXPR.OS (TEXt PROcessing System) is a document processing system (DPS) to support and assist office workers in their daily work in dealing with information and document management. In this thesis, document classification and information extraction, which are two of the major functional capabilities in TEXPROS, are investigated. Based on the nature of its content, a document is divided into structured and unstructured (i.e., of free text) parts. The conceptual and content structures are introduced to capture the semantics of the structured and unstructured part of the document respectively. The document is classified and information is extracted based on the analyses of conceptual and content structures. In our approach, the layout structure of a document is used to assist the analyses of the conceptual and content structures of the document. By nested segmentation of a document, the layout structure of the document is represented by an ordered labeled tree structure, called Layout Structure Tree (L-S-Tree). Sample-based classification mechanism is adopted in our approach for classifying the documents. A set of pre-classified documents are stored in a document sample base in the form of sample trees. In the layout analysis, an approximate tree matching is used to match the L-S-Tree of a document to be classified against the sample trees. The layout similarities between the document and the sample documents are evaluated based on the edit distance between the L-S-Tree of the document and the sample trees. The document samples which have the similar layout structure to the document are chosen to be used for the conceptual analysis of the document. In the conceptual analysis of the document, based on the mapping between the document and document samples, which was found during the layout analysis, the conceptual similarities between the document and the sample documents are evaluated based on the degree of conceptual closeness degree . The document sample which has the similar conceptual structure to the document is chosen to be used for extracting information. Extracting the information of the structured part of the document is based on the layout locations of key terms appearing in the document and string pattern matching. Based on the information extracted from the structured part of the document the type of the document is identified. In the content analysis of the document, the bottom-up and top-down analyses on the free text are combined to extract information from the unstructured part of the document. In the bottom-up analysis, the sentences of the free text are classified into those which are relevant or irrelevant to the extraction. The sentence classification is based on the semantical relationship between the phrases in the sentences and the attribute names in the corresponding content structure by consulting the thesaurus. Then the thematic roles of the phrases in each relevant sentence are identified based on the syntactic analysis and heuristic thematic analysis. In the top-down analysis, the appropriate content structure is identified based on the document type identified in the conceptual analysis. Then the information is extracted from the unstructured part of the document by evaluating the restrictions specified in the corresponding content structure based on the result of bottom-up analysis. The information extracted from the structured and unstructured parts of the document are stored in the form of a frame like structure (frame instance) in the data base for information retrieval in TEXPROS

    Adaptive Algorithms for Automated Processing of Document Images

    Get PDF
    Large scale document digitization projects continue to motivate interesting document understanding technologies such as script and language identification, page classification, segmentation and enhancement. Typically, however, solutions are still limited to narrow domains or regular formats such as books, forms, articles or letters and operate best on clean documents scanned in a controlled environment. More general collections of heterogeneous documents challenge the basic assumptions of state-of-the-art technology regarding quality, script, content and layout. Our work explores the use of adaptive algorithms for the automated analysis of noisy and complex document collections. We first propose, implement and evaluate an adaptive clutter detection and removal technique for complex binary documents. Our distance transform based technique aims to remove irregular and independent unwanted foreground content while leaving text content untouched. The novelty of this approach is in its determination of best approximation to clutter-content boundary with text like structures. Second, we describe a page segmentation technique called Voronoi++ for complex layouts which builds upon the state-of-the-art method proposed by Kise [Kise1999]. Our approach does not assume structured text zones and is designed to handle multi-lingual text in both handwritten and printed form. Voronoi++ is a dynamically adaptive and contextually aware approach that considers components' separation features combined with Docstrum [O'Gorman1993] based angular and neighborhood features to form provisional zone hypotheses. These provisional zones are then verified based on the context built from local separation and high-level content features. Finally, our research proposes a generic model to segment and to recognize characters for any complex syllabic or non-syllabic script, using font-models. This concept is based on the fact that font files contain all the information necessary to render text and thus a model for how to decompose them. Instead of script-specific routines, this work is a step towards a generic character and recognition scheme for both Latin and non-Latin scripts

    Improving Document Representation Using Retrofitting

    Get PDF
    Data-driven learning of document vectors that capture linkage between them is of immense importance in natural language processing (NLP). These document vectors can, in turn, be used for tasks like information retrieval, document classification, and clustering. Inherently, documents are linked together in the form of links or citations in case of web pages or academic papers respectively. Methods like PV-DM or PV-DBOW try to capture the semantic representation of the document using only the text information. These methods ignore the network information altogether while learning the representation. Similarly, methods developed for network representation learning like node2vec or DeepWalk, capture the linkage information between the documents but they ignore the text information altogether. In this thesis, we proposed a method based on Retrofit for learning word embeddings using a semantic lexicon, which tries to incorporate both the text and network information together while learning the document representation. We also analyze the optimum weight for adding network information that will give us the best embedding. Our experimentation result shows that our method improves the classification score by 4% and we also introduce a new dataset containing both network and content information

    Multidimensional Analysis Data To Create A Decision Support System Dedicated To The University Environment

    Get PDF
    Our objective is to make proposals for the design of a SIS SID-quality and meet the needs of different stakeholders of the university. This is where we join (which is poorly modeled by the concept of data marts in the current tools of the market), namely the modeling of data resources. Often the documents are deposited on the information system of an organization without classification, without indexing, with all the information on their content, their purpose, their technical requirements and practices. The method of describing the properties of a document is a binding step involves an author and a culture of destruction of documents. Few users perform document properties they file on a system design and information. Then it is naturally more difficult to retrieve these information gaps which usually take the form of voids, it is still necessary that the input fields are provided adequate and appropriately organized, arranged and explained. Indeed, it often happens - for example on an intranet of an organization - the drop zones are not conducive to give relevant information on the properties of materials downloaded. In the best case, the documents are managed by their own systems, accessible through their own search engine or by federated search engines. Why we try to answer the question: how to reproduce a set of metadata specific to multidimensional databases specific to the decision-oriented universities

    HSAS: Hindi Subjectivity Analysis System

    Get PDF
    With the development of Web 2.0, we are abundant with the documents expressing user's opinions, attitudes and sentiments in the textual form. This user generated textual content is an important source of information to make sound decisions by the organizations and the government. The textual information can be categorized into two types: facts and opinions. Subjectivity analysis is the automatic extraction of subjective information from the opinions posted by users and divides the content into subjective and objective sentences. Most of the works in subjectivity analysis exists for English language data but with the introduction of unicode standards UTF-8, Hindi language content on the web is growing very rapidly. In this paper, Hindi Subjectivity Analysis System (HSAS) is proposed. It explores two different methods of generating subjectivity lexicon using the available resources in English language and their comparative evaluation in performing the task of subjectivity analysis at the sentence level. The first method uses English language OpinionFinder subjectivity lexicon. The second method uses a small seed word list of Hindi language and expands it to generate subjectivity lexicon. Different evaluation strategies are used to validate the lexicon. We achieved 71.4% agreement with human annotators and ~80% accuracy in classification on a parallel data set in English and Hindi. Extensive simulations conducted on the test dataset confirm the validity of the suggested method

    Application of a bilingual faceted terminological system for specialized information retrieval

    Get PDF
    The central aim of this poster is to describe the development of a virtual resources library with specialized contents on the field of cardiovascular diseases, organized according to a faceted organisation of categories. We understand facets as the classes of the different categories of one specific subject field. The information of our e-library is classified following two main different criteria: On one hand depending on the subject(s) of the documents, ordered by an onomasiological structure. For instance, we sorted the cardiovascular diseases into more specific subfields like cardiovascular abnormalities (vascular malformations, heart defects, etc.); heart diseases (arhythmias, heart failure, heart neoplasms, etc.); vascular diseases (hypertension, stroke, etc.). On the other hand we take into consideration the characteristics and attributes of the more usual information in the biomedical areas. The classification was made depending on the form, the structure or the type of content of the document and we had classes like reference documents (atlas, books, clinical practice guidelines, databases, dictionaries and glossaries, etc.); information about congresses and other meetings; health and medical portals or associations. In addition, all the different categories and classes (subjects as well as document types) have been labelled in two languages, Spanish and English, to allow the recovery of the information no matter which language is used. Without a doubt one of the most important aspects for the quality of information resources is the accuracy of the retrieval of the contents, avoiding the silence and the documental noise and more precisely the faceted classification methodology gives us the possibility to deal with more than one category and so increasing the relevancy of the search results
    corecore