13 research outputs found

    Automatic Semantic Header Generator

    Get PDF
    As the amount of information and the number of Internet users grow, the problem of indexing and retrieval of electronic information resources becomes more critical. The existing search systems tend to generate misses and false hits due to the fact that they attempt to match the specified search terms without context in the target information resource. The COncordia INdexing and DIscovery system is an indexing system. It is a powerful means of helping users locate documents, software, and other types of data among large repositories. In environments that contain many different types of data, content indexing requires type-specific processing to extract information effectively. The Semantic Header, which is proposed by Desai [11], contains the semantic contents of information resources. It provides a useful tool in searching for a document based on a number of commonly used criteria. The information from the semantic header could be used by the search system to help locate appropriate documents with minimum effort. This paper introduces an automatic tool for the extraction and storage of some of the meta-information in a Semantic Header and an automatic text classification scheme

    Automatic Semantic Header Generator

    Get PDF
    As the mounds of information and the number of Internet users grow, the problem of indexing and retrieving of electronic information resources becomes more critical. The existing search systems tend to generate misses and false hits due to the fact that they attempt to match the speci ed search terms without proper context in the target information resource. In environments that contain many di erent types of data, content indexing requires type- speci c processing to extract indexing information e ectively. The COncordia INdexing and DIscovery (Cindi) system is a system devised to support the registration of indexing meta- data for information resources and provide a convenient system for search and discovery. The Semantic Header, containing the semantic contents of information resources stored in the Cindi system, provides a useful tool to facilitate the searching for documents based on a number of commonly used criteria. This paper presents an automatic tool for the extraction and storage of some of the meta-information in a Semantic Header and the classi cation scheme used for generating the subject headings

    Automatic semantic header generator

    Get PDF
    As the amount of information and the number of Internet users grow, the problem of indexing and retrieval of electronic information resources becomes more critical. The existing search systems tend to generate misses and false hits due to the fact that they attempt to match the specified search terms without context in the target information resource. The COncordia INdexing and DIscovery system is an indexing system. It is a powerful means of helping users locate documents, software, and other types of data among large repositories. In environments that contain many different types of data, content indexing requires type-specific processing to extract information effectively. The Semantic Header, which is proposed by Desai (11), contains the semantic contents of information resources. It provides a useful tool in searching for a document based on a number of commonly used criteria. The information from the semantic header could be used by the search system to help locate appropriate documents with minimum effort. This thesis introduces an automatic tool for the extraction and storage of some of the meta-information in a Semantic Header and an automatic text classification scheme

    Automatic semantic header generator for PDF documents

    Get PDF
    The Concordia INdexing and DIscovery system (CINDI) is an information discovery and retrieval system to enable a reader to discover resources from a bibliographic database. It uses a metadata description called semantic header to describe an information resource, whose content includes title, author name, the subject and sub-subject, etc. Automatic Semantic Header Generator (ASHG) is used to generate a draft version of the semantic header from a resource automatically. The existing system can deal with four special document formats: HTML, TEXT, LATEX, and RTF. Since more and more people use PDF for document exchange, perusal on line or in print format due to PDF document's easy to use and cross platform portability, more documents are published in PDF format. This thesis presents the design and implementation of an extension to the existing ASHG to extract the semantic header from a PDF document automatically. First, the PDF document is converted to plain text file using Xpdf, an open source software. Modification to Xpdf has been made to get better results of the conversion. In order to test the accuracy of the ASHG, 500 articles which are all from computer science field are used in an experiment to generate the semantic header; the results 80% accurate respectively. However the results reveal that the subject classification (about 41%) is the weakest point of ASHG and requiring further work

    Porting the automatic semantic header generator to the web

    Get PDF
    The Concordia INdexing and DIscovery system (CINDI) is an indexing system. It enables a user to index and discover information resources from the CINDI virtual library. The information resource is described by using a meta-data called a Semantic Header. Automatic Semantic Header Generator (ASHG) is an automatic tool for the extraction and storage of some of the meta-information in a Semantic Header and an automatic text classification scheme. This major report describes how to port the ASHG to web server on Linux. It is part of the work to develop a Web-based CINDI system. In the web based ASHG system, MySQL is employed for storing the ASHG's thesaurus. Apache is used as web server. The PHP script language is also used to create the user interface. All functions of ASHG are developed by using C++ and Perl. The functions of the extraction of the meta-data from the existing ASHG are ported and the main algorithms of ASHG are adapted to the web-based system. We redesigned and implemented the ASHG's architecture, the web-based user interface, the ASHG's thesaurus, the programs used to build and maintain the thesaurus and all interfaces between main algorithms and database in the web based ASHG system. Finally, the ASHG system has been integrated with Web-based CINDI syste

    Extracting Semantics of Documents Using Semantic Header Generator

    Get PDF
    Accurate representation of electronic information on the Internet underlies a solid foundation for precise information retrieval. However, the existing search systems tend to generate misses and false hits due to the fact that they attempt to match the specified search terms without context in the target information resource. It is clear that using traditional keywords-based methods for representing semantics of information items has become a major obstacle to high precision. In this paper, we propose the notion of Semantic Header to replace keyword indexing in extracting the meanings of information resources that marks explicitly the logical structure of a document. The information from the Semantic Header could be used by the search system to help locate appropriate documents with minimum effort. We also introduce an automatic tool, called Automatic Semantic Header Generator (ASHG), used for generating the meta-information for some significant fields of Semantic Header

    CINDI System

    Get PDF
    He advent of the Web has highlighted the importance of information discovery and retrieval as it has become a daily task for most users of the Internet. Search engines have made information search tasks much easier, however they retrieve links to documents based on term frequency, location of terms, link analysis, popularity, date of publication, length of the document, and proximity of query terms. The CINDI System is a digital library(repository) for research papers in domain of computer science. The CINDI project is to improve discovery and search experience by targeting information to that required by academics and professionals in field of Computer Science. This paper describes the CINDI system and its components and our experience with both the push mechanism and the pull mechanism available in CINDI

    Extraction of semantic header from RTF documents

    Get PDF
    The problem of indexing and retrieval of electronic information resources becomes more critical as the amount of information and the number of Internet users continues to grow. The Semantic Header, proposed by Desai [3], is a portion of each document that contains the meta-information for each publicly accessible resource on the Internet. The Semantic Header for document-like Internet resources is a powerful means of helping users locate documents and other types of data among large repositories. In environments that contain many different types of data, content indexing requires type-specific processing to extract information effectively. In this project which is a part of the ASHG system (Automatic Semantic Header Generator), we present a model for type-specific, information extraction that automatically extracts the meta-information from RTF (Rich Text Format) documents, and stores it in a Semantic Header which will be used as an index for the document. This shall provide a useful tool in searching for a document based on a number of commonly used criteria. The information from the Semantic Header could be used by the search system to help locate appropriate documents with minimum effort

    Enhancement and integration for CINDI system

    Get PDF
    The CINDI system is an assembly of inter-related subsystems, working together as a digital library for academic documents in the field of computer science. These subsystems include the CINDI Robot, which downloads scientific documents including theses, technical reports, FAQ's, academic papers and discussion groups, the CINDI Conference system and the CINDI Registration and Upload subsystem, where authors upload academic documents. In addition, there is the Gleaning subsystem that converts the non-PDF documents to PDF format and filters out the documents that are more appropriate, the Automatic Semantic Header Generator which locates information about the author, title, keywords, subject and abstract from the documents, and the CINDI Search subsystem which enables users to search for resources in the CINDI repository, This thesis is based on the techniques that were used for the integration of subsystems, which includes porting of the Document Converter from the Windows platform to Linux. Enhancements were made to the Registration and Upload subsystem to allow multiple file uploads and improvements were made to the Graphical User Interface. The CINDI Search subsystem was redesigned to improve functionality and its interface was made more user-friendly. We have also developed an Annotation subsystem allowing users to make comments on documents in the CINDI repository

    ConfSys3: An Online Academic Conference System

    Get PDF
    As an important component of Concordia INdexing and DIscovering system (CINDI), Conference System (ConfSys) aims to provide useful functionalities and services to help both organizers and participants of any roles in an academic conference and eJouranl. All processes of auction, debate, decision, final version upload and so on associated with such events and issues are supported and facilitated by ConfSys. After more than ten years development and improvement upon practical academic conference management experience, the second generation of ConfSys (ConfSys2) not only possesses a lot of strong and applied features such as user-group management, privilege system, context sensitive help system and smart daemon and database maintenance mechanism, but also be able to support multi-series academic conference. The experience with ConfSys2 pointed to some of its shortcoming which in turn pointed to the need for additional features that were needed in conference management and to incorporate the management of eJournal. This has resulted in the third generation of ConfSys – ConfSys3. In this version, we have focused on the flexibility, extensibility and customization. ConfSys3 is based on the same platform as its previous version -- Tomcat, java/jsp and MySQL. In addition to the interface improvement and many new useful features such as automatic email management, automated verification of uploaded files, incorporation of special features needed for eJournal management, introduction of concurrent track feature and associate editor, and a major upgrade to make it possible for organizers to customize their conferences. Hence, ConfSys3 is extending the advantages of ConfSys for better configuration to address specific requirements fo supporting peer review based academic events in various domains
    corecore