152,996 research outputs found

    Automatic Genre Classification in Web Pages Applied to Web Comments

    Get PDF
    Automatic Web comment detection could significantly facilitate information retrieval systems, e.g., a focused Web crawler. In this paper, we propose a text genre classifier for Web text segments as intermediate step for Web comment detection in Web pages. Different feature types and classifiers are analyzed for this purpose. We compare the two-level approach to state-of-the-art techniques operating on the whole Web page text and show that accuracy can be improved significantly. Finally, we illustrate the applicability for information retrieval systems by evaluating our approach on Web pages achieved by a Web crawler

    Large-Scale Pattern-Based Information Extraction from the World Wide Web

    Get PDF
    Extracting information from text is the task of obtaining structured, machine-processable facts from information that is mentioned in an unstructured manner. It thus allows systems to automatically aggregate information for further analysis, efficient retrieval, automatic validation, or appropriate visualization. This work explores the potential of using textual patterns for Information Extraction from the World Wide Web

    A Framework to compare text annotators and its applications

    Get PDF
    Text in human languages have a low logic structure and are inherently ambiguous. For this reason, the typical approach of Information Retrieval to text documents has been based on the Bag-of-words model, in which documents are analyzed only by the occurrence of terms, discarding any possible structure. But a recently developing line of research is devoted to adding structure to unstructured text, by recognizing the topics contained in a text and annotate them. Topic annotators are systems that have the purpose of linking a natural language document to the topics that are relevant for describing the content of the document. This systems can be applied to many classic problems of Information Retrieval: the categorization of a document can be based on its topics; the clustering of a set of documents can be done using their topics to find similarities; for a search engine, it would be easier to find relevant pages if there was a way to know the topics that the query expresses and search for them in the cached web pages. In this thesis, we present a formal framework that describe the problems related to topic retrieval, the algorithms that solve those problems, and the way they can be benchmarked

    The GNAT library for local and remote gene mention normalization

    Get PDF
    Summary: Identifying mentions of named entities, such as genes or diseases, and normalizing them to database identifiers have become an important step in many text and data mining pipelines. Despite this need, very few entity normalization systems are publicly available as source code or web services for biomedical text mining. Here we present the Gnat Java library for text retrieval, named entity recognition, and normalization of gene and protein mentions in biomedical text. The library can be used as a component to be integrated with other text-mining systems, as a framework to add user-specific extensions, and as an efficient stand-alone application for the identification of gene and protein names for data analysis. On the BioCreative III test data, the current version of Gnat achieves a Tap-20 score of 0.1987

    Social Tagging: Exploring the Image, the Tags, and the Game

    Get PDF
    An increasing amount of images are being uploaded, shared, and retrieved on the Web. These large image collections need to be properly stored, organized and easily retrieved. Tags have a key role in image retrieval but it is difficult for those who upload the images to also undertake the quality tag assignment for potential future retrieval by others. Relying on professional keyword assignment is not a practical option for large image collections due to resource constraints. Although a number of content-based image retrieval systems have been launched, they have not demonstrated sufficient utility on large-scale image sources on the web, and are usually used as a supplement to existing text-based image retrieval systems. An alternative to professional image indexing can be social tagging -- with two major types being photo-sharing networks and image labeling games. Here we analyze these applications to evaluate their usefulness from the semantic point of view. We also investigate whether social tagging behaviour can be managed. The findings of the study have shown that social tagging can generate a sizeable number of tags that can be classified as interpretive for an image, and that tagging behaviour has a manageable and adjustable nature depending on tagging guidelines

    Innovations in Discovery Systems: User Studies and the Bento Approach

    Get PDF
    Over the past 30 years, library discovery services have evolved through expanded OPACs, federated search systems employing broadcast searching; Web-scale discovery systems (WSDS) that aggregate metadata and full-text content into a single integrated index; and, currently, hybrid bento-style systems that use federated techniques over WSDS, OPACs, and local information content. The bento systems partition search results into separate zoned screen displays grouped by content format type and/or local service results. Recent studies on Web-scale discovery systems have identified a number of user access issues centering on problems with blended result displays, problematical relevancy rankings of search results, full-text search problems, and the inability of WSDS to adequately provide access to local library services and resources. The concept of “full library discovery,” a phrase first coined by Lorcan Dempsey, has been introduced to refer to discovery approaches that move beyond the retrieval of collection materials to also include local information services and local content and links. The bento-based systems are an attempt to address the identified problems with WSDS and also provide discovery services that address user needs, in particular known item search and streamlined full-text access. This presentation will provide an analysis of the 38 libraries presently employing the bento approach and will look at identified user needs and search behaviors, as revealed in detailed search and clickthrough transaction log analyses. There is a clear need for an evidence-based analysis of user search behaviors in retrieval environments characterized by access to distributed information resources
    corecore