2,481 research outputs found

    Ontology Driven Web Extraction from Semi-structured and Unstructured Data for B2B Market Analysis

    No full text
    The Market Blended Insight project1 has the objective of improving the UK business to business marketing performance using the semantic web technologies. In this project, we are implementing an ontology driven web extraction and translation framework to supplement our backend triple store of UK companies, people and geographical information. It deals with both the semi-structured data and the unstructured text on the web, to annotate and then translate the extracted data according to the backend schema

    Automatically assembling a full census of an academic field

    Get PDF
    The composition of the scientific workforce shapes the direction of scientific research, directly through the selection of questions to investigate, and indirectly through its influence on the training of future scientists. In most fields, however, complete census information is difficult to obtain, complicating efforts to study workforce dynamics and the effects of policy. This is particularly true in computer science, which lacks a single, all-encompassing directory or professional organization. A full census of computer science would serve many purposes, not the least of which is a better understanding of the trends and causes of unequal representation in computing. Previous academic census efforts have relied on narrow or biased samples, or on professional society membership rolls. A full census can be constructed directly from online departmental faculty directories, but doing so by hand is prohibitively expensive and time-consuming. Here, we introduce a topical web crawler for automating the collection of faculty information from web-based department rosters, and demonstrate the resulting system on the 205 PhD-granting computer science departments in the U.S. and Canada. This method constructs a complete census of the field within a few minutes, and achieves over 99% precision and recall. We conclude by comparing the resulting 2017 census to a hand-curated 2011 census to quantify turnover and retention in computer science, in general and for female faculty in particular, demonstrating the types of analysis made possible by automated census construction.Comment: 11 pages, 6 figures, 2 table

    Fine Grained Approach for Domain Specific Seed URL Extraction

    Get PDF
    Domain Specific Search Engines are expected to provide relevant search results. Availability of enormous number of URLs across subdomains improves relevance of domain specific search engines. The current methods for seed URLs can be systematic ensuring representation of subdomains. We propose a fine grained approach for automatic extraction of seed URLs at subdomain level using Wikipedia and Twitter as repositories. A SeedRel metric and a Diversity Index for seed URL relevance are proposed to measure subdomain coverage. We implemented our approach for \u27Security - Information and Cyber\u27 domain and identified 34,007 Seed URLs and 400,726 URLs across subdomains. The measured Diversity index value of 2.10 conforms that all subdomains are represented, hence, a relevant \u27Security Search Engine\u27 can be built. Our approach also extracted more URLs (seed and child) as compared to existing approaches for URL extraction

    Methodologies for the Automatic Location of Academic and Educational Texts on the Internet

    Get PDF
    Traditionally online databases of web resources have been compiled by a human editor, or though the submissions of authors or interested parties. Considerable resources are needed to maintain a constant level of input and relevance in the face of increasing material quantity and quality, and much of what is in databases is of an ephemeral nature. These pressures dictate that many databases stagnate after an initial period of enthusiastic data entry. The solution to this problem would seem to be the automatic harvesting of resources, however, this process necessitates the automatic classification of resources as ‘appropriate’ to a given database, a problem only solved by complex text content analysis. This paper outlines the component methodologies necessary to construct such an automated harvesting system, including a number of novel approaches. In particular this paper looks at the specific problems of automatically identifying academic research work and Higher Education pedagogic materials. Where appropriate, experimental data is presented from searches in the field of Geography as well as the Earth and Environmental Sciences. In addition, appropriate software is reviewed where it exists, and future directions are outlined

    Methodologies for the Automatic Location of Academic and Educational Texts on the Internet

    Get PDF
    Traditionally online databases of web resources have been compiled by a human editor, or though the submissions of authors or interested parties. Considerable resources are needed to maintain a constant level of input and relevance in the face of increasing material quantity and quality, and much of what is in databases is of an ephemeral nature. These pressures dictate that many databases stagnate after an initial period of enthusiastic data entry. The solution to this problem would seem to be the automatic harvesting of resources, however, this process necessitates the automatic classification of resources as ‘appropriate’ to a given database, a problem only solved by complex text content analysis. This paper outlines the component methodologies necessary to construct such an automated harvesting system, including a number of novel approaches. In particular this paper looks at the specific problems of automatically identifying academic research work and Higher Education pedagogic materials. Where appropriate, experimental data is presented from searches in the field of Geography as well as the Earth and Environmental Sciences. In addition, appropriate software is reviewed where it exists, and future directions are outlined

    Focused kraulinh capacity resources as a means of reducing search in WEB

    Get PDF
    У статті досліджується проблема створення системи моніторингу тематичних Web-ресурсів для корпоративного середовища. Запропоновано класифікацію основних алгоритмів обходу ресурсів. Розроблено класифікацію метрик ранжування сайтів за ознакою об’єктів, на основі яких виконується оцінювання. Проведено попередній розрахунок оцінки придатності сфокусованого пошуку для даної задачі.В статье исследуется проблема создания системы мониторинга тематических Web-ресурсов для корпоративной среды. Предложена классификация основных алгоритмов обхода ресурсов. Разработана классификация метрик ранжирования сайтов по признаку объектов, на основе которых выполняется оценка. Проведен предварительный расчет оценки пригодности сфокусированного поиска для данной задачи.This paper examines the problem of creating a system for monitoring Web-themed resources for the corporate environment. The classification of the basic algorithms traversing resources. The classification of metrics ranking sites on the basis of the objects on which the evaluation is performed. A preliminary calculation of the suitability assessment focused search for this problem
    corecore