2,373 research outputs found

    Identifying Web Tables - Supporting a Neglected Type of Content on the Web

    Full text link
    The abundance of the data in the Internet facilitates the improvement of extraction and processing tools. The trend in the open data publishing encourages the adoption of structured formats like CSV and RDF. However, there is still a plethora of unstructured data on the Web which we assume contain semantics. For this reason, we propose an approach to derive semantics from web tables which are still the most popular publishing tool on the Web. The paper also discusses methods and services of unstructured data extraction and processing as well as machine learning techniques to enhance such a workflow. The eventual result is a framework to process, publish and visualize linked open data. The software enables tables extraction from various open data sources in the HTML format and an automatic export to the RDF format making the data linked. The paper also gives the evaluation of machine learning techniques in conjunction with string similarity functions to be applied in a tables recognition task.Comment: 9 pages, 4 figure

    MEDQUAL: Improving Medical Web Search over Time with Dynamic Credibility Heuristics

    Get PDF
    Performing a search on the World Wide Web (WWW) and traversing the resulting links is an adventure in which one encounters both credible and incredible web pages. Search engines, such as Google, rely on macroscopic Web topology patterns and even highly ranked 'authoritative' web sites may be a mixture of informed and uninformed opinions. Without credibility heuristics to guide the user in a maze of facts, assertions, and inferences, the Web remains an ineffective knowledge delivery platform. This report presents the design and implementation of a modular extension to the popular Google search engine, MEDQUAL, which provisions both URL and content-based heuristic credibility rules to reorder raw Google rankings in the medical domain. MEDQUAL, a software system written in Java, starts with a bootstrap configuration file which loads in basic heuristics in XML format. It then provides a subscription mechanism so users can join birds of feather specialty groups, for example Pediatrics, in order to load specialized heuristics as well. The platform features a coordination mechanism whereby information seekers can effectively become secondary authors, contributing by consensus vote additional credibility heuristics. MEDQUAL uses standard XML namespace conventions to divide opinion groups so that competing groups can be supported simultaneously. The net effect is a merger of basic and supplied heuristics so that the system continues to adapt and improve itself over time to changing web content, changing opinions, and new opinion groups. The key goal of leveraging the intelligence of a large-scale and diffuse WWW user community is met and we conclude by discussing our plans to develop MEDQUAL further and evaluate it

    A Taxonomy of Workflow Management Systems for Grid Computing

    Full text link
    With the advent of Grid and application technologies, scientists and engineers are building more and more complex applications to manage and process large data sets, and execute scientific experiments on distributed resources. Such application scenarios require means for composing and executing complex workflows. Therefore, many efforts have been made towards the development of workflow management systems for Grid computing. In this paper, we propose a taxonomy that characterizes and classifies various approaches for building and executing workflows on Grids. We also survey several representative Grid workflow systems developed by various projects world-wide to demonstrate the comprehensiveness of the taxonomy. The taxonomy not only highlights the design and engineering similarities and differences of state-of-the-art in Grid workflow systems, but also identifies the areas that need further research.Comment: 29 pages, 15 figure

    Representation and use of chemistry in the global electronic age.

    Get PDF
    We present an overview of the current state of public semantic chemistry and propose new approaches at a strategic and a detailed level. We show by example how a model for a Chemical Semantic Web can be constructed using machine-processed data and information from journal articles.This manuscript addresses questions of robotic access to data and its automatic re-use, including the role of Open Access archival of data. This is a pre-refereed preprint allowed by the publisher's (Royal Soc. Chemistry) Green policy. The author's preferred manuscript is an HTML hyperdocument with ca. 20 links to images, some of which are JPEgs and some of which are SVG (scalable vector graphics) including animations. There are also links to molecules in CML, for which the Jmol viewer is recommended. We susgeest that readers who wish to see the full glory of the manuscript, download the Zipped version and unpack on their machine. We also supply a PDF and DOC (Word) version which obviously cannot show the animations, but which may be the best palce to start, particularly for those more interested in the text

    Data Transformation and Semantic Log Purging for Process Mining

    Get PDF
    Existing process mining approaches are able to tolerate a certain degree of noise in the process log. However, processes that contain infrequent paths, multiple (nested) parallel branches, or have been changed in an ad-hoc manner, still pose major challenges. For such cases, process mining typically returns "spaghetti-models", that are hardly usable even as a starting point for process (re-)design. In this paper, we address these challenges by introducing data transformation and pre-processing steps that improve and ensure the quality of mined models for existing process mining approaches. We propose the concept of semantic log purging, the cleaning of logs based on domain specific constraints utilizing semantic knowledge which typically complements processes. Furthermore we demonstrate the feasibility and effectiveness of the approach based on a case study in the higher education domain. We think that semantic log purging will enable process mining to yield better results, thus giving process (re-)designers a valuable tool

    A Techno-Social Approach for Achieving Online Readership Popularity

    Get PDF
    Understanding what drives readership popularity in online interactive media has important implications to individual practitioners and net-enabled organizations. For instance, it helps generate a success “formula” for designing potentially popular websites in the increasingly competitive online world. So far, research in this area lacks a unified approach in guiding the design of online interactive media as well as in predicting their successful adoption and use, from both technological and social orientations. Drawing upon the media success literature and related social cognition theories, we establish a techno-social model for achieving online readership popularity, accounting for the impacts of technology-dependent and media-embedded characteristics. The proposed model and hypotheses will be tested by a content analysis of 100+ very popular weblogs and survey of 2000+ active weblog readers. This research carries significant value for sustaining community- and firm-based user networks that have been recognized as an important source of social and knowledge capitals

    Lexically specific knowledge and individual differences in adult native speakers’ processing of the English passive

    Get PDF
    This article provides experimental evidence for the role of lexically specific representations in the processing of passive sentences and considerable education-related differences in comprehension of the passive construction. The experiment measured response time and decision accuracy of participants with high and low academic attainment using an online task that compared processing and comprehension of active and passive sentences containing verbs strongly associated with the passive and active constructions, as determined by collostructional analysis. As predicted by usage-based accounts, participants’ performance was influenced by frequency (both groups processed actives faster than passives; the low academic attainment participants also made significantly more errors on passive sentences) and lexical specificity (i.e., processing of passives was slower with verbs strongly associated with the active). Contra to proposals made by Dąbrowska and Street (2006), the results suggest that all participants have verb-specific as well as verb-general representations, but that the latter are not as entrenched in the participants with low academic attainment, resulting in less reliable performance. The results also show no evidence of a speed–accuracy trade-off, making alternative accounts of the results (e.g., those of two-stage processing models, such as Townsend & Bever, 2001) problematic

    BlogForever D2.4: Weblog spider prototype and associated methodology

    Get PDF
    The purpose of this document is to present the evaluation of different solutions for capturing blogs, established methodology and to describe the developed blog spider prototype
    • …
    corecore