7,819 research outputs found

    Website Content Extraction Using Web Structure Analysis

    Get PDF
    The Web poses itself as the largest data repository ever available in the history of humankind. Major efforts have been made in order to provide efficient to relevant information within huge repository of data. Although several techniques have been developed to the problem of Web data extraction, their use is still not spread, mostly because of the need for high human intervention and the low quality of the extraction results. For this project a domain-oriented approach to Web data extraction and discuss it application to extracting news from Web Sites. It will use the abstraction method to identify important sections in a web document. The relevance information will be taken account and will be highlighted in order to develop a focused web content output. The fact-finding and data about the project are gathered from various sources such as internet, and books. The methodology used is a Waterfall Model that involves several phases which are Planning, Analysis, Design and Implementation. The result of this project is the display and review of web content extraction and how it being currently being developed which the goals is to give more usability and easiness toward web users

    Interoperability and FAIRness through a novel combination of Web technologies

    Get PDF
    Data in the life sciences are extremely diverse and are stored in a broad spectrum of repositories ranging from those designed for particular data types (such as KEGG for pathway data or UniProt for protein data) to those that are general-purpose (such as FigShare, Zenodo, Dataverse or EUDAT). These data have widely different levels of sensitivity and security considerations. For example, clinical observations about genetic mutations in patients are highly sensitive, while observations of species diversity are generally not. The lack of uniformity in data models from one repository to another, and in the richness and availability of metadata descriptions, makes integration and analysis of these data a manual, time-consuming task with no scalability. Here we explore a set of resource-oriented Web design patterns for data discovery, accessibility, transformation, and integration that can be implemented by any general- or special-purpose repository as a means to assist users in finding and reusing their data holdings. We show that by using off-the-shelf technologies, interoperability can be achieved atthe level of an individual spreadsheet cell. We note that the behaviours of this architecture compare favourably to the desiderata defined by the FAIR Data Principles, and can therefore represent an exemplar implementation of those principles. The proposed interoperability design patterns may be used to improve discovery and integration of both new and legacy data, maximizing the utility of all scholarly outputs

    Approach for Unwrapping the Unstructured to Structured Data the Case of Classified Ads in HTML Format

    Get PDF
    Data sources with various forms and formats available on the Internet. Data can be in the form of semi-structured and unstructured data. Research‟s objective is developing approach for unwrapping the unstructured data available on the internet into structured data / database. Unstructured data used in this study is in the case of classified ads on the Indonesia website, and those unstructured data is in HTML format. The Illustration made to test the approach. The results of the test show the value of f-measure 99.13%

    Enhanced biomedical data extraction from scientific publications

    Get PDF
    The field of scientific research is constantly expanding, with thousands of new articles being published every day. As online databases grow, so does the need for technologies capable of navigating and extracting key information from the stored publications. In the biomedical field, these articles lay the foundation for advancing our understanding of human health and improving medical practices. With such a vast amount of data available, it can be difficult for researchers to quickly and efficiently extract the information they need. The challenge is compounded by the fact that many existing tools are expensive, hard to learn and not compatible with all article types. To address this, a prototype was developed. This prototype leverages the PubMed API to provide researchers access to the information in numerous open access articles. Features include the tracking of keywords and high frequent words along with the possibility of extracting table content. The prototype is designed to streamline the process of extracting data from research articles, allowing researchers to more efficiently analyze and synthesize information from multiple sources.Masteroppgave i informatikkINF399MAMN-INFMAMN-PRO

    The NASA Astrophysics Data System: Architecture

    Full text link
    The powerful discovery capabilities available in the ADS bibliographic services are possible thanks to the design of a flexible search and retrieval system based on a relational database model. Bibliographic records are stored as a corpus of structured documents containing fielded data and metadata, while discipline-specific knowledge is segregated in a set of files independent of the bibliographic data itself. The creation and management of links to both internal and external resources associated with each bibliography in the database is made possible by representing them as a set of document properties and their attributes. To improve global access to the ADS data holdings, a number of mirror sites have been created by cloning the database contents and software on a variety of hardware and software platforms. The procedures used to create and manage the database and its mirrors have been written as a set of scripts that can be run in either an interactive or unsupervised fashion. The ADS can be accessed at http://adswww.harvard.eduComment: 25 pages, 8 figures, 3 table

    Abmash: Mashing Up Legacy Web Applications by Automated Imitation of Human Actions

    Get PDF
    Many business web-based applications do not offer applications programming interfaces (APIs) to enable other applications to access their data and functions in a programmatic manner. This makes their composition difficult (for instance to synchronize data between two applications). To address this challenge, this paper presents Abmash, an approach to facilitate the integration of such legacy web applications by automatically imitating human interactions with them. By automatically interacting with the graphical user interface (GUI) of web applications, the system supports all forms of integrations including bi-directional interactions and is able to interact with AJAX-based applications. Furthermore, the integration programs are easy to write since they deal with end-user, visual user-interface elements. The integration code is simple enough to be called a "mashup".Comment: Software: Practice and Experience (2013)

    KSNet-Approach to Knowledge Fusion from Distributed Sources

    Get PDF
    The rapidity of the decision making process is an important factor in different branches of the human life (business, healthcare, industry, military applications etc.). Since responsible persons make decisions using available knowledge, it is important for knowledge management systems to deliver necessary and timely information. Knowledge logistics is a new direction in the knowledge management addressing this. Technology of knowledge fusion, based on the synergistic use of knowledge from multiple distributed sources, is a basis for these activities. The paper presents an overview of a Knowledge Source Network configuration approach (KSNet-approach) to knowledge fusion, multi-agent architecture and research prototype of the KSNet knowledge fusion system based on this approach
    corecore