103,564 research outputs found

    Integrating information to bootstrap information extraction from web sites

    Get PDF
    In this paper we propose a methodology to learn to extract domain-specific information from large repositories (e.g. the Web) with minimum user intervention. Learning is seeded by integrating information from structured sources (e.g. databases and digital libraries). Retrieved information is then used to bootstrap learning for simple Information Extraction (IE) methodologies, which in turn will produce more annotation to train more complex IE engines. All the corpora for training the IE en- gines are produced automatically by integrating in- formation from different sources such as available corpora and services (e.g. databases or digital libraries, etc.). User intervention is limited to providing an initial URL and adding information missed by the different modules when the computation has finished. The information added or delete by the user can then be reused providing further training and therefore getting more information (recall) and/or more precision. We are currently applying this methodology to mining web sites of Computer Science departments.peer-reviewe

    Building Hyper View web sites

    Get PDF
    In this report a framework for building ā€œvirtualā€ web sites using the HyperView system is presented. Virtual web sites are web sites that offer information extracted and integrated from other web sites on the fly. The HyperView system already supports the demand-driven integration of information from different semistructured information sources into a graph database. The problem we are dealing with here is to query the database and generate HTML pages from the results as a response to HTTP requests received from the user. The returned HTML pages should hide the aspects of data extraction and integration and should give the user the impression of a single, coherent web site. We show first how HyperViews comprised of graph-transformation rules can be defined that generate HTML pages from the database. This way web sites for individual application schemata can be designed. In the second part we present a generic rule set that defines a web interface for HyperView graph databases with arbitrary schemata. This generic web interface can be customized for the particular application by annotating the database schema and chosing appropriate styles. The work presented in this report completes the HyperView approach in the sense that it closes the circle of extracting and integrating information from the web by again publishing the integrated data on the web. Our approach applies as well to the integration and generation of XML documents on the web

    SPOT: a web-based tool for using biological databases to prioritize SNPs after a genome-wide association study

    Get PDF
    SPOT (http://spot.cgsmd.isi.edu), the SNP prioritization online tool, is a web site for integrating biological databases into the prioritization of single nucleotide polymorphisms (SNPs) for further study after a genome-wide association study (GWAS). Typically, the next step after a GWAS is to genotype the top signals in an independent replication sample. Investigators will often incorporate information from biological databases so that biologically relevant SNPs, such as those in genes related to the phenotype or with potentially non-neutral effects on gene expression such as a splice sites, are given higher priority. We recently introduced the genomic information network (GIN) method for systematically implementing this kind of strategy. The SPOT web site allows users to upload a list of SNPs and GWAS P-values and returns a prioritized list of SNPs using the GIN method. Users can specify candidate genes or genomic regions with custom levels of prioritization. The results can be downloaded or viewed in the browser where users can interactively explore the details of each SNP, including graphical representations of the GIN method. For investigators interested in incorporating biological databases into a post-GWAS SNP selection strategy, the SPOT web tool is an easily implemented and flexible solution

    Insider threat resistant SQL-injection prevention in PHP

    Get PDF
    Web sites are either static sites, programs, or databases. Very often they are a mixture of these three aspects integrating relational databases as a back-end. Web sites require configuration and programming attention to assure security, confidentiality, and trustiness of the published information. SQL-injection attacks rely on some weak validation of textual input used to build database queries. Maliciously crafted input may threaten the confidentiality and the security policies of Web sites relying on a database to store and retrieve information. Furthermore, insiders may introduce malicious code in a Web application, code that, when triggered by some specific input, for example, would violate security policies. This paper presents an original approach that combines static analysis, dynamic analysis, and code reengineering to automatically protect applications written in PHP from both malicious input (outsider threats) and malicious code (insider threats) that carry SQLinjection attacks. The paper also reports preliminary results about experiments performed on an old SQL-injection prone version of phpBB (version 2.0.0, 37193 LOC of PHP version 4.2.2 code). Results show that our approach successfully improved phpBB-2.0.0 resistance to SQLinjection attacks

    Integrating web services into data intensive web sites

    Get PDF
    Designing web sites is a complex task. Ad-hoc rapid prototyping easily leads to unsatisfactory results, e.g. poor maintainability and extensibility. However, existing web design frameworks focus exclusively on data presentation: the development of specific functionalities is still achieved through low-level programming. In this paper we address this issue by describing our work on the integration of (semantic) web services into a web design framework, OntoWeaver. The resulting architecture, OntoWeaver-S, supports rapid prototyping of service centred data-intensive web sites, which allow access to remote web services. In particular, OntoWeaver-S is integrated with a comprehensive web service platform, IRS-II, for the specification, discovery, and execution of web services. Moreover, it employs a set of comprehensive site ontologies to model and represent all aspects of service-centred data-intensive web sites, and thus is able to offer high level support for the design and development process

    The Firegoose: two-way integration of diverse data from different bioinformatics web resources with desktop applications

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Information resources on the World Wide Web play an indispensable role in modern biology. But integrating data from multiple sources is often encumbered by the need to reformat data files, convert between naming systems, or perform ongoing maintenance of local copies of public databases. Opportunities for new ways of combining and re-using data are arising as a result of the increasing use of web protocols to transmit structured data.</p> <p>Results</p> <p>The Firegoose, an extension to the Mozilla Firefox web browser, enables data transfer between web sites and desktop tools. As a component of the Gaggle integration framework, Firegoose can also exchange data with Cytoscape, the R statistical package, Multiexperiment Viewer (MeV), and several other popular desktop software tools. Firegoose adds the capability to easily use local data to query KEGG, EMBL STRING, DAVID, and other widely-used bioinformatics web sites. Query results from these web sites can be transferred to desktop tools for further analysis with a few clicks.</p> <p>Firegoose acquires data from the web by screen scraping, microformats, embedded XML, or web services. We define a microformat, which allows structured information compatible with the Gaggle to be embedded in HTML documents.</p> <p>We demonstrate the capabilities of this software by performing an analysis of the genes activated in the microbe <it>Halobacterium salinarum NRC-1 </it>in response to anaerobic environments. Starting with microarray data, we explore functions of differentially expressed genes by combining data from several public web resources and construct an integrated view of the cellular processes involved.</p> <p>Conclusion</p> <p>The Firegoose incorporates Mozilla Firefox into the Gaggle environment and enables interactive sharing of data between diverse web resources and desktop software tools without maintaining local copies. Additional web sites can be incorporated easily into the framework using the scripting platform of the Firefox browser. Performing data integration in the browser allows the excellent search and navigation capabilities of the browser to be used in combination with powerful desktop tools.</p

    A Query Integrator and Manager for the Query Web

    Get PDF
    We introduce two concepts: the Query Web as a layer of interconnected queries over the document web and the semantic web, and a Query Web Integrator and Manager (QI) that enables the Query Web to evolve. QI permits users to write, save and reuse queries over any web accessible source, including other queries saved in other installations of QI. The saved queries may be in any language (e.g. SPARQL, XQuery); the only condition for interconnection is that the queries return their results in some form of XML. This condition allows queries to chain off each other, and to be written in whatever language is appropriate for the task. We illustrate the potential use of QI for several biomedical use cases, including ontology view generation using a combination of graph-based and logical approaches, value set generation for clinical data management, image annotation using terminology obtained from an ontology web service, ontology-driven brain imaging data integration, small-scale clinical data integration, and wider-scale clinical data integration. Such use cases illustrate the current range of applications of QI and lead us to speculate about the potential evolution from smaller groups of interconnected queries into a larger query network that layers over the document and semantic web. The resulting Query Web could greatly aid researchers and others who now have to manually navigate through multiple information sources in order to answer specific questions

    Interoperability of Information Systems and Heterogenous Databases Using XML

    Get PDF
    Interoperabilily of information systerrrs is the most critical issue facing businesse! that need to access information from multiple idormution systems on tlifferent environments ancl diverse platforms. Interoperability has been a basic requirement for the modern information systems in a competitive and volatile business environment, particularly with the advent of distributed network system and the growing relevance of inter-network communications. Our objective in tltis paper is to develop a comprehensiveframework tofacilitate interoperability smong distributed and heterogeneous information systems and to develop prototype software to validate tlte application of XML in interoperability of infurmation systems and databases

    Heterogeneous Relational Databases for a Grid-enabled Analysis Environment

    Get PDF
    Grid based systems require a database access mechanism that can provide seamless homogeneous access to the requested data through a virtual data access system, i.e. a system which can take care of tracking the data that is stored in geographically distributed heterogeneous databases. This system should provide an integrated view of the data that is stored in the different repositories by using a virtual data access mechanism, i.e. a mechanism which can hide the heterogeneity of the backend databases from the client applications. This paper focuses on accessing data stored in disparate relational databases through a web service interface, and exploits the features of a Data Warehouse and Data Marts. We present a middleware that enables applications to access data stored in geographically distributed relational databases without being aware of their physical locations and underlying schema. A web service interface is provided to enable applications to access this middleware in a language and platform independent way. A prototype implementation was created based on Clarens [4], Unity [7] and POOL [8]. This ability to access the data stored in the distributed relational databases transparently is likely to be a very powerful one for Grid users, especially the scientific community wishing to collate and analyze data distributed over the Grid
    • ā€¦
    corecore