684 research outputs found

    OpenAIRE's DOIBoost - Boosting CrossRef for Research

    Get PDF
    Research in information science and scholarly communication strongly relies on the availability of openly accessible datasets of scholarly entities metadata and, where possible, their relative payloads. Since such metadata information is scattered across diverse, freely accessible, online resources (e.g. CrossRef, ORCID), researchers in this domain are doomed to struggle with (meta)data integration problems, in order to produce custom datasets of often undocumented and rather obscure provenance. This practice leads to waste of time, duplication of efforts, and typically infringes open science best practices of transparency and reproducibility of science. In this article, we describe how to generate DOIBoost, a metadata collection that enriches CrossRef with inputs from Microsoft Academic Graph, ORCID, and Unpaywall for the purpose of supporting high-quality and robust research experiments, saving times to researchers and enabling their comparison. To this aim, we describe the dataset value and its schema, analyse its actual content, and share the software Toolkit and experimental workflow required to reproduce it. The DOIBoost dataset and Software Toolkit are made openly available via Zenodo.org. DOIBoost will become an input source to the OpenAIRE information graph

    Scientific Computing Meets Big Data Technology: An Astronomy Use Case

    Full text link
    Scientific analyses commonly compose multiple single-process programs into a dataflow. An end-to-end dataflow of single-process programs is known as a many-task application. Typically, tools from the HPC software stack are used to parallelize these analyses. In this work, we investigate an alternate approach that uses Apache Spark -- a modern big data platform -- to parallelize many-task applications. We present Kira, a flexible and distributed astronomy image processing toolkit using Apache Spark. We then use the Kira toolkit to implement a Source Extractor application for astronomy images, called Kira SE. With Kira SE as the use case, we study the programming flexibility, dataflow richness, scheduling capacity and performance of Apache Spark running on the EC2 cloud. By exploiting data locality, Kira SE achieves a 2.5x speedup over an equivalent C program when analyzing a 1TB dataset using 512 cores on the Amazon EC2 cloud. Furthermore, we show that by leveraging software originally designed for big data infrastructure, Kira SE achieves competitive performance to the C implementation running on the NERSC Edison supercomputer. Our experience with Kira indicates that emerging Big Data platforms such as Apache Spark are a performant alternative for many-task scientific applications

    Machine Learning on Large Databases: Transforming Hidden Markov Models to SQL Statements

    Get PDF
    Machine Learning is a research field with substantial relevance for many applications in different areas. Because of technical improvements in sensor technology, its value for real life applications has even increased within the last years. Nowadays, it is possible to gather massive amounts of data at any time with comparatively little costs. While this availability of data could be used to develop complex models, its implementation is often narrowed because of limitations in computing power. In order to overcome performance problems, developers have several options, such as improving their hardware, optimizing their code, or use parallelization techniques like the MapReduce framework. Anyhow, these options might be too cost intensive, not suitable, or even too time expensive to learn and realize. Following the premise that developers usually are not SQL experts we would like to discuss another approach in this paper: using transparent database support for Big Data Analytics. Our aim is to automatically transform Machine Learning algorithms to parallel SQL database systems. In this paper, we especially show how a Hidden Markov Model, given in the analytics language R, can be transformed to a sequence of SQL statements. These SQL statements will be the basis for a (inter-operator and intra-operator) parallel execution on parallel DBMS as a second step of our research, not being part of this paper

    Scalable Data Integration for Linked Data

    Get PDF
    Linked Data describes an extensive set of structured but heterogeneous datasources where entities are connected by formal semantic descriptions. In thevision of the Semantic Web, these semantic links are extended towards theWorld Wide Web to provide as much machine-readable data as possible forsearch queries. The resulting connections allow an automatic evaluation to findnew insights into the data. Identifying these semantic connections betweentwo data sources with automatic approaches is called link discovery. We derivecommon requirements and a generic link discovery workflow based on similaritiesbetween entity properties and associated properties of ontology concepts. Mostof the existing link discovery approaches disregard the fact that in times ofBig Data, an increasing volume of data sources poses new demands on linkdiscovery. In particular, the problem of complex and time-consuming linkdetermination escalates with an increasing number of intersecting data sources.To overcome the restriction of pairwise linking of entities, holistic clusteringapproaches are needed to link equivalent entities of multiple data sources toconstruct integrated knowledge bases. In this context, the focus on efficiencyand scalability is essential. For example, reusing existing links or backgroundinformation can help to avoid redundant calculations. However, when dealingwith multiple data sources, additional data quality problems must also be dealtwith. This dissertation addresses these comprehensive challenges by designingholistic linking and clustering approaches that enable reuse of existing links.Unlike previous systems, we execute the complete data integration workflowvia a distributed processing system. At first, the LinkLion portal will beintroduced to provide existing links for new applications. These links act asa basis for a physical data integration process to create a unified representationfor equivalent entities from many data sources. We then propose a holisticclustering approach to form consolidated clusters for same real-world entitiesfrom many different sources. At the same time, we exploit the semantic typeof entities to improve the quality of the result. The process identifies errorsin existing links and can find numerous additional links. Additionally, theentity clustering has to react to the high dynamics of the data. In particular,this requires scalable approaches for continuously growing data sources withmany entities as well as additional new sources. Previous entity clusteringapproaches are mostly static, focusing on the one-time linking and clustering ofentities from few sources. Therefore, we propose and evaluate new approaches for incremental entity clustering that supports the continuous addition of newentities and data sources. To cope with the ever-increasing number of LinkedData sources, efficient and scalable methods based on distributed processingsystems are required. Thus we propose distributed holistic approaches to linkmany data sources based on a clustering of entities that represent the samereal-world object. The implementation is realized on Apache Flink. In contrastto previous approaches, we utilize efficiency-enhancing optimizations for bothdistributed static and dynamic clustering. An extensive comparative evaluationof the proposed approaches with various distributed clustering strategies showshigh effectiveness for datasets from multiple domains as well as scalability on amulti-machine Apache Flink cluster
    • …
    corecore