163 research outputs found

    Large Scale Data Mining to Improve Usability of Data: An Intelligent Archive Testbed

    Get PDF
    Research in certain scientific disciplines - including Earth science, particle physics, and astrophysics - continually faces the challenge that the volume of data needed to perform valid scientific research can at times overwhelm even a sizable research community. The desire to improve utilization of this data gave rise to the Intelligent Archives project, which seeks to make data archives active participants in a knowledge building system capable of discovering events or patterns that represent new information or knowledge. Data mining can automatically discover patterns and events, but it is generally viewed as unsuited for large-scale use in disciplines like Earth science that routinely involve very high data volumes. Dozens of research projects have shown promising uses of data mining in Earth science, but all of these are based on experiments with data subsets of a few gigabytes or less, rather than the terabytes or petabytes typically encountered in operational systems. To bridge this gap, the Intelligent Archives project is establishing a testbed with the goal of demonstrating the use of data mining techniques in an operationally-relevant environment. This paper discusses the goals of the testbed and the design choices surrounding critical issues that arose during testbed implementation

    Active Data: A Data-Centric Approach to Data Life-Cycle Management

    Get PDF
    International audienceData-intensive science offers new opportunities for innovation and discoveries, provided that large datasets can be handled efficiently. Data management for data-intensive science applications is challenging; requiring support for complex data life cycles, coordination across multiple sites, fault tolerance, and scalability to support tens of sites and petabytes of data. In this paper, we argue that data management for data-intensive science applications requires a fundamentally different management approach than the current ad-hoc task centric approach. We propose Active Data, a fundamentally novel paradigm for data life cycle management. Active Data follows two principles: data-centric and event-driven. We report on the Active Data programming model and its preliminary implementation, and discuss the benefits and limitations of the approach on recognized challenging data-intensive science use-cases.Les importants volumes de données produits par la science présentent de nouvelles opportunités d'innovation et de découvertes. Cependant ceci sera conditionné par notre capacité à gérer efficacement de très grands jeux de données. La gestion de données pour les applications scientifiques data-intensive présente un véritable défi~; elle requière le support de cycles de vie très complexes, la coordination de plusieurs sites, de la tolérance aux pannes et de passer à l'échelle sur des dizaines de sites avec plusieurs péta-octets de données. Dans cet article nous argumentons que la gestion des données pour les applications scientifiques data-intensive nécessite une approche fondamentalement différente de l'actuel paradigme centré sur les tâches. Nous proposons Active Data, un nouveau paradigme pour la gestion du cycle de vie des données. Active Data suit deux principes~: il est centré sur les données et à base d'événements. Nous présentons le modèle de programmation Active Data, un prototype d'implémentation et discutons des avantages et limites de notre approche à partir d'étude de cas d'applications scientifiques

    A Review on the Role of Nano-Communication in Future Healthcare Systems: A Big Data Analytics Perspective

    Get PDF
    This paper presents a first-time review of the open literature focused on the significance of big data generated within nano-sensors and nano-communication networks intended for future healthcare and biomedical applications. It is aimed towards the development of modern smart healthcare systems enabled with P4, i.e. predictive, preventive, personalized and participatory capabilities to perform diagnostics, monitoring, and treatment. The analytical capabilities that can be produced from the substantial amount of data gathered in such networks will aid in exploiting the practical intelligence and learning capabilities that could be further integrated with conventional medical and health data leading to more efficient decision making. We have also proposed a big data analytics framework for gathering intelligence, form the healthcare big data, required by futuristic smart healthcare to address relevant problems and exploit possible opportunities in future applications. Finally, the open challenges, future directions for researchers in the evolving healthcare domain, are presented

    GRAPH BASESD WORD SENSE DISAMBIGUATION FOR CLINICAL ABBREVIATIONS USING APACHE SPARK

    Get PDF
    Identification of the correct sense for an ambiguous word is one of the major challenges for language processing in all domains. Word Sense Disambiguation is the task of identifying the correct sense of an ambiguous word by referencing the surrounding context of the word. Similar to the narrative documents, clinical documents suffer from ambiguity issues that impact automatic extraction of correct sense from the document. In this project, we propose a graph-based solution based on an algorithm originally implemented by Osmar R. Zaine et al. for word sense disambiguation specifically focusing on clinical text. The algorithm makes use of proposed UMLS Metathesaurus as its source of knowledge. As an enhancement to the existing implementation of the algorithm, this project uses Apache Spark - A Big Data Technology for cluster based distributed processing and performance optimization

    Big data analytics for large-scale wireless networks: Challenges and opportunities

    Full text link
    © 2019 Association for Computing Machinery. The wide proliferation of various wireless communication systems and wireless devices has led to the arrival of big data era in large-scale wireless networks. Big data of large-scale wireless networks has the key features of wide variety, high volume, real-time velocity, and huge value leading to the unique research challenges that are different from existing computing systems. In this article, we present a survey of the state-of-art big data analytics (BDA) approaches for large-scale wireless networks. In particular, we categorize the life cycle of BDA into four consecutive stages: Data Acquisition, Data Preprocessing, Data Storage, and Data Analytics. We then present a detailed survey of the technical solutions to the challenges in BDA for large-scale wireless networks according to each stage in the life cycle of BDA. Moreover, we discuss the open research issues and outline the future directions in this promising area

    A machine learning approach to the detection of ghosting and scattered light artifacts in dark energy survey images

    Get PDF
    Astronomical images are often plagued by unwanted artifacts that arise from a number of sources including imperfect optics, faulty image sensors, cosmic ray hits, and even airplanes and artificial satellites. Spurious reflections (known as “ghosts”) and the scattering of light off the surfaces of a camera and/or telescope are particularly difficult to avoid. Detecting ghosts and scattered light efficiently in large cosmological surveys that will acquire petabytes of data can be a daunting task. In this paper, we use data from the Dark Energy Survey to develop, train, and validate a machine learning model to detect ghosts and scattered light using convolutional neural networks. The model architecture and training procedure are discussed in detail, and the performance on the training and validation set is presented. Testing is performed on data and results are compared with those from a ray-tracing algorithm. As a proof of principle, we have shown that our method is promising for the Rubin Observatory and beyond
    • …
    corecore