446 research outputs found
Storage Solutions for Big Data Systems: A Qualitative Study and Comparison
Big data systems development is full of challenges in view of the variety of
application areas and domains that this technology promises to serve.
Typically, fundamental design decisions involved in big data systems design
include choosing appropriate storage and computing infrastructures. In this age
of heterogeneous systems that integrate different technologies for optimized
solution to a specific real world problem, big data system are not an exception
to any such rule. As far as the storage aspect of any big data system is
concerned, the primary facet in this regard is a storage infrastructure and
NoSQL seems to be the right technology that fulfills its requirements. However,
every big data application has variable data characteristics and thus, the
corresponding data fits into a different data model. This paper presents
feature and use case analysis and comparison of the four main data models
namely document oriented, key value, graph and wide column. Moreover, a feature
analysis of 80 NoSQL solutions has been provided, elaborating on the criteria
and points that a developer must consider while making a possible choice.
Typically, big data storage needs to communicate with the execution engine and
other processing and visualization technologies to create a comprehensive
solution. This brings forth second facet of big data storage, big data file
formats, into picture. The second half of the research paper compares the
advantages, shortcomings and possible use cases of available big data file
formats for Hadoop, which is the foundation for most big data computing
technologies. Decentralized storage and blockchain are seen as the next
generation of big data storage and its challenges and future prospects have
also been discussed
requirements and use cases
In this report, we introduce our initial vision of the Corporate Semantic Web
as the next step in the broad field of Semantic Web research. We identify
requirements of the corporate environment and gaps between current approaches
to tackle problems facing ontology engineering, semantic collaboration, and
semantic search. Each of these pillars will yield innovative methods and tools
during the project runtime until 2013. Corporate ontology engineering will
improve the facilitation of agile ontology engineering to lessen the costs of
ontology development and, especially, maintenance. Corporate semantic
collaboration focuses the human-centered aspects of knowledge management in
corporate contexts. Corporate semantic search is settled on the highest
application level of the three research areas and at that point it is a
representative for applications working on and with the appropriately
represented and delivered background knowledge. We propose an initial layout
for an integrative architecture of a Corporate Semantic Web provided by these
three core pillars
Web technologies for environmental big data
Recent evolutions in computing science and web technology provide the environmental community with continuously expanding resources for data collection and analysis that pose unprecedented challenges to the design of analysis methods, workflows, and interaction with data sets. In the light of the recent UK Research Council funded Environmental Virtual Observatory pilot project, this paper gives an overview of currently available implementations related to web-based technologies for processing large and heterogeneous datasets and discuss their relevance within the context of environmental data processing, simulation and prediction. We found that, the processing of the simple datasets used in the pilot proved to be relatively straightforward using a combination of R, RPy2, PyWPS and PostgreSQL. However, the use of NoSQL databases and more versatile frameworks such as OGC standard based implementations may provide a wider and more flexible set of features that particularly facilitate working with larger volumes and more heterogeneous data sources
The National COVID Cohort Collaborative (N3C): Rationale, design, infrastructure, and deployment
OBJECTIVE: Coronavirus disease 2019 (COVID-19) poses societal challenges that require expeditious data and knowledge sharing. Though organizational clinical data are abundant, these are largely inaccessible to outside researchers. Statistical, machine learning, and causal analyses are most successful with large-scale data beyond what is available in any given organization. Here, we introduce the National COVID Cohort Collaborative (N3C), an open science community focused on analyzing patient-level data from many centers.
MATERIALS AND METHODS: The Clinical and Translational Science Award Program and scientific community created N3C to overcome technical, regulatory, policy, and governance barriers to sharing and harmonizing individual-level clinical data. We developed solutions to extract, aggregate, and harmonize data across organizations and data models, and created a secure data enclave to enable efficient, transparent, and reproducible collaborative analytics.
RESULTS: Organized in inclusive workstreams, we created legal agreements and governance for organizations and researchers; data extraction scripts to identify and ingest positive, negative, and possible COVID-19 cases; a data quality assurance and harmonization pipeline to create a single harmonized dataset; population of the secure data enclave with data, machine learning, and statistical analytics tools; dissemination mechanisms; and a synthetic data pilot to democratize data access.
CONCLUSIONS: The N3C has demonstrated that a multisite collaborative learning health network can overcome barriers to rapidly build a scalable infrastructure incorporating multiorganizational clinical data for COVID-19 analytics. We expect this effort to save lives by enabling rapid collaboration among clinicians, researchers, and data scientists to identify treatments and specialized care and thereby reduce the immediate and long-term impacts of COVID-19
The National COVID Cohort Collaborative (N3C): Rationale, design, infrastructure, and deployment.
OBJECTIVE: Coronavirus disease 2019 (COVID-19) poses societal challenges that require expeditious data and knowledge sharing. Though organizational clinical data are abundant, these are largely inaccessible to outside researchers. Statistical, machine learning, and causal analyses are most successful with large-scale data beyond what is available in any given organization. Here, we introduce the National COVID Cohort Collaborative (N3C), an open science community focused on analyzing patient-level data from many centers.
MATERIALS AND METHODS: The Clinical and Translational Science Award Program and scientific community created N3C to overcome technical, regulatory, policy, and governance barriers to sharing and harmonizing individual-level clinical data. We developed solutions to extract, aggregate, and harmonize data across organizations and data models, and created a secure data enclave to enable efficient, transparent, and reproducible collaborative analytics.
RESULTS: Organized in inclusive workstreams, we created legal agreements and governance for organizations and researchers; data extraction scripts to identify and ingest positive, negative, and possible COVID-19 cases; a data quality assurance and harmonization pipeline to create a single harmonized dataset; population of the secure data enclave with data, machine learning, and statistical analytics tools; dissemination mechanisms; and a synthetic data pilot to democratize data access.
CONCLUSIONS: The N3C has demonstrated that a multisite collaborative learning health network can overcome barriers to rapidly build a scalable infrastructure incorporating multiorganizational clinical data for COVID-19 analytics. We expect this effort to save lives by enabling rapid collaboration among clinicians, researchers, and data scientists to identify treatments and specialized care and thereby reduce the immediate and long-term impacts of COVID-19
Towards a unified methodology for supporting the integration of data sources for use in web applications
Organisations are making increasing use of web applications and web-based systems as an integral part of providing services. Examples include personalised dynamic user content on a website, social media plug-ins or web-based mapping tools. For these types of applications to have maximum use for the user where the applications are fully functional, they require the integration of data from multiple sources. The focus of this thesis is in improving this integration process with a focus on web applications with multiple sources of data.
Integration of data from multiple sources is problematic for many reasons. Current integration methods tend to be domain specific and application specific. They are often complex, have compatibility issues with different technologies, lack maturity, are difficult to re-use, and do not accommodate new and emerging models and integration technologies. Technologies to achieve integration, such as brokers and translators do exist, but they cannot be used as a generic solution for developing web-applications achieving the integration outcomes required for successful web application development due to their domain specificity. It is because of these difficulties with integration, and the wide variety of integration approaches that there is a need to provide assistance to the developer in selecting the integration approach most appropriate to their needs.
This thesis proposes GIWeb, a unified top-down data integration methodology instantiated with a framework that will aid developers in their integration process. It will act as a conceptual structure to support the chosen technical approach. The framework will assist in the integration of data sources to support web application builders. The thesis presents the rationale for the need for the framework based on an examination of the range of applications, associated data sources and the range of potential solutions. The framework is evaluated using four case studies
- …