4,565 research outputs found
The OTree: multidimensional indexing with efficient data sampling for HPC
Spatial big data is considered an essential trend in future scientific and business applications. Indeed, research instruments, medical devices, and social networks generate hundreds of petabytes of spatial data per year. However, many authors have pointed out that the lack of specialized frameworks for multidimensional Big Data is limiting possible applications and precluding many scientific breakthroughs. Paramount in achieving High-Performance Data Analytics is to optimize and reduce the I/O operations required to analyze large data sets. To do so, we need to organize and index the data according to its multidimensional attributes. At the same time, to enable fast and interactive exploratory analysis, it is vital to generate approximate representations of large datasets efficiently. In this paper, we propose the Outlook Tree (or OTree), a novel Multidimensional Indexing with efficient data Sampling (MIS) algorithm. The OTree enables exploratory analysis of large multidimensional datasets with arbitrary precision, a vital missing feature in current distributed data management solutions. Our algorithm reduces the indexing overhead and achieves high performance even for write-intensive HPC applications. Indeed, we use the OTree to store the scientific results of a study on the efficiency of drug inhalers. Then we compare the OTree implementation on Apache Cassandra, named Qbeast, with PostgreSQL and plain storage. Lastly, we demonstrate that our proposal delivers better performance and scalability.Peer ReviewedPostprint (author's final draft
Storage Solutions for Big Data Systems: A Qualitative Study and Comparison
Big data systems development is full of challenges in view of the variety of
application areas and domains that this technology promises to serve.
Typically, fundamental design decisions involved in big data systems design
include choosing appropriate storage and computing infrastructures. In this age
of heterogeneous systems that integrate different technologies for optimized
solution to a specific real world problem, big data system are not an exception
to any such rule. As far as the storage aspect of any big data system is
concerned, the primary facet in this regard is a storage infrastructure and
NoSQL seems to be the right technology that fulfills its requirements. However,
every big data application has variable data characteristics and thus, the
corresponding data fits into a different data model. This paper presents
feature and use case analysis and comparison of the four main data models
namely document oriented, key value, graph and wide column. Moreover, a feature
analysis of 80 NoSQL solutions has been provided, elaborating on the criteria
and points that a developer must consider while making a possible choice.
Typically, big data storage needs to communicate with the execution engine and
other processing and visualization technologies to create a comprehensive
solution. This brings forth second facet of big data storage, big data file
formats, into picture. The second half of the research paper compares the
advantages, shortcomings and possible use cases of available big data file
formats for Hadoop, which is the foundation for most big data computing
technologies. Decentralized storage and blockchain are seen as the next
generation of big data storage and its challenges and future prospects have
also been discussed
Hypermedia-based discovery for source selection using low-cost linked data interfaces
Evaluating federated Linked Data queries requires consulting multiple sources on the Web. Before a client can execute queries, it must discover data sources, and determine which ones are relevant. Federated query execution research focuses on the actual execution, while data source discovery is often marginally discussed-even though it has a strong impact on selecting sources that contribute to the query results. Therefore, the authors introduce a discovery approach for Linked Data interfaces based on hypermedia links and controls, and apply it to federated query execution with Triple Pattern Fragments. In addition, the authors identify quantitative metrics to evaluate this discovery approach. This article describes generic evaluation measures and results for their concrete approach. With low-cost data summaries as seed, interfaces to eight large real-world datasets can discover each other within 7 minutes. Hypermedia-based client-side querying shows a promising gain of up to 50% in execution time, but demands algorithms that visit a higher number of interfaces to improve result completeness
Sensor Search Techniques for Sensing as a Service Architecture for The Internet of Things
The Internet of Things (IoT) is part of the Internet of the future and will
comprise billions of intelligent communicating "things" or Internet Connected
Objects (ICO) which will have sensing, actuating, and data processing
capabilities. Each ICO will have one or more embedded sensors that will capture
potentially enormous amounts of data. The sensors and related data streams can
be clustered physically or virtually, which raises the challenge of searching
and selecting the right sensors for a query in an efficient and effective way.
This paper proposes a context-aware sensor search, selection and ranking model,
called CASSARAM, to address the challenge of efficiently selecting a subset of
relevant sensors out of a large set of sensors with similar functionality and
capabilities. CASSARAM takes into account user preferences and considers a
broad range of sensor characteristics, such as reliability, accuracy, location,
battery life, and many more. The paper highlights the importance of sensor
search, selection and ranking for the IoT, identifies important characteristics
of both sensors and data capture processes, and discusses how semantic and
quantitative reasoning can be combined together. This work also addresses
challenges such as efficient distributed sensor search and
relational-expression based filtering. CASSARAM testing and performance
evaluation results are presented and discussed.Comment: IEEE sensors Journal, 2013. arXiv admin note: text overlap with
arXiv:1303.244
When Things Matter: A Data-Centric View of the Internet of Things
With the recent advances in radio-frequency identification (RFID), low-cost
wireless sensor devices, and Web technologies, the Internet of Things (IoT)
approach has gained momentum in connecting everyday objects to the Internet and
facilitating machine-to-human and machine-to-machine communication with the
physical world. While IoT offers the capability to connect and integrate both
digital and physical entities, enabling a whole new class of applications and
services, several significant challenges need to be addressed before these
applications and services can be fully realized. A fundamental challenge
centers around managing IoT data, typically produced in dynamic and volatile
environments, which is not only extremely large in scale and volume, but also
noisy, and continuous. This article surveys the main techniques and
state-of-the-art research efforts in IoT from data-centric perspectives,
including data stream processing, data storage models, complex event
processing, and searching in IoT. Open research issues for IoT data management
are also discussed
Exploring user and system requirements of linked data visualization through a visual dashboard approach
One of the open problems in SemanticWeb research is which tools should be provided to users to explore linked data. This is even more urgent now that massive amount of linked data is being released by governments worldwide. The development of single dedicated visualization applications is increasing, but the problem of exploring unknown linked data to gain a good understanding of what is contained is still open. An effective generic solution must take into account the user’s point of view, their tasks and interaction, as well as the system’s capabilities and the technical constraints the technology imposes. This paper is a first step in understanding the implications of both, user and system by evaluating our dashboard-based approach. Though we observe a high user acceptance of the dashboard approach, our paper also highlights technical challenges arising out of complexities involving current infrastructure that need to be addressed while visualising linked data. In light of the findings, guidelines for the development of linked data visualization (and manipulation) are provided
Issues in digital preservation: towards a new research agenda
Digital Preservation has evolved into a specialized, interdisciplinary research discipline of its own, seeing significant increases in terms of research capacity, results, but also challenges. However, with this specialization and subsequent formation of a dedicated subgroup of researchers active in this field, limitations of the challenges addressed can be observed. Digital preservation research may seem to react to problems arising, fixing problems that exist now, rather than proactively researching new solutions that may be applicable only after a few years of maturing. Recognising the benefits of bringing together researchers and practitioners with various professional backgrounds related to digital preservation, a seminar was organized in Schloss Dagstuhl, at the Leibniz Center for Informatics (18-23 July 2010), with the aim of addressing the current digital preservation challenges, with a specific focus on the automation aspects in this field. The main goal of the seminar was to outline some research challenges in digital preservation, providing a number of "research questions" that could be immediately tackled, e.g. in Doctoral Thesis. The seminar intended also to highlight the need for the digital preservation community to reach out to IT research and other research communities outside the immediate digital preservation domain, in order to jointly develop solutions
Old Techniques for New Join Algorithms: A Case Study in RDF Processing
Recently there has been significant interest around designing specialized RDF
engines, as traditional query processing mechanisms incur orders of magnitude
performance gaps on many RDF workloads. At the same time researchers have
released new worst-case optimal join algorithms which can be asymptotically
better than the join algorithms in traditional engines. In this paper we apply
worst-case optimal join algorithms to a standard RDF workload, the LUBM
benchmark, for the first time. We do so using two worst-case optimal engines:
(1) LogicBlox, a commercial database engine, and (2) EmptyHeaded, our prototype
research engine with enhanced worst-case optimal join algorithms. We show that
without any added optimizations both LogicBlox and EmptyHeaded outperform two
state-of-the-art specialized RDF engines, RDF-3X and TripleBit, by up to 6x on
cyclic join queries-the queries where traditional optimizers are suboptimal. On
the remaining, less complex queries in the LUBM benchmark, we show that three
classic query optimization techniques enable EmptyHeaded to compete with RDF
engines, even when there is no asymptotic advantage to the worst-case optimal
approach. We validate that our design has merit as EmptyHeaded outperforms
MonetDB by three orders of magnitude and LogicBlox by two orders of magnitude,
while remaining within an order of magnitude of RDF-3X and TripleBit
- …