557,101 research outputs found
Big Data solutions for law enforcement
Big Data, the data too large and complex for most current information infrastructure to store and analyze, has changed every sector in government and industry.
Today’s sensors and devices produce an overwhelming amount of information that is often unstructured, and solutions developed to handle Big Data now allowing us to track more information and run more complex analytics to gain a level of insight once thought impossible.
The dominant Big Data solution is the Apache Hadoop ecosystem which provides an open source platform for reliable, scalable, distributed computing on commodity hardware. Hadoop has exploded in the private sector and is the back end to many of the leading Web 2.0 companies and services. Hadoop also has a growing footprint in government, with numerous Hadoop clusters run by the Departments of Defense and Energy, as well as smaller deployments by other agencies.
One sector currently exploring Hadoop is law enforcement. Big Data analysis has already been highly effective in law enforcement and can make police departments more effective, accountable, efficient, and proactive. As Hadoop continues to spread through law enforcement agencies, it has the potential to permanently change the way policing is practiced and administered
Building data warehouses in the era of big data: an approach for scalable and flexible big data warehouses
During the last few years, the concept of Big Data Warehousing gained significant attention from the scientific community, highlighting the need to make design changes to the traditional Data Warehouse (DW) due to its limitations, in order to achieve new characteristics relevant in Big Data contexts (e.g., scalability on commodity hardware, real-time performance, and flexible storage). The state-of-the-art in Big Data Warehousing reflects the young age of the concept, as well as ambiguity and the lack of common approaches to build Big Data Warehouses (BDWs). Consequently, an approach to design and implement these complex systems is of major relevance to business analytics researchers and practitioners. In this tutorial, the design and implementation of BDWs is targeted, in order to present a general approach that researchers and practitioners can follow in their Big Data Warehousing projects, exploring several demonstration cases focusing on system design and data modelling examples in areas like smart cities, retail, finance, manufacturing, among others
High-level visualization over big linked data
The Linked Open Data (LOD) Cloud is continuously expanding and the number of complex and large sources is raising. Understanding at a glance an unknown source is a critical task for LOD users but it can be facilitated by visualization or exploration tools. H-BOLD (High-level visualization over Big Open Linked Data) is a tool that allows users with no a-priori knowledge on the domain nor SPARQL skills to start navigating and exploring Big Linked Data. Users can start from a high-level visualization and then focus on an element of interest to incrementally explore the source, as well as perform a visual query on certain classes of interest. At the moment, 32 Big Linked Data (with more than 500.000 triples) exposing a SPARQL endpoint can be explored by using H-BOLD
TOOL FOR INTERACTIVE VISUAL ANALYSIS OF LARGE HIERARCHICAL DATA STRUCTURES
In the Big Data era data visualization and exploration systems, as means for data perception and manipulation are facing major challenges. One of the challenges for modern visualization systems is to ensure adequate visual presentation and interaction. Therefore, within this paper, we present a tool for interactive visualization of data with a hierarchical structure. It is a general-purpose tool that uses a graph-based approach. However, its main focus is on the visual analysis of concept lattices generated as the output of the Formal Concept Analysis algorithm. As the data grow, concept lattice can become complex and hard for visualization and analysis. In order to address this issue, functionalities important for the exploration of the large concept lattices are applied within this tool. The usage of the tool is presented in the example of visualization of concept lattices generated based on the available data on the Canadas open data portal and can be used for exploring the usage of tags within datasets
Developing Bottom-Up, Integrated Omics Methodologies for Big Data Biomarker Discovery
Indiana University-Purdue University Indianapolis (IUPUI)The availability of highly-distributed computing compliments the proliferation of
next generation sequencing (NGS) and genome-wide association studies (GWAS)
datasets. These data sets are often complex, poorly annotated or require complex domain
knowledge to sensibly manage. These novel datasets provide a rare, multi-dimensional
omics (proteomics, transcriptomics, and genomics) view of a single sample or patient.
Previously, biologists assumed a strict adherence to the central dogma:
replication, transcription and translation. Recent studies in genomics and proteomics
emphasize that this is not the case. We must employ big-data methodologies to not only
understand the biogenesis of these molecules, but also their disruption in disease states.
The Cancer Genome Atlas (TCGA) provides high-dimensional patient data and illustrates
the trends that occur in expression profiles and their alteration in many complex disease
states.
I will ultimately create a bottom-up multi-omics approach to observe biological
systems using big data techniques. I hypothesize that big data and systems biology
approaches can be applied to public datasets to identify important subsets of genes in
cancer phenotypes. By exploring these signatures, we can better understand the role of
amplification and transcript alterations in cancer
The Frictionless Data Package : data containerization for addressing big data challenges [poster]
Presented at AGU Ocean Sciences, 11 - 16 February 2018, Portland, ORAt the Biological and Chemical Oceanography Data Management Office (BCO-DMO) Big Data challenges have been steadily increasing. The sizes of data submissions have grown as instrumentation improves. Complex data types can sometimes be stored across different repositories . This signals a paradigm shift where data and information that is meant to be tightly-coupled and has traditionally been stored under the same roof is now distributed across repositories and data stores. For domain-specific repositories like BCO-DMO, a new mechanism for assembling data, metadata and supporting documentation is needed.
Traditionally, data repositories have relied on a human's involvement throughout discovery and access workflows. This human could assess fitness for purpose by reading loosely coupled, unstructured information from web pages and documentation. Distributed storage was something that could be communicated in text that a human could read and understand. However, as machines play larger roles in the process of discovery and access of data, distributed resources must be described and packaged in ways that fit into machine automated workflows of discovery and access for assessing fitness for purpose by the end-user. Once machines have recommended a data resource as relevant to an investigator's needs, the data should be easy to integrate into that investigator's toolkits for analysis and visualization.
BCO-DMO is exploring the idea of data containerization, or packaging data and related information for easier transport, interpretation, and use. Data containerization reduces not only the friction data repositories experience trying to describe complex data resources, but also for end-users trying to access data with their own toolkits. In researching the landscape of data containerization, the Frictionlessdata Data Package (http://frictionlessdata.io/) provides a number of valuable advantages over similar solutions. This presentation will focus on these advantages and how the Frictionlessdata Data Package addresses a number of real-world use cases faced for data discovery, access, analysis and visualization in the age of Big Data.NSF #1435578, NSF #163971
On the enhancement of Big Data Pipelines through Data Preparation, Data Quality, and the distribution of Optimisation Problems
Nowadays, data are fundamental for companies, providing operational support by facilitating daily
transactions. Data has also become the cornerstone of strategic decision-making processes in
businesses. For this purpose, there are numerous techniques that allow to extract knowledge and
value from data. For example, optimisation algorithms excel at supporting decision-making
processes to improve the use of resources, time and costs in the organisation. In the current
industrial context, organisations usually rely on business processes to orchestrate their daily
activities while collecting large amounts of information from heterogeneous sources. Therefore,
the support of Big Data technologies (which are based on distributed environments) is required
given the volume, variety and speed of data. Then, in order to extract value from the data, a set
of techniques or activities is applied in an orderly way and at different stages. This set of
techniques or activities, which facilitate the acquisition, preparation, and analysis of data, is known
in the literature as Big Data pipelines.
In this thesis, the improvement of three stages of the Big Data pipelines is tackled: Data
Preparation, Data Quality assessment, and Data Analysis. These improvements can be
addressed from an individual perspective, by focussing on each stage, or from a more complex
and global perspective, implying the coordination of these stages to create data workflows.
The first stage to improve is the Data Preparation by supporting the preparation of data with
complex structures (i.e., data with various levels of nested structures, such as arrays).
Shortcomings have been found in the literature and current technologies for transforming complex
data in a simple way. Therefore, this thesis aims to improve the Data Preparation stage through
Domain-Specific Languages (DSLs). Specifically, two DSLs are proposed for different use cases.
While one of them is a general-purpose Data Transformation language, the other is a DSL aimed
at extracting event logs in a standard format for process mining algorithms.
The second area for improvement is related to the assessment of Data Quality. Depending on the
type of Data Analysis algorithm, poor-quality data can seriously skew the results. A clear example
are optimisation algorithms. If the data are not sufficiently accurate and complete, the search
space can be severely affected. Therefore, this thesis formulates a methodology for modelling
Data Quality rules adjusted to the context of use, as well as a tool that facilitates the automation
of their assessment. This allows to discard the data that do not meet the quality criteria defined
by the organisation. In addition, the proposal includes a framework that helps to select actions to
improve the usability of the data.
The third and last proposal involves the Data Analysis stage. In this case, this thesis faces the
challenge of supporting the use of optimisation problems in Big Data pipelines. There is a lack of
methodological solutions that allow computing exhaustive optimisation problems in distributed
environments (i.e., those optimisation problems that guarantee the finding of an optimal solution
by exploring the whole search space). The resolution of this type of problem in the Big Data
context is computationally complex, and can be NP-complete. This is caused by two different
factors. On the one hand, the search space can increase significantly as the amount of data to
be processed by the optimisation algorithms increases. This challenge is addressed through a
technique to generate and group problems with distributed data. On the other hand, processing
optimisation problems with complex models and large search spaces in distributed environments
is not trivial. Therefore, a proposal is presented for a particular case in this type of scenario.
As a result, this thesis develops methodologies that have been published in scientific journals and
conferences.The methodologies have been implemented in software tools that are integrated with
the Apache Spark data processing engine. The solutions have been validated through tests and use cases with real datasets
Recommended from our members
Scholarly insight Spring 2018: a Data wrangler perspective
In the movie classic Back to the Future a young Michael J. Fox is able to explore the past by a time machine developed by the slightly bizarre but exquisite Dr Brown. Unexpectedly by some small intervention the course of history was changed a bit along Fox’s adventures. In this fourth Scholarly Insight Report we have explored two innovative approaches to learn from OU data of the past, which hopefully in the future will make a large difference in how we support our students and design and implement our teaching and learning practices. In Chapter 1, we provide an in-depth analysis of 50 thousands comments expressed by students through the Student Experience on a Module (SEAM) questionnaire. By analysing over 2.5 million words using big data approaches, our Scholarly insights indicate that not all student voices are heard. Furthermore, our big data analysis indicate useful potential insights to explore how student voices change over time, and for which particular modules emergent themes might arise.
In Chapter 2 we provide our second innovative approach of a proof-of-concept of qualification path way using graph approaches. By exploring existing data of one qualification (i.e., Psychology), we show that students make a range of pathway choices during their qualification, some of which are more successful than others. As highlighted in our previous Scholarly Insight Reports, getting data from a qualification perspective within the OU is a difficult and challenging process, and the proof-of-concept provided in Chapter 2 might provide a way forward to better understand and support the complex choices our students make.
In Chapter 3, we provide a slightly more practically-oriented and perhaps down to earth approach focussing on the lessons-learned with Analytics4Action. Over the last four years nearly a hundred modules have worked with more active use of data and insights into module presentation to support their students. In Chapter 3 several good-practices are described by the LTI/TEL learning design team, as well as three innovative case-studies which we hope will inspire you to try something new as well.
Working organically in various Faculty sub-group meetings and LTI Units and in a google doc with various key stakeholders in the Faculties, we hope that our Scholarly insights can help to inform our staff, but also spark some ideas how to further improve our module designs and qualification pathways. Of course we are keen to hear what other topics require Scholarly insight. We hope that you see some potential in the two innovative approaches, and perhaps you might want to try some new ideas in your module. While a time machine has not really been invented yet, with the increasing rich and fine-grained data about our students and our learning practices we are getting closer to understand what really drives our students
- …