18 research outputs found
Scaling Up Concurrent Analytical Workloads on Multi-Core Servers
Today, an ever-increasing number of researchers, businesses, and data scientists collect and analyze massive amounts of data in database systems. The database system needs to process the resulting highly concurrent analytical workloads by exploiting modern multi-socket multi-core processor systems with non-uniform memory access (NUMA) architectures and increasing memory sizes. Conventional execution engines, however, are not designed for many cores, and neither scale nor perform efficiently on modern multi-core NUMA architectures. Firstly, their query-centric approach, where each query is optimized and evaluated independently, can result in unnecessary contention for hardware resources due to redundant work found across queries in highly concurrent workloads. Secondly, they are unaware of the non-uniform memory access costs and the underlying hardware topology, incurring unnecessarily expensive memory accesses and bandwidth saturation. In this thesis, we show how these scalability and performance impediments can be solved by exploiting sharing among concurrent queries and incorporating NUMA-aware adaptive task scheduling and data placement strategies in the execution engine. Regarding sharing, we identify and categorize state-of-the-art techniques for sharing data and work across concurrent queries at run-time into two categories: reactive sharing, which shares intermediate results across common query sub-plans, and proactive sharing, which builds a global query plan with shared operators to evaluate queries. We integrate the original research prototypes that introduce reactive and proactive sharing, perform a sensitivity analysis, and show how and when each technique benefits performance. Our most significant finding is that reactive and proactive sharing can be combined to exploit the advantages of both sharing techniques for highly concurrent analytical workloads. Regarding NUMA-awareness, we identify, implement, and compare various combinations of task scheduling and data placement strategies under a diverse set of highly concurrent analytical workloads. We develop a prototype based on a commercial main-memory column-store database system. Our most significant finding is that there is no single strategy for task scheduling and data placement that is best for all workloads. In specific, inter-socket stealing of memory-intensive tasks can hurt overall performance, and unnecessary partitioning of data across sockets involves an overhead. For this reason, we implement algorithms that adapt task scheduling and data placement to the workload at run-time. Our experiments show that both sharing and NUMA-awareness can significantly improve the performance and scalability of highly concurrent analytical workloads on modern multi-core servers. Thus, we argue that sharing and NUMA-awareness are key factors for supporting faster processing of big data analytical applications, fully exploiting the hardware resources of modern multi-core servers, and for more responsive user experience
On the K-Mer Frequency Spectra of Organism Genome and Proteome Sequences with a Preliminary Machine Learning Assessment of Prime Predictability
A regular expression and region-specific filtering system for biological records at the National Center for Biotechnology database is integrated into an object oriented sequence counting application, and a statistical software suite is designed and deployed to interpret the resulting k-mer frequencies|with a priority focus on nullomers. The proteome k-mer frequency spectra of ten model organisms and the genome k-mer frequency spectra of two bacteria and virus strains for the coding and non-coding regions are comparatively scrutinized. We observe that the naturally-evolved (NCBI/organism) and the artificially-biased (randomly-generated) sequences exhibit a clear deviation from the artificially-unbiased (randomly-generated) histogram distributions. Furthermore, a preliminary assessment of prime predictability is conducted on chronologically ordered NCBI genome snapshots over an 18-month period using an artificial neural network; three distinct supervised machine learning algorithms are used to train and test the system on customized NCBI data sets to forecast future prime states|revealing that, to a modest degree, it is feasible to make such predictions
Recommended from our members
Impact of Query Specification Mode and Problem Complexity on Query Specification Productivity of Novice Users of Database Systems
With the increased demand for the utilization of computerized information systems by business users, the need for investigating the impact of various user interfaces has been well recognized. It is usually assumed that providing the user with assistance in the usage o-f a system would significantly increase the user's productivity. There is, however, a dearth of systematic inquiry into this commonly held notion to verify its validity in a scientific fashion. The purpose of this study is to investigate the impact of system-provided user assistance and complexity level of the problem on novice users' productivity in specifying database queries. The study is theoretical in the sense that it presents an approach adopted from research in deductive database systems to attack problems concerning user interface design. It is empirical in that it conducts an experiment in a controlled laboratory setting to collect primary data for the testing of a series of hypotheses. The two independent variables are system-provided user assistance and problem complexity, while the dependent variable is the user's query specification productivity. Three measures are used as separate indicators of query specification productivity: number of syntactic errors, number of semantic errors, and time required for completing a query task. Due to the lack of a well-defined metric for user assistance, the study first presents a generic classification scheme for relational query specification. Based on this classification scheme, two quantitative metrics for measuring the amount of user assistance in terms of prompts and defaults were developed. The user assistance is operationally defined with these two metrics. Four findings emerge as significant results of the study. First, user assistance has a significant main effect on all of the three dependent measures at the 1 percent significance level. Second, problem complexity also has a significant impact on the three productivity measures at the 1 percent significance level. Third, the interaction effect of user assistance and problem complexity on the number of semantic errors and the amount of time for completion is significant at the 1 percent level. Fourth, Although this interaction effect on the number of syntactic errors is not significant at the 5 percent level, it is at the 10 percent level. More research is needed to permit a thorough understanding of the issue of user interface design. A list of topics is suggested for future research to confirm or to modify the findings of this study
Gridfields: Model-Driven Data Transformation in the Physical Sciences
Scientists\u27 ability to generate and store simulation results is outpacing their ability to analyze them via ad hoc programs. We observe that these programs exhibit an algebraic structure that can be used to facilitate reasoning and improve performance. In this dissertation, we present a formal data model that exposes this algebraic structure, then implement the model, evaluate it, and use it to express, optimize, and reason about data transformations in a variety of scientific domains.
Simulation results are defined over a logical grid structure that allows a continuous domain to be represented discretely in the computer. Existing approaches for manipulating these gridded datasets are incomplete. The performance of SQL queries that manipulate large numeric datasets is not competitive with that of specialized tools, and the up-front effort required to deploy a relational database makes them unpopular for dynamic scientific applications. Tools for processing multidimensional arrays can only capture regular, rectilinear grids. Visualization libraries accommodate arbitrary grids, but no algebra has been developed to simplify their use and afford optimization. Further, these libraries are data dependent—physical changes to data characteristics break user programs.
We adopt the grid as a first-class citizen, separating topology from geometry and separating structure from data. Our model is agnostic with respect to dimension, uniformly capturing, for example, particle trajectories (1-D), sea-surface temperatures (2-D), and blood flow in the heart (3-D). Equipped with data, a grid becomes a gridfield. We provide operators for constructing, transforming, and aggregating gridfields that admit algebraic laws useful for optimization. We implement the model by analyzing several candidate data structures and incorporating their best features. We then show how to deploy gridfields in practice by injecting the model as middleware between heterogeneous, ad hoc file formats and a popular visualization library.
In this dissertation, we define, develop, implement, evaluate and deploy a model of gridded datasets that accommodates a variety of complex grid structures and a variety of complex data products. We evaluate the applicability and performance of the model using datasets from oceanography, seismology, and medicine and conclude that our model-driven approach offers significant advantages over the status quo
NASA Administrative Data Base Management Systems
Various issues concerning administrative data base management systems are discussed. The procurement and operation of several systems are discussed
Designing and developing a prototype indigenous knowledge database and devising a knowledge management framework
Thesis (M. Tech.) - Central University of Technology, Free State, 2009The purpose of the study was to design and develop a prototype Indigenous Knowledge (IK) database that will be productive within a Knowledge Management (KM) framework specifically focused on IK. The need to develop a prototype IK database that can help standardise the work being done in the field of IK within South Africa has been established in the Indigenous Knowledge Systems (IKS) policy, which stated that “common standards would enable the integration of widely scattered and distributed references on IKS in a retrievable form. This would act as a bridge between indigenous and other knowledge systems” (IKS policy, 2004:33). In particular within the indigenous people’s organizations, holders of IK, whether individually or collectively, have a claim that their knowledge should not be exploited for elitist purposes without direct benefit to their empowerment and the improvement of their livelihoods. Establishing guidelines and a modus operandi (KM framework) are important, especially when working with communities. Researchers go into communities to gather their knowledge and never return to the communities with their results. The communities feel enraged and wronged. Creating an IK network can curb such behaviour or at least inform researchers/organisations that this behaviour is damaging. The importance of IK is that IK provides the basis for problem-solving strategies for local communities, especially the poor, which can help reduce poverty. IK is a key element of the “social capital” of the poor; their main asset to invest in the struggle for survival, to produce food, to provide shelter, or to achieve control of their own lives. It is closely intertwined with their livelihoods.
Many aspects of KM and IK were discussed and a feasibility study for a KM framework was conducted to determine if any existing KM frameworks can work in an organisation that works with IK. Other factors that can influence IK are: guidelines for implementing a KM framework, information management, quality management, human factors/capital movement, leading role players in the field of IK, Intellectual Property Rights (IPR), ethics, guidelines for doing fieldwork, and a best plan for implementation.
At this point, the focus changes from KM and IK to the prototype IK database and the technical design thereof. The focus is shifted to a more hands-on development by looking at the different data models and their underlying models. A well-designed database facilitates data management and becomes a valuable generator of information. A poorly designed database is likely to become a breeding ground for redundant data. The conceptual design stage used data modelling to create an abstract database structure that represents real-world objects in the most authentic way possible. The tools used to design the database are platform independent software; therefore the design can be implemented on many different platforms. An elementary prototype graphical user interface was designed in order to illustrate the database’s three main functions: adding new members, adding new IK records, and searching the IK database. The IK database design took cognisance of what is currently prevailing in South Africa and the rest of the world with respect to IK and database development. The development of the database was done in such a way as to establish a standard database design for IK systems in South Africa. The goal was to design and develop a database that can be disseminated to researchers/organisations working in the field of IK so that the use of a template database can assist work in the field. Consequently the work in the field will be collected in the same way and based on the same model. At a later stage, the databases could be interlinked and South Africa can have one large knowledge repository for IK