602,220 research outputs found
Recommended from our members
AGM, a dataflow database machine
In recent years, a number of database machines consisting of large numbers of parallel processing elements have been proposed. Unfortunately, one of the main limitations to parallelism in database processing is the I/O bandwidth of the underlying storage devices. One way to solve this problem is to use multiple parallel disk units. The main problem with this approach, however, is the lack of a computational model capable of utilizing the potential of any significant number of such devices.This paper presents a database model which is based on the principles of data-driven computation. According to this model, the database is represented as a network in which each node is conceptually an independent processing element, capable of communicating with other nodes by exchanging messages along the network arcs. To answer a query, one or more such messages, called tokens, are created and injected into the network. These then propagate asynchronously through the network in the search of results satisfying the given query.To investigate the performance of the proposed system, we have implemented the model on a simulated computer architecture. The results of the simulation ex-periments indicate that the model is capable of exploiting the potential I/O band-width of a large number of disk units as well as the computational power of the associated processing elements
Association Mining in Database Machine
Association rule is wildly used in most of the data mining technologies. Apriori algorithm is the fundamental association rule mining algorithm. FP-growth tree algorithm improves the performance by reduce the generation of the frequent item sets. Simplex algorithm is a advanced FP-growth algorithm by using bitmap structure with the simplex concept in geometry. The bitmap structure implementation is particular designed for storing the data in database machines to support parallel computing the association rule mining
Mining protein database using machine learning techniques
With a large amount of information relating to proteins accumulating in databases widely available online, it is of interest to apply machine learning techniques that, by extracting underlying statistical regularities in the data, make predictions about the functional and evolutionary characteristics of unseen proteins. Such predictions can help in achieving a reduction in the space over which experiment designers need to search in order to improve our understanding of the biochemical properties. Previously it has been suggested that an integration of features computable by comparing a pair of proteins can be achieved by an artificial neural network, hence predicting the degree to which they may be evolutionary related and homologous. We compiled two datasets of pairs of proteins, each pair being characterised by seven distinct features. We performed an exhaustive search through all possible combinations of features, for the problem of separating remote homologous from analogous pairs, we note that significant performance gain was obtained by the inclusion of sequence and structure information. We find that the use of a linear classifier was enough to discriminate a protein pair at the family level. However, at the superfamily level, to detect remote homologous pairs was a relatively harder problem. We find that the use of nonlinear classifiers achieve significantly higher accuracies. In this paper, we compare three different pattern classification methods on two problems formulated as detecting evolutionary and functional relationships between pairs of proteins, and from extensive cross validation and feature selection based studies quantify the average limits and uncertainties with which such predictions may be made. Feature selection points to a "knowledge gap" in currently available functional annotations. We demonstrate how the scheme may be employed in a framework to associate an individual protein with an existing family of evolutionarily related proteins
Recommended from our members
Nutrient Estimation from 24-Hour Food Recalls Using Machine Learning and Database Mapping: A Case Study with Lactose.
The Automated Self-Administered 24-Hour Dietary Assessment Tool (ASA24) is a free dietary recall system that outputs fewer nutrients than the Nutrition Data System for Research (NDSR). NDSR uses the Nutrition Coordinating Center (NCC) Food and Nutrient Database, both of which require a license. Manual lookup of ASA24 foods into NDSR is time-consuming but currently the only way to acquire NCC-exclusive nutrients. Using lactose as an example, we evaluated machine learning and database matching methods to estimate this NCC-exclusive nutrient from ASA24 reports. ASA24-reported foods were manually looked up into NDSR to obtain lactose estimates and split into training (n = 378) and test (n = 189) datasets. Nine machine learning models were developed to predict lactose from the nutrients common between ASA24 and the NCC database. Database matching algorithms were developed to match NCC foods to an ASA24 food using only nutrients ("Nutrient-Only") or the nutrient and food descriptions ("Nutrient + Text"). For both methods, the lactose values were compared to the manual curation. Among machine learning models, the XGB-Regressor model performed best on held-out test data (R2 = 0.33). For the database matching method, Nutrient + Text matching yielded the best lactose estimates (R2 = 0.76), a vast improvement over the status quo of no estimate. These results suggest that computational methods can successfully estimate an NCC-exclusive nutrient for foods reported in ASA24
The Gremlin Graph Traversal Machine and Language
Gremlin is a graph traversal machine and language designed, developed, and
distributed by the Apache TinkerPop project. Gremlin, as a graph traversal
machine, is composed of three interacting components: a graph , a traversal
, and a set of traversers . The traversers move about the graph
according to the instructions specified in the traversal, where the result of
the computation is the ultimate locations of all halted traversers. A Gremlin
machine can be executed over any supporting graph computing system such as an
OLTP graph database and/or an OLAP graph processor. Gremlin, as a graph
traversal language, is a functional language implemented in the user's native
programming language and is used to define the of a Gremlin machine.
This article provides a mathematical description of Gremlin and details its
automaton and functional properties. These properties enable Gremlin to
naturally support imperative and declarative querying, host language
agnosticism, user-defined domain specific languages, an extensible
compiler/optimizer, single- and multi-machine execution models, hybrid depth-
and breadth-first evaluation, as well as the existence of a Universal Gremlin
Machine and its respective entailments.Comment: To appear in the Proceedings of the 2015 ACM Database Programming
Languages Conferenc
Recommended from our members
Selective Laser Sintering Process Management Using a Relational Database
With more and more materials used in the Selective Laser Sintering (SLS) process, it is
becoming necessary to use a database to manage the process efficiently. In this paper, a
relational database for the SLS process is described. The database includes powdered material
data, sintering parameters, machine characteristics, mechanical properties and surface quality of
prototypes. Use ofthis database will make it is easy to store and retrieve processing information
and make decisions for planning the SLS. This paper will go on to describe how the database
can be extended to include other RP technologies.Mechanical Engineerin
Learning-based Analysis on the Exploitability of Security Vulnerabilities
The purpose of this thesis is to develop a tool that uses machine learning techniques to make predictions about whether or not a given vulnerability will be exploited. Such a tool could help organizations such as electric utilities to prioritize their security patching operations. Three different models, based on a deep neural network, a random forest, and a support vector machine respectively, are designed and implemented. Training data for these models is compiled from a variety of sources, including the National Vulnerability Database published by NIST and the Exploit Database published by Offensive Security. Extensive experiments are conducted, including testing the accuracy of each model, dynamically training the models on a rolling window of training data, and filtering the training data by various features. Of the chosen models, the deep neural network and the support vector machine show the highest accuracy (approximately 94% and 93%, respectively), and could be developed by future researchers into an effective tool for vulnerability analysis
Data Mining Using Relational Database Management Systems
Software packages providing a whole set of data mining and machine learning algorithms are attractive because they allow experimentation with many kinds of algorithms in an easy setup. However, these packages are often based on main-memory data structures, limiting the amount of data they can handle. In this paper we use a relational database as secondary storage in order to eliminate this limitation. Unlike existing approaches, which often focus on optimizing a single algorithm to work with a database backend, we propose a general approach, which provides a database interface for several algorithms at once. We have taken a popular machine learning software package, Weka, and added a relational storage manager as back-tier to the system. The extension is transparent to the algorithms implemented in Weka, since it is hidden behind Weka’s standard main-memory data structure interface. Furthermore, some general mining tasks are transfered into the database system to speed up execution. We tested the extended system, refered to as WekaDB, and our results show that it achieves a much higher scalability than Weka, while providing the same output and maintaining good computation time
- …