5 research outputs found
A Novel Integrated Framework to Ensure Better Data Quality in Big Data Analytics over Cloud Environment
With advent of Big Data Analytics, the healthcare system is increasingly adopting the analytical services that is ultimately found to generate massive load of highly unstructured data. We reviewed the existing system to find that there are lesser number of solutions towards addressing the problems of data variety, data uncertainty, and data speed. It is important that an error-free data should arrive in analytics. Existing system offers single-hand solution towards single platform. Therefore, we introduced an integrated framework that has the capability to address all these three problems in one execution time. Considering the synthetic big data of healthcare, we carried out the investigation to find that our proposed system using deep learning architecture offers better optimization of computational resources. The study outcome is found to offer comparatively better response time and higher accuracy rate as compared to existing optimization technqiues that is found and practiced widely in literature
Learning Models over Relational Data using Sparse Tensors and Functional Dependencies
Integrated solutions for analytics over relational databases are of great
practical importance as they avoid the costly repeated loop data scientists
have to deal with on a daily basis: select features from data residing in
relational databases using feature extraction queries involving joins,
projections, and aggregations; export the training dataset defined by such
queries; convert this dataset into the format of an external learning tool; and
train the desired model using this tool. These integrated solutions are also a
fertile ground of theoretically fundamental and challenging problems at the
intersection of relational and statistical data models.
This article introduces a unified framework for training and evaluating a
class of statistical learning models over relational databases. This class
includes ridge linear regression, polynomial regression, factorization
machines, and principal component analysis. We show that, by synergizing key
tools from database theory such as schema information, query structure,
functional dependencies, recent advances in query evaluation algorithms, and
from linear algebra such as tensor and matrix operations, one can formulate
relational analytics problems and design efficient (query and data)
structure-aware algorithms to solve them.
This theoretical development informed the design and implementation of the
AC/DC system for structure-aware learning. We benchmark the performance of
AC/DC against R, MADlib, libFM, and TensorFlow. For typical retail forecasting
and advertisement planning applications, AC/DC can learn polynomial regression
models and factorization machines with at least the same accuracy as its
competitors and up to three orders of magnitude faster than its competitors
whenever they do not run out of memory, exceed 24-hour timeout, or encounter
internal design limitations.Comment: 61 pages, 9 figures, 2 table
Efficient Solution of Minimum Cost Flow Problems for Large-scale Transportation Networks
With the rapid advance of information technology in the transportation industry, of which intermodal transportation is one of the most important subfields, the scale and dimension of problem sizes and datasets is rising significantly. This trend raises the need for study on improving the efficiency, profitability and level of competitiveness of intermodal transportation networks while exploiting the rich information of big data related to these networks. Therefore, this dissertation aims to investigate intermodal transportation network design problems, especially practical optimization problems, and to develop more realistic and effective models and solution approaches that will assist network operators and/or decision makers of the intermodal transportation system. This dissertation focuses on developing a novel strategy for solving the Minimum Cost Flow (MCF) problem for large-scale network design problems by adopting a divide-and-conquer policy during the optimization process. The main contribution is the development of an agglomerative clustering based tiling strategy to significantly reduce the computational and peak memory consumption of the MCF model for large-scale networks. The tiling strategy is supported by the regional-division theorem and -approximation regional-division theorem that are proposed and proved in this dissertation. The region-division theorem is a sufficient condition to exactly guarantee the consistency between the local MCF solution of each sub-network obtained by the aforementioned tiling strategy and the global MCF solution of the whole network. Furthermore, the -approximation region-division theorem provides worst-case bounds, so that the practical approximation MCF solution closely approximates the optimal solution in terms of its optimal value. A series of experiments are performed to evaluate the utility of the proposed approach of solving the large-scale MCF problem. The results indicate that the proposed approach is beneficial to save the execution time and peak memory consumption in large-scale MCF problems under different circumstances