    A new processing approach for reducing computational complexity in cloud-RAN mobile networks

    Cloud computing is considered as one of the key drivers for the next generation of mobile networks (e.g. 5G). This is combined with the dramatic expansion in mobile networks, involving millions (or even billions) of subscribers with a greater number of current and future mobile applications (e.g. IoT). Cloud Radio Access Network (C-RAN) architecture has been proposed as a novel concept to gain the benefits of cloud computing as an efficient computing resource, to meet the requirements of future cellular networks. However, the computational complexity of obtaining the channel state information in the full-centralized C-RAN increases as the size of the network is scaled up, as a result of enlargement in channel information matrices. To tackle this problem of complexity and latency, MapReduce framework and fast matrix algorithms are proposed. This paper presents two levels of complexity reduction in the process of estimating the channel information in cellular networks. The results illustrate that complexity can be minimized from O(N3) to O((N/k)3), where N is the total number of RRHs and k is the number of RRHs per group, by dividing the processing of RRHs into parallel groups and harnessing the MapReduce parallel algorithm in order to process them. The second approach reduces the computation complexity from O((N/k)3) to O((N/k)2:807) using the algorithms of fast matrix inversion. The reduction in complexity and latency leads to a significant improvement in both the estimation time and in the scalability of C-RAN networks

    Approximate Computation and Implicit Regularization for Very Large-scale Data Analysis

    Database theory and database practice are typically the domain of computer scientists who adopt what may be termed an algorithmic perspective on their data. This perspective is very different than the more statistical perspective adopted by statisticians, scientific computers, machine learners, and other who work on what may be broadly termed statistical data analysis. In this article, I will address fundamental aspects of this algorithmic-statistical disconnect, with an eye to bridging the gap between these two very different approaches. A concept that lies at the heart of this disconnect is that of statistical regularization, a notion that has to do with how robust is the output of an algorithm to the noise properties of the input data. Although it is nearly completely absent from computer science, which historically has taken the input data as given and modeled algorithms discretely, regularization in one form or another is central to nearly every application domain that applies algorithms to noisy data. By using several case studies, I will illustrate, both theoretically and empirically, the nonobvious fact that approximate computation, in and of itself, can implicitly lead to statistical regularization. This and other recent work suggests that, by exploiting in a more principled way the statistical properties implicit in worst-case algorithms, one can in many cases satisfy the bicriteria of having algorithms that are scalable to very large-scale databases and that also have good inferential or predictive properties.Comment: To appear in the Proceedings of the 2012 ACM Symposium on Principles of Database Systems (PODS 2012

    Arabic semantic similarity approach for farmers’ complaints

    Semantic similarity is applied for many areas in Natural Language Processing, such as information retrieval, text classification, plagiarism detection, and others. Many researchers used semantic similarity for English texts, but few used for Arabic due to the ambiguity of Arabic concepts in both sense and morphology. Therefore, the first contribution in this paper is developing a semantic similarity approach between Arabic sentences. Nowadays, the world faces a global problem of coronavirus disease. In light of these circumstances and distancing’s imposition, it is difficult for farmers to physically communicate with agricultural experts to provide advice and find suitable solutions for their agricultural complaints. In addition, traditional practices still are used by most farmers. Thus, our second contribution is helping the farmers solve their Arabic agricultural complaints using our proposed approach. The Latent Semantic Analysis approach is applied to retrieve the most problem-related semantic to a farmer’s complaint and find the related solution for the farmer. Two methods are used in this approach as a weighting schema for data representation are Term Frequency and Term Frequency-Inverse Document Frequency. The proposed model has also classified the big agricultural dataset and the submitted farmer complaint according to the crop type using MapReduce Support Vector Machine to improve the performance of semantic similarity results. The proposed approach’s performance with Term Frequency-Inverse Document Frequency-based Latent Semantic Analysis achieved better than its counterparts with an F-measure of 86.7%

    Teadusarvutuse algoritmide taandamine hajusarvutuse raamistikele

    Teadusarvutuses kasutatakse arvuteid ja algoritme selleks, et lahendada probleeme erinevates reaalteadustes nagu geneetika, bioloogia ja keemia. Tihti on eesmĂ€rgiks selliste loodusnĂ€htuste modelleerimine ja simuleerimine, mida pĂ€ris keskkonnas oleks vĂ€ga raske uurida. NĂ€iteks on vĂ”imalik luua pĂ€ikesetormi vĂ”i meteoriiditabamuse mudel ning arvutisimulatsioonide abil hinnata katastroofi mĂ”ju keskkonnale. Mida keerulisemad ja tĂ€psemad on sellised simulatsioonid, seda rohkem arvutusvĂ”imsust on vaja. Tihti kasutatakse selleks suurt hulka arvuteid, mis kĂ”ik samaaegselt töötavad ĂŒhe probleemi kallal. Selliseid arvutusi nimetatakse paralleel- vĂ”i hajusarvutusteks. Hajusarvutuse programmide loomine on aga keeruline ning nĂ”uab palju rohkem aega ja ressursse, kuna vaja on sĂŒnkroniseerida erinevates arvutites samaaegselt tehtavat tööd. On loodud mitmeid tarkvararaamistikke, mis lihtsustavad seda tööd automatiseerides osa hajusprogrammeerimisest. Selle teadustöö eesmĂ€rk oli uurida selliste hajusarvutusraamistike sobivust keerulisemate teadusarvutuse algoritmide jaoks. Tulemused nĂ€itasid, et olemasolevad raamistikud on ĂŒksteisest vĂ€ga erinevad ning neist ĂŒkski ei ole sobiv kĂ”igi erinevat tĂŒĂŒpi algoritmide jaoks. MĂ”ni raamistik on sobiv ainult lihtsamate algoritmide jaoks; mĂ”ni ei sobi olukorras, kus andmed ei mahu arvutite mĂ€llu. Algoritmi jaoks kĂ”ige sobivama hajusarvutisraamistiku valimine vĂ”ib olla vĂ€ga keeruline ĂŒlesanne, kuna see nĂ”uab olemasolevate raamistike uurimist ja rakendamist. Sellele probleemile lahendust otsides otsustati luua dĂŒnaamiline algoritmide modelleerimise rakendus (DAMR), mis oskab simuleerida algoritmi implementatsioone erinevates hajusarvutusraamistikes. DAMR aitab hinnata milline hajusraamistik on kĂ”ige sobivam ette antud algoritmi jaoks, ilma algoritmi reaalselt ĂŒhegi hajusraamistiku peale implementeerimata. Selle uurimustöö peamine panus on hajusarvutusraamistike kasutuselevĂ”tu lihtsamaks tegemine teadlastele, kes ei ole varem nende kasutamisega kokku puutunud. See peaks mĂ€rkimisvÀÀrselt aega ja ressursse kokku hoidma, kuna ei pea ĂŒkshaaval kĂ”iki olemasolevaid hajusraamistikke tundma Ă”ppima ja rakendama.Scientific computing uses computers and algorithms to solve problems in various sciences such as genetics, biology and chemistry. Often the goal is to model and simulate different natural phenomena which would otherwise be very difficult to study in real environments. For example, it is possible to create a model of a solar storm or a meteor hit and run computer simulations to assess the impact of the disaster on the environment. The more sophisticated and accurate the simulations are the more computing power is required. It is often necessary to use a large number of computers, all working simultaneously on a single problem. These kind of computations are called parallel or distributed computing. However, creating distributed computing programs is complicated and requires a lot more time and resources, because it is necessary to synchronize different computers working at the same time. A number of software frameworks have been created to simplify this process by automating part of a distributed programming. The goal of this research was to assess the suitability of such distributed computing frameworks for complex scientific computing algorithms. The results showed that existing frameworks are very different from each other and none of them are suitable for all different types of algorithms. Some frameworks are only suitable for simple algorithms; others are not suitable when data does not fit into the computer memory. Choosing the most appropriate distributed computing framework for an algorithm can be a very complex task, because it requires studying and applying the existing frameworks. While searching for a solution to this problem, it was decided to create a Dynamic Algorithms Modelling Application (DAMA), which is able to simulate the implementation of the algorithm in different distributed computing frameworks. DAMA helps to estimate which distributed framework is the most appropriate for a given algorithm, without actually implementing it in any of the available frameworks. This main contribution of this study is simplifying the adoption of distributed computing frameworks for researchers who are not yet familiar with using them. It should save significant time and resources as it is not necessary to study each of the available distributed computing frameworks in detail

    Distributed Estimation and Inference for the Analysis of Big Biomedical Data

    This thesis focuses on developing and implementing new statistical methods to address some of the current difficulties encountered in the analysis of high-dimensional correlated biomedical data. Following the divide-and-conquer paradigm, I develop a theoretically sound and computationally tractable class of distributed statistical methods that are made accessible to practitioners through R statistical software. This thesis aims to establish a class of distributed statistical methods for regression analyses with very large outcome variables arising in many biomedical fields, such as in metabolomic or imaging research. The general distributed procedure divides data into blocks that are analyzed on a parallelized computational platform and combines these separate results via Hansen’s (1982) generalized method of moments. These new methods provide distributed and efficient statistical inference in many different regression settings. Computational efficiency is achieved by leveraging recent developments in large scale computing, such as the MapReduce paradigm on the Hadoop platform. In the first project presented in Chapter III, I develop a divide-and-conquer procedure implemented in a parallelized computational scheme for statistical estimation and inference of regression parameters with high-dimensional correlated responses. This project is motivated by an electroencephalography study whose goal is to determine the effect of iron deficiency on infant auditory recognition memory. The proposed method (published as Hector and Song (2020a)), the Distributed and Integrated Method of Moments (DIMM), divides responses into subvectors to be analyzed in parallel using pairwise composite likelihood, and combines results using an optimal one-step meta-estimator. In the second project presented in Chapter IV, I develop an extended theoretical framework of distributed estimation and inference to incorporate a broad range of classical statistical models and biomedical data types. To reduce computational speed and meet data privacy demands, I propose to divide data by outcomes and subjects, leading to a doubly divide-and-conquer paradigm. I also address parameter heterogeneity explicitly for added flexibility. I establish a new theoretical framework for the analysis of a broad class of big data problems to facilitate valid statistical inference for biomedical researchers. Possible applications include genomic data, metabolomic data, longitudinal and spatial data, and many more. In the third project presented in Chapter V, I propose a distributed quadratic inference function framework to jointly estimate regression parameters from multiple potentially heterogeneous data sources with correlated vector outcomes. This project is motivated by the analysis of the association between smoking and metabolites in a large cohort study. The primary goal of this joint integrative analysis is to estimate covariate effects on all outcomes through a marginal regression model in a statistically and computationally efficient way. To overcome computational and modeling challenges arising from the high-dimensional likelihood of the correlated vector outcomes, I propose to analyze each data source using Qu et al.’s quadratic inference funtions, and then to jointly reestimate parameters from each data source by accounting for correlation between data sources.PHDBiostatisticsUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/163220/1/ehector_1.pd
