92 research outputs found

    Approximate Computation and Implicit Regularization for Very Large-scale Data Analysis

    Full text link
    Database theory and database practice are typically the domain of computer scientists who adopt what may be termed an algorithmic perspective on their data. This perspective is very different than the more statistical perspective adopted by statisticians, scientific computers, machine learners, and other who work on what may be broadly termed statistical data analysis. In this article, I will address fundamental aspects of this algorithmic-statistical disconnect, with an eye to bridging the gap between these two very different approaches. A concept that lies at the heart of this disconnect is that of statistical regularization, a notion that has to do with how robust is the output of an algorithm to the noise properties of the input data. Although it is nearly completely absent from computer science, which historically has taken the input data as given and modeled algorithms discretely, regularization in one form or another is central to nearly every application domain that applies algorithms to noisy data. By using several case studies, I will illustrate, both theoretically and empirically, the nonobvious fact that approximate computation, in and of itself, can implicitly lead to statistical regularization. This and other recent work suggests that, by exploiting in a more principled way the statistical properties implicit in worst-case algorithms, one can in many cases satisfy the bicriteria of having algorithms that are scalable to very large-scale databases and that also have good inferential or predictive properties.Comment: To appear in the Proceedings of the 2012 ACM Symposium on Principles of Database Systems (PODS 2012

    Algorithmic and Statistical Perspectives on Large-Scale Data Analysis

    Full text link
    In recent years, ideas from statistics and scientific computing have begun to interact in increasingly sophisticated and fruitful ways with ideas from computer science and the theory of algorithms to aid in the development of improved worst-case algorithms that are useful for large-scale scientific and Internet data analysis problems. In this chapter, I will describe two recent examples---one having to do with selecting good columns or features from a (DNA Single Nucleotide Polymorphism) data matrix, and the other having to do with selecting good clusters or communities from a data graph (representing a social or information network)---that drew on ideas from both areas and that may serve as a model for exploiting complementary algorithmic and statistical perspectives in order to solve applied large-scale data analysis problems.Comment: 33 pages. To appear in Uwe Naumann and Olaf Schenk, editors, "Combinatorial Scientific Computing," Chapman and Hall/CRC Press, 201

    Doctor of Philosophy

    Get PDF
    dissertationPublic health surveillance systems are crucial for the timely detection and response to public health threats. Since the terrorist attacks of September 11, 2001, and the release of anthrax in the following month, there has been a heightened interest in public health surveillance. The years immediately following these attacks were met with increased awareness and funding from the federal government which has significantly strengthened the United States surveillance capabilities; however, despite these improvements, there are substantial challenges faced by today's public health surveillance systems. Problems with the current surveillance systems include: a) lack of leveraging unstructured public health data for surveillance purposes; and b) lack of information integration and the ability to leverage resources, applications or other surveillance efforts due to systems being built on a centralized model. This research addresses these problems by focusing on the development and evaluation of new informatics methods to improve the public health surveillance. To address the problems above, we first identified a current public surveillance workflow which is affected by the problems described and has the opportunity for enhancement through current informatics techniques. The 122 Mortality Surveillance for Pneumonia and Influenza was chosen as the primary use case for this dissertation work. The second step involved demonstrating the feasibility of using unstructured public health data, in this case death certificates. For this we created and evaluated a pipeline iv composed of a detection rule and natural language processor, for the coding of death certificates and the identification of pneumonia and influenza cases. The second problem was addressed by presenting the rationale of creating a federated model by leveraging grid technology concepts and tools for the sharing and epidemiological analyses of public health data. As a case study of this approach, a secured virtual organization was created where users are able to access two grid data services, using death certificates from the Utah Department of Health, and two analytical grid services, MetaMap and R. A scientific workflow was created using the published services to replicate the mortality surveillance workflow. To validate these approaches, and provide proofs-of-concepts, a series of real-world scenarios were conducted

    Harnessing Evolution in-Materio as an Unconventional Computing Resource

    Get PDF
    This thesis illustrates the use and development of physical conductive analogue systems for unconventional computing using the Evolution in-Materio (EiM) paradigm. EiM uses an Evolutionary Algorithm to configure and exploit a physical material (or medium) for computation. While EiM processors show promise, fundamental questions and scaling issues remain. Additionally, their development is hindered by slow manufacturing and physical experimentation. This work addressed these issues by implementing simulated models to speed up research efforts, followed by investigations of physically implemented novel in-materio devices. Initial work leveraged simulated conductive networks as single substrate ‘monolithic’ EiM processors, performing classification by formulating the system as an optimisation problem, solved using Differential Evolution. Different material properties and algorithm parameters were isolated and investigated; which explained the capabilities of configurable parameters and showed ideal nanomaterial choice depended upon problem complexity. Subsequently, drawing from concepts in the wider Machine Learning field, several enhancements to monolithic EiM processors were proposed and investigated. These ensured more efficient use of training data, better classification decision boundary placement, an independently optimised readout layer, and a smoother search space. Finally, scalability and performance issues were addressed by constructing in-Materio Neural Networks (iM-NNs), where several EiM processors were stacked in parallel and operated as physical realisations of Hidden Layer neurons. Greater flexibility in system implementation was achieved by re-using a single physical substrate recursively as several virtual neurons, but this sacrificed faster parallelised execution. These novel iM-NNs were first implemented using Simulated in-Materio neurons, and trained for classification as Extreme Learning Machines, which were found to outperform artificial networks of a similar size. Physical iM-NN were then implemented using a Raspberry Pi, custom Hardware Interface and Lambda Diode based Physical in-Materio neurons, which were trained successfully with neuroevolution. A more complex AutoEncoder structure was then proposed and implemented physically to perform dimensionality reduction on a handwritten digits dataset, outperforming both Principal Component Analysis and artificial AutoEncoders. This work presents an approach to exploit systems with interesting physical dynamics, and leverage them as a computational resource. Such systems could become low power, high speed, unconventional computing assets in the future

    The Unprecedented Assimilation of Mobile Telephony in Ireland: a Phenomenon of the Celtic Tiger Era or a Result of Cultural Traits?

    Get PDF
    Following the universal acceptance of mobile telephony the once elitist gadget has become an indispensable democratic tool of everyday communications over the last three decades. Controversially, this thesis illustrates that its level of both adoption and usage did not develop in a similar homogenous pattern in selected OECD countries. In particular, the Irish performance is rather astonishing given the speed of adoption as well as the exceptional high revenue figures achieved by the wireless operators. Consequently, this work determines a selection of factors that drive and encourage both the adoption and usage of cellular telephony in Ireland. The Irish experience is examined in the light of Rogers‘ theory of adoption and diffusion of innovation and demonstrates that domestic socio-economic factors such as the traditional Irish family structure helped the adoption process as did its young demography following the launch of prepaid services. Similarly, historic events such as emigration and the policy of attracting overseas companies to settle in Ireland created traits of a cosmopolite and open economy society whereas the civil war and governmental policies hindered the adequate rollout of the PSTN which resulted in a migration towards cellular telephony. Significantly, by deploying a linear regression model this thesis showed that Hofstede‘s cultural dimension of uncertainty avoidance correlates the most with mobile telephony adoption. Controversially, while this dimension is generally link with protestant cultures this finding is rather contradictive when recalling Ireland‘s tradition of Catholicism and puts a long-cherished stereotype associated with Ireland into question. It was further demonstrated that the Irish benefited from their selection of the global TACS standard that promised economies of scale and subsequently reasonable-priced equipment. Due to this selection the incumbent establish some form of international roaming, which was a novelty outside the NMT system sphere at the time. With regard to the exceptional revenue figures which were seen as a result of a ‗rip-off‘ policy by the wireless carriers this thesis found proof that they were in fact a consequence of the Irish‘s enthusiastic mobile phone usage rather than a product of over-charging. It was further demonstrated that the stereotype of the talkative Irish is profound in their legacy of story-telling as well as a consequence of the British suppression when the mother tongue was used to both conserve and keep their culture alive. Following the independence from their occupiers this regained freedom can easily be observed by the extensive rate of speech and ‗pirate‘ radio broadcastings. Altogether, it was shown that the Irish society resonate most fortune with the adoption of an innovation such as mobile telephony. Therewith, underpinning the relevance of cultural and social factors in addition to traditional solely economic and marked-orientated models

    The ICT Landscape in Brazil, India and China

    Get PDF
    The Information Society Unit at IPTS (European Commission) has been investigating the Information and Communication Technologies (ICT) sector and ICT R&D in Asia for several years. This research exercise led to three reports, written by national experts, on China, India and Taiwan, each one including a dataset and a technical annex. This report offers a synthesis on three out of the four BRIC countries (Brazil, India, Russia, China). The report describes, for each of the three countries (Brazil, India, China), its ICT sector, and gives a company level assessment. It also analyses Indian ICT R&D strategies, and assesses the innovation model. In 2010, BRIC countries accounted for 13% of global demand, with spending of about €328 billion in ICT (EITO, 2011). Therefore, they are becoming major players as producers of ICT goods and services. China has become the world’s largest producer of ICT products (exports of ICT increased fourfold between 2004 and 2008). This impressive growth of the ICT market is translated into R&D expenditures and output. Innovative capability in Asia has grown, the dynamics in terms of catching up are strong. Asian countries are increasingly present in the ICT R&D global landscape.JRC.J.3-Information Societ

    Optimization of A Real Time Multi Mixed Make-To-Order Assembly Line to Reduce Positive Drift

    Get PDF
    ThesisAssembly lines are critical for the realization of product manufacture. In recent times, there has been a shift from the make-to-stock (mass production) approach to a make-to-order (mass customization) approach and this has brought on a strong emphasis on product variety. Although variety can be included to a product at various phases of production, literature shows that by providing each functional module of the product with several variants, assembly lines provide the most cost-effective approach to achieve high product variety. However, there are certain challenges associated with using assembly lines to achieve product variety. One of these challenges is assembly line balancing. Assembly line balancing is the search for an optimum assignment of tasks, such that given precedence constraints according to pre-defined single or multi objective goal are met. These objectives include reducing the number of stations for a given cycle time or minimizing the cycle time for a given number of stations. Cycle time refers to the amount of time allotted to accomplish a certain process in an assembly process. This deviation from the optimal cycle time is technically referred to as drift. Drift can be negative or positive. Negative drift represents the time span during which an assembly line is idle, due to work being finished ahead of prescribed cycle time. Positive drift, meanwhile, represents time span in which an assembly line exceeds the prescribed cycle time. The problems caused by drift, especially positive drift, is so vast that there is a research niche are dedicated to this study called Assembly Line Balancing Problems. Various authors have proposed numerous solutions for solving assembly line balancing problems created by positive drift. However, there is very little information on optimizing multi model make-to order systems with real time inputs so as to reduce the effects of positive drift. This study looks at how such a system can be optimized by using the case study of a water bottling plant. This is done by initially looking at the literature in the field of assembly line balancing to isolate the research gap this study aims to fill. Secondly, the water bottling plant, described as the case study, is modelled using MATLAB/Simulink. Thirdly, the different optimization methodologies are discussed and applied to the created model. Finally, the optimized model is tested and the results are analysed. The results of this study show that positive drift, which can be a major challenge in a real time multi mixed assembly line, can be reduced by the optimization of assembly lines. The results of this study can also be seen as an addition to the knowledge base of the broader research on mixed model assembly line balancing

    Topics in Matrix Sampling Algorithms

    Full text link
    We study three fundamental problems of Linear Algebra, lying in the heart of various Machine Learning applications, namely: 1)"Low-rank Column-based Matrix Approximation". We are given a matrix A and a target rank k. The goal is to select a subset of columns of A and, by using only these columns, compute a rank k approximation to A that is as good as the rank k approximation that would have been obtained by using all the columns; 2) "Coreset Construction in Least-Squares Regression". We are given a matrix A and a vector b. Consider the (over-constrained) least-squares problem of minimizing ||Ax-b||, over all vectors x in D. The domain D represents the constraints on the solution and can be arbitrary. The goal is to select a subset of the rows of A and b and, by using only these rows, find a solution vector that is as good as the solution vector that would have been obtained by using all the rows; 3) "Feature Selection in K-means Clustering". We are given a set of points described with respect to a large number of features. The goal is to select a subset of the features and, by using only this subset, obtain a k-partition of the points that is as good as the partition that would have been obtained by using all the features. We present novel algorithms for all three problems mentioned above. Our results can be viewed as follow-up research to a line of work known as "Matrix Sampling Algorithms". [Frieze, Kanna, Vempala, 1998] presented the first such algorithm for the Low-rank Matrix Approximation problem. Since then, such algorithms have been developed for several other problems, e.g. Graph Sparsification and Linear Equation Solving. Our contributions to this line of research are: (i) improved algorithms for Low-rank Matrix Approximation and Regression (ii) algorithms for a new problem domain (K-means Clustering).Comment: PhD Thesis, 150 page
    • …
    corecore