745 research outputs found

    Algorithmic and Statistical Perspectives on Large-Scale Data Analysis

    Full text link
    In recent years, ideas from statistics and scientific computing have begun to interact in increasingly sophisticated and fruitful ways with ideas from computer science and the theory of algorithms to aid in the development of improved worst-case algorithms that are useful for large-scale scientific and Internet data analysis problems. In this chapter, I will describe two recent examples---one having to do with selecting good columns or features from a (DNA Single Nucleotide Polymorphism) data matrix, and the other having to do with selecting good clusters or communities from a data graph (representing a social or information network)---that drew on ideas from both areas and that may serve as a model for exploiting complementary algorithmic and statistical perspectives in order to solve applied large-scale data analysis problems.Comment: 33 pages. To appear in Uwe Naumann and Olaf Schenk, editors, "Combinatorial Scientific Computing," Chapman and Hall/CRC Press, 201

    An Analytical Performance Evaluation on Multiview Clustering Approaches

    Get PDF
    The concept of machine learning encompasses a wide variety of different approaches, one of which is called clustering. The data points are grouped together in this approach to the problem. Using a clustering method, it is feasible, given a collection of data points, to classify each data point as belonging to a specific group. This can be done if the algorithm is given the collection of data points. In theory, data points that constitute the same group ought to have attributes and characteristics that are equivalent to one another, however data points that belong to other groups ought to have properties and characteristics that are very different from one another. The generation of multiview data is made possible by recent developments in information collecting technologies. The data were collected from à variety of sources and were analysed using a variety of perspectives. The data in question are what are known as multiview data. On a single view, the conventional clustering algorithms are applied. In spite of this, real-world data are complicated and can be clustered in a variety of different ways, depending on how the data are interpreted. In practise, the real-world data are messy. In recent years, Multiview Clustering, often known as MVC, has garnered an increasing amount of attention due to its goal of utilising complimentary and consensus information derived from different points of view. On the other hand, the vast majority of the systems that are currently available only enable the single-clustering scenario, whereby only makes utilization of a single cluster to split the data. This is the case since there is only one cluster accessible. In light of this, it is absolutely necessary to carry out investigation on the multiview data format. The study work is centred on multiview clustering and how well it performs compared to these other strategies

    Learning Ideological Latent space in Twitter

    Get PDF
    People are shifting from traditional news sources to online news at an incredibly fast rate. However, the technology behind online news consumption forces users to be confined to content that confirms with their own point of view. This has led to social phenomena like polarization of point-of-view and intolerance towards opposing views. In this thesis we study information filter bubbles from a mathematical standpoint. We use data mining techniques to learn a liberal-conservative ideology space in Twitter and presents a case study on how such a latent space can be used to tackle the filter bubble problem on social networks. We model the problem of learning liberal-conservative ideology as a constrained optimization problem. Using matrix factorization we uncover an ideological latent space for content consumption and social interaction habits of users in Twitter. We validate our model on real world Twitter dataset on three controversial topics - "Obamacare", "gun control" and "abortion". Using the proposed technique we are able to separate users by their ideology with 95% purity. Our analysis shows that there is a very high correlation (0.8 - 0.9) between the estimated ideology using machine learning and true ideology collected from various sources. Finally, we re-examine the learnt latent space, and present a case study showcasing how this ideological latent space can be used to develop exploratory and interactive interfaces that can help in diffusing the information filter bubble. Our matrix factorization based model for learning ideology latent space, along with the case studies provide a theoretically solid as well as a practical and interesting point-of-view to online polarization. Further, it provides a strong foundation and suggests several avenues for future work in multiple emerging interdisciplinary research areas, for instance, humanly interpretable and explanatory machine learning, transparent recommendations and a new field that we coin as Next Generation Social Networks
    • …
    corecore