18 research outputs found
Ranking of multidimensional drug profiling data by fractional-adjusted bi-partitional scores
Motivation: The recent development of high-throughput drug profiling (high content screening or HCS) provides a large amount of quantitative multidimensional data. Despite its potentials, it poses several challenges for academia and industry analysts alike. This is especially true for ranking the effectiveness of several drugs from many thousands of images directly. This paper introduces, for the first time, a new framework for automatically ordering the performance of drugs, called fractional adjusted bi-partitional score (FABS). This general strategy takes advantage of graph-based formulations and solutions and avoids many shortfalls of traditionally used methods in practice. We experimented with FABS framework by implementing it with a specific algorithm, a variant of normalized cut—normalized cut prime (FABS-NC′), producing a ranking of drugs. This algorithm is known to run in polynomial time and therefore can scale well in high-throughput applications
Archives of Data Science, Series A. Vol. 1,1: Special Issue: Selected Papers of the 3rd German-Polish Symposium on Data Analysis and Applications
The first volume of Archives of Data Science, Series A is a special issue of a selection of contributions which have been originally presented at the {\em 3rd Bilateral German-Polish Symposium on Data Analysis and Its Applications} (GPSDAA 2013). All selected papers fit into the emerging field of data science consisting of the mathematical sciences (computer science, mathematics, operations research, and statistics) and an application domain (e.g. marketing, biology, economics, engineering)
Recommended from our members
Applications and Advances in Similarity-based Machine Learning
Similarity-based machine learning methods differ from traditional machine learning methods in that they also use pairwise similarity relations between objects to infer the labels of unlabeled objects. A recent comparative study for classification problems by Baumann et al. [2019] demonstrated that similarity-based techniques have superior performance and robustness when compared to well-established machine learning techniques. Similarity-based machine learning methods benefit from two advantages that could explain superior their performance: They can make use of the pairwise relations between unlabeled objects, and they are robust due to the transitive property of pairwise similarities. A challenge for similarity-based machine learning methods on large datasets is that the number of pairwise similarity grows quadratically in the size of the dataset. For large datasets, it thus becomes practically impossible to compute all possible pairwise similarities. In 2016, Hochbaum and Baumann proposed the technique of sparse computation to address this growth by computing only those pairwise similarities that are relevant. Their proposed implementation of sparse computation is still difficult to scale to millions objects. This dissertation focuses on advancing the practical implementations of sparse computation to larger datasets and on two applications for which similarity-based machine learning was particularly effective. The applications that are studied here are cell identification in calcium-imaging movies and detecting aberrant linking behavior in directed networks. For sparse computation we present faster, geometric algorithms and a technique, named sparse-reduced computation, that combines sparse computation with compression. The geometric algorithms compute the exact same output as the original implementation of sparse computation, but identify the relevant pairwise similarities faster by using the concept of data shifting for identifying objects in the same or neighboring blocks. Empirical results on datasets with up to 10 million objects show a significant reduction in running time. Sparse-reduced computation combines sparse computation with a technique for compressing highly-similar or identical objects, enabling the use of similarity-based machine learning on massively-large datasets. The computational results demonstrate that sparse-reduced computation provides a significant reduction in running time with a minute loss in accuracy.A major problem facing neuroscientists today is cell identification in calcium-imaging movies. These movies are in-vivo recordings of thousands of neurons at cellular resolution. There is a great need for automated approaches to extract the activity of single neurons from these movies since manual post-processing takes tens of hours per dataset. We present the HNCcorr algorithm for cell identification in calcium-imaging movies. The name HNCcorr is derived from its use of the similarity-based Hochbaum's Normalized Cut (HNC) model with pairwise similarities derived from correlation. In HNCcorr, the task of cell detection is approached as a clustering problem. HNCcorr utilizes HNC to detect cells in these movies as coherent clusters of pixels that are highly distinct from the remaining pixels. HNCcorr guarantees, unlike existing methodologies for cell identification, a globally optimal solution to the underlying optimization problem. Of independent interest is a novel method, named similarity-squared, that we devised for measuring similarity between pixels. We provide an experimental study and demonstrate that HNCcorr is a top performer on the Neurofinder cell identification benchmark and that it improves over algorithms based on matrix factorization.The second application is detecting aberrant agents, such as fake news sources or spam websites, based on their link behavior in networks. Across contexts, a distinguishing characteristic between normal and aberrant agents is that normal agents rarely link to aberrant ones. We refer to this phenomenon as aberrant linking behavior. We present an Markov Random Fields (MRF) formulation, with links as the pairwise similarities, that detects aberrant agents based on aberrant linking behavior and any prior information (if given). This MRF formulation is solved optimally and in polynomial time. We compare the optimal solution for the MRF formulation to well-known algorithms based on random walks. In our empirical experiment with twenty-three different datasets, the MRF method outperforms the other detection algorithms. This work represents the first use of optimization methods for detecting aberrant agents as well as the first time that MRF is applied to directed graphs
Recommended from our members
Ranking of multidimensional drug profiling data by fractional-adjusted bi-partitional scores.
MotivationThe recent development of high-throughput drug profiling (high content screening or HCS) provides a large amount of quantitative multidimensional data. Despite its potentials, it poses several challenges for academia and industry analysts alike. This is especially true for ranking the effectiveness of several drugs from many thousands of images directly. This paper introduces, for the first time, a new framework for automatically ordering the performance of drugs, called fractional adjusted bi-partitional score (FABS). This general strategy takes advantage of graph-based formulations and solutions and avoids many shortfalls of traditionally used methods in practice. We experimented with FABS framework by implementing it with a specific algorithm, a variant of normalized cut-normalized cut prime (FABS-NC(')), producing a ranking of drugs. This algorithm is known to run in polynomial time and therefore can scale well in high-throughput applications.ResultsWe compare the performance of FABS-NC(') to other methods that could be used for drugs ranking. We devise two variants of the FABS algorithm: FABS-SVM that utilizes support vector machine (SVM) as black box, and FABS-Spectral that utilizes the eigenvector technique (spectral) as black box. We compare the performance of FABS-NC(') also to three other methods that have been previously considered: center ranking (Center), PCA ranking (PCA), and graph transition energy method (GTEM). The conclusion is encouraging: FABS-NC(') consistently outperforms all these five alternatives. FABS-SVM has the second best performance among these six methods, but is far behind FABS-NC('): In some cases FABS-NC(') produces over half correctly predicted ranking experiment trials than FABS-SVM.AvailabilityThe system and data for the evaluation reported here will be made available upon request to the authors after this manuscript is accepted for publication
Multivariate Analysis in Management, Engineering and the Sciences
Recently statistical knowledge has become an important requirement and occupies a prominent position in the exercise of various professions. In the real world, the processes have a large volume of data and are naturally multivariate and as such, require a proper treatment. For these conditions it is difficult or practically impossible to use methods of univariate statistics. The wide application of multivariate techniques and the need to spread them more fully in the academic and the business justify the creation of this book. The objective is to demonstrate interdisciplinary applications to identify patterns, trends, association sand dependencies, in the areas of Management, Engineering and Sciences. The book is addressed to both practicing professionals and researchers in the field
The Impact of Community Cohesion on Crime
Community cohesion generally acts to increase the safety of communities by increasing informal guardianship, and enhancing the work of formal crime prevention organisations. Understanding the dynamics of local social interactions is essential for community building. However, community cohesion is difficult to empirically quantify, because there are no obvious and direct indicators of community cohesion collected at population levels within official datasets. A potentially more promising alternative for estimating community cohesion is through the use of data from social media. Social media offers an opportunity for exploring networks of social interactions in a local community.
This research will use social media data to explore the impact of community cohesion on crime. Sentiment analysis of tweets can help to uncover patterns of community mood in different areas. Modelling of community engagement on Facebook is useful for understanding patterns of social interactions and the strength of social networks in local communities.
The central contribution of this thesis is the use of new metrics that estimate popularity, commitment and virality known as the PCV indicators for quantifying community cohesion on social media. These metrics, combined with diversity statistics constructed from “traditional” Census data, provide a better correlate of community cohesion and crime. To demonstrate the viability of this novel method for estimating the impact of community cohesion, a model of community engagement and burglary rates is constructed using Leeds community areas as an example. By examining the diversity of different community areas and strength of their social networks, from traditional and new data sources; it was found that stability and strong social media engagement in a local area are associated with lower burglary rates. The proposed new method can provide a better alternative for estimating community cohesion and its impact on crime. It is recommended that policy planning for resource allocation and community building needs to consider social structure and social networks in different communities
Projection-Based Clustering through Self-Organization and Swarm Intelligence
It covers aspects of unsupervised machine learning used for knowledge discovery in data science and introduces a data-driven approach to cluster analysis, the Databionic swarm (DBS). DBS consists of the 3D landscape visualization and clustering of data. The 3D landscape enables 3D printing of high-dimensional data structures. The clustering and number of clusters or an absence of cluster structure are verified by the 3D landscape at a glance. DBS is the first swarm-based technique that shows emergent properties while exploiting concepts of swarm intelligence, self-organization and the Nash equilibrium concept from game theory. It results in the elimination of a global objective function and the setting of parameters. By downloading the R package DBS can be applied to data drawn from diverse research fields and used even by non-professionals in the field of data mining
Projection-Based Clustering through Self-Organization and Swarm Intelligence: Combining Cluster Analysis with the Visualization of High-Dimensional Data
Cluster Analysis; Dimensionality Reduction; Swarm Intelligence; Visualization; Unsupervised Machine Learning; Data Science; Knowledge Discovery; 3D Printing; Self-Organization; Emergence; Game Theory; Advanced Analytics; High-Dimensional Data; Multivariate Data; Analysis of Structured Dat