1,972 research outputs found

    Event detection in location-based social networks

    Get PDF
    With the advent of social networks and the rise of mobile technologies, users have become ubiquitous sensors capable of monitoring various real-world events in a crowd-sourced manner. Location-based social networks have proven to be faster than traditional media channels in reporting and geo-locating breaking news, i.e. Osama Bin Laden’s death was first confirmed on Twitter even before the announcement from the communication department at the White House. However, the deluge of user-generated data on these networks requires intelligent systems capable of identifying and characterizing such events in a comprehensive manner. The data mining community coined the term, event detection , to refer to the task of uncovering emerging patterns in data streams . Nonetheless, most data mining techniques do not reproduce the underlying data generation process, hampering to self-adapt in fast-changing scenarios. Because of this, we propose a probabilistic machine learning approach to event detection which explicitly models the data generation process and enables reasoning about the discovered events. With the aim to set forth the differences between both approaches, we present two techniques for the problem of event detection in Twitter : a data mining technique called Tweet-SCAN and a machine learning technique called Warble. We assess and compare both techniques in a dataset of tweets geo-located in the city of Barcelona during its annual festivities. Last but not least, we present the algorithmic changes and data processing frameworks to scale up the proposed techniques to big data workloads.This work is partially supported by Obra Social “la Caixa”, by the Spanish Ministry of Science and Innovation under contract (TIN2015-65316), by the Severo Ochoa Program (SEV2015-0493), by SGR programs of the Catalan Government (2014-SGR-1051, 2014-SGR-118), Collectiveware (TIN2015-66863-C2-1-R) and BSC/UPC NVIDIA GPU Center of Excellence.We would also like to thank the reviewers for their constructive feedback.Peer ReviewedPostprint (author's final draft

    Hierarchical Change Point Detection on Dynamic Networks

    Full text link
    This paper studies change point detection on networks with community structures. It proposes a framework that can detect both local and global changes in networks efficiently. Importantly, it can clearly distinguish the two types of changes. The framework design is generic and as such several state-of-the-art change point detection algorithms can fit in this design. Experiments on both synthetic and real-world networks show that this framework can accurately detect changes while achieving up to 800X speedup.Comment: 9 pages, ACM WebSci'1

    D-AREdevil: a novel approach for discovering disease-associated rare cell populations in mass cytometry data

    Get PDF
    Background: The advances in single-cell technologies such as mass cytometry provides increasing resolution of the complexity of cellular samples, allowing researchers to deeper investigate and understand the cellular heterogeneity and possibly detect and discover previously undetectable rare cell populations. The identification of rare cell populations is of paramount importance for understanding the onset, progression and pathogenesis of many diseases. However, their identification remains challenging due to the always increasing dimensionality and throughput of the data generated. Aim: This study aimed at implementing a straightforward approach that efficiently supports a data analyst to identify disease-associated rare cell populations in large and complex biological samples and within reasonable limits of time and computational infrastructure. Methods: We proposed a novel computational framework called D-AREdevil (disease- associated rare cells detection) for cytometry datasets. The main characteristic of our computational framework is the combination of an anomaly detection algorithm (i.e. LOF, or FiRE) that provides a continuous score for individual cells with one of the best performing and fastest unsupervised clustering methods (i.e. FlowSOM). In our approach, the LOF score serves to select a set of candidate cells belonging to one or more subgroups of similar rare cell populations. Then, we tested these subgroups of rare cells for association with a patient group, disease type, clinical outcome or other characteristic of interest. Results: We reported in this study the properties and implementation of D-AREdevil and presented an evaluation of its performances and applications on three different testing datasets based on mass cytometry data. We generated data mixed with one or more known rare cell populations at varying frequencies (below 1%) and tested the ability of our approach to identify those cells in order to bring them to the attention of the data analyst. This is a key step in the process of finding cell subgroups that are associated with a disease or outcome of interest, when their existence and identification is not previously known and has yet to be discovered. Conclusions: We proposed a novel computational framework with demostrated good sensitivity and precision in detecting target rare cell poopulations present at very low frequencies in the total datasets (<1%). -- Contexte: Les avancĂ©es en technologies sur cellules individuelles telles que la cytomĂ©trie de masse offrent une meilleure rĂ©solution de la complexitĂ© des Ă©chantillons cellulaires, permettant aux chercheurs d’étudier et de comprendre plus en profondeur l’hĂ©tĂ©rogĂ©nĂ©itĂ© cellulaire et Ă©ventuellement de dĂ©tecter et dĂ©couvrir des populations de cellules rares auparavant indĂ©tectables. L’identification de populations de cellules rares est importante pour comprendre l’apparition, la progression et la pathogenĂšse de nombreuses maladies. Cependant, leur identification reste difficile en raison de la haute dimensionnalitĂ© et du dĂ©bit toujours croissants de donnĂ©es gĂ©nĂ©rĂ©es. But: Cette Ă©tude met en Ɠuvre une approche simple et efficace pour identifier des populations de cellules rares associĂ©es Ă  une maladie dans des Ă©chantillons biologiques vastes et complexes dans des limites de temps et d’infrastructure de calcul raisonnables. MĂ©thodes: Nous proposons un nouveau cadre de calcul appelĂ© D-AREdevil (dĂ©tection de cellules rares associĂ©es Ă  une maladie) pour l’analyse de donnĂ©es de cytomĂ©trie de masse. La principale caractĂ©ristique de notre cadre computationnel est la combinaison d’un algorithme de dĂ©tection d’anomalies (LOF ou FiRE) qui fournit un score continu pour chaque cellule avec l’une des mĂ©thodes de regroupement non-supervisĂ© les plus performantes et les plus rapides (FlowSOM). Dans notre approche, le score LOF sert Ă  sĂ©lectionner un ensemble de cellules candidates appartenant Ă  un ou plusieurs sous-groupes de populations de cellules rares similaires. Ensuite, nous testons ces sous-groupes de cellules rares pour dĂ©terminer s’ils sont associĂ©es avec un groupe de patients, un type de maladie, un rĂ©sultat clinique ou une autre caractĂ©ristique d’intĂ©rĂȘt. RĂ©sultats: Dans cette Ă©tude, nous avons rapportĂ© les propriĂ©tĂ©s et l’implĂ©mentation de D-AREdevil, et prĂ©sentĂ© une Ă©valuation de ses performances et applications sur trois jeux de donnĂ©es diffĂ©rents de cytomĂ©trie de masse. Nous avons gĂ©nĂ©rĂ© des donnĂ©es mĂ©langĂ©es contenant une ou plusieurs populations de cellules rares connues Ă  des frĂ©quences variables (infĂ©rieures Ă  1%) et nous avons testĂ© la capacitĂ© de notre approche Ă  identifier ces cellules afin de les porter Ă  l’attention de l’analyste. Il s’agit lĂ  d’une Ă©tape clĂ© dans le processus de recherche de sous-groupes de cellules qui sont associĂ©s Ă  une maladie ou Ă  un rĂ©sultat d’intĂ©rĂȘt qui est encore inconnu. Conclusions: Nous proposons un nouveau cadre de calcul avec une bonne sensibilitĂ© et une bonne prĂ©cision dans la dĂ©tection de cellules rares qui sont prĂ©sentes Ă  de trĂšs basses frĂ©quences dans l’ensemble des donnĂ©es (<1%)

    Anti-fragile ICT Systems

    Get PDF
    This book introduces a novel approach to the design and operation of large ICT systems. It views the technical solutions and their stakeholders as complex adaptive systems and argues that traditional risk analyses cannot predict all future incidents with major impacts. To avoid unacceptable events, it is necessary to establish and operate anti-fragile ICT systems that limit the impact of all incidents, and which learn from small-impact incidents how to function increasingly well in changing environments. The book applies four design principles and one operational principle to achieve anti-fragility for different classes of incidents. It discusses how systems can achieve high availability, prevent malware epidemics, and detect anomalies. Analyses of Netflix’s media streaming solution, Norwegian telecom infrastructures, e-government platforms, and Numenta’s anomaly detection software show that cloud computing is essential to achieving anti-fragility for classes of events with negative impacts

    PIKS: A Technique to Identify Actionable Trends for Policy-Makers Through Open Healthcare Data

    Full text link
    With calls for increasing transparency, governments are releasing greater amounts of data in multiple domains including finance, education and healthcare. The efficient exploratory analysis of healthcare data constitutes a significant challenge. Key concerns in public health include the quick identification and analysis of trends, and the detection of outliers. This allows policies to be rapidly adapted to changing circumstances. We present an efficient outlier detection technique, termed PIKS (Pruned iterative-k means searchlight), which combines an iterative k-means algorithm with a pruned searchlight based scan. We apply this technique to identify outliers in two publicly available healthcare datasets from the New York Statewide Planning and Research Cooperative System, and California's Office of Statewide Health Planning and Development. We provide a comparison of our technique with three other existing outlier detection techniques, consisting of auto-encoders, isolation forests and feature bagging. We identified outliers in conditions including suicide rates, immunity disorders, social admissions, cardiomyopathies, and pregnancy in the third trimester. We demonstrate that the PIKS technique produces results consistent with other techniques such as the auto-encoder. However, the auto-encoder needs to be trained, which requires several parameters to be tuned. In comparison, the PIKS technique has far fewer parameters to tune. This makes it advantageous for fast, "out-of-the-box" data exploration. The PIKS technique is scalable and can readily ingest new datasets. Hence, it can provide valuable, up-to-date insights to citizens, patients and policy-makers. We have made our code open source, and with the availability of open data, other researchers can easily reproduce and extend our work. This will help promote a deeper understanding of healthcare policies and public health issues

    Matched filters for noisy induced subgraph detection

    Full text link
    First author draftWe consider the problem of finding the vertex correspondence between two graphs with different number of vertices where the smaller graph is still potentially large. We propose a solution to this problem via a graph matching matched filter: padding the smaller graph in different ways and then using graph matching methods to align it to the larger network. Under a statistical model for correlated pairs of graphs, which yields a noisy copy of the small graph within the larger graph, the resulting optimization problem can be guaranteed to recover the true vertex correspondence between the networks, though there are currently no efficient algorithms for solving this problem. We consider an approach that exploits a partially known correspondence and show via varied simulations and applications to the Drosophila connectome that in practice this approach can achieve good performance.https://arxiv.org/abs/1803.02423https://arxiv.org/abs/1803.0242

    Matched Filters for Noisy Induced Subgraph Detection

    Full text link
    The problem of finding the vertex correspondence between two noisy graphs with different number of vertices where the smaller graph is still large has many applications in social networks, neuroscience, and computer vision. We propose a solution to this problem via a graph matching matched filter: centering and padding the smaller adjacency matrix and applying graph matching methods to align it to the larger network. The centering and padding schemes can be incorporated into any algorithm that matches using adjacency matrices. Under a statistical model for correlated pairs of graphs, which yields a noisy copy of the small graph within the larger graph, the resulting optimization problem can be guaranteed to recover the true vertex correspondence between the networks. However, there are currently no efficient algorithms for solving this problem. To illustrate the possibilities and challenges of such problems, we use an algorithm that can exploit a partially known correspondence and show via varied simulations and applications to {\it Drosophila} and human connectomes that this approach can achieve good performance.Comment: 41 pages, 7 figure
    • 

    corecore