Search CORE

490 research outputs found

Evaluating Overfit and Underfit in Models of Network Community Structure

Author: Clauset Aaron
Ghasemian Amir
Hosseinmardi Homa
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2019
Field of study

A common data mining task on networks is community detection, which seeks an unsupervised decomposition of a network into structural groups based on statistical regularities in the network's connectivity. Although many methods exist, the No Free Lunch theorem for community detection implies that each makes some kind of tradeoff, and no algorithm can be optimal on all inputs. Thus, different algorithms will over or underfit on different inputs, finding more, fewer, or just different communities than is optimal, and evaluation methods that use a metadata partition as a ground truth will produce misleading conclusions about general accuracy. Here, we present a broad evaluation of over and underfitting in community detection, comparing the behavior of 16 state-of-the-art community detection algorithms on a novel and structurally diverse corpus of 406 real-world networks. We find that (i) algorithms vary widely both in the number of communities they find and in their corresponding composition, given the same input, (ii) algorithms can be clustered into distinct high-level groups based on similarities of their outputs on real-world networks, and (iii) these differences induce wide variation in accuracy on link prediction and link description tasks. We introduce a new diagnostic for evaluating overfitting and underfitting in practice, and use it to roughly divide community detection methods into general and specialized learning algorithms. Across methods and inputs, Bayesian techniques based on the stochastic block model and a minimum description length approach to regularization represent the best general learning approach, but can be outperformed under specific circumstances. These results introduce both a theoretically principled approach to evaluate over and underfitting in models of network community structure and a realistic benchmark by which new methods may be evaluated and compared.Comment: 22 pages, 13 figures, 3 table

arXiv.org e-Print Archive

Crossref

A Data-Driven Approach for Modeling Agents

Author: Kavak Hamdi
Publication venue: ODU Digital Commons
Publication date: 01/04/2019
Field of study

Agents are commonly created on a set of simple rules driven by theories, hypotheses, and assumptions. Such modeling premise has limited use of real-world data and is challenged when modeling real-world systems due to the lack of empirical grounding. Simultaneously, the last decade has witnessed the production and availability of large-scale data from various sensors that carry behavioral signals. These data sources have the potential to change the way we create agent-based models; from simple rules to driven by data. Despite this opportunity, the literature has neglected to offer a modeling approach to generate granular agent behaviors from data, creating a gap in the literature. This dissertation proposes a novel data-driven approach for modeling agents to bridge the research gap. The approach is composed of four detailed steps including data preparation, attribute model creation, behavior model creation, and integration. The connection between and within each step is established using data flow diagrams. The practicality of the approach is demonstrated with a human mobility model that uses millions of location footprints collected from social media. In this model, the generation of movement behavior is tested with five machine learning/statistical modeling techniques covering a large number of model/data configurations. Results show that Random Forest-based learning is the most effective for the mobility use case. Furthermore, agent attribute values are obtained/generated with machine learning and translational assignment techniques. The proposed approach is evaluated in two ways. First, the use case model is compared to another model which is developed using a state-of-the-art data-driven approach. The model’s prediction performance is comparable to the state-of-the-art model. The plausibility of behaviors and model structure in the use case model is found to be closer to real-world than the state-of-the-art model. This outcome indicates that the proposed approach produces realistic results. Second, a standard mobility dataset is used for driving the mobility model in place of social media data. Despite its small size, the data and model resembled the results gathered from the primary use case indicating the possibility of using different datasets with the proposed approach

Old Dominion University

Factor validation and Rasch analysis of the individual recovery outcomes counter

Author: Dickens Geoffrey L.
Hallett Nutmeg
Hardie Scott M.
Ion Robin M.
Rudd Bridey
Publication venue
Publication date: 11/09/2017
Field of study

Objective: The Individual Recovery Outcomes Counter is a 12-item personal recovery self assessment tool for adults with mental health problems. Although widely used across Scotland, limited research into its psychometric properties has been conducted. We tested its' measurement properties to ascertain the suitability of the tool for continued use in its present form.Materials and methods: Anonymised data from the assessments of 1,743 adults using mental health services in Scotland were subject to tests based on principles of Rasch measurement theory, principal components analysis and confirmatory factor analysis.Results: Rasch analysis revealed that the 6-point response structure of the Individual Recovery Outcomes Counter was problematic. Re-scoring on a 4-point scale revealed well ordered items that measure a single, recovery-related construct, and has acceptable fit statistics. Confirmatory factor analysis supported this. Scale items covered around 75% of the recovery continuum; those individuals least far along the continuum were least well addressed.Conclusions: A modified tool worked well for many, but not all, service users. The study suggests specific developments are required if the Individual Recovery Outcomes Counter is to maximise its' utility for service users and provide meaningful data for service providers.*Implications for Rehabilitation*Agencies and services working with people with mental health problems aim to help them with their recovery.*The individual recovery outcomes counter has been developed and is used widely in Scotland to help service users track their progress to recovery.*Using a large sample of routinely collected data we have demonstrated that a number of modifications are needed if the tool is to adequately measure recovery.*This will involve consideration of the scoring system, item content and inclusion, and theoretical basis of the tool

Abertay Research Portal

Northumbria Research Link

Crossref

E-space: Manchester Metropolitan University's Research Repository

University of Birmingham Research Portal

Western Sydney ResearchDirect

Enhancing Discrete Choice Models with Representation Learning

Author: Alahi Alexandre
Lurkin Virginie
Sifringer Brian
Publication venue: 'Elsevier BV'
Publication date: 04/09/2020
Field of study

In discrete choice modeling (DCM), model misspecifications may lead to limited predictability and biased parameter estimates. In this paper, we propose a new approach for estimating choice models in which we divide the systematic part of the utility specification into (i) a knowledge-driven part, and (ii) a data-driven one, which learns a new representation from available explanatory variables. Our formulation increases the predictive power of standard DCM without sacrificing their interpretability. We show the effectiveness of our formulation by augmenting the utility specification of the Multinomial Logit (MNL) and the Nested Logit (NL) models with a new non-linear representation arising from a Neural Network (NN), leading to new choice models referred to as the Learning Multinomial Logit (L-MNL) and Learning Nested Logit (L-NL) models. Using multiple publicly available datasets based on revealed and stated preferences, we show that our models outperform the traditional ones, both in terms of predictive performance and accuracy in parameter estimation. All source code of the models are shared to promote open science.Comment: 35 pages, 12 tables, 6 figures, +11 p. Appendi

arXiv.org e-Print Archive

Pure OAI Repository

Enhancing discrete choice models with representation learning

Author: Alahi Alexandre
Lurkin Virginie
Sifringer Brian
Publication venue: 'Elsevier BV'
Publication date: 28/12/2018
Field of study

In discrete choice modeling (DCM), model misspecifications may lead to limited predictability and biased parameter estimates. In this paper, we propose a new approach for estimating choice models in which we divide the systematic part of the utility specification into (i) a knowledge-driven part, and (ii) a data-driven one, which learns a new representation from available explanatory variables. Our formulation increases the predictive power of standard DCM without sacrificing their interpretability. We show the effectiveness of our formulation by augmenting the utility specification of the Multinomial Logit (MNL) and the Nested Logit (NL) models with a new non linear representation arising from a Neural Network (NN), leading to new choice models referred to as the Learning Multinomial Logit (L-MNL) and Learning Nested Logit (L-NL) models. Using multiple publicly available datasets based on revealed and stated preferences, we show that our models outperform the traditional ones, both in terms of predictive performance and accuracy in parameter estimation. All source code of the models are shared to promote open science

Infoscience - École polytechnique fédérale de Lausanne

Recommended from our members

Limits of Model Selection, Link Prediction, and Community Detection

Author: Ghasemian Amir
Publication venue: University of Colorado Boulder
Publication date: 01/01/2019
Field of study

Relational data has become increasingly ubiquitous nowadays. Networks are very rich tools in graph theory, which represent real world interactions through a simple abstract graph, including nodes and edges. Network analysis and modeling has gained extremely wide attentions from the researchers in various disciplines, such as computer science, social science, biology, economics, electrical engineering, and physics. Network analysis is the study of the network topology to answer a variety of application-based questions regarding the original real world problem. For example in social network analysis the questions are related to how people interact with each other in online social networks, or in collaboration networks, how diseases propagate or how information flows through a network, or how to control a disease or food outbreak. In electric networks like power grids or in internet networks, the questions can be related to vulnerability assessment of the networks to be prepared for power outage or internet blackout. In biological network analysis, the questions are related to how different diseases are related to each other, which can be useful in discovering new symptoms of diseases and producing and developing new medicines. It appears clearly that the reason of the importance of this interdisciplinary area of science, is due to its widespread applications which involves scientists and researchers with a variety of background and interests. Although networks are much simpler compared to the original complex systems, the interactions among the nodes in the real-world network may seem random, and capturing patterns on these entities is not trivial. There are tremendous questions about inference on networks, which makes this topic very attractive for researchers in the field. In this dissertation we answer some of the questions regarding this topic in two lines of study: one focused on experimental analyses and one focused on theoretical limitations. In Chapter 2 we look at community detection, a common graph mining task in network inference, which seeks an unsupervised decomposition of a network into groups based on statistical regularities in network connectivity. Although many such algorithms exist, community detection’s No Free Lunch theorem implies that no algorithm can be optimal across all inputs. However, little is known in practice about how different algorithms over or underfit to real networks, or how to reliably assess such behavior across algorithms. We present a broad investigation of over and underfitting across 16 state-of-the-art community detection algorithms applied to a novel benchmark corpus of 572 structurally diverse real-world networks. We find that (i) algorithms vary widely in the number and composition of communities they find, given the same input; (ii) algorithms can be clustered into distinct high-level groups based on similarities of their outputs on real-world networks; (iii) algorithmic differences induce wide variation in accuracy on link-based learning tasks; and, (iv) no algorithm is always the best at such tasks across all inputs. Finally, we quantify each algorithm’s overall tendency to over or underfit to network data using a theoretically principled diagnostic, and discuss the implications for future advances in community detection. In Chapter 3 we investigate link prediction problem, another important inference task in complex networks with a wide variety of applications. As we observed in Chapter 2, the community detection algorithmic differences induce wide variation in accuracy on link prediction tasks. On the other hand, many link prediction techniques exist in literature and still there is lack of methodology to analyze and compare these techniques. In Chapter 3, we provide a methodological overview of link prediction techniques and present new results on optimal link prediction and on transfer learning for link prediction. In the former, we investiga

CU Scholar Institutional Repository

Weather AI

Author: Diogo Maria Rodrigues da Cunha Mariano
Publication venue
Publication date: 27/07/2020
Field of study

Repositório Aberto da Universidade do Porto