490 research outputs found
Evaluating Overfit and Underfit in Models of Network Community Structure
A common data mining task on networks is community detection, which seeks an
unsupervised decomposition of a network into structural groups based on
statistical regularities in the network's connectivity. Although many methods
exist, the No Free Lunch theorem for community detection implies that each
makes some kind of tradeoff, and no algorithm can be optimal on all inputs.
Thus, different algorithms will over or underfit on different inputs, finding
more, fewer, or just different communities than is optimal, and evaluation
methods that use a metadata partition as a ground truth will produce misleading
conclusions about general accuracy. Here, we present a broad evaluation of over
and underfitting in community detection, comparing the behavior of 16
state-of-the-art community detection algorithms on a novel and structurally
diverse corpus of 406 real-world networks. We find that (i) algorithms vary
widely both in the number of communities they find and in their corresponding
composition, given the same input, (ii) algorithms can be clustered into
distinct high-level groups based on similarities of their outputs on real-world
networks, and (iii) these differences induce wide variation in accuracy on link
prediction and link description tasks. We introduce a new diagnostic for
evaluating overfitting and underfitting in practice, and use it to roughly
divide community detection methods into general and specialized learning
algorithms. Across methods and inputs, Bayesian techniques based on the
stochastic block model and a minimum description length approach to
regularization represent the best general learning approach, but can be
outperformed under specific circumstances. These results introduce both a
theoretically principled approach to evaluate over and underfitting in models
of network community structure and a realistic benchmark by which new methods
may be evaluated and compared.Comment: 22 pages, 13 figures, 3 table
A Data-Driven Approach for Modeling Agents
Agents are commonly created on a set of simple rules driven by theories, hypotheses, and assumptions. Such modeling premise has limited use of real-world data and is challenged when modeling real-world systems due to the lack of empirical grounding. Simultaneously, the last decade has witnessed the production and availability of large-scale data from various sensors that carry behavioral signals. These data sources have the potential to change the way we create agent-based models; from simple rules to driven by data. Despite this opportunity, the literature has neglected to offer a modeling approach to generate granular agent behaviors from data, creating a gap in the literature.
This dissertation proposes a novel data-driven approach for modeling agents to bridge the research gap. The approach is composed of four detailed steps including data preparation, attribute model creation, behavior model creation, and integration. The connection between and within each step is established using data flow diagrams.
The practicality of the approach is demonstrated with a human mobility model that uses millions of location footprints collected from social media. In this model, the generation of movement behavior is tested with five machine learning/statistical modeling techniques covering a large number of model/data configurations. Results show that Random Forest-based learning is the most effective for the mobility use case. Furthermore, agent attribute values are obtained/generated with machine learning and translational assignment techniques.
The proposed approach is evaluated in two ways. First, the use case model is compared to another model which is developed using a state-of-the-art data-driven approach. The model’s prediction performance is comparable to the state-of-the-art model. The plausibility of behaviors and model structure in the use case model is found to be closer to real-world than the state-of-the-art model. This outcome indicates that the proposed approach produces realistic results. Second, a standard mobility dataset is used for driving the mobility model in place of social media data. Despite its small size, the data and model resembled the results gathered from the primary use case indicating the possibility of using different datasets with the proposed approach
Factor validation and Rasch analysis of the individual recovery outcomes counter
Objective: The Individual Recovery Outcomes Counter is a 12-item personal recovery self assessment tool for adults with mental health problems. Although widely used across Scotland, limited research into its psychometric properties has been conducted. We tested its' measurement properties to ascertain the suitability of the tool for continued use in its present form.Materials and methods: Anonymised data from the assessments of 1,743 adults using mental health services in Scotland were subject to tests based on principles of Rasch measurement theory, principal components analysis and confirmatory factor analysis.Results: Rasch analysis revealed that the 6-point response structure of the Individual Recovery Outcomes Counter was problematic. Re-scoring on a 4-point scale revealed well ordered items that measure a single, recovery-related construct, and has acceptable fit statistics. Confirmatory factor analysis supported this. Scale items covered around 75% of the recovery continuum; those individuals least far along the continuum were least well addressed.Conclusions: A modified tool worked well for many, but not all, service users. The study suggests specific developments are required if the Individual Recovery Outcomes Counter is to maximise its' utility for service users and provide meaningful data for service providers.*Implications for Rehabilitation*Agencies and services working with people with mental health problems aim to help them with their recovery.*The individual recovery outcomes counter has been developed and is used widely in Scotland to help service users track their progress to recovery.*Using a large sample of routinely collected data we have demonstrated that a number of modifications are needed if the tool is to adequately measure recovery.*This will involve consideration of the scoring system, item content and inclusion, and theoretical basis of the tool
Enhancing Discrete Choice Models with Representation Learning
In discrete choice modeling (DCM), model misspecifications may lead to
limited predictability and biased parameter estimates. In this paper, we
propose a new approach for estimating choice models in which we divide the
systematic part of the utility specification into (i) a knowledge-driven part,
and (ii) a data-driven one, which learns a new representation from available
explanatory variables. Our formulation increases the predictive power of
standard DCM without sacrificing their interpretability. We show the
effectiveness of our formulation by augmenting the utility specification of the
Multinomial Logit (MNL) and the Nested Logit (NL) models with a new non-linear
representation arising from a Neural Network (NN), leading to new choice models
referred to as the Learning Multinomial Logit (L-MNL) and Learning Nested Logit
(L-NL) models. Using multiple publicly available datasets based on revealed and
stated preferences, we show that our models outperform the traditional ones,
both in terms of predictive performance and accuracy in parameter estimation.
All source code of the models are shared to promote open science.Comment: 35 pages, 12 tables, 6 figures, +11 p. Appendi
Enhancing discrete choice models with representation learning
In discrete choice modeling (DCM), model misspecifications may lead to limited predictability and biased parameter estimates. In this paper, we propose a new approach for estimating choice models in which we divide the systematic part of the utility specification into (i) a knowledge-driven part, and (ii) a data-driven one, which learns a new representation from available explanatory variables. Our formulation increases the predictive power of standard DCM without sacrificing their interpretability. We show the effectiveness of our formulation by augmenting the utility specification of the Multinomial Logit (MNL) and the Nested Logit (NL) models with a new non linear representation arising from a Neural Network (NN), leading to new choice models referred to as the Learning Multinomial Logit (L-MNL) and Learning Nested Logit (L-NL) models. Using multiple publicly available datasets based on revealed and stated preferences, we show that our models outperform the traditional ones, both in terms of predictive performance and accuracy in parameter estimation. All source code of the models are shared to promote open science
Recommended from our members
Limits of Model Selection, Link Prediction, and Community Detection
Relational data has become increasingly ubiquitous nowadays. Networks are very rich tools in graph theory, which represent real world interactions through a simple abstract graph, including nodes and edges. Network analysis and modeling has gained extremely wide attentions from the researchers in various disciplines, such as computer science, social science, biology, economics, electrical engineering, and physics. Network analysis is the study of the network topology to answer a variety of application-based questions regarding the original real world problem. For example in social network analysis the questions are related to how people interact with each other in online social networks, or in collaboration networks, how diseases propagate or how information flows through a network, or how to control a disease or food outbreak. In electric networks like power grids or in internet networks, the questions can be related to vulnerability assessment of the networks to be prepared for power outage or internet blackout. In biological network analysis, the questions are related to how different diseases are related to each other, which can be useful in discovering new symptoms of diseases and producing and developing new medicines. It appears clearly that the reason of the importance of this interdisciplinary area of science, is due to its widespread applications which involves scientists and researchers with a variety of background and interests.
Although networks are much simpler compared to the original complex systems, the interactions among the nodes in the real-world network may seem random, and capturing patterns on these entities is not trivial. There are tremendous questions about inference on networks, which makes this topic very attractive for researchers in the field. In this dissertation we answer some of the questions regarding this topic in two lines of study: one focused on experimental analyses and one focused on theoretical limitations.
In Chapter 2 we look at community detection, a common graph mining task in network inference, which seeks an unsupervised decomposition of a network into groups based on statistical regularities in network connectivity. Although many such algorithms exist, community detection’s No Free Lunch theorem implies that no algorithm can be optimal across all inputs. However, little is known in practice about how different algorithms over or underfit to real networks, or how to reliably assess such behavior across algorithms. We present a broad investigation of over and underfitting across 16 state-of-the-art community detection algorithms applied to a novel benchmark corpus of 572 structurally diverse real-world networks. We find that (i) algorithms vary widely in the number and composition of communities they find, given the same input; (ii) algorithms can be clustered into distinct high-level groups based on similarities of their outputs on real-world networks; (iii) algorithmic differences induce wide variation in accuracy on link-based learning tasks; and, (iv) no algorithm is always the best at such tasks across all inputs. Finally, we quantify each algorithm’s overall tendency to over or underfit to network data using a theoretically principled diagnostic, and discuss the implications for future advances in community detection.
In Chapter 3 we investigate link prediction problem, another important inference task in complex networks with a wide variety of applications. As we observed in Chapter 2, the community detection algorithmic differences induce wide variation in accuracy on link prediction tasks. On the other hand, many link prediction techniques exist in literature and still there is lack of methodology to analyze and compare these techniques. In Chapter 3, we provide a methodological overview of link prediction techniques and present new results on optimal link prediction and on transfer learning for link prediction. In the former, we investiga
- …