2,487 research outputs found
Deep generative models for network data synthesis and monitoring
Measurement and monitoring are fundamental tasks in all networks, enabling the down-stream management and optimization of the network.
Although networks inherently
have abundant amounts of monitoring data, its access and effective measurement is
another story. The challenges exist in many aspects. First, the inaccessibility of network monitoring data for external users, and it is hard to provide a high-fidelity dataset
without leaking commercial sensitive information. Second, it could be very expensive
to carry out effective data collection to cover a large-scale network system, considering the size of network growing, i.e., cell number of radio network and the number of
flows in the Internet Service Provider (ISP) network. Third, it is difficult to ensure fidelity and efficiency simultaneously in network monitoring, as the available resources
in the network element that can be applied to support the measurement function are
too limited to implement sophisticated mechanisms. Finally, understanding and explaining the behavior of the network becomes challenging due to its size and complex
structure. Various emerging optimization-based solutions (e.g., compressive sensing)
or data-driven solutions (e.g. deep learning) have been proposed for the aforementioned challenges. However, the fidelity and efficiency of existing methods cannot yet
meet the current network requirements.
The contributions made in this thesis significantly advance the state of the art in
the domain of network measurement and monitoring techniques. Overall, we leverage
cutting-edge machine learning technology, deep generative modeling, throughout the
entire thesis. First, we design and realize APPSHOT , an efficient city-scale network
traffic sharing with a conditional generative model, which only requires open-source
contextual data during inference (e.g., land use information and population distribution). Second, we develop an efficient drive testing system — GENDT, based on generative model, which combines graph neural networks, conditional generation, and quantified model uncertainty to enhance the efficiency of mobile drive testing. Third, we
design and implement DISTILGAN, a high-fidelity, efficient, versatile, and real-time
network telemetry system with latent GANs and spectral-temporal networks. Finally,
we propose SPOTLIGHT , an accurate, explainable, and efficient anomaly detection system of the Open RAN (Radio Access Network) system. The lessons learned through
this research are summarized, and interesting topics are discussed for future work in
this domain. All proposed solutions have been evaluated with real-world datasets and
applied to support different applications in real systems
On the Generation of Realistic and Robust Counterfactual Explanations for Algorithmic Recourse
This recent widespread deployment of machine learning algorithms presents many new challenges. Machine learning algorithms are usually opaque and can be particularly difficult to interpret. When humans are involved, algorithmic and automated decisions can negatively impact people’s lives. Therefore, end users would like to be insured against potential harm. One popular way to achieve this is to provide end users access to algorithmic recourse, which gives end users negatively affected by algorithmic decisions the opportunity to reverse unfavorable decisions, e.g., from a loan denial to a loan acceptance. In this thesis, we design recourse algorithms to meet various end user needs. First, we propose methods for the generation of realistic recourses. We use generative models to suggest recourses likely to occur under the data distribution. To this end, we shift the recourse action from the input space to the generative model’s latent space, allowing to generate counterfactuals that lie in regions with data support. Second, we observe that small changes applied to the recourses prescribed to end users likely invalidate the suggested recourse after being nosily implemented in practice. Motivated by this observation, we design methods for the generation of robust recourses and for assessing the robustness of recourse algorithms to data deletion requests. Third, the lack of a commonly used code-base for counterfactual explanation and algorithmic recourse algorithms and the vast array of evaluation measures in literature make it difficult to compare the per formance of different algorithms. To solve this problem, we provide an open source benchmarking library that streamlines the evaluation process and can be used for benchmarking, rapidly developing new methods, and setting up new
experiments. In summary, our work contributes to a more reliable interaction of end users and machine learned models by covering fundamental aspects of the recourse process and suggests new solutions towards generating realistic and robust counterfactual explanations for algorithmic recourse
UMSL Bulletin 2023-2024
The 2023-2024 Bulletin and Course Catalog for the University of Missouri St. Louis.https://irl.umsl.edu/bulletin/1088/thumbnail.jp
Quantifying Equity Risk Premia: Financial Economic Theory and High-Dimensional Statistical Methods
The overarching question of this dissertation is how to quantify the unobservable risk premium of a stock when its return distribution varies over time.
The first chapter, titled “Theory-based versus machine learning-implied stock risk premia”, starts with a comparison of two competing strands of the literature. The approach advocated by Martin and Wagner (2019) relies on financial economic theory to derive a closed-form approximation of conditional risk premia using information embedded in the prices of European options. The other approach, exemplified by the study of Gu et al. (2020), draws on the flexibility of machine learning methods and vast amounts of historical data to determine the unknown functional form. The goal of this study is to determine which of the two approaches produces more accurate measurements of stock risk premia. In addition, we present a novel hybrid approach that employs machine learning to overcome the approximation errors induced by the theory-based approach. We find that our hybrid approach is competitive especially at longer investment horizons.
The second chapter, titled “The uncertainty principle in asset pricing”, introduces a representation of the conditional capital asset pricing model (CAPM) in which the betas and the equity premium are jointly characterized by the information embedded in option prices. A unique feature of our model is that its implied components represent valid measurements of their physical counterparts without the need for any further risk adjustment. Moreover, because the model’s time-varying parameters are directly observable, the model can be tested without any of the complications that typically arise from statistical estimation. One of the main empirical findings is that the well-known flat relationship between average predicted and realized excess returns of beta-sorted portfolios can be explained by the uncertainty governing market excess returns.
In the third chapter, titled “Multi-task learning in cross-sectional regressions”, we challenge the way in which cross-sectional regressions are used to test factor models with time-varying loadings. More specifically, we extend the procedure by Fama and MacBeth (1973) by systematically selecting stock characteristics using a combination of l1- and l2-regularization, known as the multi-task Lasso, and addressing the bias that is induced by selection via repeated sample splitting. In the empirical part of this chapter, we apply our testing procedure to the option-implied CAPM from chapter two, and find that, while variants of the momentum effect lead to a rejection of the model, the implied beta is by far the most important predictor of cross-sectional return variation
LIPIcs, Volume 251, ITCS 2023, Complete Volume
LIPIcs, Volume 251, ITCS 2023, Complete Volum
Simple Sorting Criteria Help Find the Causal Order in Additive Noise Models
Additive Noise Models (ANM) encode a popular functional assumption that
enables learning causal structure from observational data. Due to a lack of
real-world data meeting the assumptions, synthetic ANM data are often used to
evaluate causal discovery algorithms. Reisach et al. (2021) show that, for
common simulation parameters, a variable ordering by increasing variance is
closely aligned with a causal order and introduce var-sortability to quantify
the alignment. Here, we show that not only variance, but also the fraction of a
variable's variance explained by all others, as captured by the coefficient of
determination , tends to increase along the causal order. Simple baseline
algorithms can use -sortability to match the performance of established
methods. Since -sortability is invariant under data rescaling, these
algorithms perform equally well on standardized or rescaled data, addressing a
key limitation of algorithms exploiting var-sortability. We characterize and
empirically assess -sortability for different simulation parameters. We
show that all simulation parameters can affect -sortability and must be
chosen deliberately to control the difficulty of the causal discovery task and
the real-world plausibility of the simulated data. We provide an implementation
of the sortability measures and sortability-based algorithms in our library
CausalDisco (https://github.com/CausalDisco/CausalDisco).Comment: See https://github.com/CausalDisco/CausalDisco for implementation
UMSL Bulletin 2022-2023
The 2022-2023 Bulletin and Course Catalog for the University of Missouri St. Louis.https://irl.umsl.edu/bulletin/1087/thumbnail.jp
Effects of municipal smoke-free ordinances on secondhand smoke exposure in the Republic of Korea
ObjectiveTo reduce premature deaths due to secondhand smoke (SHS) exposure among non-smokers, the Republic of Korea (ROK) adopted changes to the National Health Promotion Act, which allowed local governments to enact municipal ordinances to strengthen their authority to designate smoke-free areas and levy penalty fines. In this study, we examined national trends in SHS exposure after the introduction of these municipal ordinances at the city level in 2010.MethodsWe used interrupted time series analysis to assess whether the trends of SHS exposure in the workplace and at home, and the primary cigarette smoking rate changed following the policy adjustment in the national legislation in ROK. Population-standardized data for selected variables were retrieved from a nationally representative survey dataset and used to study the policy action’s effectiveness.ResultsFollowing the change in the legislation, SHS exposure in the workplace reversed course from an increasing (18% per year) trend prior to the introduction of these smoke-free ordinances to a decreasing (−10% per year) trend after adoption and enforcement of these laws (β2 = 0.18, p-value = 0.07; β3 = −0.10, p-value = 0.02). SHS exposure at home (β2 = 0.10, p-value = 0.09; β3 = −0.03, p-value = 0.14) and the primary cigarette smoking rate (β2 = 0.03, p-value = 0.10; β3 = 0.008, p-value = 0.15) showed no significant changes in the sampled period. Although analyses stratified by sex showed that the allowance of municipal ordinances resulted in reduced SHS exposure in the workplace for both males and females, they did not affect the primary cigarette smoking rate as much, especially among females.ConclusionStrengthening the role of local governments by giving them the authority to enact and enforce penalties on SHS exposure violation helped ROK to reduce SHS exposure in the workplace. However, smoking behaviors and related activities seemed to shift to less restrictive areas such as on the streets and in apartment hallways, negating some of the effects due to these ordinances. Future studies should investigate how smoke-free policies beyond public places can further reduce the SHS exposure in ROK
A First Course in Causal Inference
I developed the lecture notes based on my ``Causal Inference'' course at the
University of California Berkeley over the past seven years. Since half of the
students were undergraduates, my lecture notes only require basic knowledge of
probability theory, statistical inference, and linear and logistic regressions
Data- og ekspertdreven variabelseleksjon for prediktive modeller i helsevesenet : mot økt tolkbarhet i underbestemte maskinlæringsproblemer
Modern data acquisition techniques in healthcare generate large collections of data from multiple sources, such as novel diagnosis and treatment methodologies. Some concrete examples are electronic healthcare record systems, genomics, and medical images. This leads to situations with often unstructured, high-dimensional heterogeneous patient cohort data where classical statistical methods may not be sufficient for optimal utilization of the data and informed decision-making. Instead, investigating such data structures with modern machine learning techniques promises to improve the understanding of patient health issues and may provide a better platform for informed decision-making by clinicians. Key requirements for this purpose include (a) sufficiently accurate predictions and (b) model interpretability. Achieving both aspects in parallel is difficult, particularly for datasets with few patients, which are common in the healthcare domain. In such cases, machine learning models encounter mathematically underdetermined systems and may overfit easily on the training data. An important approach to overcome this issue is feature selection, i.e., determining a subset of informative features from the original set of features with respect to the target variable. While potentially raising the predictive performance, feature selection fosters model interpretability by identifying a low number of relevant model parameters to better understand the underlying biological processes that lead to health issues.
Interpretability requires that feature selection is stable, i.e., small changes in the dataset do not lead to changes in the selected feature set. A concept to address instability is ensemble feature selection, i.e. the process of repeating the feature selection multiple times on subsets of samples of the original dataset and aggregating results in a meta-model. This thesis presents two approaches for ensemble feature selection, which are tailored towards high-dimensional data in healthcare: the Repeated Elastic Net Technique for feature selection (RENT) and the User-Guided Bayesian Framework for feature selection (UBayFS). While RENT is purely data-driven and builds upon elastic net regularized models, UBayFS is a general framework for ensembles with the capabilities to include expert knowledge in the feature selection process via prior weights and side constraints. A case study modeling the overall survival of cancer patients compares these novel feature selectors and demonstrates their potential in clinical practice.
Beyond the selection of single features, UBayFS also allows for selecting whole feature groups (feature blocks) that were acquired from multiple data sources, as those mentioned above. Importance quantification of such feature blocks plays a key role in tracing information about the target variable back to the acquisition modalities. Such information on feature block importance may lead to positive effects on the use of human, technical, and financial resources if systematically integrated into the planning of patient treatment by excluding the acquisition of non-informative features. Since a generalization of feature importance measures to block importance is not trivial, this thesis also investigates and compares approaches for feature block importance rankings.
This thesis demonstrates that high-dimensional datasets from multiple data sources in the medical domain can be successfully tackled by the presented approaches for feature selection. Experimental evaluations demonstrate favorable properties of both predictive performance, stability, as well as interpretability of results, which carries a high potential for better data-driven decision support in clinical practice.Moderne datainnsamlingsteknikker i helsevesenet genererer store datamengder fra flere kilder, som for eksempel nye diagnose- og behandlingsmetoder. Noen konkrete eksempler er elektroniske helsejournalsystemer, genomikk og medisinske bilder. Slike pasientkohortdata er ofte ustrukturerte, høydimensjonale og heterogene og hvor klassiske statistiske metoder ikke er tilstrekkelige for optimal utnyttelse av dataene og god informasjonsbasert beslutningstaking. Derfor kan det være lovende å analysere slike datastrukturer ved bruk av moderne maskinlæringsteknikker for å øke forståelsen av pasientenes helseproblemer og for å gi klinikerne en bedre plattform for informasjonsbasert beslutningstaking. Sentrale krav til dette formålet inkluderer (a) tilstrekkelig nøyaktige prediksjoner og (b) modelltolkbarhet. Å oppnå begge aspektene samtidig er vanskelig, spesielt for datasett med få pasienter, noe som er vanlig for data i helsevesenet. I slike tilfeller må maskinlæringsmodeller håndtere matematisk underbestemte systemer og dette kan lett føre til at modellene overtilpasses treningsdataene. Variabelseleksjon er en viktig tilnærming for å håndtere dette ved å identifisere en undergruppe av informative variabler med hensyn til responsvariablen. Samtidig som variabelseleksjonsmetoder kan lede til økt prediktiv ytelse, fremmes modelltolkbarhet ved å identifisere et lavt antall relevante modellparametere. Dette kan gi bedre forståelse av de underliggende biologiske prosessene som fører til helseproblemer.
Tolkbarhet krever at variabelseleksjonen er stabil, dvs. at små endringer i datasettet ikke fører til endringer i hvilke variabler som velges. Et konsept for å adressere ustabilitet er ensemblevariableseleksjon, dvs. prosessen med å gjenta variabelseleksjon flere ganger på en delmengde av prøvene i det originale datasett og aggregere resultater i en metamodell. Denne avhandlingen presenterer to tilnærminger for ensemblevariabelseleksjon, som er skreddersydd for høydimensjonale data i helsevesenet: "Repeated Elastic Net Technique for feature selection" (RENT) og "User-Guided Bayesian Framework for feature selection" (UBayFS). Mens RENT er datadrevet og bygger på elastic net-regulariserte modeller, er UBayFS et generelt rammeverk for ensembler som muliggjør inkludering av ekspertkunnskap i variabelseleksjonsprosessen gjennom forhåndsbestemte vekter og sidebegrensninger. En case-studie som modellerer overlevelsen av kreftpasienter sammenligner disse nye variabelseleksjonsmetodene og demonstrerer deres potensiale i klinisk praksis.
Utover valg av enkelte variabler gjør UBayFS det også mulig å velge blokker eller grupper av variabler som representerer de ulike datakildene som ble nevnt over. Kvantifisering av viktigheten av variabelgrupper spiller en nøkkelrolle for forståelsen av hvorvidt datakildene er viktige for responsvariablen. Tilgang til slik informasjon kan føre til at bruken av menneskelige, tekniske og økonomiske ressurser kan forbedres dersom informasjonen integreres systematisk i planleggingen av pasientbehandlingen. Slik kan man redusere innsamling av ikke-informative variabler. Siden generaliseringen av viktighet av variabelgrupper ikke er triviell, undersøkes og sammenlignes også tilnærminger for rangering av viktigheten til disse variabelgruppene.
Denne avhandlingen viser at høydimensjonale datasett fra flere datakilder fra det medisinske domenet effektivt kan håndteres ved bruk av variabelseleksjonmetodene som er presentert i avhandlingen. Eksperimentene viser at disse kan ha positiv en effekt på både prediktiv ytelse, stabilitet og tolkbarhet av resultatene. Bruken av disse variabelseleksjonsmetodene bærer et stort potensiale for bedre datadrevet beslutningsstøtte i klinisk praksis
- …