15,163 research outputs found
Model Reduction for the Kuramoto-Sakaguchi Model: analyzing the effect of non-entrained rogue oscillators
The Kuramoto-Sakaguchi model is a paradigmatic model of coupled oscillator
system which displays collective behaviour. This thesis is concerned
with better understanding of the model through construction of lowerdimensional
reduced models that are more tractable for analysis. The
role of non-entrained rogue oscillators on the synchronized oscillators is
highlighted. After reviewing traditional analysis via mean-field theory
in the thermodynamic limit of infinitely many oscillators, we proceed to
construct reduced models for finite-size systems, where we investigate on how the effects of rogue oscillators should be incorporated. We first
describe the rogue oscillators’ effect via averaging, leading to a closed
deterministic system that involves the synchronized oscillators only. We
perform model reduction analysis on the system via the collective coordinate
framework. It is demonstrated that inclusion of the effect of
rogue oscillators is crucial for obtaining an accurate description of the
system. A new non-linear ansatz is introduced which significantly improves
the accuracy of the reduced system, both for finite-size systems
and in the thermodynamic limit. We then analyze the fluctuation of
rogue oscillator’s effect around their mean, by constructing stochastic
process approximations. It is demonstrated that utilizing an Ornstein-
Uhlenbeck process leads to stochastic reduced model that can capture
the fluctuations exhibited in the full model. This thesis also adds to the
mean-field theory analysis for Kuramoto-like model by performing meanfield
analysis on the Kuramoto-Sakaguchi model with uniform intrinsic
frequency distribution, which reviews that for a non-zero phase-offset
parameter, the system exhibits an intricate transition to synchronization,
with first-order transition to partial synchronization followed by a
second-order transition to global synchronization
Statistical analysis of grouped text documents
L'argomento di questa tesi sono i modelli statistici per l'analisi dei dati testuali, con particolare attenzione ai contesti in cui i campioni di testo sono raggruppati.
Quando si ha a che fare con dati testuali, il primo problema è quello di elaborarli, per renderli compatibili dal punto di vista computazionale e metodologico con i metodi matematici e statistici prodotti e continuamente sviluppati dalla comunità scientifica. Per questo motivo, la tesi passa in rassegna i metodi esistenti per la rappresentazione analitica e l'elaborazione di campioni di dati testuali, compresi i "Vector Space Models", le "rappresentazioni distribuite" di parole e documenti e i "contextualized embeddings". Questa rassegna comporta la standardizzazione di una notazione che, anche all'interno dello stesso approccio di rappresentazione, appare molto eterogenea in letteratura.
Vengono poi esplorati due domini di applicazione: i social media e il turismo culturale. Per quanto riguarda il primo, viene proposto uno studio sull'autodescrizione di gruppi diversi di individui sulla piattaforma StockTwits, dove i mercati finanziari sono gli argomenti dominanti. La metodologia proposta ha integrato diversi tipi di dati, sia testuali che variabili categoriche. Questo studio ha agevolato la comprensione sul modo in cui le persone si presentano online e ha trovato stutture di comportamento ricorrenti all'interno di gruppi di utenti.
Per quanto riguarda il turismo culturale, la tesi approfondisce uno studio condotto nell'ambito del progetto "Data Science for Brescia - Arts and Cultural Places", in cui è stato addestrato un modello linguistico per classificare le recensioni online scritte in italiano in quattro aree semantiche distinte relative alle attrazioni culturali della città di Brescia. Il modello proposto permette di identificare le attrazioni nei documenti di testo, anche quando non sono esplicitamente menzionate nei metadati del documento, aprendo così la possibilità di espandere il database relativo a queste attrazioni culturali con nuove fonti, come piattaforme di social media, forum e altri spazi online.
Infine, la tesi presenta uno studio metodologico che esamina la specificità di gruppo delle parole, analizzando diversi stimatori di specificità di gruppo proposti in letteratura. Lo studio ha preso in considerazione documenti testuali raggruppati con variabile di "outcome" e variabile di gruppo. Il suo contributo consiste nella proposta di modellare il corpus di documenti come una distribuzione multivariata, consentendo la simulazione di corpora di documenti di testo con caratteristiche predefinite. La simulazione ha fornito preziose indicazioni sulla relazione tra gruppi di documenti e parole. Inoltre, tutti i risultati possono essere liberamente esplorati attraverso un'applicazione web, i cui componenti sono altresì descritti in questo manoscritto.
In conclusione, questa tesi è stata concepita come una raccolta di studi, ognuno dei quali suggerisce percorsi di ricerca futuri per affrontare le sfide dell'analisi dei dati testuali raggruppati.The topic of this thesis is statistical models for the analysis of textual data, emphasizing contexts in which text samples are grouped.
When dealing with text data, the first issue is to process it, making it computationally and methodologically compatible with the existing mathematical and statistical methods produced and continually developed by the scientific community. Therefore, the thesis firstly reviews existing methods for analytically representing and processing textual datasets, including Vector Space Models, distributed representations of words and documents, and contextualized embeddings. It realizes this review by standardizing a notation that, even within the same representation approach, appears highly heterogeneous in the literature.
Then, two domains of application are explored: social media and cultural tourism. About the former, a study is proposed about self-presentation among diverse groups of individuals on the StockTwits platform, where finance and stock markets are the dominant topics. The methodology proposed integrated various types of data, including textual and categorical data. This study revealed insights into how people present themselves online and found recurring patterns within groups of users.
About the latter, the thesis delves into a study conducted as part of the "Data Science for Brescia - Arts and Cultural Places" Project, where a language model was trained to classify Italian-written online reviews into four distinct semantic areas related to cultural attractions in the Italian city of Brescia. The model proposed allows for the identification of attractions in text documents, even when not explicitly mentioned in document metadata, thus opening possibilities for expanding the database related to these cultural attractions with new sources, such as social media platforms, forums, and other online spaces.
Lastly, the thesis presents a methodological study examining the group-specificity of words, analyzing various group-specificity estimators proposed in the literature. The study considered grouped text documents with both outcome and group variables. Its contribution consists of the proposal of modeling the corpus of documents as a multivariate distribution, enabling the simulation of corpora of text documents with predefined characteristics. The simulation provided valuable insights into the relationship between groups of documents and words. Furthermore, all its results can be freely explored through a web application, whose components are also described in this manuscript.
In conclusion, this thesis has been conceived as a collection of papers. It aimed to contribute to the field with both applications and methodological proposals, and each study presented here suggests paths for future research to address the challenges in the analysis of grouped textual data
LIPIcs, Volume 251, ITCS 2023, Complete Volume
LIPIcs, Volume 251, ITCS 2023, Complete Volum
Smart Gas Sensors: Materials, Technologies, Practical Applications, and Use of Machine Learning – A Review
The electronic nose, popularly known as the E-nose, that combines gas sensor arrays (GSAs) with machine learning has gained a strong foothold in gas sensing technology. The E-nose designed to mimic the human olfactory system, is used for the detection and identification of various volatile compounds. The GSAs develop a unique signal fingerprint for each volatile compound to enable pattern recognition using machine learning algorithms. The inexpensive, portable and non-invasive characteristics of the E-nose system have rendered it indispensable within the gas-sensing arena. As a result, E-noses have been widely employed in several applications in the areas of the food industry, health management, disease diagnosis, water and air quality control, and toxic gas leakage detection. This paper reviews the various sensor fabrication technologies of GSAs and highlights the main operational framework of the E-nose system. The paper details vital signal pre-processing techniques of feature extraction, feature selection, in addition to machine learning algorithms such as SVM, kNN, ANN, and Random Forests for determining the type of gas and estimating its concentration in a competitive environment. The paper further explores the potential applications of E-noses for diagnosing diseases, monitoring air quality, assessing the quality of food samples and estimating concentrations of volatile organic compounds (VOCs) in air and in food samples. The review concludes with some challenges faced by E-nose, alternative ways to tackle them and proposes some recommendations as potential future work for further development and design enhancement of E-noses
Universality of Poisson-Dirichlet law for log-correlated Gaussian fields via level set statistics
Many low temperature disordered systems are expected to exhibit
Poisson-Dirichlet (PD) statistics. In this paper, we focus on the case when the
underlying disorder is a logarithmically correlated Gaussian process
on the box . Canonical examples include branching
random walk, -scale invariant fields, with the central example being the two
dimensional Gaussian free field (GFF), a universal scaling limit of a wide
range of statistical mechanics models. The corresponding Gibbs measure obtained
by exponentiating (inverse temperature) times is a discrete
version of the Gaussian multiplicative chaos (GMC) famously constructed by
Kahane. In the low temperature or supercritical regime, the GMC is expected to
exhibit atomic behavior on suitable renormalization, dictated by the extremal
statistics of . Moreover, it is predicted, going back to a conjecture
made in 2001 by Carpentier and Le Doussal, that the weights of this atomic GMC
has a PD distribution. In a series of works, Biskup and Louidor carried out a
comprehensive study of the near maxima of the 2D GFF, and established the
conjectured PD behavior throughout the super-critical regime. In another
direction, Ding, Roy and Zeitouni established universal behavior of the maximum
for a general class of log-correlated Gaussian fields.
In this paper we continue this program simply under the assumption of
log-correlation and nothing further. We prove that the GMC concentrates on an
neighborhood of the local extrema and the PD prediction holds, in any
dimension , throughout the supercritical regime, significantly generalizing
past results. Unlike for the 2D GFF, in absence of any Markovian structure for
general Gaussian fields, we develop and use as our key input a sharp estimate
of the size of level sets, which could have other applications.Comment: 78 pages; new title, explanations and references adde
Reducing Computational and Statistical Complexity in Machine Learning Through Cardinality Sparsity
High-dimensional data has become ubiquitous across the sciences but causes
computational and statistical challenges. A common approach for dealing with
these challenges is sparsity. In this paper, we introduce a new concept of
sparsity, called cardinality sparsity. Broadly speaking, we call a tensor
sparse if it contains only a small number of unique values. We show that
cardinality sparsity can improve deep learning and tensor regression both
statistically and computationally. On the way, we generalize recent statistical
theories in those fields
Data-assisted modeling of complex chemical and biological systems
Complex systems are abundant in chemistry and biology; they can be multiscale, possibly high-dimensional or stochastic, with nonlinear dynamics and interacting components. It is often nontrivial (and sometimes impossible), to determine and study the macroscopic quantities of interest and the equations they obey. One can only (judiciously or randomly) probe the system, gather observations and study trends. In this thesis, Machine Learning is used as a complement to traditional modeling and numerical methods to enable data-assisted (or data-driven) dynamical systems. As case studies, three complex systems are sourced from diverse fields: The first one is a high-dimensional computational neuroscience model of the Suprachiasmatic Nucleus of the human brain, where bifurcation analysis is performed by simply probing the system. Then, manifold learning is employed to discover a latent space of neuronal heterogeneity. Second, Machine Learning surrogate models are used to optimize dynamically operated catalytic reactors. An algorithmic pipeline is presented through which it is possible to program catalysts with active learning. Third, Machine Learning is employed to extract laws of Partial Differential Equations describing bacterial Chemotaxis. It is demonstrated how Machine Learning manages to capture the rules of bacterial motility in the macroscopic level, starting from diverse data sources (including real-world experimental data). More importantly, a framework is constructed though which already existing, partial knowledge of the system can be exploited. These applications showcase how Machine Learning can be used synergistically with traditional simulations in different scenarios: (i) Equations are available but the overall system is so high-dimensional that efficiency and explainability suffer, (ii) Equations are available but lead to highly nonlinear black-box responses, (iii) Only data are available (of varying source and quality) and equations need to be discovered. For such data-assisted dynamical systems, we can perform fundamental tasks, such as integration, steady-state location, continuation and optimization. This work aims to unify traditional scientific computing and Machine Learning, in an efficient, data-economical, generalizable way, where both the physical system and the algorithm matter
The Application of Data Analytics Technologies for the Predictive Maintenance of Industrial Facilities in Internet of Things (IoT) Environments
In industrial production environments, the maintenance of equipment has a decisive influence on costs and on the plannability of production capacities. In particular, unplanned failures during production times cause high costs, unplanned downtimes and possibly additional collateral damage. Predictive Maintenance starts here and tries to predict a possible failure and its cause so early that its prevention can be prepared and carried out in time. In order to be able to predict malfunctions and failures, the industrial plant with its characteristics, as well as wear and ageing processes, must be modelled. Such modelling can be done by replicating its physical properties. However, this is very complex and requires enormous expert knowledge about the plant and about wear and ageing processes of each individual component. Neural networks and machine learning make it possible to train such models using data and offer an alternative, especially when very complex and non-linear behaviour is evident.
In order for models to make predictions, as much data as possible about the condition of a plant and its environment and production planning data is needed. In Industrial Internet of Things (IIoT) environments, the amount of available data is constantly increasing. Intelligent sensors and highly interconnected production facilities produce a steady stream of data. The sheer volume of data, but also the steady stream in which data is transmitted, place high demands on the data processing systems. If a participating system wants to perform live analyses on the incoming data streams, it must be able to process the incoming data at least as fast as the continuous data stream delivers it. If this is not the case, the system falls further and further behind in processing and thus in its analyses. This also applies to Predictive Maintenance systems, especially if they use complex and computationally intensive machine learning models. If sufficiently scalable hardware resources are available, this may not be a problem at first. However, if this is not the case or if the processing takes place on decentralised units with limited hardware resources (e.g. edge devices), the runtime behaviour and resource requirements of the type of neural network used can become an important criterion.
This thesis addresses Predictive Maintenance systems in IIoT environments using neural networks and Deep Learning, where the runtime behaviour and the resource requirements are relevant. The question is whether it is possible to achieve better runtimes with similarly result quality using a new type of neural network. The focus is on reducing the complexity of the network and improving its parallelisability. Inspired by projects in which complexity was distributed to less complex neural subnetworks by upstream measures, two hypotheses presented in this thesis emerged: a) the distribution of complexity into simpler subnetworks leads to faster processing overall, despite the overhead this creates, and b) if a neural cell has a deeper internal structure, this leads to a less complex network. Within the framework of a qualitative study, an overall impression of Predictive Maintenance applications in IIoT environments using neural networks was developed. Based on the findings, a novel model layout was developed named Sliced Long Short-Term Memory Neural Network (SlicedLSTM). The SlicedLSTM implements the assumptions made in the aforementioned hypotheses in its inner model architecture.
Within the framework of a quantitative study, the runtime behaviour of the SlicedLSTM was compared with that of a reference model in the form of laboratory tests. The study uses synthetically generated data from a NASA project to predict failures of modules of aircraft gas turbines. The dataset contains 1,414 multivariate time series with 104,897 samples of test data and 160,360 samples of training data.
As a result, it could be proven for the specific application and the data used that the SlicedLSTM delivers faster processing times with similar result accuracy and thus clearly outperforms the reference model in this respect. The hypotheses about the influence of complexity in the internal structure of the neuronal cells were confirmed by the study carried out in the context of this thesis
Streaming Euclidean Max-Cut: Dimension vs Data Reduction
Max-Cut is a fundamental problem that has been studied extensively in various
settings. We design an algorithm for Euclidean Max-Cut, where the input is a
set of points in , in the model of dynamic geometric streams,
where the input is presented as a sequence of point
insertions and deletions. Previously, Frahling and Sohler [STOC 2005] designed
a -approximation algorithm for the low-dimensional regime, i.e.,
it uses space .
To tackle this problem in the high-dimensional regime, which is of growing
interest, one must improve the dependence on the dimension , ideally to
space complexity . Lammersen,
Sidiropoulos, and Sohler [WADS 2009] proved that Euclidean Max-Cut admits
dimension reduction with target dimension .
Combining this with the aforementioned algorithm that uses space ,
they obtain an algorithm whose overall space complexity is indeed polynomial in
, but unfortunately exponential in .
We devise an alternative approach of \emph{data reduction}, based on
importance sampling, and achieve space bound , which is exponentially better (in ) than the
dimension-reduction approach. To implement this scheme in the streaming model,
we employ a randomly-shifted quadtree to construct a tree embedding. While this
is a well-known method, a key feature of our algorithm is that the embedding's
distortion affects only the space complexity, and the
approximation ratio remains
- …