2,049 research outputs found
Fuzzy classification with distance-based depth prototypes: High-dimensional unsupervised and/or supervised problems
Supervised and unsupervised classification is crucial in many areas where different types of data sets are common, such as biology, medicine, or industry, among others. A key consideration is that some units are more typical of the group they belong to than others. For this reason, fuzzy classification approaches are necessary. In this paper, a fuzzy supervised classification method, which is based on the construction of prototypes, is proposed. The method obtains the prototypes from an objective function that includes label information and a distance-based depth function. It works with any distance and it can deal with data sets of a wide nature variety. It can further be applied to data sets where the use of Euclidean distance is not suitable and to high-dimensional data (data sets in which the number of features is larger than the number of observations , often written as >> ). In addition, the model can also cope with unsupervised classification, thus becoming an interesting alternative to other fuzzy clustering methods. With synthetic data sets along with high-dimensional real biomedical and industrial data sets, we demonstrate the good performance of the supervised and unsupervised fuzzy proposed procedures
Pandemic lifeworlds: A segmentation analysis of public responsiveness to official communication about Covid-19 in England
Pandemics such as Covid-19 pose tremendous public health communication challenges in promoting protective behaviours, vaccination, and educating the public about risks. Segmenting audiences based on attitudes and behaviours is a means to increase the precision and potential effectiveness of such communication. The present study reports on such an audience segmentation effort for the population of England, sponsored by the United Kingdom Health Security Agency (UKHSA) and involving a collaboration of market research and academic experts. A cross-sectional online survey was conducted between 4 and 24 January 2022 with 5525 respondents (5178 used in our analyses) in England using market research opt-in panel. An additional 105 telephone interviews were conducted to sample persons without online or smartphone access. Respondents were quota sampled to be demographically representative. The primary analytic technique was k means cluster analysis, supplemented with other techniques including multi-dimensional scaling and use of respondent â as well as sample-standardized data when necessary to address differences in response set for some groups of respondents. Identified segments were profiled against demographic, behavioural self-report, attitudinal, and communication channel variables, with differences by segment tested for statistical significance. Seven segments were identified, including distinctly different groups of persons who tended toward a high level of compliance and several that were relatively low in compliance. The segments were characterized by distinctive patterns of demographics, attitudes, behaviours, trust in information sources, and communication channels preferred. Segments were further validated by comparing the segmentation variable versus a set of demographic variables as predictors of reported protective behaviours in the past two weeks and of vaccine refusal; the demographics together had about one-quarter the effect size of the single seven-level segment variable. With respect to managerial implications, different communication strategies for each segment are suggested for each segment, illustrating advantages of rich segmentation descriptions for understanding public health communication audiences. Strengths and weaknesses of the methods used are discussed, to help guide future efforts
Online semi-supervised learning in non-stationary environments
Existing Data Stream Mining (DSM) algorithms assume the availability of labelled and
balanced data, immediately or after some delay, to extract worthwhile knowledge from the
continuous and rapid data streams. However, in many real-world applications such as
Robotics, Weather Monitoring, Fraud Detection Systems, Cyber Security, and Computer
Network Traffic Flow, an enormous amount of high-speed data is generated by Internet of
Things sensors and real-time data on the Internet. Manual labelling of these data streams
is not practical due to time consumption and the need for domain expertise. Another
challenge is learning under Non-Stationary Environments (NSEs), which occurs due to
changes in the data distributions in a set of input variables and/or class labels. The problem
of Extreme Verification Latency (EVL) under NSEs is referred to as Initially Labelled Non-Stationary Environment (ILNSE). This is a challenging task because the learning algorithms
have no access to the true class labels directly when the concept evolves. Several approaches
exist that deal with NSE and EVL in isolation. However, few algorithms address both issues
simultaneously. This research directly responds to ILNSEâs challenge in proposing two
novel algorithms âPredictor for Streaming Data with Scarce Labelsâ (PSDSL) and
Heterogeneous Dynamic Weighted Majority (HDWM) classifier. PSDSL is an Online Semi-Supervised Learning (OSSL) method for real-time DSM and is closely related to label
scarcity issues in online machine learning.
The key capabilities of PSDSL include learning from a small amount of labelled data in an
incremental or online manner and being available to predict at any time. To achieve this,
PSDSL utilises both labelled and unlabelled data to train the prediction models, meaning it
continuously learns from incoming data and updates the model as new labelled or
unlabelled data becomes available over time. Furthermore, it can predict under NSE
conditions under the scarcity of class labels. PSDSL is built on top of the HDWM classifier,
which preserves the diversity of the classifiers. PSDSL and HDWM can intelligently switch
and adapt to the conditions. The PSDSL adapts to learning states between self-learning,
micro-clustering and CGC, whichever approach is beneficial, based on the characteristics of
the data stream. HDWM makes use of âseedâ learners of different types in an ensemble to
maintain its diversity. The ensembles are simply the combination of predictive models
grouped to improve the predictive performance of a single classifier.
PSDSL is empirically evaluated against COMPOSE, LEVELIW, SCARGC and MClassification
on benchmarks, NSE datasets as well as Massive Online Analysis (MOA) data streams and real-world datasets. The results showed that PSDSL performed significantly better than
existing approaches on most real-time data streams including randomised data instances.
PSDSL performed significantly better than âStaticâ i.e. the classifier is not updated after it is
trained with the first examples in the data streams. When applied to MOA-generated data
streams, PSDSL ranked highest (1.5) and thus performed significantly better than SCARGC,
while SCARGC performed the same as the Static. PSDSL achieved better average prediction
accuracies in a short time than SCARGC.
The HDWM algorithm is evaluated on artificial and real-world data streams against existing
well-known approaches such as the heterogeneous WMA and the homogeneous Dynamic
DWM algorithm. The results showed that HDWM performed significantly better than WMA
and DWM. Also, when recurring concept drifts were present, the predictive performance of
HDWM showed an improvement over DWM. In both drift and real-world streams,
significance tests and post hoc comparisons found significant differences between
algorithms, HDWM performed significantly better than DWM and WMA when applied to
MOA data streams and 4 real-world datasets Electric, Spam, Sensor and Forest cover. The
seeding mechanism and dynamic inclusion of new base learners in the HDWM algorithms
benefit from the use of both forgetting and retaining the models. The algorithm also
provides the independence of selecting the optimal base classifier in its ensemble depending
on the problem.
A new approach, Envelope-Clustering is introduced to resolve the cluster overlap conflicts
during the cluster labelling process. In this process, PSDSL transforms the centroidsâ
information of micro-clusters into micro-instances and generates new clusters called
Envelopes. The nearest envelope clusters assist the conflicted micro-clusters and
successfully guide the cluster labelling process after the concept drifts in the absence of true
class labels. PSDSL has been evaluated on real-world problem âkeystroke dynamicsâ, and
the results show that PSDSL achieved higher prediction accuracy (85.3%) and SCARGC
(81.6%), while the Static (49.0%) significantly degrades the performance due to changes in
the users typing pattern. Furthermore, the predictive accuracies of SCARGC are found
highly fluctuated between (41.1% to 81.6%) based on different values of parameter âkâ
(number of clusters), while PSDSL automatically determine the best values for this
parameter
City Clustering Tool at iFood - Data-driven approach to design online experiments
Internship Report presented as the partial requirement for obtaining a Master's degree in Data Science and Advanced Analytics, specialization in Business AnalyticsOne of the many ways innovation occurs in big tech companies is due to A/B testing in order to achieve reliable results the design of these online experiments needs to be well thought. There are some business constraints that might hinder some key requirements of the design such as the fact that some tests canât be done under the granularity of users and most be done under the granularity of cities which might happen due to ethical and judicial constraints. In those cases in order to make sure that the chosen sample is a good representation of the population itâs proposed a cientific approach of city clustering so that the test cities all together represent a bigger portion of the county plus a best matching city function in order to choose the control cities.
With the assumption that the introduction of a city clustering tool would improve the city A/B testing design consistency within the profitability department. The present document reports the descriptive details of the research, discovery, development and validation phase. Results show that new experiments done using said tool are more reliable than the ones done prior. Although results are positive, future steps are proposed, which includes a better UI/UX in order to facilitate stakeholderâs interaction with the tool
iBUST: An intelligent behavioural trust model for securing industrial cyber-physical systems
To meet the demand of the world's largest population, smart manufacturing has accelerated the adoption of smart factoriesâwhere autonomous and cooperative instruments across all levels of production and logistics networks are integrated through a Cyber-Physical Production System (CPPS). However, these networks are comprised of various heterogeneous devices with varying computational power and memory capabilities. As a result, many secure communication protocols â that demand considerably high computational power and memory â can not be verbatim employed on these networks, and thereby, leaving them more vulnerable to security threats and attacks over conventional networks. These threats can largely be tackled by employing a Trust Management Model (TMM) by exploiting the behavioural patterns of nodes to identify their trust class. In this context, ML-based models are best suited due to their ability to capture hidden patterns in data, learning and improving the pattern detection accuracy over time to counteract and tackle threats of a dynamic nature, which is absent in most of the conventional models. However, among the existing ML-based solutions in detecting attack patterns, many of them are computationally expensive, require a long training time, and a considerably large amount of training dataâwhich are seldom available. An aid to this is the association rule learning (ARL) paradigm, whose models are computationally inexpensive and do not require a long training time. Therefore, this paper proposes an ARL-based intelligent Behavioural Trust Model (iBUST) for securing the CPPS. For this intelligent TMM, a variant of Frequency Pattern Growth (FP-Growth), called enhanced FP-Growth (EFP-Growth) algorithm is developed by altering the internal data structures for faster execution and by developing a modified exponential decay function (MEDF) to automatically calculate minimum supports for adapting trust evolution characteristics. In addition, a new optimisation model for finding optimum parameter values in the MEDF and an algorithm for transmuting a 1D quantitative feature into a respective categorical feature are developed to facilitate the model. Afterwards, the trust class of an object is identified employing the NaĂŻve Bayes classifier. This proposed model is evaluated on a trust evolution-supported experimental environment along with other compared models taking a benchmark dataset into consideration, where it outperforms its counterparts
Subgroup discovery for structured target concepts
The main object of study in this thesis is subgroup discovery, a theoretical framework for finding subgroups in dataâi.e., named sub-populationsâ whose behaviour with respect to a specified target concept is exceptional when compared to the rest of the dataset. This is a powerful tool that conveys crucial information to a human audience, but despite past advances has been limited to simple target concepts. In this work we propose algorithms that bring this framework to novel application domains. We introduce the concept of representative subgroups, which we use not only to ensure the fairness of a sub-population with regard to a sensitive trait, such as race or gender, but also to go beyond known trends in the data. For entities with additional relational information that can be encoded as a graph, we introduce a novel measure of robust connectedness which improves on established alternative measures of density; we then provide a method that uses this measure to discover which named sub-populations are more well-connected. Our contributions within subgroup discovery crescent with the introduction of kernelised subgroup discovery: a novel framework that enables the discovery of subgroups on i.i.d. target concepts with virtually any kind of structure. Importantly, our framework additionally provides a concrete and efficient tool that works out-of-the-box without any modification, apart from specifying the Gramian of a positive definite kernel. To use within kernelised subgroup discovery, but also on any other kind of kernel method, we additionally introduce a novel random walk graph kernel. Our kernel allows the fine tuning of the alignment between the vertices of the two compared graphs, during the count of the random walks, while we also propose meaningful structure-aware vertex labels to utilise this new capability. With these contributions we thoroughly extend the applicability of subgroup discovery and ultimately re-define it as a kernel method.Der Hauptgegenstand dieser Arbeit ist die Subgruppenentdeckung (Subgroup Discovery), ein theoretischer Rahmen fĂŒr das Auffinden von Subgruppen in Datenâd. h. benannte Teilpopulationenâderen Verhalten in Bezug auf ein bestimmtes Targetkonzept im Vergleich zum Rest des Datensatzes auĂergewöhnlich ist. Es handelt sich hierbei um ein leistungsfĂ€higes Instrument, das einem menschlichen Publikum wichtige Informationen vermittelt. Allerdings ist es trotz bisherigen Fortschritte auf einfache Targetkonzepte beschrĂ€nkt. In dieser Arbeit schlagen wir Algorithmen vor, die diesen Rahmen auf neuartige Anwendungsbereiche ĂŒbertragen. Wir fĂŒhren das Konzept der reprĂ€sentativen Untergruppen ein, mit dem wir nicht nur die Fairness einer Teilpopulation in Bezug auf ein sensibles Merkmal wie Rasse oder Geschlecht sicherstellen, sondern auch ĂŒber bekannte Trends in den Daten hinausgehen können. FĂŒr EntitĂ€ten mit zusĂ€tzlicher relationalen Information, die als Graph kodiert werden kann, fĂŒhren wir ein neuartiges MaĂ fĂŒr robuste Verbundenheit ein, das die etablierten alternativen DichtemaĂe verbessert; anschlieĂend stellen wir eine Methode bereit, die dieses MaĂ verwendet, um herauszufinden, welche benannte Teilpopulationen besser verbunden sind. Unsere BeitrĂ€ge in diesem Rahmen gipfeln in der EinfĂŒhrung der kernelisierten Subgruppenentdeckung: ein neuartiger Rahmen, der die Entdeckung von Subgruppen fĂŒr u.i.v. Targetkonzepten mit praktisch jeder Art von Struktur ermöglicht. Wichtigerweise, unser Rahmen bereitstellt zusĂ€tzlich ein konkretes und effizientes Werkzeug, das ohne jegliche Modifikation funktioniert, abgesehen von der Angabe des Gramian eines positiv definitiven Kernels. FĂŒr den Einsatz innerhalb der kernelisierten Subgruppentdeckung, aber auch fĂŒr jede andere Art von Kernel-Methode, fĂŒhren wir zusĂ€tzlich einen neuartigen Random-Walk-Graph-Kernel ein. Unser Kernel ermöglicht die Feinabstimmung der Ausrichtung zwischen den Eckpunkten der beiden unter-Vergleich-gestelltenen Graphen wĂ€hrend der ZĂ€hlung der Random Walks, wĂ€hrend wir auch sinnvolle strukturbewusste Vertex-Labels vorschlagen, um diese neue FĂ€higkeit zu nutzen. Mit diesen BeitrĂ€gen erweitern wir die Anwendbarkeit der Subgruppentdeckung grĂŒndlich und definieren wir sie im Endeffekt als Kernel-Methode neu
Evaluating machine learning models in non-standard settings: An overview and new findings
Estimating the generalization error (GE) of machine learning models is
fundamental, with resampling methods being the most common approach. However,
in non-standard settings, particularly those where observations are not
independently and identically distributed, resampling using simple random data
divisions may lead to biased GE estimates. This paper strives to present
well-grounded guidelines for GE estimation in various such non-standard
settings: clustered data, spatial data, unequal sampling probabilities, concept
drift, and hierarchically structured outcomes. Our overview combines
well-established methodologies with other existing methods that, to our
knowledge, have not been frequently considered in these particular settings. A
unifying principle among these techniques is that the test data used in each
iteration of the resampling procedure should reflect the new observations to
which the model will be applied, while the training data should be
representative of the entire data set used to obtain the final model. Beyond
providing an overview, we address literature gaps by conducting simulation
studies. These studies assess the necessity of using GE-estimation methods
tailored to the respective setting. Our findings corroborate the concern that
standard resampling methods often yield biased GE estimates in non-standard
settings, underscoring the importance of tailored GE estimation
Synthetic Aperture Radar (SAR) Meets Deep Learning
This reprint focuses on the application of the combination of synthetic aperture radars and depth learning technology. It aims to further promote the development of SAR image intelligent interpretation technology. A synthetic aperture radar (SAR) is an important active microwave imaging sensor, whose all-day and all-weather working capacity give it an important place in the remote sensing community. Since the United States launched the first SAR satellite, SAR has received much attention in the remote sensing community, e.g., in geological exploration, topographic mapping, disaster forecast, and traffic monitoring. It is valuable and meaningful, therefore, to study SAR-based remote sensing applications. In recent years, deep learning represented by convolution neural networks has promoted significant progress in the computer vision community, e.g., in face recognition, the driverless field and Internet of things (IoT). Deep learning can enable computational models with multiple processing layers to learn data representations with multiple-level abstractions. This can greatly improve the performance of various applications. This reprint provides a platform for researchers to handle the above significant challenges and present their innovative and cutting-edge research results when applying deep learning to SAR in various manuscript types, e.g., articles, letters, reviews and technical reports
- âŠ