328 research outputs found

    Online semi-supervised learning in non-stationary environments

    Get PDF
    Existing Data Stream Mining (DSM) algorithms assume the availability of labelled and balanced data, immediately or after some delay, to extract worthwhile knowledge from the continuous and rapid data streams. However, in many real-world applications such as Robotics, Weather Monitoring, Fraud Detection Systems, Cyber Security, and Computer Network Traffic Flow, an enormous amount of high-speed data is generated by Internet of Things sensors and real-time data on the Internet. Manual labelling of these data streams is not practical due to time consumption and the need for domain expertise. Another challenge is learning under Non-Stationary Environments (NSEs), which occurs due to changes in the data distributions in a set of input variables and/or class labels. The problem of Extreme Verification Latency (EVL) under NSEs is referred to as Initially Labelled Non-Stationary Environment (ILNSE). This is a challenging task because the learning algorithms have no access to the true class labels directly when the concept evolves. Several approaches exist that deal with NSE and EVL in isolation. However, few algorithms address both issues simultaneously. This research directly responds to ILNSE’s challenge in proposing two novel algorithms “Predictor for Streaming Data with Scarce Labels” (PSDSL) and Heterogeneous Dynamic Weighted Majority (HDWM) classifier. PSDSL is an Online Semi-Supervised Learning (OSSL) method for real-time DSM and is closely related to label scarcity issues in online machine learning. The key capabilities of PSDSL include learning from a small amount of labelled data in an incremental or online manner and being available to predict at any time. To achieve this, PSDSL utilises both labelled and unlabelled data to train the prediction models, meaning it continuously learns from incoming data and updates the model as new labelled or unlabelled data becomes available over time. Furthermore, it can predict under NSE conditions under the scarcity of class labels. PSDSL is built on top of the HDWM classifier, which preserves the diversity of the classifiers. PSDSL and HDWM can intelligently switch and adapt to the conditions. The PSDSL adapts to learning states between self-learning, micro-clustering and CGC, whichever approach is beneficial, based on the characteristics of the data stream. HDWM makes use of “seed” learners of different types in an ensemble to maintain its diversity. The ensembles are simply the combination of predictive models grouped to improve the predictive performance of a single classifier. PSDSL is empirically evaluated against COMPOSE, LEVELIW, SCARGC and MClassification on benchmarks, NSE datasets as well as Massive Online Analysis (MOA) data streams and real-world datasets. The results showed that PSDSL performed significantly better than existing approaches on most real-time data streams including randomised data instances. PSDSL performed significantly better than ‘Static’ i.e. the classifier is not updated after it is trained with the first examples in the data streams. When applied to MOA-generated data streams, PSDSL ranked highest (1.5) and thus performed significantly better than SCARGC, while SCARGC performed the same as the Static. PSDSL achieved better average prediction accuracies in a short time than SCARGC. The HDWM algorithm is evaluated on artificial and real-world data streams against existing well-known approaches such as the heterogeneous WMA and the homogeneous Dynamic DWM algorithm. The results showed that HDWM performed significantly better than WMA and DWM. Also, when recurring concept drifts were present, the predictive performance of HDWM showed an improvement over DWM. In both drift and real-world streams, significance tests and post hoc comparisons found significant differences between algorithms, HDWM performed significantly better than DWM and WMA when applied to MOA data streams and 4 real-world datasets Electric, Spam, Sensor and Forest cover. The seeding mechanism and dynamic inclusion of new base learners in the HDWM algorithms benefit from the use of both forgetting and retaining the models. The algorithm also provides the independence of selecting the optimal base classifier in its ensemble depending on the problem. A new approach, Envelope-Clustering is introduced to resolve the cluster overlap conflicts during the cluster labelling process. In this process, PSDSL transforms the centroids’ information of micro-clusters into micro-instances and generates new clusters called Envelopes. The nearest envelope clusters assist the conflicted micro-clusters and successfully guide the cluster labelling process after the concept drifts in the absence of true class labels. PSDSL has been evaluated on real-world problem ‘keystroke dynamics’, and the results show that PSDSL achieved higher prediction accuracy (85.3%) and SCARGC (81.6%), while the Static (49.0%) significantly degrades the performance due to changes in the users typing pattern. Furthermore, the predictive accuracies of SCARGC are found highly fluctuated between (41.1% to 81.6%) based on different values of parameter ‘k’ (number of clusters), while PSDSL automatically determine the best values for this parameter

    A correlation-based fuzzy cluster validity index with secondary options detector

    Full text link
    The optimal number of clusters is one of the main concerns when applying cluster analysis. Several cluster validity indexes have been introduced to address this problem. However, in some situations, there is more than one option that can be chosen as the final number of clusters. This aspect has been overlooked by most of the existing works in this area. In this study, we introduce a correlation-based fuzzy cluster validity index known as the Wiroonsri-Preedasawakul (WP) index. This index is defined based on the correlation between the actual distance between a pair of data points and the distance between adjusted centroids with respect to that pair. We evaluate and compare the performance of our index with several existing indexes, including Xie-Beni, Pakhira-Bandyopadhyay-Maulik, Tang, Wu-Li, generalized C, and Kwon2. We conduct this evaluation on four types of datasets: artificial datasets, real-world datasets, simulated datasets with ranks, and image datasets, using the fuzzy c-means algorithm. Overall, the WP index outperforms most, if not all, of these indexes in terms of accurately detecting the optimal number of clusters and providing accurate secondary options. Moreover, our index remains effective even when the fuzziness parameter mm is set to a large value. Our R package called WPfuzzyCVIs used in this work is also available in https://github.com/nwiroonsri/WPfuzzyCVIs.Comment: 19 page

    Fuzzy spectral clustering methods for textual data

    Get PDF
    Nowadays, the development of advanced information technologies has determined an increase in the production of textual data. This inevitable growth accentuates the need to advance in the identification of new methods and tools able to efficiently analyse such kind of data. Against this background, unsupervised classification techniques can play a key role in this process since most of this data is not classified. Document clustering, which is used for identifying a partition of clusters in a corpus of documents, has proven to perform efficiently in the analyses of textual documents and it has been extensively applied in different fields, from topic modelling to information retrieval tasks. Recently, spectral clustering methods have gained success in the field of text classification. These methods have gained popularity due to their solid theoretical foundations which do not require any specific assumption on the global structure of the data. However, even though they prove to perform well in text classification problems, little has been done in the field of clustering. Moreover, depending on the type of documents analysed, it might be often the case that textual documents do not contain only information related to a single topic: indeed, there might be an overlap of contents characterizing different knowledge domains. Consequently, documents may contain information that is relevant to different areas of interest to some degree. The first part of this work critically analyses the main clustering algorithms used for text data, involving also the mathematical representation of documents and the pre-processing phase. Then, three novel fuzzy versions of spectral clustering algorithms for text data are introduced. The first one exploits the use of fuzzy K-medoids instead of K-means. The second one derives directly from the first one but is used in combination with Kernel and Set Similarity (KS2M), which takes into account the Jaccard index. Finally, in the third one, in order to enhance the clustering performance, a new similarity measure S∗ is proposed. This last one exploits the inherent sequential nature of text data by means of a weighted combination between the Spectrum string kernel function and a measure of set similarity. The second part of the thesis focuses on spectral bi-clustering algorithms for text mining tasks, which represent an interesting and partially unexplored field of research. In particular, two novel versions of fuzzy spectral bi-clustering algorithms are introduced. The two algorithms differ from each other for the approach followed in the identification of the document and the word partitions. Indeed, the first one follows a simultaneous approach while the second one a sequential approach. This difference leads also to a diversification in the choice of the number of clusters. The adequacy of all the proposed fuzzy (bi-)clustering methods is evaluated by experiments performed on both real and benchmark data sets

    Applications

    Get PDF
    Volume 3 describes how resource-aware machine learning methods and techniques are used to successfully solve real-world problems. The book provides numerous specific application examples: in health and medicine for risk modelling, diagnosis, and treatment selection for diseases in electronics, steel production and milling for quality control during manufacturing processes in traffic, logistics for smart cities and for mobile communications

    A rigorous possibility approach for the geotechnical reliability assessment supported by external database and local experience

    Get PDF
    This is the final version. Available on open access from Elsevier via the DOI in this recordData availability: Data will be made available on request.Reliability analyses based on probability theory are widely applied in geotechnical engineering, and several analytical or numerical methods have been built upon the concept of failure occurrence. Nevertheless, common geotechnical engineering real-world problems deal with scarce or sparse information where experimental data are not always available to a sufficient extent and quality to infer a reliable probability distribution function. This paper rigorously combines Fuzzy Clustering and Possibility Theory for deriving a data-driven, quantitative, reliability approach, in addition to fully probability-oriented assessments, when useful but heterogeneous sources of information are available. The proposed non-probabilistic approach is mathematically consistent with the failure probability, when ideal random data are considered. Additionally, it provides a robust tool to account for epistemic uncertainties when data are uncertain, scarce, and sparse. The Average Cumulative Function transformation is used to obtain possibility distributions inferred from the fuzzy clustering of an indirect database. Target Reliability Index Values, consistent with the prescribed values provided by Eurocode 0, are established. Moreover, a Degree of Understanding tier system based on the practitioner’s local experience is also proposed. The proposed methodology is detailed and discussed for two numerical examples using national-scale databases, highlighting the potential benefits compared to traditional probabilistic approaches.Engineering and Physical Sciences Research Council (EPSRC

    Analysis on the Application of Machine-Learning Algorithms for District-Heating Networks' Characterization & Management

    Get PDF
    359 p.Esta tesis doctoral estudia la viabilidad de la aplicación de algoritmos de aprendizaje automático para la caracterización energética de los edificios en entornos de redes de calefacción urbana. En particular, la disertación se centrará en el análisis de las siguientes cuatro aplicaciones principales: (i)La identificación y eliminación de valores atípicos de demanda en los edificios; (ii) Reconocimiento de los principales patrones de demanda energética en edificios conectados a la red. (iii) Estudio de interpretabilidad/clasificación de dichos patrones energéticos. Análisis descriptivo de los patrones de la demanda. (iv) Predicción de la demanda de energía en resolución diaria y horaria.El interés de la tesis fue despertado por la situación energética actual en la Unión Europea, donde los edificios son responsables de más del 40% del consumo total de energía. Las redes de distrito modernas han sido identificadas como sistemas eficientes para el suministro de energía desde las plantas de producción hasta los consumidores finales/edificios debido a su economía de escala. Además, debido a la agrupación de edificios en una misma red, permitirán el desarrollo e implementación de algoritmos para la gestión de la energía en el sistema completo

    Technology and Management Applied in Construction Engineering Projects

    Get PDF
    This book focuses on fundamental and applied research on construction project management. It presents research papers and practice-oriented papers. The execution of construction projects is specific and particularly difficult because each implementation is a unique, complex, and dynamic process that consists of several or more subprocesses that are related to each other, in which various aspects of the investment process participate. Therefore, there is still a vital need to study, research, and conclude the engineering technology and management applied in construction projects. This book present unanimous research approach is a result of many years of studies, conducted by 35 well experienced authors. The common subject of research concerns the development of methods and tools for modeling multi-criteria processes in construction engineering

    O impacto da inflação no S&P 500

    Get PDF
    O presente trabalho tem como principal objetivo o estudo do impacto da inflação no índice S&P 500 e identificar o método que apresenta melhor adaptação à previsão do índice S&P 500 em contexto económico e por isso identificar a correlação que existe entre as variáveis que são influenciadas pelo fenómeno. A resposta dos bancos centrais à inflação é importante e incisiva. Se o Banco Central adotar uma abordagem agressiva para controlar a inflação, aumentando significativamente as taxas de juros, poderá levar a criar pressão de baixa sobre o mercado de ações, incluindo o S&P 500. Embora as ações tenham a capacidade de superar a inflação a longo prazo, períodos de alta inflação podem afetar negativamente o investimento nos mercados financeiros. Além disso, a inflação pode levar a aumentos nas taxas de juros, o que pode tornar os investimentos em renda fixa mais atraentes em relação às ações. Neste estudo, as variáveis utilizadas foram o S&P 500, o Consumer Price Index, Real Gross Domestic Product and Unemployment Rate, todas com frequência trimestral e 60 observações de janeiro de 2015 a dezembro de 2022. Os modelos univariados de séries temporais são usados para analisar e prever o comportamento de uma única série temporal ao longo do tempo e aqui tornam-se importantes uma vez que se pretende verificar a correlação não-linear existente entre a variável S&P 500 e as demais. Assim, é necessário o estudo individual do melhor modelo a aplicar.The main objective of this work is to study the impact of inflation on the S&P 500 index and identify the method that best adapts to the prediction of the S&P 500 index in an economic context and therefore identify the correlation that exists between the variables that are influenced by the phenomenon. Central banks' response to inflation is important and incisive. If the Central Bank takes an aggressive approach to controlling inflation by significantly raising interest rates, it could lead to creating downward pressure on the stock market, including the S&P 500. Although stocks can outperform inflation over the long term, periods of high inflation can negatively affect investment in financial markets. Additionally, inflation can lead to increases in interest rates, which can make fixed income investments more attractive relative to stocks. In this study, the variables used were the S&P 500, the Consumer Price Index, Real Gross Domestic Product and Unemployment Rate, all with a quarterly frequency and 60 observations from January 2015 to December 2022. Univariate time series models are used to analyze and predict the behavior of a single time series over time and here they become important since the aim is to verify the non-linear correlation that exists between the S&P 500 variable and the others. Therefore, it is necessary to individually study the best model to apply
    corecore