649 research outputs found

    HiPart: Hierarchical Divisive Clustering Toolbox

    Full text link
    This paper presents the HiPart package, an open-source native python library that provides efficient and interpret-able implementations of divisive hierarchical clustering algorithms. HiPart supports interactive visualizations for the manipulation of the execution steps allowing the direct intervention of the clustering outcome. This package is highly suited for Big Data applications as the focus has been given to the computational efficiency of the implemented clustering methodologies. The dependencies used are either Python build-in packages or highly maintained stable external packages. The software is provided under the MIT license. The package's source code and documentation can be found at https://github.com/panagiotisanagnostou/HiPart

    A soft hierarchical algorithm for the clustering of multiple bioactive chemical compounds

    Get PDF
    Most of the clustering methods used in the clustering of chemical structures such as Wards, Group Average, K- means and Jarvis-Patrick, are known as hard or crisp as they partition a dataset into strictly disjoint subsets; and thus are not suitable for the clustering of chemical structures exhibiting more than one activity. Although, fuzzy clustering algorithms such as fuzzy c-means provides an inherent mechanism for the clustering of overlapping structures (objects) but this potential of the fuzzy methods which comes from its fuzzy membership functions have not been utilized effectively. In this work a fuzzy hierarchical algorithm is developed which provides a mechanism not only to benefit from the fuzzy clustering process but also to get advantage of the multiple membership function of the fuzzy clustering. The algorithm divides each and every cluster, if its size is larger than a pre-determined threshold, into two sub clusters based on the membership values of each structure. A structure is assigned to one or both the clusters if its membership value is very high or very similar respectively. The performance of the algorithm is evaluated on two bench mark datasets and a large dataset of compound structures derived from MDL MDDR database. The results of the algorithm show significant improvement in comparison to a similar implementation of the hard c-means algorithm

    Automating Large-Scale Simulation Calibration to Real-World Sensor Data

    Get PDF
    Many key decisions and design policies are made using sophisticated computer simulations. However, these sophisticated computer simulations have several major problems. The two main issues are 1) gaps between the simulation model and the actual structure, and 2) limitations of the modeling engine\u27s capabilities. This dissertation\u27s goal is to address these simulation deficiencies by presenting a general automated process for tuning simulation inputs such that simulation output matches real world measured data. The automated process involves the following key components -- 1) Identify a model that accurately estimates the real world simulation calibration target from measured sensor data; 2) Identify the key real world measurements that best estimate the simulation calibration target; 3) Construct a mapping from the most useful real world measurements to actual simulation outputs; 4) Build fast and effective simulation approximation models that predict simulation output using simulation input; 5) Build a relational model that captures inter variable dependencies between simulation inputs and outputs; and finally 6) Use the relational model to estimate the simulation input variables from the mapped sensor data, and use either the simulation model or approximate simulation model to fine tune input simulation parameter estimates towards the calibration system. The work in this dissertation individually validates and completes five out of the six calibration components with respect to the residential energy domain. Step 1 is satisfied by identifying the best model for predicting next hour residential electrical consumption, the calibration target. Step 2 is completed by identifying the most important sensors for predicting residential electrical consumption, the real world measurements. While step 3 is completed by domain experts, step 4 is addressed by using techniques from the Big Data machine learning domain to build approximations for the EnergyPlus (E+) simulator. Step 5\u27s solution leverages the same Big Data machine learning techniques to build a relational model that describes how the simulator\u27s variables are probabilistically related. Finally, step 6 is partially demonstrated by using the relational model to estimate simulation parameters for E+ simulations with known ground truth simulation inputs

    Methods for fast and reliable clustering

    Get PDF

    Technical and Fundamental Features Analysis for Stock Market Prediction with Data Mining Methods

    Get PDF
    Predicting stock prices is an essential objective in the financial world. Forecasting stock returns and their risk represents one of the most critical concerns of market decision makers. This thesis investigates the stock price forecasting with three approaches from the data mining concept and shows how different elements in the stock price can help to enhance the accuracy of our prediction. For this reason, the first and second approaches capture many fundamental indicators from the stocks and implement them as explanatory variables to do stock price classification and forecasting. In the third approach, technical features from the candlestick representation of the share prices are extracted and used to enhance the accuracy of the forecasting. In each approach, different tools and techniques from data mining and machine learning are employed to justify why the forecasting is working. Furthermore, since the idea is to evaluate the potential of features in the stock trend forecasting, therefore we diversify our experiments using both technical and fundamental features. Therefore, in the first approach, a three-stage methodology is developed while in the first step, a comprehensive investigation of all possible features which can be effective on stocks risk and return are identified. Then, in the next stage, risk and return are predicted by applying data mining techniques for the given features. Finally, we develop a hybrid algorithm, based on some filters and function-based clustering; and re-predicted the risk and return of stocks. In the second approach, instead of using single classifiers, a fusion model is proposed based on the use of multiple diverse base classifiers that operate on a common input and a meta-classifier that learns from base classifiers’ outputs to obtain a more precise stock return and risk predictions. A set of diversity methods, including Bagging, Boosting, and AdaBoost, is applied to create diversity in classifier combinations. Moreover, the number and procedure for selecting base classifiers for fusion schemes are determined using a methodology based on dataset clustering and candidate classifiers’ accuracy. Finally, in the third approach, a novel forecasting model for stock markets based on the wrapper ANFIS (Adaptive Neural Fuzzy Inference System) – ICA (Imperialist Competitive Algorithm) and technical analysis of Japanese Candlestick is presented. Two approaches of Raw-based and Signal-based are devised to extract the model’s input variables and buy and sell signals are considered as output variables. To illustrate the methodologies, for the first and second approaches, Tehran Stock Exchange (TSE) data for the period from 2002 to 2012 are applied, while for the third approach, we used General Motors and Dow Jones indexes.Predicting stock prices is an essential objective in the financial world. Forecasting stock returns and their risk represents one of the most critical concerns of market decision makers. This thesis investigates the stock price forecasting with three approaches from the data mining concept and shows how different elements in the stock price can help to enhance the accuracy of our prediction. For this reason, the first and second approaches capture many fundamental indicators from the stocks and implement them as explanatory variables to do stock price classification and forecasting. In the third approach, technical features from the candlestick representation of the share prices are extracted and used to enhance the accuracy of the forecasting. In each approach, different tools and techniques from data mining and machine learning are employed to justify why the forecasting is working. Furthermore, since the idea is to evaluate the potential of features in the stock trend forecasting, therefore we diversify our experiments using both technical and fundamental features. Therefore, in the first approach, a three-stage methodology is developed while in the first step, a comprehensive investigation of all possible features which can be effective on stocks risk and return are identified. Then, in the next stage, risk and return are predicted by applying data mining techniques for the given features. Finally, we develop a hybrid algorithm, based on some filters and function-based clustering; and re-predicted the risk and return of stocks. In the second approach, instead of using single classifiers, a fusion model is proposed based on the use of multiple diverse base classifiers that operate on a common input and a meta-classifier that learns from base classifiers’ outputs to obtain a more precise stock return and risk predictions. A set of diversity methods, including Bagging, Boosting, and AdaBoost, is applied to create diversity in classifier combinations. Moreover, the number and procedure for selecting base classifiers for fusion schemes are determined using a methodology based on dataset clustering and candidate classifiers’ accuracy. Finally, in the third approach, a novel forecasting model for stock markets based on the wrapper ANFIS (Adaptive Neural Fuzzy Inference System) – ICA (Imperialist Competitive Algorithm) and technical analysis of Japanese Candlestick is presented. Two approaches of Raw-based and Signal-based are devised to extract the model’s input variables and buy and sell signals are considered as output variables. To illustrate the methodologies, for the first and second approaches, Tehran Stock Exchange (TSE) data for the period from 2002 to 2012 are applied, while for the third approach, we used General Motors and Dow Jones indexes.154 - Katedra financívyhově

    Rails Quality Data Modelling via Machine Learning-Based Paradigms

    Get PDF

    Distribution of picophytoplankton communities from brackish to hypersaline waters in a South Australian coastal lagoon

    Get PDF
    Background Picophytoplankton (i.e. cyanobacteria and pico-eukaryotes) are abundant and ecologically critical components of the autotrophic communities in the pelagic realm. These micro-organisms colonized a variety of extreme environments including high salinity waters. However, the distribution of these organisms along strong salinity gradient has barely been investigated. The abundance and community structure of cyanobacteria and pico-eukaryotes were investigated along a natural continuous salinity gradient (1.8% to 15.5%) using flow cytometry. Results Highest picophytoplankton abundances were recorded under salinity conditions ranging between 8.0% and 11.0% (1.3 × 106 to 1.4 × 106 cells ml-1). Two populations of picocyanobacteria (likely Synechococcus and Prochlorococcus) and 5 distinct populations of pico-eukaryotes were identified along the salinity gradient. The picophytoplankton cytometric-richness decreased with salinity and the most cytometrically diversified community (4 to 7 populations) was observed in the brackish-marine part of the lagoon (i.e. salinity below 3.5%). One population of pico-eukaryote dominated the community throughout the salinity gradient and was responsible for the bloom observed between 8.0% and 11.0%. Finally only this halotolerant population and Prochlorococcus-like picocyanobacteria were identified in hypersaline waters (i.e. above 14.0%). Salinity was identified as the main factor structuring the distribution of picophytoplankton along the lagoon. However, nutritive conditions, viral lysis and microzooplankton grazing are also suggested as potentially important players in controlling the abundance and diversity of picophytoplankton along the lagoon. Conclusions The complex patterns described here represent the first observation of picophytoplankton dynamics along a continuous gradient where salinity increases from 1.8% to 15.5%. This result provides new insight into the distribution of pico-autotrophic organisms along strong salinity gradients and allows for a better understanding of the overall pelagic functioning in saline systems which is critical for the management of these precious and climatically-stress ecosystems

    Informational Paradigm, management of uncertainty and theoretical formalisms in the clustering framework: A review

    Get PDF
    Fifty years have gone by since the publication of the first paper on clustering based on fuzzy sets theory. In 1965, L.A. Zadeh had published “Fuzzy Sets” [335]. After only one year, the first effects of this seminal paper began to emerge, with the pioneering paper on clustering by Bellman, Kalaba, Zadeh [33], in which they proposed a prototypal of clustering algorithm based on the fuzzy sets theory

    Ligand-based design of dopamine reuptake inhibitors : fuzzy relational clustering and 2-D and 3-D QSAR modleing

    Get PDF
    As the three-dimensional structure of the dopamine transporter (DAT) remains undiscovered, any attempt to model the binding of drug-like ligands to this protein must necessarily include strategies that use ligand information. For flexible ligands that bind to the DAT, the identification of the binding conformation becomes an important but challenging task. In the first part of this work, the selection of a few representative structures as putative binding conformations from a large collection of conformations of a flexible GBR 12909 analogue was demonstrated by cluster analysis. Novel structurebased features that can be easily generalized to other molecules were developed and used for clustering. Since the feature space may or may not be Euclidean, a recently-developed fuzzy relational clustering algorithm capable of handling such data was used. Both superposition-dependent and superposition-independent features were used along with region-specific clustering that focused on separate pharmacophore elements in the molecule. Separate sets of representative structures were identified for the superpositiondependent and superposition-independent analyses. In the second part of this work, several QSAR models were developed for a series of analogues of methylphenidate (MP), another potent dopamine reuptake inhibitor. In a novel method, the Electrotopological-state (B-state) indices for atoms of the scaffold common to all 80 compounds were used to develop an effective test set spanning both the structure space as well as the activity space. The utility of B-state indices in modeling a series of analogues with a common scaffold was demonstrated. Several models were developed using various combinations of 2-D and 3-D descriptors in the Molconn-Z and MOE descriptor sets. The models derived from CoMFA descriptors were found to be the most predictive and explanatory. Progressive scrambling of all models indicated several stable models. The best models were used to predict the activity of the test set analogues and were found to produce reasonable residuals. Substitutions in the phenyl ring of MP, especially at the 3- and 4-positions, were found to be the most important for DATbinding. It was predicted that for better DAT-binding the substituents at these positions should be relatively bulky, electron-rich atoms or groups
    corecore