486 research outputs found

    The multiple pheromone Ant clustering algorithm

    Get PDF
    Ant Colony Optimisation algorithms mimic the way ants use pheromones for marking paths to important locations. Pheromone traces are followed and reinforced by other ants, but also evaporate over time. As a consequence, optimal paths attract more pheromone, whilst the less useful paths fade away. In the Multiple Pheromone Ant Clustering Algorithm (MPACA), ants detect features of objects represented as nodes within graph space. Each node has one or more ants assigned to each feature. Ants attempt to locate nodes with matching feature values, depositing pheromone traces on the way. This use of multiple pheromone values is a key innovation. Ants record other ant encounters, keeping a record of the features and colony membership of ants. The recorded values determine when ants should combine their features to look for conjunctions and whether they should merge into colonies. This ability to detect and deposit pheromone representative of feature combinations, and the resulting colony formation, renders the algorithm a powerful clustering tool. The MPACA operates as follows: (i) initially each node has ants assigned to each feature; (ii) ants roam the graph space searching for nodes with matching features; (iii) when departing matching nodes, ants deposit pheromones to inform other ants that the path goes to a node with the associated feature values; (iv) ant feature encounters are counted each time an ant arrives at a node; (v) if the feature encounters exceed a threshold value, feature combination occurs; (vi) a similar mechanism is used for colony merging. The model varies from traditional ACO in that: (i) a modified pheromone-driven movement mechanism is used; (ii) ants learn feature combinations and deposit multiple pheromone scents accordingly; (iii) ants merge into colonies, the basis of cluster formation. The MPACA is evaluated over synthetic and real-world datasets and its performance compares favourably with alternative approaches

    Projection-Based Clustering through Self-Organization and Swarm Intelligence

    Get PDF
    It covers aspects of unsupervised machine learning used for knowledge discovery in data science and introduces a data-driven approach to cluster analysis, the Databionic swarm (DBS). DBS consists of the 3D landscape visualization and clustering of data. The 3D landscape enables 3D printing of high-dimensional data structures. The clustering and number of clusters or an absence of cluster structure are verified by the 3D landscape at a glance. DBS is the first swarm-based technique that shows emergent properties while exploiting concepts of swarm intelligence, self-organization and the Nash equilibrium concept from game theory. It results in the elimination of a global objective function and the setting of parameters. By downloading the R package DBS can be applied to data drawn from diverse research fields and used even by non-professionals in the field of data mining

    Learning in Dynamic Data-Streams with a Scarcity of Labels

    Get PDF
    Analysing data in real-time is a natural and necessary progression from traditional data mining. However, real-time analysis presents additional challenges to batch-analysis; along with strict time and memory constraints, change is a major consideration. In a dynamic stream there is an assumption that the underlying process generating the stream is non-stationary and that concepts within the stream will drift and change over time. Adopting a false assumption that a stream is stationary will result in non-adaptive models degrading and eventually becoming obsolete. The challenge of recognising and reacting to change in a stream is compounded by the scarcity of labels problem. This refers to the very realistic situation in which the true class label of an incoming point is not immediately available (or will never be available) or in situations where manually labelling incoming points is prohibitively expensive. The goal of this thesis is to evaluate unsupervised learning as the basis for online classification in dynamic data-streams with a scarcity of labels. To realise this goal, a novel stream clustering algorithm based on the collective behaviour of ants (Ant Colony Stream Clustering (ACSC)) is proposed. This algorithm is shown to be faster and more accurate than comparative, peer stream-clustering algorithms while requiring fewer sensitive parameters. The principles of ACSC are extended in a second stream-clustering algorithm named Multi-Density Stream Clustering (MDSC). This algorithm has adaptive parameters and crucially, can track clusters and monitor their dynamic behaviour over time. A novel technique called a Dynamic Feature Mask (DFM) is proposed to ``sit on top’’ of these stream-clustering algorithms and can be used to observe and track change at the feature level in a data stream. This Feature Mask acts as an unsupervised feature selection method allowing high-dimensional streams to be clustered. Finally, data-stream clustering is evaluated as an approach to one-class classification and a novel framework (named COCEL: Clustering and One class Classification Ensemble Learning) for classification in dynamic streams with a scarcity of labels is described. The proposed framework can identify and react to change in a stream and hugely reduces the number of required labels (typically less than 0.05% of the entire stream)

    Projection-Based Clustering through Self-Organization and Swarm Intelligence: Combining Cluster Analysis with the Visualization of High-Dimensional Data

    Get PDF
    Cluster Analysis; Dimensionality Reduction; Swarm Intelligence; Visualization; Unsupervised Machine Learning; Data Science; Knowledge Discovery; 3D Printing; Self-Organization; Emergence; Game Theory; Advanced Analytics; High-Dimensional Data; Multivariate Data; Analysis of Structured Dat

    Temporal enhancer profiling of parallel lineages identifies AHR and GLIS1 as regulators of mesenchymal multipotency

    Get PDF
    Temporal data on gene expression and context-specific open chromatin states can improve identification of key transcription factors (TFs) and the gene regulatory networks (GRNs) controlling cellular differentiation. However, their integration remains challenging. Here, we delineate a general approach for data-driven and unbiased identification of key TFs and dynamic GRNs, called EPIC-DREM. We generated time-series transcriptomic and epigenomic profiles during differentiation of mouse multipotent bone marrow stromal cell line (ST2) toward adipocytes and osteoblasts. Using our novel approach we constructed time-resolved GRNs for both lineages and identifed the shared TFs involved in both differentiation processes. To take an alternative approach to prioritize the identified shared regulators, we mapped dynamic super-enhancers in both lineages and associated them to target genes with correlated expression profiles. The combination of the two approaches identified aryl hydrocarbon receptor (AHR) and Glis family zinc finger 1 (GLIS1) as mesenchymal key TFs controlled by dynamic cell type-specific super-enhancers that become repressed in both lineages. AHR and GLIS1 control differentiation-induced genes and their overexpression can inhibit the lineage commitment of the multipotent bone marrow-derived ST2 cells

    Improving the hierarchical classification of protein functions With swarm intelligence

    Get PDF
    This thesis investigates methods to improve the performance of hierarchical classification. In terms of this thesis hierarchical classification is a form of supervised learning, where the classes in a data set are arranged in a tree structure. As a base for our new methods we use the TDDC (top-down divide-and-conquer) approach for hierarchical classification, where each classifier is built only to discriminate between sibling classes. Firstly, we propose a swarm intelligence technique which varies the types of classifiers used at each divide within the TDDC tree. Our technique, PSO/ACO-CS (Particle Swarm Optimisation/Ant Colony Optimisation Classifier Selection), finds combinations of classifiers to be used in the TDDC tree using the global search ability of PSO/ACO. Secondly, we propose a technique that attempts to mitigate a major drawback of the TDDC approach. The drawback is that if at any point in the TDDC tree an example is misclassified it can never be correctly classified further down the TDDC tree. Our approach, PSO/ACO-RO (PSO/ACO-Recovery Optimisation) decides whether to redirect examples at a given classifier node using, again, the global search ability of PSO/ACO. Thirdly, we propose an ensemble based technique, HEHRS (Hierarchical Ensembles of Hierarchical Rule Sets), which attempts to boost the accuracy at each classifier node in the TDDC tree by using information from classifiers (rule sets) in the rest of that tree. We use Particle Swarm Optimisation to weight the individual rules within each ensemble. We evaluate these three new methods in hierarchical bioinformatics datasets that we have created for this research. These data sets represent the real world problem of protein function prediction. We find through extensive experimentation that the three proposed methods improve upon the baseline TDDC method to varying degrees. Overall the HEHRS and PSO/ACO- CS-RO approaches are most effective, although they are associated with a higher computational cost

    An appraisal of the physical environmental quality of the Selebi Phikwe Ni-Cu mine area, South-Eastern

    Get PDF
    ThesisThis research project focused on the environmental impact of mining and smelting nickel-copper (Ni-Cu) at the Selebi Phikwe area, south-eastern Botswana. Physico-chemical properties, mineralogical identification and characterisation, and heavy metals concentrations of elements for samples of tailings dump, soils, particulate air matter (PAM), Colophospermum mopane (mopane plant), and Imbrasia belina (phane caterpillar) were investigated. Physico-chemical properties studied on tailings dump and soil samples included soil texture and colour, particle size distribution (PSD), pH, electrical conductivity (EC), cation exchange capacity (CEC) and descriptive petrography. Identification and characterisation of minerals contained in tailings dump, soil, and PAM samples were performed employing X-ray powder diffraction (XRPD) techniques which included clay size and heavy minerals fractionation. Chemical analyses for heavy metals (cadmium, Cd; cobalt, Co; chromium, Cr; nickel, Ni and selenium, Se) concentrations in tail ings dump, soils, PAM, mopane leaves and phane caterpillar were measured with a graphite furnace atomic absorption spectrometer (GFAAS) whereas the flame atomic absorption spectrometer (FAAS) measured copper, Cu; iron, Fe and zinc, Zn concentration levels. The clay and silt soil components made up to 50 wt % of soil. Very acidic soils were located close to the smelter/concentrator plant, and both soil EC and CEC va lues were significantly low. Physical tests revealed albite, NaAISi30 s; cristobalite, a- Si02; chalcopyrite, CuFeS2; pyrrhotite, Fe1_xS; tremolite, Ca2MgsSis022(OHh; and pentlandite, (Fe,Ni)9Ss; to be contained in tailings dump. Soil colour varied from pale yellow, reddish yellow to dark reddish brown. The tailings dump comprised of nickelbloedite, Na2(Ni(S04h.4H20; pyrrhotite; quartz, Si02; pentlandite; malachite,Cu2C03(OHh; chalcopyrite; actinolite, Ca2(Mg,Fe)sSis022(OHh; cristobalite; tremolite; kaolinite, AI2SbOs(OH)4; mica and albite. The PAM consisted of quartz, Si02; pyrrhotite; chalcopyrite, CuFeS2; albite, and djurleite, CU31 S16. Bulk soil samples consisted of actinolite, albite, quartz, microciine, KAISi30 s; pyrrhotite, silicon sulphide, SiS; and cobalt oxide, CoO whereas the < 2 f.In fraction was made of kaolinite, smectite, Nao.3(AI,MghSi4010(OH)2.xH20; anorthite, CaAI2Si20 s; illite, KAI2(Si~1010)(OH)2 and quartz. Ojurleite polymorphs (CU31S16 and CU193S) were formed from secondary mineralisation of chalcopyrite and the S02 released from concentration/smelting processes. Ambient temperature and an acidic milieu created favourable conditions for the formation of nickelblodite and malachite from the primary ore minerals: pentlandite, chalcopyrite and pyrrhotite in tailings dump. Cobalt oxide and silicon sulphide identified in surface soils were indicative of environmental chemical alteration of mining waste deposited on surface soils. High concentrations of heavy metals recorded in different environmental media had affected the physical environmental quality at Selebi Phikwe. Heavy metals including Cd, Co, Cr, Cu, Fe, Ni, Se and Zn, which are deleterious to the environment, and pose as health hazards to humanbeings, were associated with these minerals. Contamination of waterbodies around Selebi Phikwe might have been possible by the heavy ions in solution. Consumption of stunted phane might pose as health hazard. In overcoming pollution problems at Selebi Phikwe, aspects of pollution management such as phytoremediation and phytomining, environmental desulphurisation, phytostabilisation, and biotechnology could be introduced as pollution control measures

    Data Mining Techniques for Complex User-Generated Data

    Get PDF
    Nowadays, the amount of collected information is continuously growing in a variety of different domains. Data mining techniques are powerful instruments to effectively analyze these large data collections and extract hidden and useful knowledge. Vast amount of User-Generated Data (UGD) is being created every day, such as user behavior, user-generated content, user exploitation of available services and user mobility in different domains. Some common critical issues arise for the UGD analysis process such as the large dataset cardinality and dimensionality, the variable data distribution and inherent sparseness, and the heterogeneous data to model the different facets of the targeted domain. Consequently, the extraction of useful knowledge from such data collections is a challenging task, and proper data mining solutions should be devised for the problem under analysis. In this thesis work, we focus on the design and development of innovative solutions to support data mining activities over User-Generated Data characterised by different critical issues, via the integration of different data mining techniques in a unified frame- work. Real datasets coming from three example domains characterized by the above critical issues are considered as reference cases, i.e., health care, social network, and ur- ban environment domains. Experimental results show the effectiveness of the proposed approaches to discover useful knowledge from different domains

    GAC-MAC-SGA 2023 Sudbury Meeting: Abstracts, Volume 46

    Get PDF
    • …
    corecore