56 research outputs found

    Discussion on density-based clustering methods applied for automated identification of airspace flows

    Get PDF
    Air Traffic Management systems generate a huge amount of track data daily. Flight trajectories can be clustered to extract main air traffic flows by means of unsupervised machine learning techniques. A well-known methodology for unsupervised extraction of air traffic flows conducts a two-step process. The first step reduces the dimensionality of the track data, whereas the second step clusters the data based on a density-based algorithm, DBSCAN. This paper explores advancements in density-based clustering such as OPTICS or HDBSCAN*. This assessment is based on quantitative and qualitative evaluations of the clustering solutions offered by these algorithms. In addition, the paper proposes a hierarchical clustering algorithm for handling noise in this methodology. This algorithm is based on a recursive application of DBSCAN* (RDBSCAN*). The paper demonstrates the sensitivity of these algorithms to different hyper-parameters, recommending a specific setting for the main one, which is common for all methods. RDBSCAN* outperforms the other algorithms in terms of the density-based internal validity metric. Finally, the outcome of the clustering shows that the algorithm extracts main clusters of the dataset effectively, connecting outliers to these main clusters

    Efficient Computation of Multiple Density-Based Clustering Hierarchies

    Full text link
    HDBSCAN*, a state-of-the-art density-based hierarchical clustering method, produces a hierarchical organization of clusters in a dataset w.r.t. a parameter mpts. While the performance of HDBSCAN* is robust w.r.t. mpts in the sense that a small change in mpts typically leads to only a small or no change in the clustering structure, choosing a "good" mpts value can be challenging: depending on the data distribution, a high or low value for mpts may be more appropriate, and certain data clusters may reveal themselves at different values of mpts. To explore results for a range of mpts values, however, one has to run HDBSCAN* for each value in the range independently, which is computationally inefficient. In this paper, we propose an efficient approach to compute all HDBSCAN* hierarchies for a range of mpts values by replacing the graph used by HDBSCAN* with a much smaller graph that is guaranteed to contain the required information. An extensive experimental evaluation shows that with our approach one can obtain over one hundred hierarchies for the computational cost equivalent to running HDBSCAN* about 2 times.Comment: A short version of this paper appears at IEEE ICDM 2017. Corrected typos. Revised abstrac

    Spanish Corpora of tweets about COVID-19 vaccination for automatic stance detection

    Get PDF
    The paper presents new annotated corpora for performing stance detection on Spanish Twitter data, most notably Health-related tweets. The objectives of this research are threefold: (1) to develop a manually annotated benchmark corpus for emotion recognition taking into account different variants of Spanish in social posts; (2) to evaluate the efficiency of semi-supervised models for extending such corpus with unlabelled posts; and (3) to describe such short text corpora via specialised topic modelling. A corpus of 2,801 tweets about COVID-19 vaccination was annotated by three native speakers to be in favour (904), against (674) or neither (1,223) with a 0.725 Fleiss’ kappa score. Results show that the self-training method with SVM base estimator can alleviate annotation work while ensuring high model performance. The self-training model outperformed the other approaches and produced a corpus of 11,204 tweets with a macro averaged f1 score of 0.94. The combination of sentence-level deep learning embeddings and density-based clustering was applied to explore the contents of both corpora. Topic quality was measured in terms of the trustworthiness and the validation index.Agencia Estatal de Investigación | Ref. PID2020–113673RB-I00Xunta de Galicia | Ref. ED431C2018/55Fundação para a Ciência e a Tecnologia | Ref. UIDB/04469/2020Financiado para publicación en acceso aberto: Universidade de Vigo/CISU

    PhageWeb – Web Interface for Rapid Identification and Characterization of Prophages in Bacterial Genomes

    Get PDF
    This study developed a computational tool with a graphical interface and a web-service that allows the identification of phage regions through homology search and gene clustering. It uses G+C content variation evaluation and tRNA prediction sites as evidence to reinforce the presence of prophages in indeterminate regions. Also, it performs the functional characterization of the prophages regions through data integration of biological databases. The performance of PhageWeb was compared to other available tools (PHASTER, Prophinder, and PhiSpy) using Sensitivity (Sn) and Positive Predictive Value (PPV) tests. As a reference for the tests, more than 80 manually annotated genomes were used. In the PhageWeb analysis, the Sn index was 86.1% and the PPV was approximately 87%, while the second best tool presented Sn and PPV values of 83.3 and 86.5%, respectively. These numbers allowed us to observe a greater precision in the regions identified by PhageWeb while compared to other prediction tools submitted to the same tests. Additionally, PhageWeb was much faster than the other computational alternatives, decreasing the processing time to approximately one-ninth of the time required by the second best software. PhageWeb is freely available at http://computationalbiology.ufpa.br/phageweb

    NK Hybrid Genetic Algorithm for Clustering

    Get PDF
    The NK hybrid genetic algorithm for clustering is proposed in this paper. In order to evaluate the solutions, the hybrid algorithm uses the NK clustering validation criterion 2 (NKCV2). NKCV2 uses information about the disposition of N small groups of objects. Each group is composed of K+1 objects of the dataset. Experimental results show that density-based regions can be identified by using NKCV2 with fixed small K. In NKCV2, the relationship between decision variables is known, which in turn allows us to apply gray box optimization. Mutation operators, a partition crossover, and a local search strategy are proposed, all using information about the relationship between decision variables. In partition crossover, the evaluation function is decomposed into q independent components; partition crossover then deterministically returns the best among 2^q possible offspring with computational complexity O(N). The NK hybrid genetic algorithm allows the detection of clusters with arbitrary shapes and the automatic estimation of the number of clusters. In the experiments, the NK hybrid genetic algorithm produced very good results when compared to another genetic algorithm approach and to state-of-art clustering algorithms.In Brazil, this research was partially funded by FAPESP (2015/06462-1, 2015/50122-0, and 2013/07375-0) and CNPq (303012/2015-3 and 304400/2014-9). In Spain, this research was partially funded by Ministerio de Economía y Competitividad (TIN2014-57341-R and TIN2017-88213-R) and by Ministerio de Educación Cultura y Deporte (CAS12/00274)
    • …
    corecore