9 research outputs found

    SOTXTSTREAM: Density-based self-organizing clustering of text streams

    Get PDF
    A streaming data clustering algorithm is presented building upon the density-based selforganizing stream clustering algorithm SOSTREAM. Many density-based clustering algorithms are limited by their inability to identify clusters with heterogeneous density. SOSTREAM addresses this limitation through the use of local (nearest neighbor-based) density determinations. Additionally, many stream clustering algorithms use a two-phase clustering approach. In the first phase, a micro-clustering solution is maintained online, while in the second phase, the micro-clustering solution is clustered offline to produce a macro solution. By performing self-organization techniques on micro-clusters in the online phase, SOSTREAM is able to maintain a macro clustering solution in a single phase. Leveraging concepts from SOSTREAM, a new density-based self-organizing text stream clustering algorithm, SOTXTSTREAM, is presented that addresses several shortcomings of SOSTREAM. Gains in clustering performance of this new algorithm are demonstrated on several real-world text stream datasets

    Fast & Efficient Learning of Bayesian Networks from Data: Knowledge Discovery and Causality

    Full text link
    Structure learning is essential for Bayesian networks (BNs) as it uncovers causal relationships, and enables knowledge discovery, predictions, inferences, and decision-making under uncertainty. Two novel algorithms, FSBN and SSBN, based on the PC algorithm, employ local search strategy and conditional independence tests to learn the causal network structure from data. They incorporate d-separation to infer additional topology information, prioritize conditioning sets, and terminate the search immediately and efficiently. FSBN achieves up to 52% computation cost reduction, while SSBN surpasses it with a remarkable 72% reduction for a 200-node network. SSBN demonstrates further efficiency gains due to its intelligent strategy. Experimental studies show that both algorithms match the induction quality of the PC algorithm while significantly reducing computation costs. This enables them to offer interpretability and adaptability while reducing the computational burden, making them valuable for various applications in big data analytics

    Quantifying soybean phenotypes using UAV imagery and machine learning, deep learning methods

    Get PDF
    Crop breeding programs aim to introduce new cultivars to the world with improved traits to solve the food crisis. Food production should need to be twice of current growth rate to feed the increasing number of people by 2050. Soybean is one the major grain in the world and only US contributes around 35 percent of world soybean production. To increase soybean production, breeders still rely on conventional breeding strategy, which is mainly a 'trial and error' process. These constraints limit the expected progress of the crop breeding program. The goal was to quantify the soybean phenotypes of plant lodging and pubescence color using UAV-based imagery and advanced machine learning. Plant lodging and soybean pubescence color are two of the most important phenotypes for soybean breeding programs. Soybean lodging and pubescence color is conventionally evaluated visually by breeders, which is time-consuming and subjective to human errors. The goal of this study was to investigate the potential of unmanned aerial vehicle (UAV)-based imagery and machine learning in the assessment of lodging conditions and deep learning in the assessment pubescence color of soybean breeding lines. A UAV imaging system equipped with an RGB (red-green-blue) camera was used to collect the imagery data of 1,266 four-row plots in a soybean breeding field at the reproductive stage. Soybean lodging scores and pubescence scores were visually assessed by experienced breeders. Lodging scores were grouped into four classes, i.e., non-lodging, moderate lodging, high lodging, and severe lodging. In contrast, pubescence color scores were grouped into three classes, i.e., gray, tawny, and segregation. UAV images were stitched to build orthomosaics, and soybean plots were segmented using a grid method. Twelve image features were extracted from the collected images to assess the lodging scores of each breeding line. Four models, i.e., extreme gradient boosting (XGBoost), random forest (RF), K-nearest neighbor (KNN), and artificial neural network (ANN), were evaluated to classify soybean lodging classes. Five data pre-processing methods were used to treat the imbalanced dataset to improve the classification accuracy. Results indicate that the pre-processing method SMOTE-ENN consistently performs well for all four (XGBoost, RF, KNN, and ANN) classifiers, achieving the highest overall accuracy (OA), lowest misclassification, higher F1-score, and higher Kappa coefficient. This suggests that Synthetic Minority Over-sampling-Edited Nearest Neighbor (SMOTE-ENN) may be an excellent pre-processing method for using unbalanced datasets and classification tasks. Furthermore, an overall accuracy of 96 percent was obtained using the SMOTE-ENN dataset and ANN classifier. On the other hand, to classify the soybean pubescence color, seven pre-trained deep learning models, i.e., DenseNet121, DenseNet169, DenseNet201, ResNet50, InceptionResNet-V2, Inception-V3, and EfficientNet were used, and images of each plot were fed into the model. Data was enhanced using two rotational and two scaling factors to increase the datasets. Among the seven pre-trained deep learning models, ResNet50 and DenseNet121 classifiers showed a higher overall accuracy of 88 percent, along with higher precision, recall, and F1-score for all three classes of pubescence color. In conclusion, the developed UAV-based high-throughput phenotyping system can gather image features to estimate soybean crucial phenotypes and classify the phenotypes, which will help the breeders in phenotypic variations in breeding trials. Also, the RGB imagery-based classification could be a cost-effective choice for breeders and associated researchers for plant breeding programs in identifying superior genotypes.Includes bibliographical references

    Programming language identification using machine learning

    Get PDF
    Many developer tools such as code search and source code highlighting require to know the programming language each given file is written in. Another task that also requires knowing the programming language of some source code file is the generation of statistics for code repositories. The main aim of this project is to build a programming language classifier that could be used in the previously mentioned tasks. This is done mainly by employing machine learning in combination with text classification techniques. We have developed a source code classifier that has been previously trained and then tested using source code from the Rosetta project dataset [44]. We also measure some metrics such as the time it takes to train the classifier, its accuracy, and time to perform the classification of individual source files. We also assess the economic impact in this social context, evaluating why a tool like this makes sense for developers.Muchas herramientas diseñadas para programadores, como búsqueda de código o coloreado de sintaxis requieren conocer previamente en qué lenguaje está escrito cada fichero para funcionar correctamente. Generar estadísticas sobre repositorios de código también precisa de esta información. El principal objectivo de este Trabajo de Fin de Grado es construir un clasificador de lenguajes de programación que pueda ser utilizado en las tareas mencionadas anteriormente empleando aprendizaje automático junto con técnicas de procesamiento de textos. Hemos desarrollado un clasificador de código fuente que ha sido entrenado y posteriormente evaluado con código fuente del proyecto Rosetta [44]. Tambien hemos medido algunas métricas como el tiempo de entrenamiento de los clasificadores, su precisión, y el tiempo que tardan en clasificar ficheros de código fuente individuales. También se expone el impacto económico en este contexto social, evaluando por qué esta herramienta es importante para desarrolladores.Ingeniería Informátic

    Multimodal Side-Tuning for Code Snippets Programming Language Recognition

    Get PDF
    Identificare in modo automatico il linguaggio di programmazione di una porzione di codice sorgente è uno dei temi che ancora oggi presenta diverse difficoltà. Il numero di linguaggi di programmazione, la quantità di codice pubblicato e reso open source, e il numero di sviluppatori che producono e pubblicano nuovo codice sorgente è in continuo aumento. Le motivazioni che richiedono la necessità di disporre di strumenti in grado di riconoscere il tipo di linguaggio per snippet di codice sorgente sono svariate. Ad esempio, tali strumenti trovano applicazione in ambiati quali: la ricerca di codice sorgente; la ricerca di possibili vulnerabilità nel codice; la syntax highlighting; o semplicemente per comprendere il contenuto di progetti software. Nasce così l'esigenza di disporre di dataset di snippet di codice allineati in modo adeguato con il linguaggio di programmazione. StackOverflow, una piattaforma di condivisione di conoscenza tra sviluppatori, offre la possibilità di avere accesso a centinaia di migliaia di snippet di codice sorgente scritti nei linguaggi più usati dagli sviluppatori, rendendolo il luogo ideale da cui estrarre snippet per la risoluzione del task proposto. Nel lavoro svolto si è dedicata molta attenzione a tale problematica, iterando sull'approccio scelto al fine di ottenere una metodologia che ha permesso l'estrazione di un dataset adeguato. Al fine di risolvere il task dell'identificazione del linguaggio per gli snippet estratti da StackOverflow, nel lavoro svolto si fa uso di un approccio multimodale (considerando rappresentazioni testuali e di immagini degli snippet), prendendo in esame la tecnica innovativa di side-tuning (basata sull'adattamento incrementale di una rete neurale pre-addestrata). I risultati ottenuti sono confrontabili con lo stato dell'arte e in alcuni casi migliori, in considerazione della difficoltà del task affrontato nel caso di snippet di codice sorgente che presentano poche linee di codice

    Intelligent lighting : a machine learning perspective

    Get PDF

    Three-dimensional image classification using hierarchical spatial decomposition: A study using retinal data

    Get PDF
    This thesis describes research conducted in the field of image mining especially volumetric image mining. The study investigates volumetric representation techniques based on hierarchical spatial decomposition to classify three-dimensional (3D) images. The aim of this study was to investigate the effectiveness of using hierarchical spatial decomposition coupled with regional homogeneity in the context of volumetric data representation. The proposed methods involve the following: (i) decomposition, (ii) representation, (iii) single feature vector generation and (iv) classifier generation. In the decomposition step, a given image (volume) is recursively decomposed until either homogeneous regions or a predefined maximum level are reached. For measuring the regional homogeneity, different critical functions are proposed. These critical functions are based on histograms of a given region. Once the image is decomposed, two representation methods are proposed: (i) to represent the decomposition using regions identified in the decomposition (region-based) or (ii) to represent the entire decomposition (whole image-based). The first method is based on individual regions, whereby each decomposed sub-volume (region) is represented in terms of different statistical and histogram-based techniques. Feature vector generation techniques are used to convert the set of feature vectors for each sub-volume into a single feature vector. In the whole image-based representation method, a tree is used to represent each image. Each node in the tree represents a region (sub-volume) using a single value and each edge describes the difference between the node and its parent node. A frequent sub-tree mining technique was adapted to identified a set of frequent sub-graphs. Selected sub-graphs are then used to build a feature vector for each image. In both cases, a standard classifier generator is applied, to the generated feature vectors, to model and predict the class of each image. Evaluation was conducted with respect to retinal optical coherence tomography images in terms of identifying Age-related Macular Degeneration (AMD). Two types of evaluation were used: (i) classification performance evaluation and (ii) statistical significance testing using ANalysis Of VAriance (ANOVA). The evaluation revealed that the proposed methods were effective for classifying 3D retinal images. It is consequently argued that the approaches are generic
    corecore