16 research outputs found

    Algorithms Comparison for Non-Requirements Classification using the Semantic Feature of Software Requirement Statements

    Get PDF
    Noise in a Software Requirements Specification (SRS) is an irrelevant requirements statement or a non-requirements statement. This can be confusing to the reader and can have negative repercussions in later stages of software development. This study proposes a classification model to detect the second type of noise, the non-requirements statement. The classification model that is built is based on the semantic features of the non-requirements statement. This research also compares the five best-supervised machine learning methods to date, which are support vector machine (SVM), naïve Bayes (NB), random forest (RF), k-nearest neighbor (kNN), and Decision Tree. This comparison aimed to determine which method can produce the best non-requirements classification, model. The comparison shows that the best model is produced by the SVM method with an average accuracy of 0.96. The most significant features in this non-requirement classification model are the requirements statement or non-requirements, id statement, normalized mean value, standard deviation value, similarity variant value, standard deviation normalization value, maximum normalized value, similarity variant normalization value, value Bad NN, mean value, number of sentences, bad VB score, and project id

    NOISE DETECTION IN SOFTWARE REQUIREMENTS SPECIFICATION DOCUMENT USING SPECTRAL CLUSTERING

    Get PDF
    Requirements engineering phase in software development resulting in a SRS (Software Requirements Specification) document. The use of natural language approach in generating such document has some drawbacks that caused 7 common mistakes among the engineer which had been formulated by Meyer as "The 7 sins of specifier". One of the 7 common mistakes is noise. This study attempted to detect noise in software requirements with spectral clustering. The clustering algorithm working on fewer dimensions compared to others. The resulting kappa coefficient is 0.4426. The result showed that the consistency between noise prediction and noise assessment made by three annotators is still low

    The effect of different similarity distance measures in detecting outliers using single-linkage clustering algorithm for univariate circular biological data

    Get PDF
    Clustering algorithms can be used to create an outlier detection procedure in univariate circular data. The circular distance between each point of angular observation in circular data is used to calculate the similarity measure to appropriately group observations. In this paper, we present a clustering-based procedure for detecting outliers in univariate circular biological data using various similarity distance measures. Three circular similarity distance measures; Satari distance, Di distance and Chang-chien distance were used to detect outliers using a single-linkage clustering algorithm. Satari distance and Di distance are two similarity measures that have similar formulas for univariate circular data. This study aims to develop and demonstrate the effectiveness of the proposed clustering-based procedure with various similarity distance measures in detecting outliers. The circular similarity distance of SL-Satari/Di and other similarity measures, including SL-Chang, were compared at various dendrogram cutting points. It is found that a clustering-based procedure using a single-linkage algorithm with various similarity distances is a practical and promising approach to detect outliers in univariate circular data, particularly for biological data. According to the results, the SL-Satari/Di distance outperformed the SL-Chang distance for certain data conditions

    Outlier detection and robust normal-curvature estimation in mobile laser scanning 3D point cloud data

    Get PDF
    This paper proposes two robust statistical techniques for outlier detection and robust saliency features, such as surface normal and curvature, estimation in laser scanning 3D point cloud data. One is based on a robust z-score and the other uses a Mahalanobis type robust distance. The methods couple the ideas of point to plane orthogonal distance and local surface point consistency to get Maximum Consistency with Minimum Distance (MCMD). The methods estimate the best-fit-plane based on most probable outlier free, and most consistent, points set in a local neighbourhood. Then the normal and curvature from the best-fit-plane will be highly robust to noise and outliers. Experiments are performed to show the performance of the algorithms compared to several existing well-known methods (from computer vision, data mining, machine learning and statistics) using synthetic and real laser scanning datasets of complex (planar and non-planar) objects. Results for plane fitting, denoising, sharp feature preserving and segmentation are significantly improved. The algorithms are demonstrated to be significantly faster, more accurate and robust. Quantitatively, for a sample size of 50 with 20% outliers the proposed MCMD_Z is approximately 5, 15 and 98 times faster than the existing methods: uLSIF, RANSAC and RPCA, respectively. The proposed MCMD_MD method can tolerate 75% clustered outliers, whereas, RPCA and RANSAC can only tolerate 47% and 64% outliers, respectively. In terms of outlier detection, for the same dataset, MCMD_Z has an accuracy of 99.72%, 0.4% false positive rate and 0% false negative rate; for RPCA, RANSAC and uLSIF, the accuracies are 97.05%, 47.06% and 94.54%, respectively, and they have misclassification rates higher than the proposed methods. The new methods have potential for local surface reconstruction, fitting, and other point cloud processing tasks

    Outlier Detection of Time Series with A Novel Hybrid Method in Cloud Computing

    Get PDF
    In the wake of the development in science and technology, Cloud Computing has obtained more attention in different field. Meanwhile, outlier detection for data mining in Cloud Computing is playing more and more significant role in different research domains and massive research works have devoted to outlier detection, which includes distance-based, density-based and clustering-based outlier detection. However, the existing available methods spend high computation time. Therefore, the improved algorithm of outlier detection, which has higher performance to detect outlier is presented. In this paper, the proposed method, which is an improved spectral clustering algorithm (SKM++), is fit for handling outliers. Then, pruning data can reduce computational complexity and combine distance-based method Manhattan Distance (distm) to obtain outlier score. Finally, the method confirms the outlier by extreme analysis. This paper validates the presented method by experiments with a real collected data by sensors and comparison against the existing approaches, the experimental results turn out that our proposed method precedes the existing

    A federated learning framework for the next-generation machine learning systems

    Get PDF
    Dissertação de mestrado em Engenharia Eletrónica Industrial e Computadores (especialização em Sistemas Embebidos e Computadores)The end of Moore's Law aligned with rising concerns about data privacy is forcing machine learning (ML) to shift from the cloud to the deep edge, near to the data source. In the next-generation ML systems, the inference and part of the training process will be performed right on the edge, while the cloud will be responsible for major ML model updates. This new computing paradigm, referred to by academia and industry researchers as federated learning, alleviates the cloud and network infrastructure while increasing data privacy. Recent advances have made it possible to efficiently execute the inference pass of quantized artificial neural networks on Arm Cortex-M and RISC-V (RV32IMCXpulp) microcontroller units (MCUs). Nevertheless, the training is still confined to the cloud, imposing the transaction of high volumes of private data over a network. To tackle this issue, this MSc thesis makes the first attempt to run a decentralized training in Arm Cortex-M MCUs. To port part of the training process to the deep edge is proposed L-SGD, a lightweight version of the stochastic gradient descent optimized for maximum speed and minimal memory footprint on Arm Cortex-M MCUs. The L-SGD is 16.35x faster than the TensorFlow solution while registering a memory footprint reduction of 13.72%. This comes at the cost of a negligible accuracy drop of only 0.12%. To merge local model updates returned by edge devices this MSc thesis proposes R-FedAvg, an implementation of the FedAvg algorithm that reduces the impact of faulty model updates returned by malicious devices.O fim da Lei de Moore aliado às crescentes preocupações sobre a privacidade dos dados gerou a necessidade de migrar as aplicações de Machine Learning (ML) da cloud para o edge, perto da fonte de dados. Na próxima geração de sistemas ML, a inferência e parte do processo de treino será realizada diretamente no edge, enquanto que a cloud será responsável pelas principais atualizações do modelo ML. Este novo paradigma informático, referido pelos investigadores académicos e industriais como treino federativo, diminui a sobrecarga na cloud e na infraestrutura de rede, ao mesmo tempo que aumenta a privacidade dos dados. Avanços recentes tornaram possível a execução eficiente do processo de inferência de redes neurais artificiais quantificadas em microcontroladores Arm Cortex-M e RISC-V (RV32IMCXpulp). No entanto, o processo de treino continua confinado à cloud, impondo a transação de grandes volumes de dados privados sobre uma rede. Para abordar esta questão, esta dissertação faz a primeira tentativa de realizar um treino descentralizado em microcontroladores Arm Cortex-M. Para migrar parte do processo de treino para o edge é proposto o L-SGD, uma versão lightweight do tradicional método stochastic gradient descent (SGD), otimizada para uma redução de latência do processo de treino e uma redução de recursos de memória nos microcontroladores Arm Cortex-M. O L-SGD é 16,35x mais rápido do que a solução disponibilizada pelo TensorFlow, ao mesmo tempo que regista uma redução de utilização de memória de 13,72%. O custo desta abordagem é desprezível, sendo a perda de accuracy do modelo de apenas 0,12%. Para fundir atualizações de modelos locais devolvidas por dispositivos do edge, é proposto o RFedAvg, uma implementação do algoritmo FedAvg que reduz o impacto de atualizações de modelos não contributivos devolvidos por dispositivos maliciosos

    Exploration of outliers in if-then rule-based knowledge bases

    Get PDF
    The article presents both methods of clustering and outlier detection in complex data, such as rule-based knowledge bases. What distinguishes this work from others is, first, the application of clustering algorithms to rules in domain knowledge bases, and secondly, the use of outlier detection algorithms to detect unusual rules in knowledge bases. The aim of the paper is the analysis of using four algorithms for outlier detection in rule-based knowledge bases: Local Outlier Factor (LOF), Connectivity-based Outlier Factor (COF), K-MEANS, and SMALL CLUSTERS. The subject of outlier mining is very important nowadays. Outliers in rules If-Then mean unusual rules, which are rare in comparing to others and should be explored by the domain expert as soon as possible. In the research, the authors use the outlier detection methods to find a given number of outliers in rules (1% , 5%, 10%), while in small groups, the number of outliers covers no more than 5% of the rule cluster. Subsequently, the authors analyze which of seven various quality indices, which they use for all rules and after removing selected outliers, improve the quality of rule clusters. In the experimental stage, the authors use six different knowledge bases. The best results (the most often the clusters quality was improved) are achieved for two outlier detection algorithms LOF and COF

    Towards Outlier Detection For Scattered Data and Mixed Attribute Data

    No full text
    Detecting outliers which are grossly different from or inconsistent with the remaining dataset is a major challenge in real-world knowledge discovery and data mining (KDD) applications. The research work in this thesis starts with a critical review on the latest and most popular methodologies available in outlier detection area. Based on a series of performance evaluation of these algorithms, two major issues that exist in outlier detection, namely scattered data problem and mixed attribute problem, are identified, and then being further addressed by the novel approaches proposed in this thesis. Based on our review and evaluation it has been found that the existing outlier detection methods are ineffective for many real-world scatter datasets, due to the implicit data patterns within these sparse datasets. In order to address this issue, we define a novel Local Distance-based Outlier Factor (LDOF) to measure the outlierness of objects in scattered datasets. LDOF uses the relative location of an object to its neighbours to determine the degree that the object deviates from its neighbourhood. The characteristics of LDOF are theoretically analysed, including LDOF's lower bound, false-detection probabilities, as well as its parameter range tolerance. In order to facilitate parameter settings in real-world applications, we employ a top-n technique in the proposed outlier detection approach, where only the objects with the highest LDOF values are regarded as outliers. Compared to conventional approaches (such as top-n KNN and top-n LOF), our method, top-n LDOF, proved more effective for detecting outliers in scattered data. The parameter settings for LDOF is also more practical for real-world applications, since its performance is relatively stable over a large range of parameter values, as illustrated by experimental results on both real-world and synthetic datasets. Secondly, for the mixed attribute problem, traditional outlier detection methods often fail to effectively identify outliers, due to the lack of the mechanisms to consider the interactions among various types of the attributes that might exist in the real-world datasets. To address this issue in mixed attribute datasets, we propose a novel Pattern based Outlier Detection approach (POD). A pattern in this thesis is defined as a mathematical representation that describes the majority of the observations in datasets and captures the interactions among different types of attributes. The POD is designed in the way that the more an object deviates from these patterns, the higher its outlier factor is. We simply use logistic regression to learn patterns and then formulate the outlier factor in mixed attribute datasets. For the datasets which outliers are randomly allocated among normal data objects, distance based methods, i.e. LOF and KNN, would not have effective. On the contrary, as the outlierness definition proposed in POD is able to integrate numeric and categorical attributes into a united definition, the numeric attributes would not represent the final outlierness directly but contribute their anomaly through categorical attributes. Therefore, the POD will be able to offer considerably performance improvement compared to those traditional methods. A series of experiments show that the performance enhancement by the POD is statistically significant comparing to several classic outlier detection methods. However, for POD, the algorithm sometimes shows lower detection precision for some mixed attribute datasets, because POD has a strong assumption that the observed mixed attribute dataset in any subspace is linearly separable. This limitation is determined by the linear classifier, logistic regression, we used in POD algorithm
    corecore