183 research outputs found

    Valve Health Identification Using Sensors and Machine Learning Methods

    Get PDF
    Predictive maintenance models attempt to identify developing issues with industrial equipment before they become critical. In this paper, we describe both supervised and unsupervised approaches to predictive maintenance for subsea valves in the oil and gas industry. The supervised approach is appropriate for valves for which a long history of operation along with manual assessments of the state of the valves exists, while the unsupervised approach is suitable to address the cold start problem when new valves, for which we do not have an operational history, come online. For the supervised prediction problem, we attempt to distinguish between healthy and unhealthy valve actuators using sensor data measuring hydraulic pressures and flows during valve opening and closing events. Unlike previous approaches that solely rely on raw sensor data, we derive frequency and time domain features, and experiment with a range of classification algorithms and different feature subsets. The performing models for the supervised approach were discovered to be Adaboost and Random Forest ensembles. In the unsupervised approach, the goal is to detect sudden abrupt changes in valve behaviour by comparing the sensor readings from consecutive opening or closing events. Our novel methodology doing this essentially works by comparing the sequences of sensor readings captured during these events using both raw sensor readings, as well as normalised and first derivative versions of the sequences. We evaluate the effectiveness of a number of well-known time series similarity measures and find that using discrete Frechet distance or dynamic time warping leads to the best results, with the Bray-Curtis similarity measure leading to only marginally poorer change detection but requiring considerably less computational effort

    Leveraging Software Clones for Software Comprehension: Techniques and Practice

    Get PDF
    RÉSUMÉ Le corps de cette thèse est centré sur deux aspects de la détection de clones logiciels: la détection et l’application. En détection, la contribution principale de cette thèse est un nouveau détecteur de clones conçu avec la librairie mtreelib, elle-même développée expressément pour ce travail. Cette librairie implémente un arbre de métrique général, une structure de donnée spécialisée dans la division des espaces de métriques dans le but d’accélérer certaines requêtes communes, comme les requêtes par intervalles ou les requêtes de plus proche voisin. Cette structure est utilisée pour construire un détecteur de clones qui approxime la distance de Levenshtein avec une forte précision. Une brève évaluation est présentée pour soutenir cette précision. D’autres résultats pertinents sur les métriques et la détection incrémentale de clones sont également présentés. Plusieurs applications du nouveau détecteur de clones sont présentés. Tout d’abord, un algorithme original pour la reconstruction d’informations perdus dans les systèmes de versionnement est proposé et testé sur plusieurs grands systèmes. Puis, une évaluation qualitative et quantitative de Firefox est faite sur la base d’une analyse du plus proche voisin; les courbes obtenues sont utilisées pour mettre en lumière les difficultés d’effectuer une transition entre un cycle de développement lent et rapide. Ensuite, deux expériences industrielles d’utilisation et de déploiement d’une technologie de détection de clonage sont présentés. Ces deux expériences concernent les langages C/C++, Java et TTCN-3. La grande différence de population de clones entre C/C++ et Java et TTCN-3 est présentée. Finalement, un résultat obtenu grâce au croisement d’une analyse de clones et d’une analyse de flux de sécurité met en lumière l’utilité des clones dans l’identification des failles de sécurité. Le travail se termine par une conclusion et quelques perspectives futures.----------ABSTRACT This thesis explores two topics in clone analysis: detection and application. The main contribution in clone detection is a new clone detector based on a library called mtreelib. This library is a package developed for clone detection that implements the metric data structure. This structure is used to build a clone detector that approximates the Levenshtein distance with high accuracy. A small benchmark is produced to assess the accuracy. Other results from these regarding metrics and incremental clone detection are also presented. Many applications of the clone detector are introduced. An original algorithm to reconstruct missing information in the structure of software repositories is described and tested with data sourced from large existing software. An insight into Firefox is exposed showing the quantity of change between versions and the link between different release cycle types and the number of bugs. Also, an analysis crossing the results from pattern traversal, flow analysis and clone detection is presented. Two industrial experiments using a different clone detector, CLAN, are also presented with some developers’ perspectives. One of the experiments is done on a language never explored in clone detection, TTCN-3, and the results show that the clone population in that language differs greatly from other well-known languages, like C/C++ and Java. The thesis concludes with a summary of the findings and some perspectives for future research

    Handling Imbalanced Classes: Feature Based Variance Ranking Techniques for Classification

    Get PDF
    To obtain good predictions in the presence of imbalance classes has posed significant challenges in the data science community. Imbalanced classed data is a term used to describe a situation where there are unequal number of classes or groups in datasets. In most real-life datasets one of the classes are always higher in number than others and is called the majority class, while the smaller classes are called the minority class. During classifications even with very high accuracy, the classified minority groups are usually very small when compared to the total number of minority in the datasets and more often than not, the minority classes are what is being sought. This work is specifically concern with providing techniques to improve classifications performance by eliminating or reducing negative effects of class imbalance. Real-life datasets have been found to contain different types of error in combination with class imbalance. While these errors are easily corrected, but the solutions to class imbalance have remained elusive. Previously, machine learning (ML) technique has been used to solve the problems of class imbalanced. There are notable shortcomings that have been identified while using this technique. Mostly, it involve fine-tuning and changing parameters of the algorithms and this process is not standardised because of countless numbers of algorithms and parameters. In general, the results obtained from these unstandardised (ML) technique are very inconsistent and cannot be replicated with similar datasets and algorithms. We present a novel technique for dealing with imbalanced classes called variance ranking features selection, that enables machine learning algorithms to classify more of minority classes during classification, hence reducing the negative effects of class imbalance. Our approaches utilised the intrinsic property of the datasets called the variance. As the variance is one of the measures of central tendency of the data items concentration within the datasets vector space. We demonstrated the selections of features at different level of performance threshold thereby providing an opportunity for performance and feature significance to be assessed and correlated at different levels of prediction. In the evaluations we compared our features selections with some of the best known features selections techniques using proximity distance comparison techniques and verify all the results with different datasets, both binary and multi classed with varying degree of class imbalance. In all the experiments, the results we obtained showed a significant improvement when compared with other previous work in class imbalance

    A Hybrid Templated-Based Composite Classification System

    Get PDF
    An automatic target classification system contains a classifier which reads a feature as an input and outputs a class label. Typically, the feature is a vector of real numbers. Other features can be non-numeric, such as a string of symbols or alphabets. One method of improving the performance of an automatic classification system is through combining two or more independent classifiers that are complementary in nature. Complementary classifiers are observed by finding an optimal method for partitioning the problem space. For example, the individual classifiers may operate to identify specific objects. Another method may be to use classifiers that operate on different features. We propose a design for a hybrid composite classification system, which exploits both real-numbered and non-numeric features with a template matching classification scheme. This composite classification system is made up of two independent classification systems.These two independent classification systems, which receive input from two separate sensors are then combined over various fusion methods for the purpose of target identification. By using these two separate classifiers, we explore conditions that allow the two techniques to be complementary in nature, thus improving the overall performance of the classification system. We examine various fusion techniques, in search of the technique that generates the best results. We investigate different parameter spaces and fusion rules on example problems to demonstrate our classification system. Our examples consider various application areas to help further demonstrate the utility of our classifier. Optimal classifier performance is obtained using a mathematical framework, which takes into account decision variables based on decision-maker preferences and/or engineering specifications, depending upon the classification problem at hand

    A Topic Modeling approach for Code Clone Detection

    Get PDF
    In this thesis work, the potential benefits of Latent Dirichlet Allocation (LDA) as a technique for code clone detection has been described. The objective is to propose a language-independent, effective, and scalable approach for identifying similar code fragments in relatively large software systems. The main assumption is that the latent topic structure of software artifacts gives an indication of the presence of code clones. It can be hypothesized that artifacts with similar topic distributions contain duplicated code fragments and to prove this hypothesis, an experimental investigation using multiple datasets from various application domains were conducted. In addition, CloneTM, an LDA-based working prototype for code clone detection was developed. Results showed that, if calibrated properly, topic modeling can deliver a satisfactory performance in capturing different types of code clones, showing particularity good performance in detecting Type III clones. CloneTM also achieved levels of performance comparable to already existing practical tools that adopt different clone detection strategies

    Some Relationships Between Sequences and Their Kmer Profiles

    Get PDF
    This paper explores kmer profiles in bioinformatics its two applications: one as a model for the reads of genome assembly, the second as a nice representation of DNA sequences. Kmer profiles are simply unordered collections of fixed length substrings (with length k) of DNA sequences; they resemble an idealized form of input genome assemblers receive while has been in the literature used as a fast way to approximate the otherwise expensive edit distance. The obvious question is the choice of k. After using the theory of metric embedding, de Bruijn assembly, and to some extent algebra, the familiar conclusion for genome assembly is recovered: k should be as large as permitted. The conclusion for edit distance approximation is more subtle. Small k loses nice mathematical properties while retaining good computational ones. Large k has good mathematical properties (with a proper metric distortion) while becomes computationally ugly due to the curse of dimensionality.Bachelor of Scienc

    Similarity-based user identification across social networks

    Get PDF
    Σε αυτή τη διπλωματική μελετάμε την ταυτοποίηση των χρηστών στα κοινωνικά δίκτυα, εκπαιδεύοντας εναν συνδυασμό διαφορετικών μετρικών ομοιότητας. Αυτή η εφαρμογή γίνεται ιδιαίτερα ενδιαφέρουσα, καθώς η αύξηση του αριθμού και της ποικιλομορφίας των κοινωνικών δικτύων και η παρουσία των ατόμων σε πολλαπλά δίκτυα γίνεται πλέον κοινός τόπος. Εχοντας ως κίνητρο την ανάγκη να επαλήθευσουμε τις πληροφορίες που εμφανίζονται σε κοινωνικά δίκτυα, όπως μελετάται στο ερευνητικό πρόγραμμα REVEAL (REVEALing hidden concepts in Social Media), η παρουσία ατόμων σε διαφορετικά δίκτυα παρέχει μια ενδιαφέρουσα ευκαιρία : μπορούμε να χρησιμοποιήσουμε τις πληροφορίες από ένα δίκτυο για να επαληθεύσουμε τις πληροφορίες που εμφανίζονται σε ένα άλλο. Για να επιτευχθεί αυτό, χρειάζεται να ταυτοποιήσουμε τους χρήστες σε διαφορετικά δίκτυα. Προσεγγίζουμε αυτό το πρόβλημα συνδυάζοντας κάποια μέτρα ομοιότητας που λαμβάνουν υπόψη τον εργασιακό χώρο, την τοποθεσία, τα επαγγελματικά ενδιαφέροντα και εμπειρία των χρηστών, όπως αναφέρονται και καθορίζονται στα διάφορα δίκτυα. Εχουμε πειραματιστεί με μια ποικιλία από συνδυαστικές προσεγγίσεις, που κυμαίνονται από την απλή κατά μέσο όρο ταξινόμηση έως υβριδικούς εκπαιδευόμενους ταξινομητές. Τα πειράματά μας δείχνουν ότι, υπό ορισμένες προϋποθέσεις, η ταυτοποίηση χρηστών είναι δυνατή με αρκετά υψηλή ακρίβεια για να επιτευχθεί ο στόχος της επαλήθευσης των πληροφοριών.In this thesis we study the identifiability of users across social networks, with a trainable combination of different similarity metrics. This application is becoming particularly interesting as the number and variety of social networks increase and the presence of individuals in multiple networks is becoming commonplace. Motivated by the need to verify information that appears in social networks, as addressed by the research project REVEAL (REVEALing hidden concepts in Social Media), the presence of individuals in different networks provides an interesting opportunity: we can use information from one network to verify information that appears in another. In order to achieve this, we need to identify users across networks. We approach this problem by a combination of similarity measures that take into account the users’ affiliation, location, professional interests and past experience, as stated in the different networks. We experimented with a variety of combination approaches, ranging from simple averaging to trained hybrid models. Our experiments show that, under certain conditions, identification is possible with sufficiently high accuracy to support the goal of verification
    corecore