6,182 research outputs found

    Improving the Automatic Text Classification Algorithm of Siav, a Case Study

    Get PDF
    Siav on ettevõte, mis pakub digitaalsete dokumentide haldamise- ja säilitamise- ning töövoogude juhtimisele keskenduvaid infotehnoloogiateenuseid. Üheks firma projektiks on ärilises kontekstis kasutatava automaatse tekstiklassifitseerimise teenuse loomine. Antud lõputöö eesmärgiks on parandada praeguse klassifikaatori täpsust ja usaldusväärsust läbi tehislike närvivõrkude. Olemasolevat lahendust analüüsitakse ja selle kitsaskohtade parandamiseks pakutakse välja mitu edasiarendust, mis kasutavad lingvistilist eeltöötlemist ja tehislikke närvivõrke. Pakutud lahendused teostatakse ja nende jõudlust võrreldakse olemasoleva lahendusega. Lõpetuseks arutletakse väljapakutud lahenduse ja selle konteksti sobimise üle.Siav is an IT service company that provides products for electronic document management, workflow management and the preservation of digital documents. One of their projects is to create an automatic text classifier suitable for use in business contexts. The primary aim of this thesis is to improve the current accuracy and confidence reliability of the text classifier using neural networks. In order to accomplish these goals, the baselined implementation is analysed and a number of approaches from linguistic processing and neural networks are proposed to address limitations in the current technology. The proposed techniques are then implemented and the performance results are compared against the existing metrics. Finally, observations are made regarding the proposed solution and its suitability for business use compared to the existing one

    Online Tool Condition Monitoring Based on Parsimonious Ensemble+

    Full text link
    Accurate diagnosis of tool wear in metal turning process remains an open challenge for both scientists and industrial practitioners because of inhomogeneities in workpiece material, nonstationary machining settings to suit production requirements, and nonlinear relations between measured variables and tool wear. Common methodologies for tool condition monitoring still rely on batch approaches which cannot cope with a fast sampling rate of metal cutting process. Furthermore they require a retraining process to be completed from scratch when dealing with a new set of machining parameters. This paper presents an online tool condition monitoring approach based on Parsimonious Ensemble+, pENsemble+. The unique feature of pENsemble+ lies in its highly flexible principle where both ensemble structure and base-classifier structure can automatically grow and shrink on the fly based on the characteristics of data streams. Moreover, the online feature selection scenario is integrated to actively sample relevant input attributes. The paper presents advancement of a newly developed ensemble learning algorithm, pENsemble+, where online active learning scenario is incorporated to reduce operator labelling effort. The ensemble merging scenario is proposed which allows reduction of ensemble complexity while retaining its diversity. Experimental studies utilising real-world manufacturing data streams and comparisons with well known algorithms were carried out. Furthermore, the efficacy of pENsemble was examined using benchmark concept drift data streams. It has been found that pENsemble+ incurs low structural complexity and results in a significant reduction of operator labelling effort.Comment: this paper has been published by IEEE Transactions on Cybernetic

    Investigating Labelless Drift Adaptation for Malware Detection

    Get PDF
    The evolution of malware has long plagued machine learning-based detection systems, as malware authors develop innovative strategies to evade detection and chase profits. This induces concept drift as the test distribution diverges from the training, causing performance decay that requires constant monitoring and adaptation. In this work, we analyze the adaptation strategy used by DroidEvolver, a state-of-the-art learning system that self-updates using pseudo-labels to avoid the high overhead associated with obtaining a new ground truth. After removing sources of experimental bias present in the original evaluation, we identify a number of flaws in the generation and integration of these pseudo-labels, leading to a rapid onset of performance degradation as the model poisons itself. We propose DroidEvolver++, a more robust variant of DroidEvolver, to address these issues and highlight the role of pseudo-labels in addressing concept drift. We test the tolerance of the adaptation strategy versus different degrees of pseudo-label noise and propose the adoption of methods to ensure only high-quality pseudo-labels are used for updates. Ultimately, we conclude that the use of pseudo-labeling remains a promising solution to limitations on labeling capacity, but great care must be taken when designing update mechanisms to avoid negative feedback loops and self-poisoning which have catastrophic effects on performance

    Online Metric-Weighted Linear Representations for Robust Visual Tracking

    Full text link
    In this paper, we propose a visual tracker based on a metric-weighted linear representation of appearance. In order to capture the interdependence of different feature dimensions, we develop two online distance metric learning methods using proximity comparison information and structured output learning. The learned metric is then incorporated into a linear representation of appearance. We show that online distance metric learning significantly improves the robustness of the tracker, especially on those sequences exhibiting drastic appearance changes. In order to bound growth in the number of training samples, we design a time-weighted reservoir sampling method. Moreover, we enable our tracker to automatically perform object identification during the process of object tracking, by introducing a collection of static template samples belonging to several object classes of interest. Object identification results for an entire video sequence are achieved by systematically combining the tracking information and visual recognition at each frame. Experimental results on challenging video sequences demonstrate the effectiveness of the method for both inter-frame tracking and object identification.Comment: 51 pages. Appearing in IEEE Transactions on Pattern Analysis and Machine Intelligenc

    Fame for sale: efficient detection of fake Twitter followers

    Get PDF
    Fake followers\textit{Fake followers} are those Twitter accounts specifically created to inflate the number of followers of a target account. Fake followers are dangerous for the social platform and beyond, since they may alter concepts like popularity and influence in the Twittersphere - hence impacting on economy, politics, and society. In this paper, we contribute along different dimensions. First, we review some of the most relevant existing features and rules (proposed by Academia and Media) for anomalous Twitter accounts detection. Second, we create a baseline dataset of verified human and fake follower accounts. Such baseline dataset is publicly available to the scientific community. Then, we exploit the baseline dataset to train a set of machine-learning classifiers built over the reviewed rules and features. Our results show that most of the rules proposed by Media provide unsatisfactory performance in revealing fake followers, while features proposed in the past by Academia for spam detection provide good results. Building on the most promising features, we revise the classifiers both in terms of reduction of overfitting and cost for gathering the data needed to compute the features. The final result is a novel Class A\textit{Class A} classifier, general enough to thwart overfitting, lightweight thanks to the usage of the less costly features, and still able to correctly classify more than 95% of the accounts of the original training set. We ultimately perform an information fusion-based sensitivity analysis, to assess the global sensitivity of each of the features employed by the classifier. The findings reported in this paper, other than being supported by a thorough experimental methodology and interesting on their own, also pave the way for further investigation on the novel issue of fake Twitter followers
    corecore