155 research outputs found

    Searching for Needles in the Cosmic Haystack

    Get PDF
    Searching for pulsar signals in radio astronomy data sets is a difficult task. The data sets are extremely large, approaching the petabyte scale, and are growing larger as instruments become more advanced. Big Data brings with it big challenges. Processing the data to identify candidate pulsar signals is computationally expensive and must utilize parallelism to be scalable. Labeling benchmarks for supervised classification is costly. To compound the problem, pulsar signals are very rare, e.g., only 0.05% of the instances in one data set represent pulsars. Furthermore, there are many different approaches to candidate classification with no consensus on a best practice. This dissertation is focused on identifying and classifying radio pulsar candidates from single pulse searches. First, to identify and classify Dispersed Pulse Groups (DPGs), we developed a supervised machine learning approach that consists of RAPID (a novel peak identification algorithm), feature extraction, and supervised machine learning classification. We tested six algorithms for classification with four imbalance treatments. Results showed that classifiers with imbalance treatments had higher recall values. Overall, classifiers using multiclass RandomForests combined with Synthetic Majority Oversampling TEchnique (SMOTE) were the most efficient; they identified additional known pulsars not in the benchmark, with less false positives than other classifiers. Second, we developed a parallel single pulse identification method, D-RAPID, and introduced a novel automated multiclass labeling (ALM) technique that we combined with feature selection to improve execution performance. D-RAPID improved execution performance over RAPID by a factor of 5. We also showed that the combination of ALM and feature selection sped up the execution performance of RandomForest by 54% on average with less than a 2% average reduction in classification performance. Finally, we proposed CoDRIFt, a novel classification algorithm that is distributed for scalability and employs semi-supervised learning to leverage unlabeled data to inform classification. We evaluated and compared CoDRIFt to eleven other classifiers. The results showed that CoDRIFt excelled at classifying candidates in imbalanced benchmarks with a majority of non-pulsar signals (\u3e95%). Furthermore, CoDRIFt models created with very limited sets of labeled data (as few as 22 labeled minority class instances) were able to achieve high recall (mean = 0.98). In comparison to the other algorithms trained on similar sets, CoDRIFt outperformed them all, with recall 2.9% higher than the next best classifier and a 35% average improvement over all eleven classifiers. CoDRIFt is customizable for other problem domains with very large, imbalanced data sets, such as fraud detection and cyber attack detection

    Ageing Analysis of Embedded SRAM on a Large-Scale Testbed Using Machine Learning

    Full text link
    Ageing detection and failure prediction are essential in many Internet of Things (IoT) deployments, which operate huge quantities of embedded devices unattended in the field for years. In this paper, we present a large-scale empirical analysis of natural SRAM wear-out using 154 boards from a general-purpose testbed. Starting from SRAM initialization bias, which each node can easily collect at startup, we apply various metrics for feature extraction and experiment with common machine learning methods to predict the age of operation for this node. Our findings indicate that even though ageing impacts are subtle, our indicators can well estimate usage times with an R2R^2 score of 0.77 and a mean error of 24% using regressors, and with an F1 score above 0.6 for classifiers applying a six-months resolution

    Algorithms for Multiclass Classification and Regularized Regression

    Get PDF

    Algorithms for Multiclass Classification and Regularized Regression

    Get PDF

    Tiny Classifier Circuits: Evolving Accelerators for Tabular Data

    Full text link
    A typical machine learning (ML) development cycle for edge computing is to maximise the performance during model training and then minimise the memory/area footprint of the trained model for deployment on edge devices targeting CPUs, GPUs, microcontrollers, or custom hardware accelerators. This paper proposes a methodology for automatically generating predictor circuits for classification of tabular data with comparable prediction performance to conventional ML techniques while using substantially fewer hardware resources and power. The proposed methodology uses an evolutionary algorithm to search over the space of logic gates and automatically generates a classifier circuit with maximised training prediction accuracy. Classifier circuits are so tiny (i.e., consisting of no more than 300 logic gates) that they are called "Tiny Classifier" circuits, and can efficiently be implemented in ASIC or on an FPGA. We empirically evaluate the automatic Tiny Classifier circuit generation methodology or "Auto Tiny Classifiers" on a wide range of tabular datasets, and compare it against conventional ML techniques such as Amazon's AutoGluon, Google's TabNet and a neural search over Multi-Layer Perceptrons. Despite Tiny Classifiers being constrained to a few hundred logic gates, we observe no statistically significant difference in prediction performance in comparison to the best-performing ML baseline. When synthesised as a Silicon chip, Tiny Classifiers use 8-18x less area and 4-8x less power. When implemented as an ultra-low cost chip on a flexible substrate (i.e., FlexIC), they occupy 10-75x less area and consume 13-75x less power compared to the most hardware-efficient ML baseline. On an FPGA, Tiny Classifiers consume 3-11x fewer resources.Comment: 14 pages, 16 figure

    Ohjattu tekstiluokittelu mediatutkimuksessa: aihemallinnuksen ja rakennepiirteiden käyttö BERTin tukena

    Get PDF
    Tämä työ esittelee ohjattuun koneoppimiseen perustuvan tekstiluokittelijan kehitysprosessin mediatutkimuksen näkökulmasta. Valittu lähestymistapa mahdollistaa mediatutkijan asiantuntijatiedon valjastamisen laaja-alaiseen laskennalliseen analyysiin ja suurten aineistojen käsittelyyn. Työssä kehitetään neuroverkkopohjainen tekstiluokittelija, jonka avulla vertaillaan tekstistä erotettujen erilaisten luokittelupiirteiden kykyä mallintaa journalististen tekstien kehystystaktiikoita ja aihepiirejä. Kehitystyössä käytetyt aineistot on annotoitu osana kahta mediatutkimusprojektia. Näistä ensimmäisessä tutkitaan tapoja, joilla vastamedia MV-lehti uudelleenkehystää valtamedian artikkeleita. Siinä on aineistona 37 185 MV-lehden artikkelia, joista on eristetty kolme erilaista kehystystaktiikkaa (Toivanen et al. 2021), jotka luokittelijan on määrä tunnistaa tekstistä automaattisesti. Toisessa projektissa keskiössä on valtamedioissa käyty alkoholipolitiikkaa koskeva keskustelu, jota varten kerättiin 33 902 artikkelin aineisto Ylen, Iltalehden ja STT:n uutisista (Käynnissä oleva Vallan virrat -tutkimusprojekti). Luokittelijan tehtävänä on tunnistaa aineistosta artikkelit, jotka sisältävät keskustelua alkoholipolitiikasta. Työn tarkoituksena on selvittää, mitkä tekstin piirteet soveltuvat parhaiten luokittelupiirteiksi kulloiseenkin tehtävään, ja mitkä niistä johtavat parhaaseen luokittelutarkkuuteen. Luokittelupiirteinä käytetään BERT-kielimallista eristettyä virketason kontekstuaalista tietoa, artikkelin muotoiluun liittyviä ominaisuuksia, kuten lihavointeja ja html-koodia, ja aihemallinnuksen avulla tuotettuja artikkelikohtaisia aihejakaumia. Alustavat kokeet pelkästään kontekstuaalista tietoa hyödyntävällä luokittelijalla olivat lupaavia, mutta niidenkään tarkkuus ei yltänyt tarvittavalle tasolle. Oli siis tarpeen selvittää, paraneeko luokittelijan suorituskyky yhdistelemällä eri piirteitä. Hypoteesi on uskottava, sillä esimerkiksi BERT-pohjaiset upotukset koodaavat muutaman virkkeen pituisen sekvenssin lingvististä ja jakaumallista informaatiota, kun taas aihemalli sisältää laajempaa rakenteellista informaatiota. Nämä piirteet täydentäisivät toisiaan artikkelitason luokitustehtävässä. Yhdistelemällä tekstien kontekstuaalista informaatiota aihemallinnukseen on hiljattain saavutettu parannuksia erilaisissa tekstinluokittelutesteissä ja sovelluksissa (Peinelt et al. 2020, Glazkova 2021). Yhdistämällä kontekstuaaliset piirteet aihemallin informaatioon päästään tässä työssä tosin vain marginaalisiin parannuksiin ja vain tietyissä ympäristöissä. Tästä huolimatta kehitetty luokittelija suoriutuu monesta luokittelutehtävästä paremmin kuin pelkästään kontekstuaalisia piirteitä hyödyntävä luokittelija. Lisäksi löydetään potentiaalisia kehityskohteita, joilla voitaisiin päästä edelleen parempaan luokittelutarkkuuteen. Kokeiden perusteella kehysanalyysiin perustuva automaattinen luokittelu neuroverkkojen avulla on mahdollista, mutta luokittelijoiden tarkkuudessa ja tulkittavuudessa on vielä kehityksen varaa, eivätkä ne vielä ole tarpeeksi tarkkoja korkeaa varmuutta vaativiin johtopäätöksiin.This thesis showcases a workflow in developing a modern machine learning based classifier to bridge the gap between qualitative and quantitative research in media studies. Due to the recent datafication of our social environment, there has been growing interest in combining qualitative and quantitative methodologies in media studies. Current machine learning methods make it possible to gain insights from large datasets that would be impractical to analyze with more traditional methods. Supervised document classification presents a good platform for combining specific domain knowledge and close reading with broader quantitative analysis. In this thesis, several classification features are extracted from journalistic texts and they are used to model framings and topics that are of interest to media researchers. Neural methods are utilized to build a supervised document classifier that can leverage the extracted features. The datasets used in development have been annotated as part of two ongoing media research projects. The first one consists of 37 185 articles from the Finnish countermedia publication MV-lehti and has been annotated into three categories based on a frame analysis of Toivanen et al. 2021. The second dataset revolves around the discourse that has been taking place in the legacy media sources Yle, Iltalehti and STT. This dataset consists of articles related to alcohol policy. The goal of the study is to reveal, which features perform best for classification, and does their performance differ across subtasks. As classification features, contextual sequence representations are extracted from the fin- BERT language model. Topic distributions are extracted from topic models that are trained on the data. Additionally, a structural featureset developed in Toivanen et al. 2021 is utilized. These structural features consist of different markup features of the articles, such as distances between tags and image sizes. The hypothesis that BERT-based embeddings could be improved upon by augmenting them with additional information is reinforced by recent good results in natural language benchmarks and tasks (Peinelt et al 2020, Glazkova 2021). By combining contextual embeddings with topics, only marginal performance increase is achieved and only in certain environments. In most instances, the combination was detrimental to performance due to increased noise in the classification feature. Nevertheless, various combinations of BERT- based embeddings, topics and structural features were found to outperform purely BERT-based classification in many subtasks. Additionally, potential future developments to achieve better classification performance are outlined. Based on the experiments, automated frame analysis with neural classifiers is possible, but the accuracy is not yet sufficient for inferences of high certainty

    Tree-structured multiclass probability estimators

    Get PDF
    Nested dichotomies are used as a method of transforming a multiclass classification problem into a series of binary problems. A binary tree structure is constructed over the label space that recursively splits the set of classes into subsets, and a binary classification model learns to discriminate between the two subsets of classes at each node. Several distinct nested dichotomy structures can be built in an ensemble for superior performance. In this thesis, we introduce two new methods for constructing more accurate nested dichotomies. Random-pair selection is a subset selection method that aims to group similar classes together in a non-deterministic fashion to easily enable the construction of accurate ensembles. Multiple subset evaluation takes this, and other subset selection methods, further by evaluating several different splits and choosing the best performing one. Finally, we also discuss the calibration of the probability estimates produced by nested dichotomies. We observe that nested dichotomies systematically produce under-confident predictions, even if the binary classifiers are well calibrated, and especially when the number of classes is high. Furthermore, substantial performance gains can be made when probability calibration methods are also applied to the internal models

    Machine learning-based detection and mapping of riverine litter utilizing Sentinel-2 imagery

    Get PDF
    Despite the substantial impact of rivers on the global marine litter problem, riverine litter has been accorded inadequate consideration. Therefore, our objective was to detect riverine litter by utilizing middle-scale multispectral satellite images and machine learning (ML), with the Tisza River (Hungary) as a study area. The Very High Resolution (VHR) images obtained from the Google Earth database were employed to recognize some riverine litter spots (a blend of anthropogenic and natural substances). These litter spots served as the basis for training and validating five supervised machine-learning algorithms based on Sentinel-2 images [Artificial Neural Network (ANN), Support Vector Classifier (SVC), Random Forest (RF), Naïve Bays (NB) and Decision Tree (DT)]. To evaluate the generalization capability of the developed models, they were tested on larger unseen data under varying hydrological conditions and with different litter sizes. Besides the best-performing model was used to investigate the spatio-temporal variations of riverine litter in the Middel Tisza. According to the results, almost all the developed models showed favorable metrics based on the validation dataset (e.g., F1-score; SVC: 0.94, ANN: 0.93, RF: 0.91, DT: 0.90, and NB: 0.83); however, during the testing process, they showed medium (e.g., F1-score; RF:0.69, SVC: 0.62; ANN: 0.62) to poor performance (e.g., F1-score; NB: 0.48; DT: 0.45). The capability of all models to detect litter was bounded to the pixel size of the Sentinel-2 images. Based on the spatio-temporal investigation, hydraulic structures (e.g., Kisköre Dam) are the greatest litter accumulation spots. Although the highest transport rate of litter occurs during floods, the largest litter spot area upstream of the Kisköre Dam was observed at low stages in summer. This study represents a preliminary step in the automatic detection of riverine litter; therefore, additional research incorporating a larger dataset with more representative small litter spots, as well as finer spatial resolution images is necessary
    corecore