66 research outputs found

    Combining Labelled and Unlabelled Data in the Design of Pattern Classification Systems

    Get PDF
    There has been much interest in applying techniques that incorporate knowledge from unlabelled data into a supervised learning system but less effort has been made to compare the effectiveness of different approaches on real world problems and to analyse the behaviour of the learning system when using different amount of unlabelled data. In this paper an analysis of the performance of supervised methods enforced by unlabelled data and some semisupervised approaches using different ratios of labelled to unlabelled samples is presented. The experimental results show that when supported by unlabelled samples much less labelled data is generally required to build a classifier without compromising the classification performance. If only a very limited amount of labelled data is available the results show high variability and the performance of the final classifier is more dependant on how reliable the labelled data samples are rather than use of additional unlabelled data. Semi-supervised clustering utilising both labelled and unlabelled data have been shown to offer most significant improvements when natural clusters are present in the considered problem

    Automation and orchestration of hardware and firmware data mining using a smart data analytics platform

    Get PDF
    Effective data mining is going to be important for differentiating and succeeding in the digital economy especially with increased commoditization and reduced barrier to entry for infrastructure devices like servers, storage and networking systems. There is lot of telemetry data from manufacturing facilities and customers that can be used to drive improved supportability experience, unmatched product quality and reliability of infrastructure devices like servers and storage devices. Currently data mining of hardware, firmware and platform logs is a challenging task as the domain knowledge is complex with expertise for large multinational organization distributed across the world. With increasing complexity and data mining continuing to be a very time consuming task that requires math/statistics skills, diverse programming & machine learning skills and cross domain knowledge, it is important to look at next generation analytics solution tailored to infrastructure vendors to improve supportability, quality, reliability, performance and security. In this publication we propose a smart, automated and generic data analytics platform that enables a 24/7 data mining solution using an built in platform domain modeler, an expert system for analyzing hardware and firmware logs and a policy manager that allows user defined hypothesis to be verified round the clock based on policies and configurable triggers. This smart data analytics platform will help democratize data mining of hardware and firmware logs and help improve troubleshooting complex issues, improve supportability experience, reliability and quality and reduce warranty costs

    Building nonlinear data models with self-organizing maps

    Get PDF
    We study the extraction of nonlinear data models in high dimensional spaces with modified self-organizing maps. Our algorithm maps lower dimensional lattice into a high dimensional space without topology violations by tuning the neighborhood widths locally. The approach is based on a new principle exploiting the specific dynamical properties of the first order phase transition induced by the noise of the data. The performance of the algorithm is demonstrated for one- and two-dimensional principal manifolds and for sparse data sets

    Effective Knowledge Representation Through Data Modelling Approaches

    Get PDF
    Data modelling can be seen as knowledge representation in terms of sharing the same philosophical assumptions. In data modelling process, the recognition of the philosophical background on human inquiry and the nature of knowledge pertinent for appreciating the problems is important as different ontological views lead to different conceptions about data models. Recognising and incorporating different forms of organisational knowledge are also important in the data modelling process as a formal representation of some subset of the knowledge, which the organisation needs to carry out its business. This paper discusses the two distinct philosophical foundations for the effective representation of organisational knowledge

    Designing a customer data model and defining customer master data in a Finnish SaaS company

    Get PDF
    In this study a logical customer data model is designed and customer master data in the data model is defined for a case company. During the process of defining customer data and customer itself, the business glossary of a customer is defined to have clear definitions of a customer and to unify the vocabulary across the case company. Defining the important vocabulary ensures the base to define the customer data and customer master data. In addition, quality aspects are studied or ensuring high-quality customer data in the future. This study aims to understand what customer data in the case company is and model it to a logical data model to unify siloed operations, systems, and data. The case company is a Finnish Software as a Service company. It is in the middle of a merging process due to recent company acquisitions. The case company wants to have common customer data and customer master data. The case company does not have master data defined. It is important to identify, which data is critical to the business so that the case company can have the one truth and the development activities can be targeted into the right direction to ensure the most advantage. The research method of this study is design research. The empirical part of this study is done by workshops, and there are two rounds of workshops. First round analyses the current situation based on the processes and different functions in the company, that are working with customer data. The outcome of the first workshop round is customer terminology and its definitions and the customer data model. The second round concentrates on iterative development of the terminology and customer data model, and further identifying development needs, restrictions, and possibilities of having a common customer data model and master data. After the workshops,the terminology and data model are developed with internal experts. Lastly, there is a review event, where the participants get to see and comment the designed customer data model and the identified customer master data. After this study, the case company has a clear definition of what is a customer, and how it should be modeled in a logical data model in the future to have one common customer data structure to unify the case company. Also, the case company has the most important, necessary, common customer data, the customer master data defined. The next step after this study is to plan the implementation of the customer data designs of this study, taking into account the quality principles, that were defined in this study to support the sustainability of the designs.Tässä tutkimuksessa suunnitellaan kohdeyritykselle asiakastiedon looginen tietomalli ja määritellään asiakkaan ydintieto. Prosessin aikana määritellään, mikä on asiakas ja mitä on asiakastieto, ja sitä myötä tehdään sanasto asiakkaaseen liittyvistä termeistä. Sanaston on tarkoitus selkeyttää ja yhdistää asiakkaaseen liittyvää sanastoa kohdeyrityksessä. Tärkeän sanaston määrittely mahdollistaa asiakastiedon sekä asiakastiedon ydintiedon määrittelyn. Lisäksi tässä tutkimuksessa tutkitaan, mitä täytyy ottaa huomioon, jotta tulevaisuudessa asiakastiedot ovat korkealaatuisia. Tämä tutkimus pyrkii ymmärtämään, mitä asiakastieto on kohdeyritykselle, ja mallintaa sen loogiseksi tietomalliksi, joka yhdistäisi siiloutuneita operaatioita, systeemejä ja dataa. Kohdeyritys on suomalainen ohjelmistopalveluita tarjoava yritys. Kohdeyritys on tehnyt lähimenneisyydessä yritysostoja, ja on nyt keskellä yhdistysmisprosessia. Kohdeyrityksellä ei ole yhteistä määriteltyä ydintietoa. On tärkeää tunnistaa, mikä tieto on kriittistä yritykselle, jotta yrityksellä olisi yksi yhteinen totuus asiakastiedoista, ja kehityshankkeet voitaisiin kohdistaa oikein, jotta voidaan taata suurin hyöty. Tämän tutkimuksen tutkimusmetodi on suunnittelututkimus. Tutkimuksen empiirinen osa suoritetaan työpajojen avulla, ja ne järjestetään kahdessa kierroksessa. Ensimmäinen kierros analysoi nykytilannetta prosessien ja eri yrityksen toimintojen kautta, jotka ovat asiakastietojen kanssa tekemisissä. Ensimmäisen työpajakierroksen tuloksena muodostetaan terminologia asiakkaasta ja asiakkaan tietomalli. Toinen työpajakierros keskittyy ensimmäisen työpajakierroksen tulosten iteratiivisen kehittämiseen sekä tunnistamaan mahdollisuuksia, rajoitteita ja haasteita, jotka liittyvät siihen, että yrityksellä olisi yhteinen asiakastiedon tietomalli ja ydintieto. Työpajojen jälkeen terminologiaa ja tietomallia kehitetään yrityksen sisäisten asiantuntijoiden kanssa. Tutkimuksen viimeisessä vaiheessa järjestetään tilaisuus, jossa esitellään ja käydään läpi suunniteltu tietomalli ja määritelty ydintieto, ja osallistujilla on mahdollisuus kommentoida tuloksia, ja kommenttien perusteella tehdään viimeisiä pieniä tarkennuksia. Tämän tutkimuksen jälkeen yrityksellä on selkeä määritelmä siitä, mikä on asiakas, ja kuinka asiakastiedot tulisi mallintaa loogiseksi tietomalliksi, jotta kohdeyrityksellä voisi olla yhtenäinen asiakastiedon rakenne, joka yhtenäistäisi kohdeyritystä. Sen lisäksi yrityksellä on määriteltynä asiakastiedon tärkein, välttämättömin, yhteinen ydintieto. Tämän tutkimuksen jälkeen seuraava askel on suunnitella luodun mallin implementointi ottaen huomioon laatuun liittyvät periaatteet, jotka määriteltiin tässä tutkimuksessa tukemaan suunnitelman kestävyyttä

    Synchronic Curation for Assessing Reuse and Integration Fitness of Multiple Data Collections

    Get PDF
    Data driven applications often require using data integrated from different, large, and continuously updated collections. Each of these collections may present gaps, overlapping data, have conflicting information, or complement each other. Thus, a curation need is to continuously assess if data from multiple collections are fit for integration and reuse. To assess different large data collections at the same time, we present the Synchronic Curation (SC) framework. SC involves processing steps to map the different collections to a unifying data model that represents research problems in a scientific area. The data model, which includes the collections' provenance and a data dictionary, is implemented in a graph database where collections are continuously ingested and can be queried. SC has a collection analysis and comparison module to track updates, and to identify gaps, changes, and irregularities within and across collections. Assessment results can be accessed interactively through a web-based interactive graph. In this paper we introduce SC as an interdisciplinary enterprise, and illustrate its capabilities through its implementation in ASTRIAGraph, a space sustainability knowledge system

    A Resource-Aware and Time-Critical IoT Framework

    Get PDF
    Internet of Things (IoT) systems produce great amount of data, but usually have insufficient resources to process them in the edge. Several time-critical IoT scenarios have emerged and created a challenge of supporting low latency applications. At the same time cloud computing became a success in delivering computing as a service at affordable price with great scalability and high reliability. We propose an intelligent resource allocation system that optimally selects the important IoT data streams to transfer to the cloud for processing. The optimization runs on utility functions computed by predictor algorithms that forecast future events with some probabilistic confidence based on a dynamically recalculated data model. We investigate ways of reducing specifically the upload bandwidth of IoT video streams and propose techniques to compute the corresponding utility functions. We built a prototype for a smart squash court and simulated multiple courts to measure the efficiency of dynamic allocation of network and cloud resources for event detection during squash games. By continuously adapting to the observed system state and maximizing the expected quality of detection within the resource constraints our system can save up to 70% of the resources compared to the naive solution

    Modeling Black Literature: Behind the Screen with the Black Bibliography Project

    Get PDF
    The Black Bibliography Project (BBP) plans to produce a bibliographic database of printed works by Black writers from the eighteenth to the twenty-first centuries. With the support of the Beinecke Library and a grant from the Mellon Foundation, project co-PIs and codirectors Jacqueline Goldsby and Meredith McGill collaborated with a team of librarians from Yale to develop the data model for their database. Drawing on Beinecke’s James Weldon Johnson Memorial Collection to pull case studies, the team of librarians developed a Linked Data model for BBP in an instance of Wikibase and trained and supported a group of graduate student bibliographers in a pilot phase of data entry. This essay details our collaboration with the BBP codirectors and other contributing faculty as well as our training of the graduate student bibliographers. It also explores how a project conceived as a scholarly intervention additionally became an intervention in the historic inequalities and gaps in cataloging description and access
    • …
    corecore