222 research outputs found

    Set2Box: Similarity Preserving Representation Learning of Sets

    Full text link
    Sets have been used for modeling various types of objects (e.g., a document as the set of keywords in it and a customer as the set of the items that she has purchased). Measuring similarity (e.g., Jaccard Index) between sets has been a key building block of a wide range of applications, including, plagiarism detection, recommendation, and graph compression. However, as sets have grown in numbers and sizes, the computational cost and storage required for set similarity computation have become substantial, and this has led to the development of hashing and sketching based solutions. In this work, we propose Set2Box, a learning-based approach for compressed representations of sets from which various similarity measures can be estimated accurately in constant time. The key idea is to represent sets as boxes to precisely capture overlaps of sets. Additionally, based on the proposed box quantization scheme, we design Set2Box+, which yields more concise but more accurate box representations of sets. Through extensive experiments on 8 real-world datasets, we show that, compared to baseline approaches, Set2Box+ is (a) Accurate: achieving up to 40.8X smaller estimation error while requiring 60% fewer bits to encode sets, (b) Concise: yielding up to 96.8X more concise representations with similar estimation error, and (c) Versatile: enabling the estimation of four set-similarity measures from a single representation of each set.Comment: Accepted by ICDM 202

    Performance Study of Cryptography based Dynamic Multi-Keyword Searchable Security Algorithm in Cloud Using CRSA /B+ Tree

    Get PDF
    Today, Cloud computing is a buzz word in IT industry. Cloud, a shared pool of computing resources, allows access to needed resources on demand through internet and web applications. Since data is outsourced to third party, user needs to maintain the accountability of their data in cloud. Hence preserving the confidentiality and securing the sensitive data in cloud is a major concern. Many cryptographic techniques have been proposed by researchers to assure the confidentiality of the user2019;s data in cloud. But, the challenging task is to provide the secure search over this preserved data which has been encrypted so as to retrieve the effective data. Hence, we are proposing a system to have a secure search over the encrypted data on the cloud which preserves its confidentiality. In our system, a noble approach has been made using the Commutative-RSA algorithm, a cryptographic technique where the dual encryption takes place thus reducing the overall computation overhead. The search operation over the encrypted data is based on the tree search algorithm which supports multi-keyword search. Based on the relevance score, the more appropriate data is retrieved on the search operation. Using this approach, the information is not leaked when the encrypted data is searched by users and also the queries are handled in an efficient way. Finally, we demonstrate the effectiveness and efficiency of the proposed schemes through extensive experimental evaluation

    Sensors and Systems for Indoor Positioning

    Get PDF
    This reprint is a reprint of the articles that appeared in Sensors' (MDPI) Special Issue on “Sensors and Systems for Indoor Positioning". The published original contributions focused on systems and technologies to enable indoor applications

    Supplier Data Analysis and Utilization in Supply Chain Management : Case ABB Smart Power

    Get PDF
    This thesis' subject is utilizing supplier data analysis for supply chain management at ABB Smart Power. The research problem is the lack of information on the supply chain. Suppliers and materials are divided into three categories based on their relationship to the supply chain. The main information to be learned from this data is material consumption, material spend, material volume, and movement of materials in the supplier network. The Purpose of this thesis is to build a supply data analysis system to calculate and visualize this information for the supply chain management team. This thesis excludes electronic components due to their volatility in the market. In addition, machine learning algorithm k-nearest neighbours is tested for material price forecasting. The thesis focuses on the research question “how can the program Power BI be used for gathering, analysing, and utilizing the supplier data”. The solution proposed is to automate supplier data extraction from SAP ERP utilizing Microsoft Excel with VBA-programming and utilize Microsoft Power BI for data analysis, visualization, and machine learning to provide the information required to solve the research problem. The Development process for the solution to the research problem can be divided into four parts which are data extraction, data analysis, visualization and utilization, and machine learning. This solution is built through multiple prototypes and developed based on theory, testing, and feedback. The end product is released to Power BI Service ABB workspace for use. The thesis is divided into seven chapters according to the constructive research process steps. The first three chapters focus on the background, research question, theory, and technologies utilized in the thesis. The fourth chapter focuses on the research approach and process. The fifth chapter focuses on the solution construct’s development process and the sixth chapter on the results of this solution. The final chapter focuses on the conclusions, discussion, and future development of this research. The development and results of this thesis conclude that the combination of Microsoft Excel with VBA-programming and Microsoft Power BI for data analysis presents an efficient method for gathering, analysing, and utilizing supplier data. Through the data analysis capabilities of Power BI, the data can be analysed, calculated, and visualized efficiently. The Machine learning implementation is possible for Power BI, however utilizing DAX-programming caused technical problems which could not be solved during the thesis.Tämän tutkielman aiheena on toimittajadatan käyttö ja analysointi toimitusketjun hallinnassa ABB Smart Powerilla. Tutkimusongelma on toimittajatiedon puute toimitusketjussa. Toimittajat ja materiaalit on jaettu kolmeen kategoriaan perustuen toimitusketju suhteisiin. Tärkeimmät hyödynnettävät tiedot tästä datasta ovat materiaalien kulutus, materiaalikulut, materiaalimäärät ja materiaalien liikkuminen toimittajaverkostossa. Tämän tutkielman päämääränä on rakentaa järjestelmä toimittajadatan analysoimiseen, laskemiseen sekä visualisointiin toimitusketjun hallintatiimille. Tutkielma poissulkee elektroniset komponentit, johtuen niiden markkinoiden epävakaudesta. Lisäksi koneoppimisalgoritmi k-nearest neighboursia testataan materiaalien hintojen ennustamiseen. Tutkielma keskittyy tutkimuskysymykseen “Kuinka Power BI ohjelmaa voidaan hyödyntää toimittajadatan keräämiseen, analysointiin ja käyttöön”. Ehdotettu ratkaisu on automatisoida toimittajadatan keräys käyttämällä Microsoftin Excel-ohjelmaa VBA-ohjelmoinnin kanssa ja käyttää Microsoft Power BI:tä datan analysointiin, visualisointiin sekä koneoppimiseen tarvittavan informaation saamiseen tutkimuskysymyksen ratkaisemiseksi. Ratkaisun kehitysprosessi voidaan jakaa neljään osaan, jotka ovat datan keräys, data-analyysi, visualisointi ja koneoppiminen. Ratkaisu rakennetaan prototyyppien kautta, jotka kehittyvät teorian, testauksen ja palautteen perusteella. Lopputuote julkaistaan Power BI Servicessa ABB:n workspacessa. Tutkielma on jaettu seitsemään osaan konstruktiivisen tutkimuksen vaiheiden mukaisesti. Ensimmäiset kolme kappaletta keskittyvät taustatietoihin, tutkimuskysymykseen, teoriaan sekä käytettyihin teknologioihin. Neljäs kappale keskittyy tutkimuksen lähestymistapaan sekä prosessiin. Viides kappale keskittyy ratkaisun konstruktion kehitysprosessiin ja kuudes kappale keskittyy tämän ratkaisun tuloksiin. Viimeinen kappale keskittyy lopputuloksiin, pohdintaan sekä kehitykseen. Tutkielman kehityksen ja tuloksien perusteella VBA-ohjelmoinnin käyttö Microsoft Excel ohjelmoinnilla yhdistettynä Power BI:n data analyysin kanssa tuottaa tehokkaan metodin toimittajadatan keräämiseen, analysointiin sekä käyttöön. Power BI:n data-analyysiominaisuuksien avulla data voidaan tehokkaasti analysoida, laskea ja visualisoida. DAX-ohjelmoinnin hyödyntäminen koneoppimisessa aiheutti toisaalta teknisiä ongelmia, joita ei pystytty ratkaisemaan tässä tutkielmassa

    A Statistical Approach to Topological Data Analysis

    Get PDF
    Until very recently, topological data analysis and topological inference methods mostlyrelied on deterministic approaches. The major part of this habilitation thesis presents astatistical approach to such topological methods. We first develop model selection toolsfor selecting simplicial complexes in a given filtration. Next, we study the estimationof persistent homology on metric spaces. We also study a robust version of topologicaldata analysis. Related to this last topic, we also investigate the problem of Wassersteindeconvolution. The second part of the habilitation thesis gathers our contributions inother fields of statistics, including a model selection method for Gaussian mixtures, animplementation of the slope heuristic for calibrating penalties, and a study of Breiman’spermutation importance measure in the context of random forests

    k-Means

    Get PDF
    The k-means clustering algorithm (k-means for short) provides a method offinding structure in input examples. It is also called the Lloyd–Forgy algorithm as it was independently introduced by both Stuart Lloyd and Edward Forgy. k-means, like other algorithms you will study in this part of the book, is an unsupervised learning algorithm and, as such, does not require labels associated with input examples. Recall that unsupervised learning algorithms provide a way of discovering some inherent structure in the input examples. This is in contrast with supervised learning algorithms, which require input examples and associated labels so as to fit a hypothesis function that maps input examples to one or more output variables
    corecore