8 research outputs found

    Using and extending itemsets in data mining : query approximation, dense itemsets, and tiles

    Get PDF
    Frequent itemsets are one of the best known concepts in data mining, and there is active research in itemset mining algorithms. An itemset is frequent in a database if its items co-occur in sufficiently many records. This thesis addresses two questions related to frequent itemsets. The first question is raised by a method for approximating logical queries by an inclusion-exclusion sum truncated to the terms corresponding to the frequent itemsets: how good are the approximations thereby obtained? The answer is twofold: in theory, the worst-case bound for the algorithm is very large, and a construction is given that shows the bound to be tight; but in practice, the approximations tend to be much closer to the correct answer than in the worst case. While some other algorithms based on frequent itemsets yield even better approximations, they are not as widely applicable. The second question concerns extending the definition of frequent itemsets to relax the requirement of perfect co-occurrence: highly correlated items may form an interesting set, even if they never co-occur in a single record. The problem is to formalize this idea in a way that still admits efficient mining algorithms. Two different approaches are used. First, dense itemsets are defined in a manner similar to the usual frequent itemsets and can be found using a modification of the original itemset mining algorithm. Second, tiles are defined in a different way so as to form a model for the whole data, unlike frequent and dense itemsets. A heuristic algorithm based on spectral properties of the data is given and some of its properties are explored.Yksi tiedon louhinnan tunnetuimmista käsitteistä ovat kattavat joukot, ja niiden etsintäalgoritmeja tutkitaan aktiivisesti. Joukko on tietokannassa kattava, jos sen alkiot esiintyvät yhdessä riittävän monessa tietueessa. Väitöskirjassa käsitellään kahta kattaviin joukkoihin liittyvää kysymystä. Ensimmäinen liittyy algoritmiin, jolla arvioidaan loogisten kyselyjen tuloksia laskemalla inkluusio-ekskluusio-summa pelkästään kattavilla joukoilla; kysymys on, kuinka hyviä arvioita näin saadaan. Väitöskirjassa annetaan kaksi vastausta: Teoriassa algoritmin pahimman tapauksen raja on hyvin suuri, ja vastaesimerkillä osoitetaan, että raja on tiukka. Käytännössä arviot ovat paljon lähempänä oikeaa tulosta kuin teoreettinen raja antaa ymmärtää. Arvioita vertaillaan eräisiin muihin algoritmeihin, joiden tulokset ovat vielä parempia mutta jotka eivät ole yhtä yleisesti sovellettavissa. Toinen kysymys koskee kattavien joukkojen määritelmän yleistämistä siten, että täydellisen yhteisesiintymisen vaatimuksesta tingitään. Joukko korreloituneita alkioita voi olla kiinnostava, vaikka alkiot eivät koskaan esiintyisi kaikki samassa tietueessa. Ongelma on tämän ajatuksen muuttaminen sellaiseksi määritelmäksi, että tehokkaita louhinta-algoritmeja voidaan käyttää. Väitöskirjassa esitetään kaksi lähestymistapaa. Ensinnäkin tiheät kattavat joukot määritellään samanlaiseen tapaan kuin tavalliset kattavat joukot, ja ne voidaan löytää samantyyppisellä algoritmilla. Toiseksi määritellään laatat, jotka muodostavat koko datalle mallin, toisin kuin kattavat ja tiheät kattavat joukot. Laattojen etsimistä varten kuvataan datan spektraalisiin ominaisuuksiin perustuva heuristiikka, jonka eräitä ominaisuuksia tutkitaan.reviewe

    Unsupervised learning of relation detection patterns

    Get PDF
    L'extracció d'informació és l'àrea del processament de llenguatge natural l'objectiu de la qual és l'obtenir dades estructurades a partir de la informació rellevant continguda en fragments textuals. L'extracció d'informació requereix una quantitat considerable de coneixement lingüístic. La especificitat d'aquest coneixement suposa un inconvenient de cara a la portabilitat dels sistemes, ja que un canvi d'idioma, domini o estil té un cost en termes d'esforç humà. Durant dècades, s'han aplicat tècniques d'aprenentatge automàtic per tal de superar aquest coll d'ampolla de portabilitat, reduint progressivament la supervisió humana involucrada. Tanmateix, a mida que augmenta la disponibilitat de grans col·leccions de documents, esdevenen necessàries aproximacions completament nosupervisades per tal d'explotar el coneixement que hi ha en elles. La proposta d'aquesta tesi és la d'incorporar tècniques de clustering a l'adquisició de patrons per a extracció d'informació, per tal de reduir encara més els elements de supervisió involucrats en el procés En particular, el treball se centra en el problema de la detecció de relacions. L'assoliment d'aquest objectiu final ha requerit, en primer lloc, el considerar les diferents estratègies en què aquesta combinació es podia dur a terme; en segon lloc, el desenvolupar o adaptar algorismes de clustering adequats a les nostres necessitats; i en tercer lloc, el disseny de procediments d'adquisició de patrons que incorporessin la informació de clustering. Al final d'aquesta tesi, havíem estat capaços de desenvolupar i implementar una aproximació per a l'aprenentatge de patrons per a detecció de relacions que, utilitzant tècniques de clustering i un mínim de supervisió humana, és competitiu i fins i tot supera altres aproximacions comparables en l'estat de l'art.Information extraction is the natural language processing area whose goal is to obtain structured data from the relevant information contained in textual fragments. Information extraction requires a significant amount of linguistic knowledge. The specificity of such knowledge supposes a drawback on the portability of the systems, as a change of language, domain or style demands a costly human effort. Machine learning techniques have been applied for decades so as to overcome this portability bottleneck¿progressively reducing the amount of involved human supervision. However, as the availability of large document collections increases, completely unsupervised approaches become necessary in order to mine the knowledge contained in them. The proposal of this thesis is to incorporate clustering techniques into pattern learning for information extraction, in order to further reduce the elements of supervision involved in the process. In particular, the work focuses on the problem of relation detection. The achievement of this ultimate goal has required, first, considering the different strategies in which this combination could be carried out; second, developing or adapting clustering algorithms suitable to our needs; and third, devising pattern learning procedures which incorporated clustering information. By the end of this thesis, we had been able to develop and implement an approach for learning of relation detection patterns which, using clustering techniques and minimal human supervision, is competitive and even outperforms other comparable approaches in the state of the art.Postprint (published version

    STREAMING ALGORITHMS FOR MINING FREQUENT ITEMS

    Get PDF
    Streaming model supplies solutions for handling enormous data flows for over 20 years now. The model works with sequential data access and states sublinear memory as its primary restriction. Although the majority of the algorithms are randomized and approximate, the field facilitates numerous applications from handling networking traffic to analyzing cosmology simulations and beyond. This thesis focuses on one of the most foundational and well-studied problems of finding heavy hitters, i.e. frequent items: 1.We challenge the long-lasting complexity gap in finding heavy hitters with L2 guarantee in the insertion-only stream and present the first optimal algorithm with a space complexity of O(1) words and O(1) update time. Our result improves on Count Sketch algorithm with space and time complexity of O(log n) by Charikar et al. 2002 [39]. 2. We consider the L2-heavy hitter problem in the interval query settings, rapidly emerging in the field. Compared to well known sliding window model where an algorithm is required to report the function of interest computed over the last N updates,interval query provides query flexibility, such that at any moment t one can query the function value on any interval (t1,t2)⊆(t−N,t). We present the first L2-heavy hitter algorithm in that model and extend the result to estimation all streamable functions of a frequency vector. 3. We provide the experimental study for the recent space optimal result on streaming quantiles by Karnin et al. 2016 [85]. The problem can be considered as a generalization to the heavy hitters. Additionally, we suggest several variations to the algorithms which improve the running time from O(1/ε) to O(log 1/ε), provide twice better space vs. precision trade-off, and extend the algorithm for the case of weighted updates. 4. We establish the connection between finding "halos", i.e. dense areas, in cosmology N-body simulation and finding heavy hitters. We build the first halo finder and scale it up to handle data sets with up-to 10^12 particles via GPU boosting, sampling and parallel I/O. We investigate its behavior and compare it to traditional in-memory halo finders. Our solution pushes the memory footprint from several terabytes down to less than a gigabyte, therefore, make the problem feasible for small servers and even desktops

    Efficient Computation of Frequent Itemsets In A Subcollection of Multiple Set Families

    No full text
    Many applications need to deal with the additive and multiplicative subcollections over a group of set families (databases). This paper presents two efficient algorithms for computing the frequent itemsets in these two types of subcollections respectively. Let T be a given subcollection of set families of total size m whose elements are drawn from a domain of size n. We show that ifT is an additive subcollection we can compute all frequent itemsets in T in O(m2n/(pn) + log p) time on an EREW PRAM with 1 ≤ p ≤ m2n/n processors, at a cost of maintaining the occurrences of all itemsets in each individual set family. If T is a multiplicative subcollection, we can compute all itemsets in T in O(mk/p + min {m′/p 2n, n3n log m′/p}) time on an EREW PRAM with 1 ≤ p ≤ min {m,2n} processors, where m′ = min {m,2n}. These present improvements over direct computation of the frequent itemsets on the subcollection concerned

    Algebraic Topology for Data Scientists

    Full text link
    This book gives a thorough introduction to topological data analysis (TDA), the application of algebraic topology to data science. Algebraic topology is traditionally a very specialized field of math, and most mathematicians have never been exposed to it, let alone data scientists, computer scientists, and analysts. I have three goals in writing this book. The first is to bring people up to speed who are missing a lot of the necessary background. I will describe the topics in point-set topology, abstract algebra, and homology theory needed for a good understanding of TDA. The second is to explain TDA and some current applications and techniques. Finally, I would like to answer some questions about more advanced topics such as cohomology, homotopy, obstruction theory, and Steenrod squares, and what they can tell us about data. It is hoped that readers will acquire the tools to start to think about these topics and where they might fit in.Comment: 322 pages, 69 figures, 5 table

    Collected Papers (on Neutrosophic Theory and Applications), Volume VI

    Get PDF
    This sixth volume of Collected Papers includes 74 papers comprising 974 pages on (theoretic and applied) neutrosophics, written between 2015-2021 by the author alone or in collaboration with the following 121 co-authors from 19 countries: Mohamed Abdel-Basset, Abdel Nasser H. Zaied, Abduallah Gamal, Amir Abdullah, Firoz Ahmad, Nadeem Ahmad, Ahmad Yusuf Adhami, Ahmed Aboelfetouh, Ahmed Mostafa Khalil, Shariful Alam, W. Alharbi, Ali Hassan, Mumtaz Ali, Amira S. Ashour, Asmaa Atef, Assia Bakali, Ayoub Bahnasse, A. A. Azzam, Willem K.M. Brauers, Bui Cong Cuong, Fausto Cavallaro, Ahmet Çevik, Robby I. Chandra, Kalaivani Chandran, Victor Chang, Chang Su Kim, Jyotir Moy Chatterjee, Victor Christianto, Chunxin Bo, Mihaela Colhon, Shyamal Dalapati, Arindam Dey, Dunqian Cao, Fahad Alsharari, Faruk Karaaslan, Aleksandra Fedajev, Daniela Gîfu, Hina Gulzar, Haitham A. El-Ghareeb, Masooma Raza Hashmi, Hewayda El-Ghawalby, Hoang Viet Long, Le Hoang Son, F. Nirmala Irudayam, Branislav Ivanov, S. Jafari, Jeong Gon Lee, Milena Jevtić, Sudan Jha, Junhui Kim, Ilanthenral Kandasamy, W.B. Vasantha Kandasamy, Darjan Karabašević, Songül Karabatak, Abdullah Kargın, M. Karthika, Ieva Meidute-Kavaliauskiene, Madad Khan, Majid Khan, Manju Khari, Kifayat Ullah, K. Kishore, Kul Hur, Santanu Kumar Patro, Prem Kumar Singh, Raghvendra Kumar, Tapan Kumar Roy, Malayalan Lathamaheswari, Luu Quoc Dat, T. Madhumathi, Tahir Mahmood, Mladjan Maksimovic, Gunasekaran Manogaran, Nivetha Martin, M. Kasi Mayan, Mai Mohamed, Mohamed Talea, Muhammad Akram, Muhammad Gulistan, Raja Muhammad Hashim, Muhammad Riaz, Muhammad Saeed, Rana Muhammad Zulqarnain, Nada A. Nabeeh, Deivanayagampillai Nagarajan, Xenia Negrea, Nguyen Xuan Thao, Jagan M. Obbineni, Angelo de Oliveira, M. Parimala, Gabrijela Popovic, Ishaani Priyadarshini, Yaser Saber, Mehmet Șahin, Said Broumi, A. A. Salama, M. Saleh, Ganeshsree Selvachandran, Dönüș Șengür, Shio Gai Quek, Songtao Shao, Dragiša Stanujkić, Surapati Pramanik, Swathi Sundari Sundaramoorthy, Mirela Teodorescu, Selçuk Topal, Muhammed Turhan, Alptekin Ulutaș, Luige Vlădăreanu, Victor Vlădăreanu, Ştefan Vlăduţescu, Dan Valeriu Voinea, Volkan Duran, Navneet Yadav, Yanhui Guo, Naveed Yaqoob, Yongquan Zhou, Young Bae Jun, Xiaohong Zhang, Xiao Long Xin, Edmundas Kazimieras Zavadskas
    corecore