16 research outputs found

    Using and extending itemsets in data mining : query approximation, dense itemsets, and tiles

    Get PDF
    Frequent itemsets are one of the best known concepts in data mining, and there is active research in itemset mining algorithms. An itemset is frequent in a database if its items co-occur in sufficiently many records. This thesis addresses two questions related to frequent itemsets. The first question is raised by a method for approximating logical queries by an inclusion-exclusion sum truncated to the terms corresponding to the frequent itemsets: how good are the approximations thereby obtained? The answer is twofold: in theory, the worst-case bound for the algorithm is very large, and a construction is given that shows the bound to be tight; but in practice, the approximations tend to be much closer to the correct answer than in the worst case. While some other algorithms based on frequent itemsets yield even better approximations, they are not as widely applicable. The second question concerns extending the definition of frequent itemsets to relax the requirement of perfect co-occurrence: highly correlated items may form an interesting set, even if they never co-occur in a single record. The problem is to formalize this idea in a way that still admits efficient mining algorithms. Two different approaches are used. First, dense itemsets are defined in a manner similar to the usual frequent itemsets and can be found using a modification of the original itemset mining algorithm. Second, tiles are defined in a different way so as to form a model for the whole data, unlike frequent and dense itemsets. A heuristic algorithm based on spectral properties of the data is given and some of its properties are explored.Yksi tiedon louhinnan tunnetuimmista käsitteistä ovat kattavat joukot, ja niiden etsintäalgoritmeja tutkitaan aktiivisesti. Joukko on tietokannassa kattava, jos sen alkiot esiintyvät yhdessä riittävän monessa tietueessa. Väitöskirjassa käsitellään kahta kattaviin joukkoihin liittyvää kysymystä. Ensimmäinen liittyy algoritmiin, jolla arvioidaan loogisten kyselyjen tuloksia laskemalla inkluusio-ekskluusio-summa pelkästään kattavilla joukoilla; kysymys on, kuinka hyviä arvioita näin saadaan. Väitöskirjassa annetaan kaksi vastausta: Teoriassa algoritmin pahimman tapauksen raja on hyvin suuri, ja vastaesimerkillä osoitetaan, että raja on tiukka. Käytännössä arviot ovat paljon lähempänä oikeaa tulosta kuin teoreettinen raja antaa ymmärtää. Arvioita vertaillaan eräisiin muihin algoritmeihin, joiden tulokset ovat vielä parempia mutta jotka eivät ole yhtä yleisesti sovellettavissa. Toinen kysymys koskee kattavien joukkojen määritelmän yleistämistä siten, että täydellisen yhteisesiintymisen vaatimuksesta tingitään. Joukko korreloituneita alkioita voi olla kiinnostava, vaikka alkiot eivät koskaan esiintyisi kaikki samassa tietueessa. Ongelma on tämän ajatuksen muuttaminen sellaiseksi määritelmäksi, että tehokkaita louhinta-algoritmeja voidaan käyttää. Väitöskirjassa esitetään kaksi lähestymistapaa. Ensinnäkin tiheät kattavat joukot määritellään samanlaiseen tapaan kuin tavalliset kattavat joukot, ja ne voidaan löytää samantyyppisellä algoritmilla. Toiseksi määritellään laatat, jotka muodostavat koko datalle mallin, toisin kuin kattavat ja tiheät kattavat joukot. Laattojen etsimistä varten kuvataan datan spektraalisiin ominaisuuksiin perustuva heuristiikka, jonka eräitä ominaisuuksia tutkitaan.reviewe

    Nowcasting Well-Being With Retail Market Data

    Get PDF
    Lo scopo principale del lavoro e di proporre un approccio innovativo per stimare il grado di benessere della popolazione in maniera essibile e im- mediata. I dati, forniti dall'Unicoop Tirreno, contengono informazioni sugli acquisti dal 2007 al 2012 e vengono modellati attraverso un grafo bipar- tito cliente-prodotto. Ad ogni nodo del grafo viene assegnato un valore, chiamato indice di sosticazione, in grado da un lato di mostrare quanto un prodotto sia base o sosticato e, dall'altro, quanto un cliente tenda a comprare prodotti base o prodotti sosticati. Abbiamo valutato l'evoluzione temporale della complessita economica sia dei clienti che dei prodotti aggregando su entrambi con varie misure statistiche. L'ecacia dell'indicatore stata valutata paragonandolo con l'andamento del PIL (Prodotto Interno Lordo), attraverso il metodo statistico dell' indice di correlazione di Pearson. The main pourpose of this work is to introduce a novel approach to esti- mate the degree of human wellness in a exible and immediate way. Data, provided by Unicoop Tirreno, contains sales informations between 2007 and 2012 and are modeled in a bipartite graph customer- product. We assign to every node of the graph a value called sophistication index, able to show how much a product can be basic or sophisticated and, on the other hand, the propensity of a customer to buy basic or even sophisticated products. We evaluated the temporal evolution of the sophistication index for both customers and products. The performance of the new indicator has been evaluated comparing these trends with the trend of the GDP ( Gross Do- mestic Product) using the statistical method of the Pearson Correlation

    Tilastollisesti merkityksellisten riippuvuussääntöjen tehokas haku binääridatasta

    Get PDF
    Analyzing statistical dependencies is a fundamental problem in all empirical science. Dependencies help us understand causes and effects, create new scientific theories, and invent cures to problems. Nowadays, large amounts of data is available, but efficient computational tools for analyzing the data are missing. In this research, we develop efficient algorithms for a commonly occurring search problem - searching for the statistically most significant dependency rules in binary data. We consider dependency rules of the form X->A or X->not A, where X is a set of positive-valued attributes and A is a single attribute. Such rules describe which factors either increase or decrease the probability of the consequent A. A classical example are genetic and environmental factors, which can either cause or prevent a disease. The emphasis in this research is that the discovered dependencies should be genuine - i.e. they should also hold in future data. This is an important distinction from the traditional association rules, which - in spite of their name and a similar appearance to dependency rules - do not necessarily represent statistical dependencies at all or represent only spurious connections, which occur by chance. Therefore, the principal objective is to search for the rules with statistical significance measures. Another important objective is to search for only non-redundant rules, which express the real causes of dependence, without any occasional extra factors. The extra factors do not add any new information on the dependence, but can only blur it and make it less accurate in future data. The problem is computationally very demanding, because the number of all possible rules increases exponentially with the number of attributes. In addition, neither the statistical dependency nor the statistical significance are monotonic properties, which means that the traditional pruning techniques do not work. As a solution, we first derive the mathematical basis for pruning the search space with any well-behaving statistical significance measures. The mathematical theory is complemented by a new algorithmic invention, which enables an efficient search without any heuristic restrictions. The resulting algorithm can be used to search for both positive and negative dependencies with any commonly used statistical measures, like Fisher's exact test, the chi-squared measure, mutual information, and z scores. According to our experiments, the algorithm is well-scalable, especially with Fisher's exact test. It can easily handle even the densest data sets with 10000-20000 attributes. Still, the results are globally optimal, which is a remarkable improvement over the existing solutions. In practice, this means that the user does not have to worry whether the dependencies hold in future data or if the data still contains better, but undiscovered dependencies.Tilastollisten riippuvuuksien etsintä ja analysointi on empiiristen tieteiden keskeisimpiä tehtäviä. Tilastolliset riippuvuudet auttavat ymmärtämään asioiden syy- ja seuraussuhteita, kuten esimerkiksi mitkä geenit tai elämäntavat altistavat tietyille sairauksille ja mitkä puolestaan suojelevat niiltä. Tällaiset riipuvuudet voidaan esittää havainnollisesti riippuvuussääntöinä muotoa ABCD->E, missä A,B,C ja D vastaavat havaittuja tekijöitä ja E on niistä tilastollisesti riippuva seuraus. Analysoitavaa dataa on nykyaikana valtavasti saatavilla lähes miltätahansa elämän alueelta. Ongelmana on, ettei kaikkia mahdollisia riippuvuuksia voida tutkia tavallisilla tilastollisilla työkaluilla tai tietokoneohjelmilla. Esimerkiksi jos datassa esiintyy 20 muuttujaa ja kukin niistä voi saada vain kaksi arvoa (esimerkiksi geeni esiintyy tai ei esiiny näytteessä), erilaisia mahdollisia riippuvuussääntöjä on jo yli 20 miljoonaa kappaletta. Usein data kuitenkin sisältää vähintään satoja tai jopa kymmeniä tuhansia muuttujia, eikä kaikkien mahdollisten riippuvuussääntöjen tutkiminen ole laskennallisesti mahdollista. Tässä tutkimuksessa on kehitetty tarvittavia tehokkaita laskentamenetelmiä tilastollisesti kaikkein merkitsevimpien riippuvuussääntöjen etsintään binääridatasta, jossa kukin muuttuja voi saada vain kaksi arvoa. Geenitutkimuksen lisäksi tällaista dataa esiintyy luonnostaan mm. biologiassa (eri havaintopaikoilla esiintyvät kasvi- ja eläinlajit) sekä markkinointitutkimuksessa (ns. ostoskoridata eli mitä tuotteita kukin asiakas on ostanut). Mikäli datassa on kuitenkin useampiarvoisia muuttujia, ne voidaan aina tarvittaessa esittää binäärimuodossa. Aiempiin tiedonlouhintamenetelmiin verrattuna tutkimuksessa kehitetyt menetelmät ovat sekä tehokkaampia että luotettavampia. Perinteisesti suurien datajoukkojen riippuvuuksia on yritetty analysoida assosiaatiosäännöillä, mutta assosiaatiosäännöt eivät välttämättä esitä mitään tilastollista riippuvuutta tai riippuvuus voi olla tilastollisesti merkityksetön (sattuman tuotetta). Lisäksi assosiaatiosääntöjen hakumenetelmät ovat tehottomia löytämään kaikkia merkityksellisiä riippuvuuksia. Tämän tutkimuksen tuloksena kehitetyllä tietokoneohjelmalla on kuitenkin mahdollista hakea kaikkein merkityksellisimmät riippuvuudet jopa kymmeniä tuhansia muuttujia sisältävistä datajoukoista tavallisella pöytätietokoneella. Hakukriteerinä, jolla riippuvuuden tilastollinen merkityksevyys arvioidaan, voidaan käyttää melkein mitätahansa tilastollista mittaa kuten Fisherin eksaktia testiä tai chi2-mittaa

    The ribosome builder: A software project to simulate the ribosome

    Get PDF

    On utilising change over time in data mining

    Get PDF
    Magdeburg, Univ., Fak. für Informatik, Diss., 2013von Mirko Böttche

    Text and Genre in Reconstruction

    Get PDF
    In this broad-reaching, multi-disciplinary collection, leading scholars investigate how the digital medium has altered the way we read and write text. In doing so, it challenges the very notion of scholarship as it has traditionally been imagined. Incorporating scientific, socio-historical, materialist and theoretical approaches, this rich body of work explores topics ranging from how computers have affected our relationship to language, whether the book has become an obsolete object, the nature of online journalism, and the psychology of authorship. The essays offer a significant contribution to the growing debate on how digitization is shaping our collective identity, for better or worse. Text and Genre in Reconstruction will appeal to scholars in both the humanities and sciences and provides essential reading for anyone interested in the changing relationship between reader and text in the digital age

    Text and Genre in Reconstruction

    Get PDF
    In this broad-reaching, multi-disciplinary collection, leading scholars investigate how the digital medium has altered the way we read and write text. In doing so, it challenges the very notion of scholarship as it has traditionally been imagined. Incorporating scientific, socio-historical, materialist and theoretical approaches, this rich body of work explores topics ranging from how computers have affected our relationship to language, whether the book has become an obsolete object, the nature of online journalism, and the psychology of authorship. The essays offer a significant contribution to the growing debate on how digitization is shaping our collective identity, for better or worse. Text and Genre in Reconstruction will appeal to scholars in both the humanities and sciences and provides essential reading for anyone interested in the changing relationship between reader and text in the digital age

    The dawn - a study of the traditional love lyric of medieval Spain and Portugal.

    Get PDF
    PhDThe object of this study is to investigate the origins of the traditional lyric poetry of the Iberian Peninsula through an analysis of the poetry of dawn meeting. The formative influences on each of the three types of traditional poetry, the Mozarabic kharjas, the Galician cantigas and the Castilian villancicos are examined and possible relationships are indicated. An introductory survey reviews the state of scholarship in the field of Spanish lyric poetry. Particular reference is made to the importance of the comparatively recent discovery of the kharjas because their publication has occasioned a profound reappraisal of the origins of Romance vernacular poetry. A new dimension has been brought not only to the study of the medieval lyric of Spain and Portugal but also to considerations of the relevance of the Provençal lyric to the poetry of the Peninsula. The individuality of the traditional Iberian lyric is seen in its singularly consistent use of certain related themes, one of the most significant of these being the theme of lovers' meeting at dawn. Each type of lyric is viewed against its cultural background and the many influences both popular and learned which contribute to its composition, to the development of its imagery and to its preservation are assessed. The treatment of the dawn theme and its associated imagery in each area of poetic composition is analysed both for continuity and for innovation and originality. Since religion, either Christian or pagan, is seen to be influential in the shaping of traditional poetry, religion as a theme of the poetry of meeting is reviewed in the concluding chapter. In its various aspects it is found to accord with many of the characteristics described in the previous chapters

    An edition of a fifteenth century middle English temporale sermon cycle in MSS Lambeth Palace 392 and Cambridge University library additional 5338

    Get PDF
    This edition comprises twenty-three Middle English Temporale sermons which are contained in two early fifteenth century manuscripts, Lambeth Palace 392 (Lb) and Cambridge University Library Additional 5338 (Ad). The collection runs from 1 Advent to Easter, but is not fully represented in either manuscript; only ten of the sermons (3 Advent to 5 Sunday after the octave of the Epiphany) are shared by Ad and Lb. These ten sermons are presented en face in the edition, and each manuscript has been edited separately. The choice of en face presentation was determined by the comparative brevity of the overlapping portion and by the distinctive character of both manuscripts. The AdLb series draws material from the Set I sermons of the English Wycliffite sermon-cycle; the borrowings are largely limited to the translation of the gospel pericopes which preface most of the AdLb sermons, but one sermon, that for the octave of the Epiphany, takes over almost entirely the complete wycliffite sermon for the corresponding occasion. The Notes record in detail that AdLb is a derivative compilation. But the Lollard interest of the series goes beyond these borrowings. While the collection is basically orthodox, the compiler has also added; tendentious material, or changed the emphasis of the source, to create a hybrid of quite orthodox sentiments and popular Lollard belief. This combination appears to be characteristic of early fifteenth century sermon and devotional texts. The handling of the source, which for most of the sermons is the Latin Sunday gospel collection of Nicholas de Aquevilla OFK, is reviewed extensively in the Notes and reveals the extent of the preacher's proto-Lollard interventions. The Introduction describes Lb and Ad, and discusses their inter-relation. An anslysis of the language of both manuscripts reveals an anterior Norfolk copy of the series, which is at several removes from the original. I give a brief account of the preacher's ideology, which is also explored in detail in the Notes, and suggest some ways of approaching the sermons within a literary context. I survey the relationship between three sermons in Adlb and three in the fifteenth century collection witnessed in MS Harley 2247 (H) and MS Royal 18 B XXV (R) which also draw on the sermons of Nicholas de Aquevilla. Part II contains the Notes to the sermons, Which include the relevant text of the Latin source. There is a Select Glossary and a Bibliograph
    corecore