112 research outputs found

    Finding banded patternsin large data set using segmentation

    Get PDF
    No Abstrac

    Mining local staircase patterns in noisy data

    Get PDF
    Most traditional biclustering algorithms identify biclusters with no or little overlap. In this paper, we introduce the problem of identifying staircases of biclusters. Such staircases may be indicative for causal relationships between columns and can not easily be identified by existing biclustering algorithms. Our formalization relies on a scoring function based on the Minimum Description Length principle. Furthermore, we propose a first algorithm for identifying staircase biclusters, based on a combination of local search and constraint programming. Experiments show that the approach is promising

    Patterns in permuted binary matrices

    Get PDF
    Reorganizing a dataset so that its hidden structure can be observed is useful in any data analysis task. For example, detecting a regularity in a dataset helps us to interpret the data, compress the data, and explain the processes behind the data. We study datasets that come in the form of binary matrices (tables with 0s and 1s). Our goal is to develop automatic methods that bring out certain patterns by permuting the rows and columns. We concentrate on the following patterns in binary matrices: consecutive-ones (C1P), simultaneous consecutive-ones (SC1P), nestedness, k-nestedness, and bandedness. These patterns reflect specific types of interplay and variation between the rows and columns, such as continuity and hierarchies. Furthermore, their combinatorial properties are interlinked, which helps us to develop the theory of binary matrices and efficient algorithms. Indeed, we can detect all these patterns in a binary matrix efficiently, that is, in polynomial time in the size of the matrix. Since real-world datasets often contain noise and errors, we rarely witness perfect patterns. Therefore we also need to assess how far an input matrix is from a pattern: we count the number of flips (from 0s to 1s or vice versa) needed to bring out the perfect pattern in the matrix. Unfortunately, for most patterns it is an NP-complete problem to find the minimum distance to a matrix that has the perfect pattern, which means that the existence of a polynomial-time algorithm is unlikely. To find patterns in datasets with noise, we need methods that are noise-tolerant and work in practical time with large datasets. The theory of binary matrices gives rise to robust heuristics that have good performance with synthetic data and discover easily interpretable structures in real-world datasets: dialectical variation in the spoken Finnish language, division of European locations by the hierarchies found in mammal occurrences, and co-occuring groups in network data. In addition to determining the distance from a dataset to a pattern, we need to determine whether the pattern is significant or a mere occurrence of a random chance. To this end, we use significance testing: we deem a dataset significant if it appears exceptional when compared to datasets generated from a certain null hypothesis. After detecting a significant pattern in a dataset, it is up to domain experts to interpret the results in the terms of the application.Aineiston uudelleenjärjestäminen paljastaa sen sisäisen rakenteen Elektroniset aineistot ovat usein suuria ja niiden sisältämät hahmot aluksi tuntemattomia, joten hahmojen löytämiseen tarvitaan tehokkaita tietokoneohjelmia. Hahmojen tunnistaminen auttaa kuvailemaan esimerkiksi nisäkäs- ja murresana-aineistojen sekä sosiaalisten verkostojen rakennetta. Parhaimmillaan tämä auttaa aineistoihin liittyvien tosimaailman ilmiöiden selittämisessä. Helsingin yliopistossa tarkastettava Esa Junttilan tietojenkäsittelytieteen alan väitöskirjatutkimus esittelee uusia automaattisia menetelmiä, jotka tunnistavat säännönmukaisuuksia aineistoissa. Uudet menetelmät perustuvat aineiston uudelleenjärjestämiseen, joka tuo aineiston sisältämän hahmon esiin. Aineistolla tarkoitetaan taulukkomuotoista dataa, joka sisältää vain ykkösiä ja nollia. Esimerkiksi ykköset nisäkkäiden levinneisyystaulukossa merkitsevät, että tietty nisäkäs elää tietyllä seudulla. Menetelmissä taulukon rivit ja sarakkeet järjestetään niin, että hahmo erottuu ihmisille mahdollisimman selvästi. Nisäkäsaineistolle sovellettuna kuvatut menetelmät voivat tuottaa esimerkiksi nisäkkäiden hierarkian, ryhmittymiä tai muun järjestyksen. Teoreettinen tarkastelu synnyttää hahmojen etsintään nopeita algoritmeja, jotka pystyvät käsittelemään tuhansia rivejä ja sarakkeita. Haasteena on menetelmien kyky sietää virheitä: esiintyvä hahmo on löydettävä silloinkin, kun aineiston laatu on kehno. Räätälöidyt tilastolliset testit kertovat lopulta löydetyn hahmon merkitsevyyden. Väittelijä on etsinyt kuvatuilla menetelmillä hahmoja esimerkiksi geneettisestä aineistosta, sosiaalisista verkostoista sekä nisäkkäiden, murresanojen ja fossiilien esiintymistä. Löydetty säännönmukaisuus vahvisti käsitystä tutkittujen aineistojen sisäisestä rakenteesta ja rohkaisee jatkotutkimuksiin vastaavilla tutkimusaloilla, kuten ekologiassa ja paleontologiassa. Esa Junttila väittelee matemaattis-luonnontieteellisessä tiedekunnassa 10.8.2011 kello 12 tietojenkäsittelytieteen alan tutkimuksellaan Patterns in Permuted Binary Matrices. Väitöstilaisuus järjestetään Yliopiston päärakennuksen salissa 13 (Fabianinkatu 33, 3. kerros). Vastaväittäjänä on professori Matti Nykänen (Itä-Suomen yliopisto) ja kustoksena professori Hannu Toivonen (Helsingin yliopisto). Lisätiedot: Esa Junttila, puhelin 040-8234987, [email protected]

    Detecting and generating overlapping nested communities

    Get PDF
    Nestedness has been observed in a variety of networks but has been primarily viewed in the context of bipartite networks. Numerous metrics quantify nestedness and some clustering methods identify fully nested parts of graphs, but all with similar limitations. Clustering approaches also fail to uncover the overlap between fully nested subgraphs, as they assign vertices to a single group only. In this paper, we look at the nestedness of a network through an auxiliary graph, in which a directed edge represents a nested relationship between the two corresponding vertices of the network. We present an algorithm that recovers this so-called community graph, and finds the overlapping fully nested subgraphs of a network. We also introduce an algorithm for generating graphs with such nested structure, given by a community graph. This algorithm can be used to test a nested community detection algorithm of this kind, and potentially to evaluate different metrics of nestedness as well. Finally, we evaluate our nested community detection algorithm on a large variety of networks, including bipartite and non-bipartite ones, too. We derive a new metric from the community graph to quantify the nestedness of both bipartite and non-bipartite networks

    Small mammal responses to Amazonian forest islands are modulated by their forest dependence

    Get PDF
    Hydroelectric dams have induced widespread loss, fragmentation and degradation of terrestrial habitats in lowland tropical forests. Yet their ecological impacts have been widely neglected, particularly in developing countries, which are currently earmarked for exponential hydropower development. Here we assess small mammal assemblage responses to Amazonian forest habitat insularization induced by the 28-year-old Balbina Hydroelectric Dam. We sampled small mammals on 25 forest islands (0.83–1466 ha) and four continuous forest sites in the mainland to assess the overall community structure and species-specific responses to forest insularization. We classified all species according to their degree of forest-dependency using a multi-scale approach, considering landscape, patch and local habitat characteristics. Based on 65,520 trap-nights, we recorded 884 individuals of at least 22 small mammal species. Species richness was best predicted by island area and isolation, with small islands ( 200 ha; 10.8 ± 1.3 species) and continuous forest sites (∞ ha; 12.5 ± 2.5 species) exhibited similarly high species richness. Forest-dependent species showed higher local extinction rates and were often either absent or persisted at low abundances on small islands, where non-forest-dependent species became hyper-abundant. Species capacity to use non-forest habitat matrices appears to dictate small mammal success in small isolated islands. We suggest that ecosystem functioning may be highly disrupted on small islands, which account for 62.7% of all 3546 islands in the Balbina Reservoir

    Finding Banded Patterns in Data: The Banded Pattern Mining Algorithm

    Get PDF

    Centralized and distributed cognitive task processing in the human connectome

    Get PDF
    A key question in modern neuroscience is how cognitive changes in a human brain can be quantified and captured by functional connectomes (FC) . A systematic approach to measure pairwise functional distance at different brain states is lacking. This would provide a straight-forward way to quantify differences in cognitive processing across tasks; also, it would help in relating these differences in task-based FCs to the underlying structural network. Here we propose a framework, based on the concept of Jensen-Shannon divergence, to map the task-rest connectivity distance between tasks and resting-state FC. We show how this information theoretical measure allows for quantifying connectivity changes in distributed and centralized processing in functional networks. We study resting-state and seven tasks from the Human Connectome Project dataset to obtain the most distant links across tasks. We investigate how these changes are associated to different functional brain networks, and use the proposed measure to infer changes in the information processing regimes. Furthermore, we show how the FC distance from resting state is shaped by structural connectivity, and to what extent this relationship depends on the task. This framework provides a well grounded mathematical quantification of connectivity changes associated to cognitive processing in large-scale brain networks.Comment: 22 pages main, 6 pages supplementary, 6 figures, 5 supplementary figures, 1 table, 1 supplementary table. arXiv admin note: text overlap with arXiv:1710.0219
    corecore