12,979 research outputs found

    {MDL4BMF}: Minimum Description Length for Boolean Matrix Factorization

    No full text
    Matrix factorizations—where a given data matrix is approximated by a prod- uct of two or more factor matrices—are powerful data mining tools. Among other tasks, matrix factorizations are often used to separate global structure from noise. This, however, requires solving the ‘model order selection problem’ of determining where fine-grained structure stops, and noise starts, i.e., what is the proper size of the factor matrices. Boolean matrix factorization (BMF)—where data, factors, and matrix product are Boolean—has received increased attention from the data mining community in recent years. The technique has desirable properties, such as high interpretability and natural sparsity. However, so far no method for selecting the correct model order for BMF has been available. In this paper we propose to use the Minimum Description Length (MDL) principle for this task. Besides solving the problem, this well-founded approach has numerous benefits, e.g., it is automatic, does not require a likelihood function, is fast, and, as experiments show, is highly accurate. We formulate the description length function for BMF in general—making it applicable for any BMF algorithm. We discuss how to construct an appropriate encoding, starting from a simple and intuitive approach, we arrive at a highly efficient data-to-model based encoding for BMF. We extend an existing algorithm for BMF to use MDL to identify the best Boolean matrix factorization, analyze the complexity of the problem, and perform an extensive experimental evaluation to study its behavior

    A network approach to topic models

    Full text link
    One of the main computational and scientific challenges in the modern age is to extract useful information from unstructured texts. Topic models are one popular machine-learning approach which infers the latent topical structure of a collection of documents. Despite their success --- in particular of its most widely used variant called Latent Dirichlet Allocation (LDA) --- and numerous applications in sociology, history, and linguistics, topic models are known to suffer from severe conceptual and practical problems, e.g. a lack of justification for the Bayesian priors, discrepancies with statistical properties of real texts, and the inability to properly choose the number of topics. Here we obtain a fresh view on the problem of identifying topical structures by relating it to the problem of finding communities in complex networks. This is achieved by representing text corpora as bipartite networks of documents and words. By adapting existing community-detection methods -- using a stochastic block model (SBM) with non-parametric priors -- we obtain a more versatile and principled framework for topic modeling (e.g., it automatically detects the number of topics and hierarchically clusters both the words and documents). The analysis of artificial and real corpora demonstrates that our SBM approach leads to better topic models than LDA in terms of statistical model selection. More importantly, our work shows how to formally relate methods from community detection and topic modeling, opening the possibility of cross-fertilization between these two fields.Comment: 22 pages, 10 figures, code available at https://topsbm.github.io

    Extending data mining techniques for frequent pattern discovery : trees, low-entropy sets, and crossmining

    Get PDF
    The idea of frequent pattern discovery is to find frequently occurring events in large databases. Such data mining techniques can be useful in various domains. For instance, in recommendation and e-commerce systems frequently occurring product purchase combinations are essential in user preference modeling. In the ecological domain, patterns of frequently occurring groups of species can be used to reveal insight into species interaction dynamics. Over the past few years, most frequent pattern mining research has concentrated on efficiency (speed) of mining algorithms. However, it has been argued within the community that while efficiency of the mining task is no longer a bottleneck, there is still an urgent need for methods that derive compact, yet high quality results with good application properties. The aim of this thesis is to address this need. The first part of the thesis discusses a new type of tree pattern class for expressing hierarchies of general and more specific attributes in unstructured binary data. The new pattern class is shown to have advantageous properties, and to discover relationships in data that cannot be expressed alone with the more traditional frequent itemset or association rule patterns. The second and third parts of the thesis discuss the use of entropy as a score measure for frequent pattern mining. A new pattern class is defined, low-entropy sets, which allow to express more general types of occurrence structure than with frequent itemsets. The concept can also be easily applied to tree types of pattern. Furthermore, by applying minimum description length in pattern selection for low-entropy sets it is shown experimentally that in most cases the collections of selected patterns are much smaller than by using frequent itemsets. The fourth part of the thesis examines the idea of crossmining itemsets, that is, relating itemsets to numerical variables in a database of mixed data types. The problem is formally defined and turns out to be NP-hard, although it is approximately solvable within a constant-factor of the optimum solution. Experiments show that the algorithm finds itemsets that convey structure in both the binary and the numerical part of the data

    Use of data mining for investigation of crime patterns

    Get PDF
    Lot of research is being done to improve the utilization of crime data. This thesis deals with the design and implementation of a crime database and associated search methods to identify crime patterns from the database. The database was created in Microsoft SQL Server (back end). The user interface (front end) and the crime pattern identification software (middle tier) were implemented in ASP.NET. Such a web based approach enables the user to utilize the database from anywhere and at anytime. A general ARFF file can also be generated, for the user in Windows based format to use other Data Mining software such as WEKA for detailed analysis. Further, an effective navigation was provided to make use of the software in a user friendly way

    Oyster Reef Habitat Restoration : a synopsis and synthesis of approaches; proceedings from the symposium, Williamsburg, Virginia, April 1995

    Get PDF
    This volume has its origin in a symposium held in Williamsburg, VA in April 1995, though most of the chapters have been significantly revised in the interim. The primary purpose of the symposium was to bring together state fisheries managers involved in fisheries-directed oyster enhancement and research scientists to refine approaches for enhancing oyster populations and to better develop the rationale for restoring reef habitats. We could hardly have anticipated the degree to which this been successful. In the interim between the symposium and the publication of this volume the notion that oyster reefs are valuable habitats, both for oysters and for the other ecosystem services they provide, has been gaining wider acceptance. . . . Table of Contents Introduction and Overview by Mark W. Luckenbach, Roger Mann and James A. Wesson Part I. Historical Perspectives Chapter 1 - The Evolution of the Chesapeake Oyster Reef System During the Holocene Epoch by William J. Hargis, Jr. Chapter 2 - The Morphology and Physical Oceanography of Unexploited Oyster Reefs in North America by Victor S. Kennedy and Lawrence P. Sanford Chapter 3 - Oyster Bottom: Surface Geomorphology and Twentieth Century Changes in the Maryland Chesapeake Bay by Gary F. Smith, Kelly N. Geenhawk and Dorothy L. Jensen Part II. Synopsis of Ongoing Efforts Chapter 4 - Resource Management Programs for the Eastern Oyster, Crassostrea virginica,in the U.S. Gulf of Mexico ...Past, Present and Future by Richard L. Leard, Ronald Dugas and Mark Benigan Chapter 5 - Oyster Habitat Restoration: A Response to Hurricane Andrew by William S. Perret, Ronald Dugas, John Roussel, Charles A. Wilson, and John Supan Chapter 6 - Oyster Restoration in Alabama by Richard K. Wallace, Kenneth Heck and Mark Van Hoose Chapter 7 - A History of Oyster Reef Restoration in North Carolina by Michael D. Marshall, Jeffrey E. French and Stephen W. Shelton Chapter 8 - Oyster Restoration Efforts in Virginia by James Wesson, Roger Mann and Mark Luckenbach Part Ill. Reef Morphology and Function - Questions of Scale Chapter 9 - South Carolina Intertidal Oyster Reef Studies: Design, Sampling and Focus for Evaluating Habitat Value and Function by Loren D. Coen, David M. Knott, Elizabeth L. Wenner, Nancy H. Hadley, Amy H. Ringwood and M. Yvonne Bobo Chapter 10 - Small-scale Patterns of Recruitment on a Constructed Intertidal Reef: The Role of Spatial Refugia by Ian K. Bartol and Roger Mann Chapter 11 - Perspectives on Induced Settlement and Metamorphosis as a Tool for Oyster Reef Enhancement by Stephen Coon and William K. Fitt Chapter 12 - Processes Controlling Local and Regional Patterns of Invertebrate Colonization: Applications to the Design of Artificial Oyster Habitat by Richard W. Osman and Robert B. Whitlatch Chapter 13 - Reefs as Metapopulatons: Approaches for Restoring and Managing Spatially Fragmented Habitats by Robert B. Whitlatch and Richard W. Osman Chapter 14 - Application of Landscape Ecological Principles to Oyster Reef Habitat Restoration by David B. Eggleston Chapter 15 - Use of Oyster Reefs as a Habitat for Epibenthic Fish and Decapods by Martin H. Posey, Troy D. Alphin, Christopher M. Powell and Edward Townsend Chapter 16 - Are Three Dimensional Structure and Healthy Oyster Populations the Keys to an Ecologically Interesting and Important Fish Community? by Denise L. Breitburg Chapter 17 - Materials Processing by Oysters in Patches: Interactive Roles of Current Speed and Seston Composition by Deborah Harsh and Mark W. Luckenbach Chapter 18 - Oyster Reefs as Components in Estuarine Nutrient Cycling: fucidental or Controlling? by Richard F. Dame Part IV. Alternative Substrates Chapter 19 - Use of Dredged Material for Oyster Habitat Creation in Coastal Virginia by Walter I. Priest, III, Janet Nestlerode and Christopher W. Frye Chapter 20 - Alternatives to Clam and Oyster Shell as Cultch for Eastern Oysters by Haywood, E. L., III, T. M. Soniat and R. C. Broadhurst, III Chapter 21 - Dredged Material as a Substrate for Fisheries Habitat Establishment in Coastal Waters by Douglas Clarke, David Meyer, Allison Veishlow and Michael LaCroix Part V. Management Options and Economic Considerations Chapter 22 - Managing Around Oyster Diseases in Maryland and Maryland Oyster Roundtable Strategies by Kennedy T. Paynter Chapter 23 - Chesapeake Bay Oyster Reefs, Their Importance, Destruction and Guidelines for Restoring Them by William J. Hargis, Jr. and Dexter S. Haven Chapter 24 - Economics of Augmentation of Natural Production Using Remote Setting Techniques by John E. Supan, Charles A. Wilson and Kenneth J. Robert

    More than the sum of its parts – pattern mining, neural networks, and how they complement each other

    Get PDF
    In this thesis we explore pattern mining and deep learning. Often seen as orthogonal, we show that these fields complement each other and propose to combine them to gain from each other’s strengths. We, first, show how to efficiently discover succinct and non-redundant sets of patterns that provide insight into data beyond conjunctive statements. We leverage the interpretability of such patterns to unveil how and which information flows through neural networks, as well as what characterizes their decisions. Conversely, we show how to combine continuous optimization with pattern discovery, proposing a neural network that directly encodes discrete patterns, which allows us to apply pattern mining at a scale orders of magnitude larger than previously possible. Large neural networks are, however, exceedingly expensive to train for which ‘lottery tickets’ – small, well-trainable sub-networks in randomly initialized neural networks – offer a remedy. We identify theoretical limitations of strong tickets and overcome them by equipping these tickets with the property of universal approximation. To analyze whether limitations in ticket sparsity are algorithmic or fundamental, we propose a framework to plant and hide lottery tickets. With novel ticket benchmarks we then conclude that the limitation is likely algorithmic, encouraging further developments for which our framework offers means to measure progress.In dieser Arbeit befassen wir uns mit Mustersuche und Deep Learning. Oft als gegensĂ€tzlich betrachtet, verbinden wir diese Felder, um von den StĂ€rken beider zu profitieren. Wir zeigen erst, wie man effizient prĂ€gnante Mengen von Mustern entdeckt, die Einsichten ĂŒber konjunktive Aussagen hinaus geben. Wir nutzen dann die Interpretierbarkeit solcher Muster, um zu verstehen wie und welche Information durch neuronale Netze fließen und was ihre Entscheidungen charakterisiert. Umgekehrt verbinden wir kontinuierliche Optimierung mit Mustererkennung durch ein neuronales Netz welches diskrete Muster direkt abbildet, was Mustersuche in einigen GrĂ¶ĂŸenordnungen höher erlaubt als bisher möglich. Das Training großer neuronaler Netze ist jedoch extrem teuer, fĂŒr das ’Lotterietickets’ – kleine, gut trainierbare Subnetzwerke in zufĂ€llig initialisierten neuronalen Netzen – eine Lösung bieten. Wir zeigen theoretische EinschrĂ€nkungen von starken Tickets und wie man diese ĂŒberwindet, indem man die Tickets mit der Eigenschaft der universalen Approximierung ausstattet. Um zu beantworten, ob EinschrĂ€nkungen in TicketgrĂ¶ĂŸe algorithmischer oder fundamentaler Natur sind, entwickeln wir ein Rahmenwerk zum Einbetten und Verstecken von Tickets, die als ModellfĂ€lle dienen. Basierend auf unseren Ergebnissen schließen wir, dass die EinschrĂ€nkungen algorithmische Ursachen haben, was weitere Entwicklungen begĂŒnstigt, fĂŒr die unser Rahmenwerk Fortschrittsevaluierungen ermöglicht
    • 

    corecore