11 research outputs found
Integration of genetic and genomics resources in einkorn wheat enables precision mapping of important traits
Einkorn wheat (Triticum monococcum) is an ancient grain crop and a close relative of the diploid progenitor (T. urartu) of polyploid wheat. It is the only diploid wheat species having both domesticated and wild forms and therefore provides an excellent system to identify domestication genes and genes for traits of interest to utilize in wheat improvement. Here, we leverage genomic advancements for einkorn wheat using an einkorn reference genome assembly combined with skim-sequencing of a large genetic population of 812 recombinant inbred lines (RILs) developed from a cross between a wild and a domesticated T. monococcum accession. We identify 15,919 crossover breakpoints delimited to a median and average interval of 114 Kbp and 219 Kbp, respectively. This high-resolution mapping resource enables us to perform fine-scale mapping of one qualitative (red coleoptile) and one quantitative (spikelet number per spike) trait, resulting in the identification of small physical intervals (400 Kb to 700 Kb) with a limited number of candidate genes. Furthermore, an important domestication locus for brittle rachis is also identified on chromosome 7A. This resource presents an exciting route to perform trait discovery in diploid wheat for agronomically important traits and their further deployment in einkorn as well as tetraploid pasta wheat and hexaploid bread wheat cultivars
Fast algorithms for nearest neighbour search
The nearest neighbour problem is of practical significance in a number of fields. Often we are interested in finding an object near to a given query object. The problem is old, and a large number of solutions have been proposed for it in the literature. However, it remains the case that even the most popular of the techniques proposed for its solution have not been compared against each other. Also, many techniques, including the old and popular ones, can be implemented in a number of ways, and often the different implementations of a technique have not been thoroughly compared either. This research presents a detailed investigation of different implementations of two popular nearest neighbour search data structures, KDTrees and Metric Trees, and compares the different implementations of each of the two structures against each other. The best implementations of these structures are then compared against each other and against two other techniques, Annulus Method and Cover Trees. Annulus Method is an old technique that was rediscovered during the research for this thesis. Cover Trees are one of the most novel and promising data structures for nearest neighbour search that have been proposed in the literature. i Acknowledgments The continued support of Department of Computer Science’s Machine Learning group, and particularly my supervisor Dr. Eibe Frank, is greatly appreciated, without which this thesis would not have been possible
Nearly exact mining of frequent trees in large networks
Mining frequent patterns in a single network (graph) poses a number of challenges. Already only to match one path pattern to a network under subgraph isomorphism is NP-complete. Classical matching algorithms become intractable even for reasonably small patterns, on networks which are large or have a high average degree. Based on recent advances in parameterized complexity theory, we propose a novel miner for rooted trees in networks. The miner, for a fixed parameter k (maximal pattern size), can mine all rooted trees with delay linear in the size of the network and only mildly exponential in the fixed parameter k. This allows us to mine tractably, rooted trees, in large networks such as the WWW or social networks. We establish the practical applicability of our miner, by presenting an experimental evaluation on both synthetic and real-world data. © 2013 The Author(s).status: publishe
An empirical comparison of exact nearest neighbour algorithms
Abstract. Nearest neighbour search (NNS) is an old problem that is of practical importance in a number of fields. It involves finding, for a given point q, called the query, one or more points from a given set of points that are nearest to the query q. Since the initial inception of the problem a great number of algorithms and techniques have been proposed for its solution. However, it remains the case that many of the proposed algorithms have not been compared against each other on a wide variety of datasets. This research attempts to fill this gap to some extent by presenting a detailed empirical comparison of three prominent data structures for exact NNS: KD-Trees, Metric Trees, and Cover Trees. Our results suggest that there is generally little gain in using Metric Trees or Cover Trees instead of KD-Trees for the standard NNS problem.
MIPS: A graph mining library
Many practical datasets (e.g., biological, social, economic, ... networks) can be elegantly represented with graphs. In the MiGraNT project1 we aim to develop a sound theoretical understanding of mining and learning with graphs. The MIgrant Prototype System MIPS is a library of effective algorithms, based on this theory. This is an ongoing project, which aims to integrate a larger number of results. Here, we present the basic system and a first set of algorithms.
Principles and basic system. MIPS is written in C++ and strongly benefits from the meticulous use of C++ templates, which allows to unite flexibility with efficiency. The library utilizes the C++ boost library, especially the Boost Graph Library to represent flexibly graphs of different types ((un)directed, (un)labeled, ... ) with the same code. The documentation is doxygen-based.
Frequent pattern mining. Mining frequent patterns is a data mining task often used in machine learning for feature generation. Depending on the application, homomorphism or subgraph isomorphism is the matching operator of preference, even though the latter one is more popular. For even a simple path, subgraph isomorphism is NP-complete, and classical mining algorithms become intractable for patterns of a very modest size—a lot of work studies frequency counting of patterns between 3 and 5 nodes. We use recent advances on fixed parameter tractability to construct (randomized) algorithms capable of deciding subgraph isomorphism of a pattern in a network in O(k2 log2 (k)mw 2k ), with m the number of network edges, k the number of pattern vertices and w the pattern treewidth. See [1] for (a part of) the relevant theory. Our algorithm can mine frequent trees up to size 17–18, and is, to the best of our knowledge, the first tractable tree pattern miner under subgraph isomorphism for large, dense networks. Currently, we are empirically studying the behavior for non-tree graphs.
Supervised learning. Although many libraries contain decision tree and random forest learning algorithms, MIPS includes a new implementation where the novelty lies in the aforementioned graph-based approach and exploitation of the templating mechanism. MIPS can efficiently learn from training data that does not fit in memory. These capabilities were successfully applied to the field of proteomics [2].
Future development. We are adding further components to the system, amongst which algorithms to estimate the effective sample size of a set of networked (and hence non-independent) examples, kernel regression and decision tree learners for dependent examples, algorithms to learn dynamic models for time-evolving graphs, self-compiling graph algorithms, and algorithms to let the previous parts work on graph databases (not fitting in memory). We are also improving documentation and the integration of the several components.
License. MIPS is GPLv3 licensed and available at https://dtai.cs.kuleuven.be/software/mips
References
[1] A. Kibriya and J. Ramon. Nearly exact mining of frequent trees in large networks. Data Mining and Knowledge Discovery, 27(3):478–504, November 2013.
[2] T. Fannes, E. Vandemarliere, L. Schietgat et al. Predicting tryptic cleavage from proteomics data using decision tree ensembles. Journal of Proteome Research, 12(5):2253–2259, April 2013.status: publishe
Integration of genetic and genomics resources in einkorn wheat enables precision mapping of important traits
Abstract Einkorn wheat (Triticum monococcum) is an ancient grain crop and a close relative of the diploid progenitor (T. urartu) of polyploid wheat. It is the only diploid wheat species having both domesticated and wild forms and therefore provides an excellent system to identify domestication genes and genes for traits of interest to utilize in wheat improvement. Here, we leverage genomic advancements for einkorn wheat using an einkorn reference genome assembly combined with skim-sequencing of a large genetic population of 812 recombinant inbred lines (RILs) developed from a cross between a wild and a domesticated T. monococcum accession. We identify 15,919 crossover breakpoints delimited to a median and average interval of 114 Kbp and 219 Kbp, respectively. This high-resolution mapping resource enables us to perform fine-scale mapping of one qualitative (red coleoptile) and one quantitative (spikelet number per spike) trait, resulting in the identification of small physical intervals (400 Kb to 700 Kb) with a limited number of candidate genes. Furthermore, an important domestication locus for brittle rachis is also identified on chromosome 7A. This resource presents an exciting route to perform trait discovery in diploid wheat for agronomically important traits and their further deployment in einkorn as well as tetraploid pasta wheat and hexaploid bread wheat cultivars