Search CORE

73 research outputs found

Development of Copy Number Variation Detection Algorithms and Their Application to Genome Diversity Studies

Author: Shen Feichen
Publication venue
Publication date: 01/01/2019
Field of study

Copy number variation (CNV) is an important class of variation that contributes to genome evolution and disease. CNVs that become fixed in a species give rise to segmental duplications; and already duplicated sequence is prone to subsequent gain and loss leading to additional copy-number variation. Multiple methods exist for defining CNV based on high-throughput sequencing data, including analysis of mapped read-depth. However, accurately assessing CNV can be computationally costly and multi-mapping-based approaches may not specifically distinguish among paralogs or gene families. We present two rapid CNV estimation algorithms, QuicK-mer and fastCN, for second generation short sequencing data. The QuicK-mer program is a paralog sensitive CNV detector which relies on enumerating unique k-mers from a pre-tabulated reference genome. The latest version of QuicK-mer 2.0 utilizes a newly constructed k-mer counting core based on the DJB hash function and permits multithreaded CNV counting of a large input file. As a result, QuicK-mer 2.0 can produce copy-number profiles form a 10X coverage mammalian genome in less than 5 minutes. The second CNV estimator, fastCN, is based on sequence mapping and has tolerance for mismatches. The pipeline is built around the mrsFAST read mapper and does not use additional time compared to the mrsFAST mapping process. We validated the accuracy of both approaches with existing data on human paralogous regions from the 1000 Genomes Project. We also employed QuicK-mer to perform an assessment of copy number variation on chimpanzee and human Y chromosomes. CNV has also been associated with phenotypic changes that occur also during animal domestication. Large scale CNVs were observed previously in cattle, pigs and chicken domestication. We assessed the role of CNV in dog domestication though a comparison of semi-feral village dogs and a global collection of wolfs. Our CNV selection scan uncovered many previously confirmed duplications and deletions but did not identify fixed variants that may have contributed to the initial domestication process. During this selection study, we uncovered CNVs that are errors in the existing canine reference assembly. We attempted to the complement the current CanFam3.1 reference with the de novo genome assembly of a Great Dane breed dog named Zoey. A 50x PacBio long reads sequencing with median insert size of 7.8kbp was conducted. The resulting assembly shows significant improvement with 20x increased continuity and two third reductions of unplaced contigs. The Zoey Great Dane assembly closes 80% of CanFam3.1 gaps where high GC content was the major culprit in the original assembly. Using unique k-mers assigned in these closed gaps, QuicK-mer was able to find many of these regions are fixed across dogs while small proportion shows variability.PHDHuman GeneticsUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/150064/1/feichens_1.pdfDescription of feichens_1.pdf : Restricted to UM users only

Deep Blue Documents at the University of Michigan

A Graph Analytics Framework for Knowledge Discovery

Author: Shen Feichen
Publication venue
Publication date
Field of study

Title from PDF of title page, viewed on June 20, 2016Dissertation advisor: Yugyung LeeVitaIncludes bibliographical references (pages 203-222)Thesis (Ph.D.)--School of Computing and Engineering. University of Missouri--Kansas City, 2016In the current data movement, numerous efforts have been made to convert and normalize a large number of traditionally structured and unstructured data to semi-structured data (e.g., RDF, OWL). With the increasing number of semi-structured data coming into the big data community, data integration and knowledge discovery from heterogeneous do mains become important research problems. In the application level, detection of related concepts among ontologies shows a huge potential to do knowledge discovery with big data. In RDF graph, concepts represent entities and predicates indicate properties that connect different entities. It is more crucial to ﬁgure out how different concepts are re lated within a single ontology or across multiple ontologies by analyzing predicates in different knowledge bases. However, the world today is one of information explosion, and it is extremely difﬁcult for researchers to ﬁnd existing or potential predicates to per form linking among cross domains concepts without any support from schema pattern analysis. Therefore, there is a need for a mechanism to do predicate oriented pattern analysis to partition heterogeneous ontologies into closer small topics and generate query to discover cross domains knowledge from each topic. In this work, we present such a model that conducts predicate oriented pattern analysis based on their close relationship and generates a similarity matrix. Based on this similarity matrix, we apply an innovative unsupervised learning algorithm to partition large data sets into smaller and closer topics that generate meaningful queries to fully discover knowledge over a set of interlinked data sources. In this dissertation, we present a graph analytics framework that aims at providing semantic methods for analysis and pattern discovery from graph data with cross domains. Our contributions can be summarized as follows: • The deﬁnition of predicate oriented neighborhood measures to determine the neighborhood relationships among different RDF predicates of linked data across do mains; • The design of the global and local optimization of clustering and retrieval algorithms to maximize the knowledge discovery from large linked data: i) top-down clustering, called the Hierarchical Predicate oriented K-means Clustering;ii)bottom up clustering, called the Predicate oriented Hierarchical Agglomerative Clustering; iii) automatic topic discovery and query generation, context aware topic path ﬁnding for a given source and target pair; • The implementation of an interactive tool and endpoints for knowledge discovery and visualization from integrated query design and query processing for cross do mains; • Experimental evaluations conducted to validate proposed methodologies of the frame work using DBpedia, YAGO, and Bio2RDF datasets and comparison of the pro posed methods with existing graph partition methods and topic discovery methods. In this dissertation, we propose a framework called the GraphKDD. The GraphKDD is able to analyze and quantify close relationship among predicates based on Predicate Oriented Neighbor Pattern (PONP). Based on PONP, the GraphKDD conducts a Hierarchical Predicate oriented K-Means clustering (HPKM) algorithm and a Predicate oriented Hierarchical Agglomerative clustering (PHAL) algorithm to partition graphs into semantically related sub-graphs. In addition, in application level, the GraphKDD is capable of generating query dynamically from topic discovery results and testing reachability be tween source target nodes. We validate the proposed GraphKDD framework through comprehensive evaluations using DBPedia, Yago and Bio2RDF datasets.Introduction -- Predicate oriented neighborhood patterns -- Unsupervised learning on PONP Association Measurement -- Query generation and topic aware link discovery -- The GraphKDD ontology learning framework -- Conclusion and future wor

University of Missouri: MOspace

Situation Aware Mobile Apps Framework

Author: Shen Feichen
Publication venue: University of Missouri--Kansas City
Publication date
Field of study

Title from PDF of title page, viewed on October 3, 2012Thesis advisor: Yugyung LeeVitaIncludes bibliographic references (p. 130-132)Thesis (M.S.)--School of Computing and Engineering. University of Missouri--Kansas City, 2012Mobile devices, like smart phones or tablets, have become ubiquitous, with their adoption being driven by their immediacy and sensing capabilities. Applications, or apps, that run on portable computing devices have surged in popularity, with billions of downloads taking place. However, an increasing number of mobile apps and their diverse users make it difficult to select the correct app to respond to evolving situations. To address this issue, it is vitally important to find an intelligent approach to provide situation awareness capabilities and an immediate response to the changes. In this thesis, we have developed a semantic framework for mobile apps named the Situation Awareness Mobile Apps Framework (SAMAF) to achieve the goal of dynamic and adaptive apps for automated composition, adaptation, and evolution of software systems responding to the mobile users' context and environmental changes. SAMAF is composed of two major components: i) a cloud based service framework for mobile apps development, deployment, and adaptation using a design of dynamic patterns for Service Oriented Architecture and ii) an ontology-based context modeling and reasoning framework that is implemented based on Context Ontology modeling and Event Condition Action (ECA) rule based inference to align the adaptation with the changes. The SAMAF framework has been evaluated by two kinds of experiments. One was conducted in real phone settings to obtain the running performance of mobile apps adapting to dynamic changes of the users' contexts. The other was performed with a large number of mobile phone users in a simulated JADE (Java Agent DEvelopment Framework), multiple agents' platform for testing the adaptability, reasoning correctness, and scalability based on the communication and reasoning capabilities among different kinds of agents. Our results show that the proposed framework supports feasible, scalable and adaptive responds to evolving contexts.Introduction -- Related work -- SAMAF model -- SAMAF implementation -- Scenario illustration -- Conclusion and future wor

University of Missouri: MOspace