37,838 research outputs found

    Foundational principles for large scale inference: Illustrations through correlation mining

    Full text link
    When can reliable inference be drawn in the "Big Data" context? This paper presents a framework for answering this fundamental question in the context of correlation mining, with implications for general large scale inference. In large scale data applications like genomics, connectomics, and eco-informatics the dataset is often variable-rich but sample-starved: a regime where the number nn of acquired samples (statistical replicates) is far fewer than the number pp of observed variables (genes, neurons, voxels, or chemical constituents). Much of recent work has focused on understanding the computational complexity of proposed methods for "Big Data." Sample complexity however has received relatively less attention, especially in the setting when the sample size nn is fixed, and the dimension pp grows without bound. To address this gap, we develop a unified statistical framework that explicitly quantifies the sample complexity of various inferential tasks. Sampling regimes can be divided into several categories: 1) the classical asymptotic regime where the variable dimension is fixed and the sample size goes to infinity; 2) the mixed asymptotic regime where both variable dimension and sample size go to infinity at comparable rates; 3) the purely high dimensional asymptotic regime where the variable dimension goes to infinity and the sample size is fixed. Each regime has its niche but only the latter regime applies to exa-scale data dimension. We illustrate this high dimensional framework for the problem of correlation mining, where it is the matrix of pairwise and partial correlations among the variables that are of interest. We demonstrate various regimes of correlation mining based on the unifying perspective of high dimensional learning rates and sample complexity for different structured covariance models and different inference tasks

    An Overview of the Use of Neural Networks for Data Mining Tasks

    Get PDF
    In the recent years the area of data mining has experienced a considerable demand for technologies that extract knowledge from large and complex data sources. There is a substantial commercial interest as well as research investigations in the area that aim to develop new and improved approaches for extracting information, relationships, and patterns from datasets. Artificial Neural Networks (NN) are popular biologically inspired intelligent methodologies, whose classification, prediction and pattern recognition capabilities have been utilised successfully in many areas, including science, engineering, medicine, business, banking, telecommunication, and many other fields. This paper highlights from a data mining perspective the implementation of NN, using supervised and unsupervised learning, for pattern recognition, classification, prediction and cluster analysis, and focuses the discussion on their usage in bioinformatics and financial data analysis tasks

    Current status and future perspectives of bioinformatics in Tanzania

    Get PDF
    The main bottleneck in advancing genomics in present times is the lack of expertise in using bioinformatics tools and approaches for data mining in raw DNA sequences generated by modern high throughput technologies such as next generation sequencing. Although bioinformatics has been making major progress and contributing to the development in the rest of the world, it has still not yet fully integrated the tertiary education and research sector in Tanzania. This review aims to introduce a summary of recent achievements, trends and success stories of application of bioinformatics in biotechnology. The applications of bioinformatics in the fields such as molecular biology, biotechnology, medicine and agriculture, the global trend of bioinformatics, accessibility bioinformatics products in Tanzania, bioinformatics training initiatives in Tanzania, the future prospects of bioinformatics use in biotechnology globally and Tanzania in particular are reviewed. The paper is of interest and importance to rouse public awareness of the new opportunities that could be brought about by bioinformatics to address many research problems relevant to Tanzania and sub-Sahara Africa.Keywords: Bioinformatics, Biotechnology, Genomics, Tanzania

    From parasite genomes to one healthy world: Are we having fun yet?

    Get PDF
    In 1990, the Human Genome Sequencing Project was established. This laid the ground work for an explosion of sequence data that has since followed. As a result of this effort, the first complete genome of an animal, Caenorhabditis elegans was published in 1998. The sequence of Drosophila melanogaster was made available in March, 2000 and in the following year, working drafts of the human genome were generated with the completed sequence (92%) being released in 2003. Recent advancements and next-generation technologies have made sequencing common place and have infiltrated every aspect of biological research, including parasitology. To date, sequencing of 32 apicomplexa and 24 nematode genomes are either in progress or near completion, and over 600k nematode EST and 200k apicomplexa EST submissions fill the databases. However, the winds have shifted and efforts are now refocusing on how best to store, mine and apply these data to problem solving. Herein we tend not to summarize existing X-omics datasets or present new technological advances that promise future benefits. Rather, the information to follow condenses up-to-date-applications of existing technologies to problem solving as it relates to parasite research. Advancements in non-parasite systems are also presented with the proviso that applications to parasite research are in the making

    To share or not to share: Publication and quality assurance of research data outputs. A report commissioned by the Research Information Network

    No full text
    A study on current practices with respect to data creation, use, sharing and publication in eight research disciplines (systems biology, genomics, astronomy, chemical crystallography, rural economy and land use, classics, climate science and social and public health science). The study looked at data creation and care, motivations for sharing data, discovery, access and usability of datasets and quality assurance of data in each discipline
    corecore