160 research outputs found

    Examining the uses of shared data

    Get PDF
    Background
 Many initiatives and repositories exist to encourage the sharing of research data, and thousands of microarray gene expression datasets are publicly available. Many studies reuse this data, but it is not well understood which datasets are reused and for what purpose.

 Materials and Methods
 We trained a machine-learning algorithm to automatically classify full-text gene expression microarray studies into two classes: those that generated original microarray data (n=900) and those which only reused data (n=250). We then compared the Medical Subject Heading (MeSH) terms of two classes to identify MeSH topics which were over- or under-represented by publications with reused data.

 Results
 Studies on humans, mice, chordata, and invertebrates were equally likely to be conducted using original or shared microarray data, whereas shared data was used in a relatively high proportion of studies involving fungi (odds ratio (OR)=2.4), and a relatively low proportion involving rats, bacteria, viruses, plants, or genetically-altered or inbred animals (OR<0.05). Unsurprisingly, when we looked at Major MeSH terms to represent the primary purpose of the studies, statistical and computational methods clearly dominated. The only biomedical topics with a relatively high proportion of data reuse Major MeSH terms were Promoter Regions, Evolution, and Protein Interaction Mapping.

 Discussion
 Identifying areas of particularly successful microarray data reuse—such as Saccharomyces cerevisiae datasets and studies of promoter regions and evolution—can highlight best practices to be used when developing research agendas, tools, standards, repositories, and communities in areas which have yet to receive major benefits from shared data.
&#xa

    Unbalanced data processing using oversampling: machine Learning

    Get PDF
    Nowadays, the DL algorithms show good results when used in the solution of different problems which present similar characteristics as the great amount of data and high dimensionality. However, one of the main challenges that currently arises is the classification of high dimensionality databases, with very few samples and high-class imbalance. Biomedical databases of gene expression microarrays present the characteristics mentioned above, presenting problems of class imbalance, with few samples and high dimensionality. The problem of class imbalance arises when the set of samples belonging to one class is much larger than the set of samples of the other class or classes. This problem has been identified as one of the main challenges of the algorithms applied in the context of Big Data. The objective of this research is the study of genetic expression databases, using conventional methods of sub and oversampling for the balance of classes such as RUS, ROS and SMOTE. The databases were modified by applying an increase in their imbalance and in another case generating artificial noise

    A self-adaptive migration model genetic algorithm for data mining applications

    Get PDF
    Data mining involves nontrivial process of extracting knowledge or patterns from large databases. Genetic Algorithms are efficient and robust searching and optimization methods that are used in data mining. In this paper we propose a Self-Adaptive Migration Model GA (SAMGA), where parameters of population size, the number of points of crossover and mutation rate for each population are adaptively fixed. Further, the migration of individuals between populations is decided dynamically. This paper gives a mathematical schema analysis of the method stating and showing that the algorithm exploits previously discovered knowledge for a more focused and concentrated search of heuristically high yielding regions while simultaneously performing a highly explorative search on the other regions of the search space. The effective performance of the algorithm is then shown using standard testbed functions and a set of actual classification datamining problems. Michigan style of classifier was used to build the classifier and the system was tested with machine learning databases of Pima Indian Diabetes database, Wisconsin Breast Cancer database and few others. The performance of our algorithm is better than others. © 2007 Elsevier Inc. All rights reserved

    Databases for Managing Genetic Resources Collections and Mapping Populations of Forage and Related Species

    Get PDF
    Effective management of plant material used in crop improvement and underpinning research is greatly facilitated by a properly designed data structure accessible by all those working with the material. At IGER we have developed the Aberystwyth Genetic Resources Information System, AGRIS, for managing genetic resources acquired through collecting trips, seed exchange, breeding and transgenic programmes. Recently this has been complemented by MaPIS, a Mapping Populations Information System, which links with AGRIS and allows for storage and documentation of information about plant mapping populations, including pedigrees, status and physical locations of accessions and individual genotypes. IGER also maintains the European Central Crop Databases for Lolium species and Trifolium repens, and the UK National Inventory of all plant genetic resources conserved ex situ in the UK; by November 2004, the UKNI had contributed over 220000 accessions to the 900000 in the Europe-wide database EURISCO

    Capturing genetic gains in productivity with heterosis

    Get PDF
    The National Swine Registry (NSR) has two long term goals: 1) to register purebred pigs and thus assure breed purity and 2) encourage genetic progress through performance testing - genetic selection programs. The continued production of purebred lines assures that a number of breeds are available to produce females and terminal cross pigs with high levels of heterosis. Purebred breeders, whose customers are commercial swine producers, have produced substantial rates of genetic progress by use of the STAGES program

    Guilt By Genetic Association: The Fourth Amendment and the Search of Private Genetic Databases by Law Enforcement

    Get PDF
    Over the course of 2018, a number of suspects in unsolved crimes have been identified through the use of GEDMatch, a public online genetic database. Law enforcement’s use of GEDMatch to identify suspects in cold cases likely does not constitute a search under the Fourth Amendment because the genetic information hosted on the website is publicly available. Transparency reports from direct-to-consumer (DTC) genetic testing providers like 23andMe and Ancestry suggest that federal and state officials may now be requesting access to private genetic databases as well. Whether law enforcement’s use of private DTC genetic databases to search for familial relatives of a suspect’s genetic profile constitutes a search within the meaning of the Fourth Amendment is far less clear. A strict application of the third-party doctrine suggests that individuals have no expectation of privacy in genetic information that they voluntarily disclose to third parties, including DTC providers. This Note, however, contends that the U.S. Supreme Court’s recent decision in Carpenter v. United States overwhelmingly supports the proposition that genetic information disclosed to third-party DTC providers is subject to Fourth Amendment protection. Approximately fifteen million individuals in the United States have already submitted their genetic information to DTC providers. The genetic information held by these providers can reveal a host of highly intimate details about consumers’ medical conditions, behavioral traits, genetic health risks, ethnic background, and familial relationships. Allowing law enforcement warrantless access to investigate third-party DTC genetic databases circumvents their consumers’ reasonable expectations of privacy by exposing this sensitive genetic information to law enforcement without any meaningful oversight. Furthermore, individuals likely reasonably expect that they retain ownership over their uniquely personal genetic information despite their disclosure of that information to a thirdparty provider. This Note therefore asserts that the third-party doctrine does not permit law enforcement to conduct warrantless searches for suspects on private DTC genetics databases under the Fourth Amendment

    Evaluating the accuracy of a functional SNP annotation system

    Get PDF
    Many common and chronic diseases are influenced at some level by genetic variation. Research done in population genetics, specifically in the area of single nucleotide polymorphisms (SNPs) is critical to understanding human genetic variation. A key element in assessing role of a given SNP is determining if the variation is likely to result in change in function. The SNP Integration Tool (SNPit) is a comprehensive tool that integrates diverse, existing predictors of SNP functionality, providing the user with information for improved association study analysis. To evaluate the SNPit system, we developed an alternative gold standard to measure accuracy using sensitivity and specificity. The results of our evaluation demonstrated that our alternative gold standard produced encouraging results
    corecore