185 research outputs found

    Machine Learning in Automated Text Categorization

    Full text link
    The automated categorization (or classification) of texts into predefined categories has witnessed a booming interest in the last ten years, due to the increased availability of documents in digital form and the ensuing need to organize them. In the research community the dominant approach to this problem is based on machine learning techniques: a general inductive process automatically builds a classifier by learning, from a set of preclassified documents, the characteristics of the categories. The advantages of this approach over the knowledge engineering approach (consisting in the manual definition of a classifier by domain experts) are a very good effectiveness, considerable savings in terms of expert manpower, and straightforward portability to different domains. This survey discusses the main approaches to text categorization that fall within the machine learning paradigm. We will discuss in detail issues pertaining to three different problems, namely document representation, classifier construction, and classifier evaluation.Comment: Accepted for publication on ACM Computing Survey

    Discovery of Rare Variants via Sequencing: Implications for the Design of Complex Trait Association Studies

    Get PDF
    There is strong evidence that rare variants are involved in complex disease etiology. The first step in implicating rare variants in disease etiology is their identification through sequencing in both randomly ascertained samples (e.g., the 1,000 Genomes Project) and samples ascertained according to disease status. We investigated to what extent rare variants will be observed across the genome and in candidate genes in randomly ascertained samples, the magnitude of variant enrichment in diseased individuals, and biases that can occur due to how variants are discovered. Although sequencing cases can enrich for casual variants, when a gene or genes are not involved in disease etiology, limiting variant discovery to cases can lead to association studies with dramatically inflated false positive rates
    • …
    corecore