35 research outputs found

    Mistake-Driven Learning in Text Categorization

    Full text link
    Learning problems in the text processing domain often map the text to a space whose dimensions are the measured features of the text, e.g., its words. Three characteristic properties of this domain are (a) very high dimensionality, (b) both the learned concepts and the instances reside very sparsely in the feature space, and (c) a high variation in the number of active features in an instance. In this work we study three mistake-driven learning algorithms for a typical task of this nature -- text categorization. We argue that these algorithms -- which categorize documents by learning a linear separator in the feature space -- have a few properties that make them ideal for this domain. We then show that a quantum leap in performance is achieved when we further modify the algorithms to better address some of the specific characteristics of the domain. In particular, we demonstrate (1) how variation in document length can be tolerated by either normalizing feature weights or by using negative weights, (2) the positive effect of applying a threshold range in training, (3) alternatives in considering feature frequency, and (4) the benefits of discarding features while training. Overall, we present an algorithm, a variation of Littlestone's Winnow, which performs significantly better than any other algorithm tested on this task using a similar feature set.Comment: 9 pages, uses aclap.st

    DNA Repair Biomarker for Lung Cancer Risk and its Correlation With Airway Cells Gene Expression.

    Get PDF
    Background: Improving lung cancer risk assessment is required because current early-detection screening criteria miss most cases. We therefore examined the utility for lung cancer risk assessment of a DNA Repair score obtained from OGG1, MPG, and APE1 blood tests. In addition, we examined the relationship between the level of DNA repair and global gene expression. Methods: We conducted a blinded case-control study with 150 non-small cell lung cancer case patients and 143 control individuals. DNA Repair activity was measured in peripheral blood mononuclear cells, and the transcriptome of nasal and bronchial cells was determined by RNA sequencing. A combined DNA Repair score was formed using logistic regression, and its correlation with disease was assessed using cross-validation; correlation of expression to DNA Repair was analyzed using Gene Ontology enrichment. Results: DNA Repair score was lower in case patients than in control individuals, regardless of the case's disease stage. Individuals at the lowest tertile of DNA Repair score had an increased risk of lung cancer compared to individuals at the highest tertile, with an odds ratio (OR) of 7.2 (95% confidence interval [CI] = 3.0 to 17.5; P < .001), and independent of smoking. Receiver operating characteristic analysis yielded an area under the curve  of 0.89 (95% CI = 0.82 to 0.93). Remarkably, low DNA Repair score correlated with a broad upregulation of gene expression of immune pathways in patients but not in control individuals. Conclusions: The DNA Repair score, previously shown to be a lung cancer risk factor in the Israeli population, was validated in this independent study as a mechanism-based cancer risk biomarker and can substantially improve current lung cancer risk prediction, assisting prevention and early detection by computed tomography scanning.This work was funded by grants from NIH/NCI/EDRN (#1 U01 CA111219), the Flight Attendant Medical Research Institute, Florida, the Mike Rosenbloom Foundation and Weizmann Institute of Science to ZL and TPE; and by grants from Cancer Research UK to BP and to the Cancer Research UK Cambridge Centre; and by a UK National Institute for Health Research Senior Fellowship to BP; and by the Cambridge Biomedical Research Centre and the Cancer Research UK Cambridge Centre to RCR. Volunteer participant recruitment through the Cambridge Bioresource was funded by the Cambridge Biomedical Research Centre

    Ataxia-telangiectasia: Linkage analysis in highly inbred Arab and Druze families and differentiation from an ataxia-microcephaly-cataract syndrome

    Get PDF
    Ataxia-telangiectasia (A-T) is a progressive autosomal recessive disease featuring neurodegeneration, immunodeficiency, chromosomal instability, radiation sensitivity and a highly increased proneness to cancer. A-T is ethnically widespread and genetically heterogeneous, as indicated by the existence of four complementation groups in this disease. Several "A-T-like" genetic diseases share various clinical and cellular characteristics with A-T. By using linkage analysis to study North American and Turkish A-O families, the ATA (A-T, complementation group A) gene has been mapped to chromosome 11q23. A number of Israeli Arab A-T patients coming from large, highly inbred families were assigned to group A In one of these families, an additional autosomal recessive disease was identified, characterized by ataxia, hypotonia, microcephaly and bilateral congenital cataracts. In two patients with this syndrome, normal levels of serum immunoglobulins and alpha-fetoprotein, chromosomal stability in peripheral blood lymphocytes and skin fibroblasts, and normal cellular response to treatments with X-rays and the radiomimetic drug neocarzinostatin indicated that this disease does not share, with A-T, any additional features other than ataxia. These tests also showed that another patient in this family, who is also mentally retarded, is affected with both disorders. This conclusion was further supported by linkage analysis with 11q23 markers. Lod scores between A-O and these markers, cumulated over three large Arab families, were significant and confirmed the localization of the ATA gene to aq23. However, another Druze family unassigned to a specific complementation group, showed several recombinants between A-T and the same markers, leaving the localization of the A-T gene in this family open

    Association between translation efficiency and horizontal gene transfer within microbial communities

    Get PDF
    Horizontal gene transfer (HGT) is a major force in microbial evolution. Previous studies have suggested that a variety of factors, including restricted recombination and toxicity of foreign gene products, may act as barriers to the successful integration of horizontally transferred genes. This study identifies an additional central barrier to HGT—the lack of co-adaptation between the codon usage of the transferred gene and the tRNA pool of the recipient organism. Analyzing the genomic sequences of more than 190 microorganisms and the HGT events that have occurred between them, we show that the number of genes that were horizontally transferred between organisms is positively correlated with the similarity between their tRNA pools. Those genes that are better adapted to the tRNA pools of the target genomes tend to undergo more frequent HGT. At the community (or environment) level, organisms that share a common ecological niche tend to have similar tRNA pools. These results remain significant after controlling for diverse ecological and evolutionary parameters. Our analysis demonstrates that there are bi-directional associations between the similarity in the tRNA pools of organisms and the number of HGT events occurring between them. Similar tRNA pools between a donor and a host tend to increase the probability that a horizontally acquired gene will become fixed in its new genome. Our results also suggest that frequent HGT may be a homogenizing force that increases the similarity in the tRNA pools of organisms within the same community

    Cost of resistance to trematodes in freshwater snail populations with low clonal diversity

    No full text
    Abstract Background The persistence of high genetic variability in natural populations garners considerable interest among ecologists and evolutionary biologists. One proposed hypothesis for the maintenance of high levels of genetic diversity relies on frequency-dependent selection imposed by parasites on host populations (Red Queen hypothesis). A complementary hypothesis suggests that a trade-off between fitness costs associated with tolerance to stress factors and fitness costs associated with resistance to parasites is responsible for the maintenance of host genetic diversity. Results The present study investigated whether host resistance to parasites is traded off with tolerance to environmental stress factors (high/low temperatures, high salinity), by comparing populations of the freshwater snail Melanoides tuberculata with low vs. high clonal diversity. Since polyclonal populations were found to be more parasitized than populations with low clonal diversity, we expected them to be tolerant to environmental stress factors. We found that clonal diversity explained most of the variation in snail survival under high temperature, thereby suggesting that tolerance to high temperatures of clonally diverse populations is higher than that of populations with low clonal diversity. Conclusions Our results suggest that resistance to parasites may come at a cost of reduced tolerance to certain environmental stress factors

    Multi-Document Keyphrase Extraction: Dataset, Baselines and Review

    Full text link
    Keyphrase extraction has been extensively researched within the single-document setting, with an abundance of methods, datasets and applications. In contrast, multi-document keyphrase extraction has been infrequently studied, despite its utility for describing sets of documents, and its use in summarization. Moreover, no prior dataset exists for multi-document keyphrase extraction, hindering the progress of the task. Recent advances in multi-text processing make the task an even more appealing challenge to pursue. To stimulate this pursuit, we present here the first dataset for the task, MK-DUC-01, which can serve as a new benchmark, and test multiple keyphrase extraction baselines on our data. In addition, we provide a brief, yet comprehensive, literature review of the task

    Data from: Clonal diversity driven by parasitism in a freshwater snail

    No full text
    One explanation for the widespread abundance of sexual reproduction is the advantage that genetically diverse sexual lineages have under strong pressure from virulent coevolving parasites. Such parasites are believed to track common asexual host genotypes, resulting in negative frequency-dependent selection that counterbalances the population growth-rate advantage of asexuals in comparison with sexuals. In the face of genetically diverse asexual lineages, this advantage of sexual reproduction might be eroded, and instead sexual populations would be replaced by diverse assemblages of clonal lineages. We investigated whether parasite-mediated selection promotes clonal diversity in 22 natural populations of the freshwater snail Melanoides tuberculata. We found that infection prevalence explains the observed variation in the clonal diversity of M. tuberculata populations, while no such relationship was found between infection prevalence and male frequency. Clonal diversity and male frequency were independent of snail population density. Incorporating ecological factors such as presence/absence of fish, habitat geography and habitat type did not improve the predictive power of regression models. Approximately 11% of the clonal snail genotypes were shared among 2-4 populations, creating a web of 17 interconnected populations. Taken together, our study suggests that parasite-mediated selection coupled with host dispersal ecology promotes clonal diversity. This, in return, may erode the advantage of sexual reproduction in M. tuberculata populations
    corecore