143 research outputs found

    Concept learning of text documents

    Full text link
    Concept learning of text documents can be viewed as the problem of acquiring the definition of a general category of documents. To definite the category of a text document, the Conjunctive of keywords is usually be used. These keywords should be fewer and comprehensible. A na&iuml;ve method is enumerating all combinations of keywords to extract suitable ones. However, because of the enormous number of keyword combinations, it is impossible to extract the most relevant keywords to describe the categories of documents by enumerating all possible combinations of keywords. Many heuristic methods are proposed, such as GA-base, immune based algorithm. In this work, we introduce pruning power technique and propose a robust enumeration-based concept learning algorithm. Experimental results show that the rules produce by our approach has more comprehensible and simplicity than by other methods. <br /

    Finding short patterns to classify text documents

    Full text link
    Many classification methods have been proposed to find patterns in text documents. However, according to Occam\u27s razor principle, &quot;the explanation of any phenomenon should make as few assumptions as possible&quot;, short patterns usually have more explainable and meaningful for classifying text documents. In this paper, we propose a depth-first pattern generation algorithm, which can find out short patterns from text document more effectively, comparing with breadth-first algorithm <br /

    Finding coverage using incremental attribute combinations

    Full text link
    Coverage is the range that covers only positive samples in attribute (or feature) space. Finding coverage is the kernel problem in induction algorithms because of the fact that coverage can be used as rules to describe positive samples. To reflect the characteristic of training samples, it is desirable that the large coverage that cover more positive samples. However, it is difficult to find large coverage, because the attribute space is usually very high dimensionality. Many heuristic methods such as ID3, AQ and CN2 have been proposed to find large coverage. A robust algorithm also has been proposed to find the largest coverage, but the complexities of time and space are costly when the dimensionality becomes high. To overcome this drawback, this paper proposes an algorithm that adopts incremental feature combinations to effectively find the largest coverage. In this algorithm, the irrelevant coverage can be pruned away at early stages because potentially large coverage can be found earlier. Experiments show that the space and time needed to find the largest coverage has been significantly reduced.<br /

    Finding rule groups to classify high dimensional gene expression datasets

    Full text link
    Microarray data provides quantitative information about the transcription profile of cells. To analyze microarray datasets, methodology of machine learning has increasingly attracted bioinformatics researchers. Some approaches of machine learning are widely used to classify and mine biological datasets. However, many gene expression datasets are extremely high dimensionality, traditional machine learning methods can not be applied effectively and efficiently. This paper proposes a robust algorithm to find out rule groups to classify gene expression datasets. Unlike the most classification algorithms, which select dimensions (genes) heuristically to form rules groups to identify classes such as cancerous and normal tissues, our algorithm guarantees finding out best-k dimensions (genes), which are most discriminative to classify samples in different classes, to form rule groups for the classification of expression datasets. Our experiments show that the rule groups obtained by our algorithm have higher accuracy than that of other classification approaches <br /

    Keyword extraction for text categorization

    Full text link
    Text categorization (TC) is one of the main applications of machine learning. Many methods have been proposed, such as Rocchio method, Naive bayes based method, and SVM based text classification method. These methods learn labeled text documents and then construct a classifier. A new coming text document\u27s category can be predicted. However, these methods do not give the description of each category. In the machine learning field, there are many concept learning algorithms, such as, ID3 and CN2. This paper proposes a more robust algorithm to induce concepts from training examples, which is based on enumeration of all possible keywords combinations. Experimental results show that the rules produced by our approach have more precision and simplicity than that of other methods.<br /

    Concept Learning of Text Documents

    Get PDF
    Abstrac

    Evaluating the role of alcohol consumption in breast and ovarian cancer susceptibility using population-based cohort studies and two-sample Mendelian randomization analyses.

    Get PDF
    Alcohol consumption is correlated positively with risk for breast cancer in observational studies, but observational studies are subject to reverse causation and confounding. The association with epithelial ovarian cancer (EOC) is unclear. We performed both observational Cox regression and two-sample Mendelian randomization (MR) analyses using data from various European cohort studies (observational) and publicly available cancer consortia (MR). These estimates were compared to World Cancer Research Fund (WCRF) findings. In our observational analyses, the multivariable-adjusted hazard ratios (HR) for a one standard drink/day increase was 1.06 (95% confidence interval [CI]; 1.04, 1.08) for breast cancer and 1.00 (0.92, 1.08) for EOC, both of which were consistent with previous WCRF findings. MR ORs per genetically predicted one standard drink/day increase estimated via 34 SNPs using MR-PRESSO were 1.00 (0.93, 1.08) for breast cancer and 0.95 (0.85, 1.06) for EOC. Stratification by EOC subtype or estrogen receptor status in breast cancers made no meaningful difference to the results. For breast cancer, the CIs for the genetically derived estimates include the point-estimate from observational studies so are not inconsistent with a small increase in risk. Our data provide additional evidence that alcohol intake is unlikely to have anything other than a very small effect on risk of EOC

    The causal relationship between gastro-oesophageal reflux disease and idiopathic pulmonary fibrosis: a bidirectional two-sample Mendelian randomisation study

    Get PDF
    Background: Gastro-oesophageal reflux disease (GORD) is associated with idiopathic pulmonary fibrosis (IPF) in observational studies. It is not known if this association arises because GORD causes IPF or because IPF causes GORD, or because of confounding by factors, such as smoking, associated with both GORD and IPF. We used bidirectional Mendelian randomisation (MR), where genetic variants are used as instrumental variables to address issues of confounding and reverse causation, to examine how, if at all, GORD and IPF are causally related. Methods: A bidirectional two-sample MR was performed to estimate the causal effect of GORD on IPF risk and of IPF on GORD risk, using genetic data from the largest GORD (78 707 cases and 288 734 controls) and IPF (4125 cases and 20 464 controls) genome-wide association meta-analyses currently available. Results: GORD increased the risk of IPF, with an OR of 1.6 (95% CI 1.04–2.49; p=0.032). There was no evidence of a causal effect of IPF on the risk of GORD, with an OR of 0.999 (95% CI 0.997–1.000; p=0.245). Conclusions: We found that GORD increases the risk of IPF, but found no evidence that IPF increases the risk of GORD. GORD should be considered in future studies of IPF risk and interest in it as a potential therapeutic target should be renewed. The mechanisms underlying the effect of GORD on IPF should also be investigated

    Transcriptomic analysis of mRNA expression and alternative splicing during mouse sex determination

    Get PDF
    Mammalian sex determination hinges on sexually dimorphic transcriptional programs in developing fetal gonads. A comprehensive view of these programs is crucial for understanding the normal development of fetal testes and ovaries and the etiology of human disorders of sex development (DSDs), many of which remain unexplained. Using strand-specific RNA-sequencing, we characterized the mouse fetal gonadal transcriptome from 10.5 to 13.5 days post coitum, a key time window in sex determination and gonad development. Our dataset benefits from a greater sensitivity, accuracy and dynamic range compared to microarray studies, allows global dynamics and sex-specificity of gene expression to be assessed, and provides a window to non-transcriptional events such as alternative splicing. Spliceomic analysis uncovered female-specific regulation of Lef1 splicing, which may contribute to the enhanced WNT signaling activity in XX gonads. We provide a user-friendly visualization tool for the complete transcriptomic and spliceomic dataset as a resource for the field
    • …
    corecore