134 research outputs found

    Advanced learning algorithms for cross-language patent retrieval and classification

    Get PDF
    Abstract We study several machine learning algorithms for cross-lan7guage patent retrieval and classification. In comparison with most of other studies involving machine learning for cross-language information retrieval, which basically used learning techniques for monolingual sub-tasks, our learning algorithms exploit the bilingual training documents and learn a semantic representation from them. We study Japanese-English cross-language patent retrieval using Kernel Canonical Correlation Analysis (KCCA), a method of correlating linear relationships between two variables in kernel defined feature spaces. The results are quite encouraging and are significantly better than those obtained by other state of the art methods. We also investigate learning algorithms for cross-language document classification. The learning algorithm are based on KCCA and Support Vector Machines (SVM). In particular, we study two ways of combining the KCCA and SVM and found that one particular combination called SVM_2k achieved better results than other learning algorithms for either bilingual or monolingual test documents

    A comparison of massively parallel nucleotide sequencing with oligonucleotide microarrays for global transcription profiling

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>RNA-Seq exploits the rapid generation of gigabases of sequence data by Massively Parallel Nucleotide Sequencing, allowing for the mapping and digital quantification of whole transcriptomes. Whilst previous comparisons between RNA-Seq and microarrays have been performed at the level of gene expression, in this study we adopt a more fine-grained approach. Using RNA samples from a normal human breast epithelial cell line (MCF-10a) and a breast cancer cell line (MCF-7), we present a comprehensive comparison between RNA-Seq data generated on the Applied Biosystems SOLiD platform and data from Affymetrix Exon 1.0ST arrays. The use of Exon arrays makes it possible to assess the performance of RNA-Seq in two key areas: detection of expression at the granularity of individual exons, and discovery of transcription outside annotated loci.</p> <p>Results</p> <p>We found a high degree of correspondence between the two platforms in terms of exon-level fold changes and detection. For example, over 80% of exons detected as expressed in RNA-Seq were also detected on the Exon array, and 91% of exons flagged as changing from Absent to Present on at least one platform had fold-changes in the same direction. The greatest detection correspondence was seen when the read count threshold at which to flag exons Absent in the SOLiD data was set to <it>t</it><1 suggesting that the background error rate is extremely low in RNA-Seq. We also found RNA-Seq more sensitive to detecting differentially expressed exons than the Exon array, reflecting the wider dynamic range achievable on the SOLiD platform. In addition, we find significant evidence of novel protein coding regions outside known exons, 93% of which map to Exon array probesets, and are able to infer the presence of thousands of novel transcripts through the detection of previously unreported exon-exon junctions.</p> <p>Conclusions</p> <p>By focusing on exon-level expression, we present the most fine-grained comparison between RNA-Seq and microarrays to date. Overall, our study demonstrates that data from a SOLiD RNA-Seq experiment are sufficient to generate results comparable to those produced from Affymetrix Exon arrays, even using only a single replicate from each platform, and when presented with a large genome.</p

    Open chromatin profiling identifies AP1 as a transcriptional regulator in oesophageal adenocarcinoma.

    Get PDF
    Oesophageal adenocarcinoma (OAC) is one of the ten most prevalent forms of cancer and is showing a rapid increase in incidence and yet exhibits poor survival rates. Compared to many other common cancers, the molecular changes that occur in this disease are relatively poorly understood. However, genes encoding chromatin remodeling enzymes are frequently mutated in OAC. This is consistent with the emerging concept that cancer cells exhibit reprogramming of their chromatin environment which leads to subsequent changes in their transcriptional profile. Here, we have used ATAC-seq to interrogate the chromatin changes that occur in OAC using both cell lines and patient-derived material. We demonstrate that there are substantial changes in the regulatory chromatin environment in the cancer cells and using this data we have uncovered an important role for ETS and AP1 transcription factors in driving the changes in gene expression found in OAC cells.Our work received funding from the Wellcome Trust (https://wellcome.ac.uk/) the National Institute for Health Research (https://www.nihr.ac.uk/) and Cancer Research UK (http:// www.cancerresearchuk.org/)

    GFI1 proteins regulate stem cell formation in the AGM

    Get PDF
    In vertebrates, the first haematopoietic stem cells (HSCs) with multi-lineage and long-term repopulating potential arise in the AGM (aorta-gonad-mesonephros) region. These HSCs are generated from a rare and transient subset of endothelial cells, called haemogenic endothelium (HE), through an endothelial-to-haematopoietic transition (EHT). Here, we establish the absolute requirement of the transcriptional repressors GFI1 and GFI1B (growth factor independence 1 and 1B) in this unique trans-differentiation process. We first demonstrate that Gfi1 expression specifically defines the rare population of HE that generates emerging HSCs. We further establish that in the absence of GFI1 proteins, HSCs and haematopoietic progenitor cells are not produced in the AGM, revealing the critical requirement for GFI1 proteins in intra-embryonic EHT. Finally, we demonstrate that GFI1 proteins recruit the chromatin-modifying protein LSD1, a member of the CoREST repressive complex, to epigenetically silence the endothelial program in HE and allow the emergence of blood cells.We thank the staff at the Advanced Imaging, animal facility, Molecular Biology Core facilities and Flow Cytometry of CRUK Manchester Institute for technical support and Michael Lie-A-Ling and Elli Marinopoulou for initiating the DamID-PIP bioinformatics project. We thank members of the Stem Cell Biology group, the Stem Cell Haematopoiesis groups and Martin Gering for valuable advice and critical reading of the manuscript. Work in our laboratory is supported by the Leukaemia and Lymphoma Research Foundation (LLR), Cancer Research UK (CRUK) and the Biotechnology and Biological Sciences Research Council (BBSRC). SC is the recipient of an MRC senior fellowship (MR/J009202/1).This is the author accepted manuscript. The final version is available from NPG via http://dx.doi.org/10.1038/ncb327

    Using Prior Information from the Medical Literature in GWAS of Oral Cancer Identifies Novel Susceptibility Variant on Chromosome 4 - the AdAPT Method

    Get PDF
    Background: Genome-wide association studies (GWAS) require large sample sizes to obtain adequate statistical power, but it may be possible to increase the power by incorporating complementary data. In this study we investigated the feasibility of automatically retrieving information from the medical literature and leveraging this information in GWAS. Methods: We developed a method that searches through PubMed abstracts for pre-assigned keywords and key concepts, and uses this information to assign prior probabilities of association for each single nucleotide polymorphism (SNP) with the phenotype of interest - the Adjusting Association Priors with Text (AdAPT) method. Association results from a GWAS can subsequently be ranked in the context of these priors using the Bayes False Discovery Probability (BFDP) framework. We initially tested AdAPT by comparing rankings of known susceptibility alleles in a previous lung cancer GWAS, and subsequently applied it in a two-phase GWAS of oral cancer. Results: Known lung cancer susceptibility SNPs were consistently ranked higher by AdAPT BFDPs than by p-values. In the oral cancer GWAS, we sought to replicate the top five SNPs as ranked by AdAPT BFDPs, of which rs991316, located in the ADH gene region of 4q23, displayed a statistically significant association with oral cancer risk in the replication phase (per-rare-allele log additive p-value [p(trend)] = 2.5 x 10(-3)). The combined OR for having one additional rare allele was 0.83 (95% CI: 0.76-0.90), and this association was independent of previously identified susceptibility SNPs that are associated with overall UADT cancer in this gene region. We also investigated if rs991316 was associated with other cancers of the upper aerodigestive tract (UADT), but no additional association signal was found. Conclusion: This study highlights the potential utility of systematically incorporating prior knowledge from the medical literature in genome-wide analyses using the AdAPT methodology. AdAPT is available online (url: http://services.gate.ac.uk/lld/gwas/service/config)

    Using KCCA for Japanese-English cross-language information retrieval and classification

    No full text
    Kernel Canonical Correlation Analysis (KCCA) is a method of correlating linear relationship between two multidimensional variables in feature space. We applied the KCCA to the Japanese-English cross-language information retrieval and classification. The results were encouraging

    Perceptron-like learning for ontology based information extraction

    No full text
    Recent work on ontology-based Information Extraction (IE) has tried to make use of knowledge from the target ontology in order to improve semantic annotation results. However, very few approaches exploit the ontology structure itself, and those that do so, have some limitations. This paper introduces a hierarchical learning approach for IE, which uses the target ontology as an essential part of the extraction process, by taking into account the relations between concepts. The approach is evaluated on the largest available semantically annotated corpus. The results demonstrate clearly the benefits of using knowledge from the ontology as input to the information extraction process. We also demonstrate the advantages of our approach over other state-of-the-art learning systems on a commonly used benchmark dataset
    corecore