211 research outputs found

    Semi-supervised prediction of protein interaction sentences exploiting semantically encoded metrics

    Get PDF
    Protein-protein interaction (PPI) identification is an integral component of many biomedical research and database curation tools. Automation of this task through classification is one of the key goals of text mining (TM). However, labelled PPI corpora required to train classifiers are generally small. In order to overcome this sparsity in the training data, we propose a novel method of integrating corpora that do not contain relevance judgements. Our approach uses a semantic language model to gather word similarity from a large unlabelled corpus. This additional information is integrated into the sentence classification process using kernel transformations and has a re-weighting effect on the training features that leads to an 8% improvement in F-score over the baseline results. Furthermore, we discover that some words which are generally considered indicative of interactions are actually neutralised by this process

    Deep MMT Transit Survey of the Open Cluster M37 IV: Limit on the Fraction of Stars With Planets as Small as 0.3 R_J

    Full text link
    We present the results of a deep (15 ~< r ~< 23), 20 night survey for transiting planets in the intermediate age open cluster M37 (NGC 2099) using the Megacam wide-field mosaic CCD camera on the 6.5m MMT. We do not detect any transiting planets among the ~1450 observed cluster members. We do, however, identify a ~ 1 R_J candidate planet transiting a ~ 0.8 Msun Galactic field star with a period of 0.77 days. The source is faint (V = 19.85 mag) and has an expected velocity semi-amplitude of K ~ 220 m/s (M/M_J). We conduct Monte Carlo transit injection and recovery simulations to calculate the 95% confidence upper limit on the fraction of cluster members and field stars with planets as a function of planetary radius and orbital period. Assuming a uniform logarithmic distribution in orbital period, we find that < 1.1%, < 2.7% and < 8.3% of cluster members have 1.0 R_J planets within Extremely Hot Jupiter (EHJ, 0.4 < T < 1.0 day), Very Hot Jupiter (VHJ, 1.0 < T < 3.0 days) and Hot Jupiter (HJ, 3.0 < T < 5.0 days) period ranges respectively. For 0.5 R_J planets the limits are < 3.2%, and < 21% for EHJ and VHJ period ranges, while for 0.35 R_J planets we can only place an upper limit of < 25% on the EHJ period range. For a sample of 7814 Galactic field stars, consisting primarily of FGKM dwarfs, we place 95% upper limits of < 0.3%, < 0.8% and < 2.7% on the fraction of stars with 1.0 R_J EHJ, VHJ and HJ assuming the candidate planet is not genuine. If the candidate is genuine, the frequency of ~ 1.0 R_J planets in the EHJ period range is 0.002% < f_EHJ < 0.5% with 95% confidence. We place limits of < 1.4%, < 8.8% and < 47% for 0.5 R_J planets, and a limit of < 16% on 0.3 R_J planets in the EHJ period range. This is the first transit survey to place limits on the fraction of stars with planets as small as Neptune.Comment: 61 pages, 19 figures, 5 tables, replaced with the version accepted for publication in Ap

    Enabling multi-level relevance feedback on PubMed by integrating rank learning into DBMS

    Get PDF
    Background: Finding relevant articles from PubMed is challenging because it is hard to express the user&apos;s specific intention in the given query interface, and a keyword query typically retrieves a large number of results. Researchers have applied machine learning techniques to find relevant articles by ranking the articles according to the learned relevance function. However, the process of learning and ranking is usually done offline without integrated with the keyword queries, and the users have to provide a large amount of training documents to get a reasonable learning accuracy. This paper proposes a novel multi-level relevance feedback system for PubMed, called RefMed, which supports both ad-hoc keyword queries and a multi-level relevance feedback in real time on PubMed. Results: RefMed supports a multi-level relevance feedback by using the RankSVM as the learning method, and thus it achieves higher accuracy with less feedback. RefMed "tightly" integrates the RankSVM into RDBMS to support both keyword queries and the multi-level relevance feedback in real time; the tight coupling of the RankSVM and DBMS substantially improves the processing time. An efficient parameter selection method for the RankSVM is also proposed, which tunes the RankSVM parameter without performing validation. Thereby, RefMed achieves a high learning accuracy in real time without performing a validation process. RefMed is accessible at http://dm.postech.ac.kr/refmed. Conclusions: RefMed is the first multi-level relevance feedback system for PubMed, which achieves a high accuracy with less feedback. It effectively learns an accurate relevance function from the user&apos;s feedback and efficiently processes the function to return relevant articles in real time.1114Nsciescopu

    Web Mining for Web Personalization

    Get PDF
    Web personalization is the process of customizing a Web site to the needs of specific users, taking advantage of the knowledge acquired from the analysis of the user\u27s navigational behavior (usage data) in correlation with other information collected in the Web context, namely, structure, content, and user profile data. Due to the explosive growth of the Web, the domain of Web personalization has gained great momentum both in the research and commercial areas. In this article we present a survey of the use of Web mining for Web personalization. More specifically, we introduce the modules that comprise a Web personalization system, emphasizing the Web usage mining module. A review of the most common methods that are used as well as technical issues that occur is given, along with a brief overview of the most popular tools and applications available from software vendors. Moreover, the most important research initiatives in the Web usage mining and personalization areas are presented

    Notch signaling during human T cell development

    Get PDF
    Notch signaling is critical during multiple stages of T cell development in both mouse and human. Evidence has emerged in recent years that this pathway might regulate T-lineage differentiation differently between both species. Here, we review our current understanding of how Notch signaling is activated and used during human T cell development. First, we set the stage by describing the developmental steps that make up human T cell development before describing the expression profiles of Notch receptors, ligands, and target genes during this process. To delineate stage-specific roles for Notch signaling during human T cell development, we subsequently try to interpret the functional Notch studies that have been performed in light of these expression profiles and compare this to its suggested role in the mouse

    Automated Home-Cage Behavioural Phenotyping of Mice

    Get PDF
    Neurobehavioral analysis of mouse phenotypes requires the monitoring of mouse behavior over long periods of time. Here, we describe a trainable computer vision system enabling the automated analysis of complex mouse behaviors. We provide software and an extensive manually annotated video database used for training and testing the system. Our system performs on par with human scoring, as measured from ground-truth manual annotations of thousands of clips of freely behaving mice. As a validation of the system, we characterized the home-cage behaviors of two standard inbred and two non-standard mouse strains. From this data we were able to predict in a blind test the strain identity of individual animals with high accuracy. Our video-based software will complement existing sensor based automated approaches and enable an adaptable, comprehensive, high-throughput, fine-grained, automated analysis of mouse behavior.McGovern Institute for Brain ResearchCalifornia Institute of Technology. Broad Fellows Program in Brain CircuitryNational Science Council (China) (TMS-094-1-A032

    Incorporating rich background knowledge for gene named entity classification and recognition

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Gene named entity classification and recognition are crucial preliminary steps of text mining in biomedical literature. Machine learning based methods have been used in this area with great success. In most state-of-the-art systems, elaborately designed lexical features, such as words, n-grams, and morphology patterns, have played a central part. However, this type of feature tends to cause extreme sparseness in feature space. As a result, out-of-vocabulary (OOV) terms in the training data are not modeled well due to lack of information.</p> <p>Results</p> <p>We propose a general framework for gene named entity representation, called feature coupling generalization (FCG). The basic idea is to generate higher level features using term frequency and co-occurrence information of highly indicative features in huge amount of unlabeled data. We examine its performance in a named entity classification task, which is designed to remove non-gene entries in a large dictionary derived from online resources. The results show that new features generated by FCG outperform lexical features by 5.97 F-score and 10.85 for OOV terms. Also in this framework each extension yields significant improvements and the sparse lexical features can be transformed into both a lower dimensional and more informative representation. A forward maximum match method based on the refined dictionary produces an F-score of 86.2 on BioCreative 2 GM test set. Then we combined the dictionary with a conditional random field (CRF) based gene mention tagger, achieving an F-score of 89.05, which improves the performance of the CRF-based tagger by 4.46 with little impact on the efficiency of the recognition system. A demo of the NER system is available at <url>http://202.118.75.18:8080/bioner</url>.</p

    Predicting mostly disordered proteins by using structure-unknown protein data

    Get PDF
    BACKGROUND: Predicting intrinsically disordered proteins is important in structural biology because they are thought to carry out various cellular functions even though they have no stable three-dimensional structure. We know the structures of far more ordered proteins than disordered proteins. The structural distribution of proteins in nature can therefore be inferred to differ from that of proteins whose structures have been determined experimentally. We know many more protein sequences than we do protein structures, and many of the known sequences can be expected to be those of disordered proteins. Thus it would be efficient to use the information of structure-unknown proteins in order to avoid training data sparseness. We propose a novel method for predicting which proteins are mostly disordered by using spectral graph transducer and training with a huge amount of structure-unknown sequences as well as structure-known sequences. RESULTS: When the proposed method was evaluated on data that included 82 disordered proteins and 526 ordered proteins, its sensitivity was 0.723 and its specificity was 0.977. It resulted in a Matthews correlation coefficient 0.202 points higher than that obtained using FoldIndex, 0.221 points higher than that obtained using the method based on plotting hydrophobicity against the number of contacts and 0.07 points higher than that obtained using support vector machines (SVMs). To examine robustness against training data sparseness, we investigated the correlation between two results obtained when the method was trained on different datasets and tested on the same dataset. The correlation coefficient for the proposed method is 0.14 higher than that for the method using SVMs. When the proposed SGT-based method was compared with four per-residue predictors (VL3, GlobPlot, DISOPRED2 and IUPred (long)), its sensitivity was 0.834 for disordered proteins, which is 0.052–0.523 higher than that of the per-residue predictors, and its specificity was 0.991 for ordered proteins, which is 0.036–0.153 higher than that of the per-residue predictors. The proposed method was also evaluated on data that included 417 partially disordered proteins. It predicted the frequency of disordered proteins to be 1.95% for the proteins with 5%–10% disordered sequences, 1.46% for the proteins with 10%–20% disordered sequences and 16.57% for proteins with 20%–40% disordered sequences. CONCLUSION: The proposed method, which utilizes the information of structure-unknown data, predicts disordered proteins more accurately than other methods and is less affected by training data sparseness

    Automated Retraining Methods for Document Classification and Their Parameter Tuning

    Full text link
    This paper addresses the problem of semi-supervised classification on document collections using retraining (also called self-training). A possible application is focused Web crawling which may start with very few, manually selected, training documents but can be enhanced by automatically adding initially unlabeled, positively classified Web pages for retraining. Such an approach is by itself not robust and faces tuning problems regarding parameters like the number of selected documents, the number of retraining iterations, and the ratio of positive and negative classified samples used for retraining. The paper develops methods for automatically tuning these parameters, based on predicting the leave-one-out error for a re-trained classifier and avoiding that the classifier is diluted by selecting too many or weak documents for retraining. Our experiments with three different datasets confirm the practical viability of the approach
    corecore