83 research outputs found

    QiSampler: evaluation of scoring schemes for high-throughput datasets using a repetitive sampling strategy on gold standards

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>High-throughput biological experiments can produce a large amount of data showing little overlap with current knowledge. This may be a problem when evaluating alternative scoring mechanisms for such data according to a gold standard dataset because standard statistical tests may not be appropriate.</p> <p>Findings</p> <p>To address this problem we have implemented the QiSampler tool that uses a repetitive sampling strategy to evaluate several scoring schemes or experimental parameters for any type of high-throughput data given a gold standard. We provide two example applications of the tool: selection of the best scoring scheme for a high-throughput protein-protein interaction dataset by comparison to a dataset derived from the literature, and evaluation of functional enrichment in a set of tumour-related differentially expressed genes from a thyroid microarray dataset.</p> <p>Conclusions</p> <p>QiSampler is implemented as an open source R script and a web server, which can be accessed at <url>http://cbdm.mdc-berlin.de/tools/sampler/</url>.</p

    PROMPT: a protein mapping and comparison tool

    Get PDF
    BACKGROUND: Comparison of large protein datasets has become a standard task in bioinformatics. Typically researchers wish to know whether one group of proteins is significantly enriched in certain annotation attributes or sequence properties compared to another group, and whether this enrichment is statistically significant. In order to conduct such comparisons it is often required to integrate molecular sequence data and experimental information from disparate incompatible sources. While many specialized programs exist for comparisons of this kind in individual problem domains, such as expression data analysis, no generic software solution capable of addressing a wide spectrum of routine tasks in comparative proteomics is currently available. RESULTS: PROMPT is a comprehensive bioinformatics software environment which enables the user to compare arbitrary protein sequence sets, revealing statistically significant differences in their annotation features. It allows automatic retrieval and integration of data from a multitude of molecular biological databases as well as from a custom XML format. Similarity-based mapping of sequence IDs makes it possible to link experimental information obtained from different sources despite discrepancies in gene identifiers and minor sequence variation. PROMPT provides a full set of statistical procedures to address the following four use cases: i) comparison of the frequencies of categorical annotations between two sets, ii) enrichment of nominal features in one set with respect to another one, iii) comparison of numeric distributions, and iv) correlation of numeric variables. Analysis results can be visualized in the form of plots and spreadsheets and exported in various formats, including Microsoft Excel. CONCLUSION: PROMPT is a versatile, platform-independent, easily expandable, stand-alone application designed to be a practical workhorse in analysing and mining protein sequences and associated annotation. The availability of the Java Application Programming Interface and scripting capabilities on one hand, and the intuitive Graphical User Interface with context-sensitive help system on the other, make it equally accessible to professional bioinformaticians and biologically-oriented users. PROMPT is freely available for academic users from

    SProtP: A Web Server to Recognize Those Short-Lived Proteins Based on Sequence-Derived Features in Human Cells

    Get PDF
    Protein turnover metabolism plays important roles in cell cycle progression, signal transduction, and differentiation. Those proteins with short half-lives are involved in various regulatory processes. To better understand the regulation of cell process, it is important to study the key sequence-derived factors affecting short-lived protein degradation. Until now, most of protein half-lives are still unknown due to the difficulties of traditional experimental methods in measuring protein half-lives in human cells. To investigate the molecular determinants that affect short-lived proteins, a computational method was proposed in this work to recognize short-lived proteins based on sequence-derived features in human cells. In this study, we have systematically analyzed many features that perhaps correlated with short-lived protein degradation. It is found that a large fraction of proteins with signal peptides and transmembrane regions in human cells are of short half-lives. We have constructed an SVM-based classifier to recognize short-lived proteins, due to the fact that short-lived proteins play pivotal roles in the control of various cellular processes. By employing the SVM model on human dataset, we achieved 80.8% average sensitivity and 79.8% average specificity, respectively, on ten testing dataset (TE1-TE10). We also obtained 89.9%, 99% and 83.9% of average accuracy on an independent validation datasets iTE1, iTE2 and iTE3 respectively. The approach proposed in this paper provides a valuable alternative for recognizing the short-lived proteins in human cells, and is more accurate than the traditional N-end rule. Furthermore, the web server SProtP (http://reprod.njmu.edu.cn/sprotp) has been developed and is freely available for users

    Network Compression as a Quality Measure for Protein Interaction Networks

    Get PDF
    With the advent of large-scale protein interaction studies, there is much debate about data quality. Can different noise levels in the measurements be assessed by analyzing network structure? Because proteomic regulation is inherently co-operative, modular and redundant, it is inherently compressible when represented as a network. Here we propose that network compression can be used to compare false positive and false negative noise levels in protein interaction networks. We validate this hypothesis by first confirming the detrimental effect of false positives and false negatives. Second, we show that gold standard networks are more compressible. Third, we show that compressibility correlates with co-expression, co-localization, and shared function. Fourth, we also observe correlation with better protein tagging methods, physiological expression in contrast to over-expression of tagged proteins, and smart pooling approaches for yeast two-hybrid screens. Overall, this new measure is a proxy for both sensitivity and specificity and gives complementary information to standard measures such as average degree and clustering coefficients

    Network-Based Prediction and Analysis of HIV Dependency Factors

    Get PDF
    HIV Dependency Factors (HDFs) are a class of human proteins that are essential for HIV replication, but are not lethal to the host cell when silenced. Three previous genome-wide RNAi experiments identified HDF sets with little overlap. We combine data from these three studies with a human protein interaction network to predict new HDFs, using an intuitive algorithm called SinkSource and four other algorithms published in the literature. Our algorithm achieves high precision and recall upon cross validation, as do the other methods. A number of HDFs that we predict are known to interact with HIV proteins. They belong to multiple protein complexes and biological processes that are known to be manipulated by HIV. We also demonstrate that many predicted HDF genes show significantly different programs of expression in early response to SIV infection in two non-human primate species that differ in AIDS progression. Our results suggest that many HDFs are yet to be discovered and that they have potential value as prognostic markers to determine pathological outcome and the likelihood of AIDS development. More generally, if multiple genome-wide gene-level studies have been performed at independent labs to study the same biological system or phenomenon, our methodology is applicable to interpret these studies simultaneously in the context of molecular interaction networks and to ask if they reinforce or contradict each other

    The Drosophila speciation factor HMR localizes to genomic insulator sites

    Get PDF
    Hybrid incompatibility between Drosophila melanogaster and D. simulans is caused by a lethal interaction of the proteins encoded by the Hmr and Lhr genes. In D. melanogaster the loss of HMR results in mitotic defects, an increase in transcription of transposable elements and a deregulation of heterochromatic genes. To better understand the molecular mechanisms that mediate HMR's function, we measured genome-wide localization of HMR in D. melanogaster tissue culture cells by chromatin immunoprecipitation. Interestingly, we find HMR localizing to genomic insulator sites that can be classified into two groups. One group belongs to gypsy insulators and another one borders HP1a bound regions at active genes. The transcription of the latter group genes is strongly affected in larvae and ovaries of Hmr mutant flies. Our data suggest a novel link between HMR and insulator proteins, a finding that implicates a potential role for genome organization in the formation of species

    Negated bio-events: Analysis and identification

    Get PDF
    Background: Negation occurs frequently in scientific literature, especially in biomedical literature. It has previously been reported that around 13% of sentences found in biomedical research articles contain negation. Historically, the main motivation for identifying negated events has been to ensure their exclusion from lists of extracted interactions. However, recently, there has been a growing interest in negative results, which has resulted in negation detection being identified as a key challenge in biomedical relation extraction. In this article, we focus on the problem of identifying negated bio-events, given gold standard event annotations.Results: We have conducted a detailed analysis of three open access bio-event corpora containing negation information (i.e., GENIA Event, BioInfer and BioNLP'09 ST), and have identified the main types of negated bio-events. We have analysed the key aspects of a machine learning solution to the problem of detecting negated events, including selection of negation cues, feature engineering and the choice of learning algorithm. Combining the best solutions for each aspect of the problem, we propose a novel framework for the identification of negated bio-events. We have evaluated our system on each of the three open access corpora mentioned above. The performance of the system significantly surpasses the best results previously reported on the BioNLP'09 ST corpus, and achieves even better results on the GENIA Event and BioInfer corpora, both of which contain more varied and complex events.Conclusions: Recently, in the field of biomedical text mining, the development and enhancement of event-based systems has received significant interest. The ability to identify negated events is a key performance element for these systems. We have conducted the first detailed study on the analysis and identification of negated bio-events. Our proposed framework can be integrated with state-of-the-art event extraction systems. The resulting systems will be able to extract bio-events with attached polarities from textual documents, which can serve as the foundation for more elaborate systems that are able to detect mutually contradicting bio-events. © 2013 Nawaz et al.; licensee BioMed Central Ltd

    The geographic differences in technological exclusion in Poland

    No full text
    Przedstawiono wyniki badań dotyczące kształtowania się obszarów wykluczenia technologicznego w ujęciu terytorialnym w Polsce w latach 1994-2008. Są to wyniki wstępnego etapu badań, którego celem było porównanie dostępności do wybranych technologii związanych z wykluczeniem cyfrowym w poszczególnych województwach. Głównym źródłem danych była baza zawierająca wyniki badań budżetów gospodarstw domowych (BGD) opracowywana co roku i udostępniana odpłatnie, począwszy od 1993 r. przez Główny Urząd Statystyczny (GUS). Zawiera ona dane na temat ponad 30 tys. gospodarstw domowych i ok. 100 tys. osób do nich należących w każdym badanym roku. W analizowanym okresie w Polsce kształtująca się tendencja wzrostowa dotyczyła większości badanych technologii. Przy badaniu gospodarstw domowych pod kątem wyposażenia w telefon komórkowy, komputer osobisty oraz podłączenie do internetu widoczna była zależność pomiędzy posiadaniem dostępu do tych technologii a wzrostem odsetka gospodarstw wyposażonych w poszczególne rozwiązania.The evolution into an information society is much slower in Poland than in the European Union „old fifteen”. The evolution, like in other countries, is accompanied by some negative factors. The most important are different types of technological exclusion, especially the digital divide. Its essence is the distinction between owning and using technologies that enable efficient information access and individuals not capable or unable to use them. The paper presents results of the formation of technological exclusion areas during the period 1994-2008. The results are preliminary and compare the availability of the selected technologies associated with digital exclusion in each voivodship. The primary data source is the household budget (BGD) database updated annually and available by the Central Statistical Office (GUS) since 1993. It contains data on more than 30 thousand households and about 100 thousand persons for any given year
    corecore