17 research outputs found

    On InChI and evaluating the quality of cross-reference links

    Get PDF
    BACKGROUND: There are many databases of small molecules focused on different aspects of research and its applications. Some tasks may require integration of information from various databases. However, determining which entries from different databases represent the same compound is not straightforward. Integration can be based, for example, on automatically generated cross-reference links between entries. Another approach is to use the manually curated links stored directly in databases. This study employs well-established InChI identifiers to measure the consistency and completeness of the manually curated links by comparing them with the automatically generated ones. RESULTS: We used two different tools to generate InChI identifiers and observed some ambiguities in their outputs. In part, these ambiguities were caused by indistinctness in interpretation of the structural data used. InChI identifiers were used successfully to find duplicate entries in databases. We found that the InChI inconsistencies in the manually curated links are very high (28.85% in the worst case). Even using a weaker definition of consistency, the measured values were very high in general. The completeness of the manually curated links was also very poor (only 93.8% in the best case) compared with that of the automatically generated links. CONCLUSIONS: We observed several problems with the InChI tools and the files used as their inputs. There are large gaps in the consistency and completeness of manually curated links if they are measured using InChI identifiers. However, inconsistency can be caused both by errors in manually curated links and the inherent limitations of the InChI method

    Protein Sequences Identification using NM-tree

    Get PDF
    ABSTRACT We have generalized a method for tandem mass spectra interpretation, based on the parameterized Hausdorff distance dHP . Instead of just peptides (short pieces of proteins), in this paper we describe the interpretation of whole protein sequences. For this purpose, we employ the recently introduced NM-tree to index the database of hypothetical mass spectra for exact or fast approximate search. The NM-tree combines the M-tree with the TriGen algorithm in a way that allows to dynamically control the retrieval precision at query time. A scheme for protein sequences identification using the NM-tree is proposed

    Podobnostní vyhledávání v databázích proteinových struktur

    No full text
    Proteiny patří mezi nejdůležitějších biopolymery, nebot' v organismu zastáva- jí nejrůznější životně důležité funkce. Jejich funkční rozmanitost je umožněna především jejich velkou strukturní rozmanitostí. Navíc se ukazuje, že proteiny sdílející podobnou strukturu sdílí také jiné vlastnosti (např. funkci, evoluční původ, atd.). Proto je studiu proteinových struktur a možnosti identifikovat podobné struktury věnována taková pozornost. V této práci představujeme systém umožňující podobnostní vyhledávání v databázích proteinových struktur. Tento systém, pro danou dotazovou struk- turu, vyhledá v databáze ty struktury, které jsou dotazu strukturně podobné. Systém se skládá z několika klíčových částí. Byla navržena vlastní podob- nostní míra umožňující měřit podobnost mezi dvojicí proteinových struk- tur. Speciálně pro tuto míru byla vytvořena přístupová metoda založená na metrické přístupové metodě LAESA. Přístupová metoda umožňuje hle- dat podobné struktury mnohem rychleji, než by to bylo možné sekvenčním procházením databáze. Pro dosažení dalšího urychlení byly obě části parale- lizovány, přičemž se podařilo dosáhnout téměř lineárního zrychlení. Poslední částí je...Proteins are one of the most important biopolymers having a wide range of functions in living organisms. Their huge functional diversity is achieved by their ability to fold into various 3D structures. Moreover, it has been shown that proteins sharing similar structure often share also other properties (e.g, a biological function, an evolutionary origin, etc.). Therefore, protein structures and methods to identify their similarities are so widely studied. In this thesis, we introduce a system allowing similarity search in pro- tein structure databases. The system retrieves, given a query structure, all database structures being similar to the query structure. It employs several key components. We have introduced a novel similarity measure assigning similarity scores to pairs of protein structures. We have designed specific access method based on LAESA metric indexing and using the proposed measure. The access method allows to search similar structures more effi- ciently than when a sequential scan of a database is employed. To achieve further speedup, the measure and the access method have been parallelized, resulting in almost linear speedup with the respect to the number of available cores. The last component is a web user interface that allows to accept a query structure and to present a list of...Katedra softwarového inženýrstvíDepartment of Software EngineeringFaculty of Mathematics and PhysicsMatematicko-fyzikální fakult

    Podobnostní vyhledávání v databázích proteinových struktur

    No full text
    Proteins are one of the most important biopolymers having a wide range of functions in living organisms. Their huge functional diversity is achieved by their ability to fold into various 3D structures. Moreover, it has been shown that proteins sharing similar structure often share also other properties (e.g, a biological function, an evolutionary origin, etc.). Therefore, protein structures and methods to identify their similarities are so widely studied. In this thesis, we introduce a system allowing similarity search in pro- tein structure databases. The system retrieves, given a query structure, all database structures being similar to the query structure. It employs several key components. We have introduced a novel similarity measure assigning similarity scores to pairs of protein structures. We have designed specific access method based on LAESA metric indexing and using the proposed measure. The access method allows to search similar structures more effi- ciently than when a sequential scan of a database is employed. To achieve further speedup, the measure and the access method have been parallelized, resulting in almost linear speedup with the respect to the number of available cores. The last component is a web user interface that allows to accept a query structure and to present a list of...Proteiny patří mezi nejdůležitějších biopolymery, nebot' v organismu zastáva- jí nejrůznější životně důležité funkce. Jejich funkční rozmanitost je umožněna především jejich velkou strukturní rozmanitostí. Navíc se ukazuje, že proteiny sdílející podobnou strukturu sdílí také jiné vlastnosti (např. funkci, evoluční původ, atd.). Proto je studiu proteinových struktur a možnosti identifikovat podobné struktury věnována taková pozornost. V této práci představujeme systém umožňující podobnostní vyhledávání v databázích proteinových struktur. Tento systém, pro danou dotazovou struk- turu, vyhledá v databáze ty struktury, které jsou dotazu strukturně podobné. Systém se skládá z několika klíčových částí. Byla navržena vlastní podob- nostní míra umožňující měřit podobnost mezi dvojicí proteinových struk- tur. Speciálně pro tuto míru byla vytvořena přístupová metoda založená na metrické přístupové metodě LAESA. Přístupová metoda umožňuje hle- dat podobné struktury mnohem rychleji, než by to bylo možné sekvenčním procházením databáze. Pro dosažení dalšího urychlení byly obě části parale- lizovány, přičemž se podařilo dosáhnout téměř lineárního zrychlení. Poslední částí je...Katedra softwarového inženýrstvíDepartment of Software EngineeringFaculty of Mathematics and PhysicsMatematicko-fyzikální fakult

    Query languages for the Semantic web

    No full text
    The idea of the Semantic Web brings new requirements such as to create and store metadata of documents or resources. Also ability to search in such metadata is needed. Existing query languages for the Semantic Web are unfortunately either too weak or have complicated syntax or semantics. The aim of this thesis is to compare existing Semantic Web query languages and to propose new one considering its expression strength. This comparison is done by juxtapositioning of their approaches to various issues in querying. Such issues are, for example, a basic selection of data, an ability to select data with recursively defined structure, creating data, a way of working with blank nodes, etc. On the basis of this comparison, the Tequila language is proposed. The Tequila is based on named pattern and provide general recursion. This thesis also shows the way how to use Tequila language and further, it compares the Tequila with other query languages

    Query languages for the Semantic web

    Get PDF
    The idea of the Semantic Web brings new requirements such as to create and store metadata of documents or resources. Also ability to search in such metadata is needed. Existing query languages for the Semantic Web are unfortunately either too weak or have complicated syntax or semantics. The aim of this thesis is to compare existing Semantic Web query languages and to propose new one considering its expression strength. This comparison is done by juxtapositioning of their approaches to various issues in querying. Such issues are, for example, a basic selection of data, an ability to select data with recursively defined structure, creating data, a way of working with blank nodes, etc. On the basis of this comparison, the Tequila language is proposed. The Tequila is based on named pattern and provide general recursion. This thesis also shows the way how to use Tequila language and further, it compares the Tequila with other query languages

    Similarity Search in Protein Structure Databases

    No full text
    Proteins are one of the most important biopolymers having a wide range of functions in living organisms. Their huge functional diversity is achieved by their ability to fold into various 3D structures. Moreover, it has been shown that proteins sharing similar structure often share also other properties (e.g, a biological function, an evolutionary origin, etc.). Therefore, protein structures and methods to identify their similarities are so widely studied. In this thesis, we introduce a system allowing similarity search in pro- tein structure databases. The system retrieves, given a query structure, all database structures being similar to the query structure. It employs several key components. We have introduced a novel similarity measure assigning similarity scores to pairs of protein structures. We have designed specific access method based on LAESA metric indexing and using the proposed measure. The access method allows to search similar structures more effi- ciently than when a sequential scan of a database is employed. To achieve further speedup, the measure and the access method have been parallelized, resulting in almost linear speedup with the respect to the number of available cores. The last component is a web user interface that allows to accept a query structure and to present a list of..

    A comparison of approaches to accessing existing biological and chemical relational databases via SPARQL

    No full text
    Abstract Current biological and chemical research is increasingly dependent on the reusability of previously acquired data, which typically come from various sources. Consequently, there is a growing need for database systems and databases stored in them to be interoperable with each other. One of the possible solutions to address this issue is to use systems based on Semantic Web technologies, namely on the Resource Description Framework (RDF) to express data and on the SPARQL query language to retrieve the data. Many existing biological and chemical databases are stored in the form of a relational database (RDB). Converting a relational database into the RDF form and storing it in a native RDF database system may not be desirable in many cases. It may be necessary to preserve the original database form, and having two versions of the same data may not be convenient. A solution may be to use a system mapping the relational database to the RDF form. Such a system keeps data in their original relational form and translates incoming SPARQL queries to equivalent SQL queries, which are evaluated by a relational-database system. This review compares different RDB-to-RDF mapping systems with a primary focus on those that can be used free of charge. In addition, it compares different approaches to expressing RDB-to-RDF mappings. The review shows that these systems represent a viable method providing sufficient performance. Their real-life performance is demonstrated on data and queries coming from the neXtProt project
    corecore