Search CORE

17 research outputs found

On InChI and evaluating the quality of cross-reference links

Author: Jakub Galgonek
Jiří Vondrášek
Publication venue: Springer Nature
Publication date: 01/01/2014
Field of study

BACKGROUND: There are many databases of small molecules focused on different aspects of research and its applications. Some tasks may require integration of information from various databases. However, determining which entries from different databases represent the same compound is not straightforward. Integration can be based, for example, on automatically generated cross-reference links between entries. Another approach is to use the manually curated links stored directly in databases. This study employs well-established InChI identifiers to measure the consistency and completeness of the manually curated links by comparing them with the automatically generated ones. RESULTS: We used two different tools to generate InChI identifiers and observed some ambiguities in their outputs. In part, these ambiguities were caused by indistinctness in interpretation of the structural data used. InChI identifiers were used successfully to find duplicate entries in databases. We found that the InChI inconsistencies in the manually curated links are very high (28.85% in the worst case). Even using a weaker definition of consistency, the measured values were very high in general. The completeness of the manually curated links was also very poor (only 93.8% in the best case) compared with that of the automatically generated links. CONCLUSIONS: We observed several problems with the InChI tools and the files used as their inputs. There are large gaps in the consistency and completeness of manually curated links if they are measured using InChI identifiers. However, inconsistency can be caused both by errors in manually curated links and the inherent limitations of the InChI method

Springer - Publisher Connector

PubMed Central

SProt: sphere-based protein structure similarity algorithm

Author: Galgonek Jakub
Hoksza David
Skopal Tomáš
Publication venue: BioMed Central
Publication date: 01/01/2011
Field of study

Crossref

Springer - Publisher Connector

PubMed Central

Protein Sequences Identification using NM-tree

Author: David Hoksza
Jakub Galgonek
Jakub Lokoč
Jiří Novák
Tomáš Skopal
Publication venue
Publication date: 05/03/2020
Field of study

ABSTRACT We have generalized a method for tandem mass spectra interpretation, based on the parameterized Hausdorff distance dHP . Instead of just peptides (short pieces of proteins), in this paper we describe the interpretation of whole protein sequences. For this purpose, we employ the recently introduced NM-tree to index the database of hypothetical mass spectra for exact or fast approximate search. The NM-tree combines the M-tree with the TriGen algorithm in a way that allows to dynamically control the retrieval precision at query time. A scheme for protein sequences identification using the NM-tree is proposed

CiteSeerX

Podobnostní vyhledávání v databázích proteinových struktur

Author: Galgonek Jakub
Publication venue: Univerzita Karlova, Matematicko-fyzikální fakulta
Publication date: 01/01/2013
Field of study

Proteiny patří mezi nejdůležitějších biopolymery, nebot' v organismu zastáva- jí nejrůznější životně důležité funkce. Jejich funkční rozmanitost je umožněna především jejich velkou strukturní rozmanitostí. Navíc se ukazuje, že proteiny sdílející podobnou strukturu sdílí také jiné vlastnosti (např. funkci, evoluční původ, atd.). Proto je studiu proteinových struktur a možnosti identifikovat podobné struktury věnována taková pozornost. V této práci představujeme systém umožňující podobnostní vyhledávání v databázích proteinových struktur. Tento systém, pro danou dotazovou struk- turu, vyhledá v databáze ty struktury, které jsou dotazu strukturně podobné. Systém se skládá z několika klíčových částí. Byla navržena vlastní podob- nostní míra umožňující měřit podobnost mezi dvojicí proteinových struk- tur. Speciálně pro tuto míru byla vytvořena přístupová metoda založená na metrické přístupové metodě LAESA. Přístupová metoda umožňuje hle- dat podobné struktury mnohem rychleji, než by to bylo možné sekvenčním procházením databáze. Pro dosažení dalšího urychlení byly obě části parale- lizovány, přičemž se podařilo dosáhnout téměř lineárního zrychlení. Poslední částí je...Proteins are one of the most important biopolymers having a wide range of functions in living organisms. Their huge functional diversity is achieved by their ability to fold into various 3D structures. Moreover, it has been shown that proteins sharing similar structure often share also other properties (e.g, a biological function, an evolutionary origin, etc.). Therefore, protein structures and methods to identify their similarities are so widely studied. In this thesis, we introduce a system allowing similarity search in pro- tein structure databases. The system retrieves, given a query structure, all database structures being similar to the query structure. It employs several key components. We have introduced a novel similarity measure assigning similarity scores to pairs of protein structures. We have designed specific access method based on LAESA metric indexing and using the proposed measure. The access method allows to search similar structures more effi- ciently than when a sequential scan of a database is employed. To achieve further speedup, the measure and the access method have been parallelized, resulting in almost linear speedup with the respect to the number of available cores. The last component is a web user interface that allows to accept a query structure and to present a list of...Katedra softwarového inženýrstvíDepartment of Software EngineeringFaculty of Mathematics and PhysicsMatematicko-fyzikální fakult

CU Digital Repository

Podobnostní vyhledávání v databázích proteinových struktur

Author: Galgonek Jakub
Publication venue: Univerzita Karlova, Matematicko-fyzikální fakulta
Publication date: 01/01/2012
Field of study

Proteins are one of the most important biopolymers having a wide range of functions in living organisms. Their huge functional diversity is achieved by their ability to fold into various 3D structures. Moreover, it has been shown that proteins sharing similar structure often share also other properties (e.g, a biological function, an evolutionary origin, etc.). Therefore, protein structures and methods to identify their similarities are so widely studied. In this thesis, we introduce a system allowing similarity search in pro- tein structure databases. The system retrieves, given a query structure, all database structures being similar to the query structure. It employs several key components. We have introduced a novel similarity measure assigning similarity scores to pairs of protein structures. We have designed specific access method based on LAESA metric indexing and using the proposed measure. The access method allows to search similar structures more effi- ciently than when a sequential scan of a database is employed. To achieve further speedup, the measure and the access method have been parallelized, resulting in almost linear speedup with the respect to the number of available cores. The last component is a web user interface that allows to accept a query structure and to present a list of...Proteiny patří mezi nejdůležitějších biopolymery, nebot' v organismu zastáva- jí nejrůznější životně důležité funkce. Jejich funkční rozmanitost je umožněna především jejich velkou strukturní rozmanitostí. Navíc se ukazuje, že proteiny sdílející podobnou strukturu sdílí také jiné vlastnosti (např. funkci, evoluční původ, atd.). Proto je studiu proteinových struktur a možnosti identifikovat podobné struktury věnována taková pozornost. V této práci představujeme systém umožňující podobnostní vyhledávání v databázích proteinových struktur. Tento systém, pro danou dotazovou struk- turu, vyhledá v databáze ty struktury, které jsou dotazu strukturně podobné. Systém se skládá z několika klíčových částí. Byla navržena vlastní podob- nostní míra umožňující měřit podobnost mezi dvojicí proteinových struk- tur. Speciálně pro tuto míru byla vytvořena přístupová metoda založená na metrické přístupové metodě LAESA. Přístupová metoda umožňuje hle- dat podobné struktury mnohem rychleji, než by to bylo možné sekvenčním procházením databáze. Pro dosažení dalšího urychlení byly obě části parale- lizovány, přičemž se podařilo dosáhnout téměř lineárního zrychlení. Poslední částí je...Katedra softwarového inženýrstvíDepartment of Software EngineeringFaculty of Mathematics and PhysicsMatematicko-fyzikální fakult

CU Digital Repository

Query languages for the Semantic web

Author: Galgonek Jakub
Publication venue
Publication date: 01/01/2008
Field of study

The idea of the Semantic Web brings new requirements such as to create and store metadata of documents or resources. Also ability to search in such metadata is needed. Existing query languages for the Semantic Web are unfortunately either too weak or have complicated syntax or semantics. The aim of this thesis is to compare existing Semantic Web query languages and to propose new one considering its expression strength. This comparison is done by juxtapositioning of their approaches to various issues in querying. Such issues are, for example, a basic selection of data, an ability to select data with recursively defined structure, creating data, a way of working with blank nodes, etc. On the basis of this comparison, the Tequila language is proposed. The Tequila is based on named pattern and provide general recursion. This thesis also shows the way how to use Tequila language and further, it compares the Tequila with other query languages

National Repository of Grey Literature

Query languages for the Semantic web

Author: Galgonek Jakub
Publication venue
Publication date: 01/01/2008
Field of study

CiteSeerX

CU Digital Repository

National Repository of Grey Literature

Similarity Search in Protein Structure Databases

Author: Galgonek Jakub
Publication venue
Publication date: 01/01/2012
Field of study

National Repository of Grey Literature

A comparison of approaches to accessing existing biological and chemical relational databases via SPARQL

Author: Jakub Galgonek
Jiří Vondrášek
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/06/2023
Field of study

Abstract Current biological and chemical research is increasingly dependent on the reusability of previously acquired data, which typically come from various sources. Consequently, there is a growing need for database systems and databases stored in them to be interoperable with each other. One of the possible solutions to address this issue is to use systems based on Semantic Web technologies, namely on the Resource Description Framework (RDF) to express data and on the SPARQL query language to retrieve the data. Many existing biological and chemical databases are stored in the form of a relational database (RDB). Converting a relational database into the RDF form and storing it in a native RDF database system may not be desirable in many cases. It may be necessary to preserve the original database form, and having two versions of the same data may not be convenient. A solution may be to use a system mapping the relational database to the RDF form. Such a system keeps data in their original relational form and translates incoming SPARQL queries to equivalent SQL queries, which are evaluated by a relational-database system. This review compares different RDB-to-RDF mapping systems with a primary focus on those that can be used free of charge. In addition, it compares different approaches to expressing RDB-to-RDF mappings. The review shows that these systems represent a viable method providing sufficient performance. Their real-life performance is demonstrated on data and queries coming from the neXtProt project

Directory of Open Access Journals