29 research outputs found
ProtNN: Fast and Accurate Nearest Neighbor Protein Function Prediction based on Graph Embedding in Structural and Topological Space
Studying the function of proteins is important for understanding the
molecular mechanisms of life. The number of publicly available protein
structures has increasingly become extremely large. Still, the determination of
the function of a protein structure remains a difficult, costly, and time
consuming task. The difficulties are often due to the essential role of spatial
and topological structures in the determination of protein functions in living
cells. In this paper, we propose ProtNN, a novel approach for protein function
prediction. Given an unannotated protein structure and a set of annotated
proteins, ProtNN finds the nearest neighbor annotated structures based on
protein-graph pairwise similarities. Given a query protein, ProtNN finds the
nearest neighbor reference proteins based on a graph representation model and a
pairwise similarity between vector embedding of both query and reference
protein-graphs in structural and topological spaces. ProtNN assigns to the
query protein the function with the highest number of votes across the set of k
nearest neighbor reference proteins, where k is a user-defined parameter.
Experimental evaluation demonstrates that ProtNN is able to accurately classify
several datasets in an extremely fast runtime compared to state-of-the-art
approaches. We further show that ProtNN is able to scale up to a whole PDB
dataset in a single-process mode with no parallelization, with a gain of
thousands order of magnitude of runtime compared to state-of-the-art
approaches
Mining Representative Unsubstituted Graph Patterns Using Prior Similarity Matrix
One of the most powerful techniques to study protein structures is to look
for recurrent fragments (also called substructures or spatial motifs), then use
them as patterns to characterize the proteins under study. An emergent trend
consists in parsing proteins three-dimensional (3D) structures into graphs of
amino acids. Hence, the search of recurrent spatial motifs is formulated as a
process of frequent subgraph discovery where each subgraph represents a spatial
motif. In this scope, several efficient approaches for frequent subgraph
discovery have been proposed in the literature. However, the set of discovered
frequent subgraphs is too large to be efficiently analyzed and explored in any
further process. In this paper, we propose a novel pattern selection approach
that shrinks the large number of discovered frequent subgraphs by selecting the
representative ones. Existing pattern selection approaches do not exploit the
domain knowledge. Yet, in our approach we incorporate the evolutionary
information of amino acids defined in the substitution matrices in order to
select the representative subgraphs. We show the effectiveness of our approach
on a number of real datasets. The results issued from our experiments show that
our approach is able to considerably decrease the number of motifs while
enhancing their interestingness
Towards an Efficient Discovery of the Topological Representative Subgraphs
With the emergence of graph databases, the task of frequent subgraph
discovery has been extensively addressed. Although the proposed approaches in
the literature have made this task feasible, the number of discovered frequent
subgraphs is still very high to be efficiently used in any further exploration.
Feature selection for graph data is a way to reduce the high number of frequent
subgraphs based on exact or approximate structural similarity. However, current
structural similarity strategies are not efficient enough in many real-world
applications, besides, the combinatorial nature of graphs makes it
computationally very costly. In order to select a smaller yet structurally
irredundant set of subgraphs, we propose a novel approach that mines the top-k
topological representative subgraphs among the frequent ones. Our approach
allows detecting hidden structural similarities that existing approaches are
unable to detect such as the density or the diameter of the subgraph. In
addition, it can be easily extended using any user defined structural or
topological attributes depending on the sought properties. Empirical studies on
real and synthetic graph datasets show that our approach is fast and scalable
Towards an Efficient Discovery of Topological Representative Subgraphs
National audienceLa sélection de motifs basée sur la similarité structurelle exacte ou approximative est un moyen de réduire le nombre élevé des sous-graphes fréquents. Cependant, les stratégies actuelles de similarité structurelle ne sont pas efficaces dans beaucoup de contextes réels. En outre, la nature combinatoire des graphes rend l'isomorphisme exact ou approximatif très coûteux. Dans ce papier, nous proposons une approche qui permet de sélectionner un sous-ensemble de sous-graphes topologiques représentatifs parmi les fréquents. L'approche proposée surmonte le coûteux test d'isomorphisme exact ou approximatif en mesurant la similarité structurelle globale en se basant sur un ensemble d'attributs topologiques considérés. Elle permet aussi de détecter des similaritées structurelles cachées (tels que la densité, le diamètre, etc.) qui ne sont pas considérées par les approches existantes. En outre, l'approche proposée est flexible et peut être facilement étendue avec des attributs définis par l'utilisateur selon l'application. Les analyses expérimentales sur des bases de graphes réelles et synthétiques montrent l'efficacité de notre approche
Discovery of Functional Motifs from the Interface Region of Oligomeric Proteins using Frequent Subgraph Mining
Modeling the interface region of a protein complex paves the way for understanding its dynamics and functionalities. Existing works model the interface region of a complex by using different approaches, such as, the residue composition at the interface region, the geometry of the interface residues, or the structural alignment of interface regions. These approaches are useful for ranking a set of docked conformation or for building scoring function for protein-protein docking, but they do not provide a generic and scalable technique for the extraction of interface patterns leading to functional motif discovery. In this work, we model the interface region of a protein complex by graphs and extract interface patterns of the given complex in the form of frequent subgraphs. To achieve this we develop a scalable algorithm for frequent subgraph mining. We show that a systematic review of the mined subgraphs provides an effective method for the discovery of functional motifs that exist along the interface region of a given protein complex
From SIR to SEAIRD: a novel data-driven modeling approach based on the Grey-box System Theory to predict the dynamics of COVID-19
Common compartmental modeling for COVID-19 is based on a priori knowledge and
numerous assumptions. Additionally, they do not systematically incorporate
asymptomatic cases. Our study aimed at providing a framework for data-driven
approaches, by leveraging the strengths of the grey-box system theory or
grey-box identification, known for its robustness in problem solving under
partial, incomplete, or uncertain data. Empirical data on confirmed cases and
deaths, extracted from an open source repository were used to develop the
SEAIRD compartment model. Adjustments were made to fit current knowledge on the
COVID-19 behavior. The model was implemented and solved using an Ordinary
Differential Equation solver and an optimization tool. A cross-validation
technique was applied, and the coefficient of determination was computed
in order to evaluate the goodness-of-fit of the model. %to the data. Key
epidemiological parameters were finally estimated and we provided the rationale
for the construction of SEAIRD model. When applied to Brazil's cases, SEAIRD
produced an excellent agreement to the data, with an %coefficient of
determination . The probability of COVID-19 transmission was
generally high (). On the basis of a 20-day modeling data, the
incidence rate of COVID-19 was as low as 3 infected cases per 100,000 exposed
persons in Brazil and France. Within the same time frame, the fatality rate of
COVID-19 was the highest in France (16.4\%) followed by Brazil (6.9\%), and the
lowest in Russia (). SEAIRD represents an asset for modeling
infectious diseases in their dynamical stable phase, especially for new viruses
when pathophysiology knowledge is very limited
Fouille de sous-graphes basée sur la topologie et la connaissance du domaine : application sur les structures 3D de protéines
This thesis is in the intersection of two proliferating research fields, namely data mining and bioinformatics. With the emergence of graph data in the last few years, many efforts have been devoted to mining frequent subgraphs from graph databases. Yet, the number of discovered frequentsubgraphs is usually exponential, mainly because of the combinatorial nature of graphs. Many frequent subgraphs are irrelevant because they are redundant or just useless for the user. Besides, their high number may hinder and even makes further explorations unfeasible. Redundancy in frequent subgraphs is mainly caused by structural and/or semantic similarities, since most discovered subgraphs differ slightly in structure and may infer similar or even identical meanings. In this thesis, we propose two approaches for selecting representative subgraphs among frequent ones in order to remove redundancy. Each of the proposed approaches addresses a specific type of redundancy. The first approach focuses on semantic redundancy where similarity between subgraphs is measured based on the similarity between their nodes' labels, using prior domain knowledge. The second approach focuses on structural redundancy where subgraphs are represented by a set of user-defined topological descriptors, and similarity between subgraphs is measured based on the distance between their corresponding topological descriptions. The main application data of this thesis are protein 3D-structures. This choice is based on biological and computational reasons. From a biological perspective, proteins play crucial roles in almost every biological process. They are responsible of a variety of physiological functions. From a computational perspective, we are interested in mining complex data. Proteins are a perfect example of such data as they are made of complex structures composed of interconnected amino acids which themselves are composed of interconnected atoms. Large amounts of protein structures are currently available in online databases, in computer analyzable formats. Protein 3D-structures can be transformed into graphs where amino acids are the graph nodes and their connections are the graph edges. This enables using graph mining techniques to study them. The biological importance of proteins, their complexity, and their availability in computer analyzable formats made them a perfect application data for this thesis.Cette thèse est à l'intersection de deux domaines de recherche en plein expansion, à savoir la fouille de données et la bioinformatique. Avec l'émergence des bases de graphes au cours des dernières années, de nombreux efforts ont été consacrés à la fouille des sous-graphes fréquents. Mais le nombre de sous-graphes fréquents découverts est exponentiel, cela est dû principalement à la nature combinatoire des graphes. Beaucoup de sous-graphes fréquents ne sont pas pertinents parce qu'ils sont redondants ou tout simplement inutiles pour l'utilisateur. En outre, leur nombre élevé peut nuire ou même rendre parfois irréalisable toute utilisation ultérieure. La redondance dans les sous-graphes fréquents est principalement due à la similarité structurelle et / ou sémantique, puisque la plupart des sous-graphes découverts diffèrent légèrement dans leur structures et peuvent exprimer des significations similaires ou même identiques. Dans cette thèse, nous proposons deux approches de sélection des sous-graphes représentatifs parmi les fréquents afin d'éliminer la redondance. Chacune des approches proposées s'intéresse à un type spécifique de redondance. La première approche s'adresse à la redondance sémantique où la similarité entre les sous-graphes est mesurée en fonction de la similarité entre les étiquettes de leurs noeuds, en utilisant les connaissances de domaine. La deuxième approche s'adresse à la redondance structurelle où les sous-graphes sont représentés par des descripteurs topologiques définis par l'utilisateur, et la similarité entre les sous-graphes est mesurée en fonction de la distance entre leurs descriptions topologiques respectives. Les principales données d'application de cette thèse sont les structures 3D des protéines. Ce choix repose sur des raisons biologiques et informatiques. D'un point de vue biologique, les protéines jouent un rôle crucial dans presque tous les processus biologiques. Ils sont responsables d'une variété de fonctions physiologiques. D'un point de vue informatique, nous nous sommes intéressés à la fouille de données complexes. Les protéines sont un exemple parfait de ces données car elles sont faites de structures complexes composées d'acides aminés interconnectés qui sont eux-mêmes composées d'atomes interconnectés. Des grandes quantités de structures protéiques sont actuellement disponibles dans les bases de données en ligne. Les structures 3D des protéines peuvent être transformées en graphes où les acides aminés représentent les noeuds du graphe et leurs connexions représentent les arêtes. Cela permet d'utiliser des techniques de fouille de graphes pour les étudier. L'importance biologique des protéines et leur complexité ont fait d'elles des données d'application appropriées pour cette thèse
Face Recognition in the Wild
AbstractFace recognition is one of the most important tasks in pattern recognition and computer vision. The most conventional way to per- form face recognition is to compare a set of facial features that are extracted from a source image or a video frame with a reference image database of known faces. Such a classification takes the form of a prediction within a closed-set of classes. However, a more realistic scenario that fits the ground truth of real-world face recognition applications is to consider the possibility of encountering faces that do not belong to any of the training classes, i.e., an open-set classification. Such a constraint is very challenging to most existing face recognition systems since the latter are based on closed-set classification methods which always assign a training label to novel unknown instances even if they represent unseen faces that are not represented in the reference database. This results in a misclassification. In this paper, we introduce Face Recognition in the Wild (FRW), a novel face recognition system that allows (1) to efficiently recognize known faces from the reference database, and (2) to prevent misclassifying instances that represent unknown and unseen faces. FRW formulates this problem as a multi-class classification in an open-set context where the presence of instances from unknown classes is possible. Experimental results on the challenging Olivetti Faces benchmark dataset show the efficiency of our approach in open-set face recognition problems