5 research outputs found
Mining sequential patterns for determination of protein characteristics
ΠΠ΅Π»Π°Π½ΡΠ΅Π²ΠΈΠ½Π΅ ΠΈΠ»ΠΈ ΠΏΡΠΎΡΠ΅ΠΈΠ½ΠΈ ΡΡ Π²Π°ΠΆΠ½ΠΈ Π±ΠΈΠΎΠ»ΠΎΡΠΊΠΈ ΠΌΠ°ΠΊΡΠΎΠΌΠΎΠ»Π΅ΠΊΡΠ»ΠΈ ΠΏΠΎΠ»ΠΈΠΌΠ΅ΡΠ½Π΅ ΠΏΡΠΈΡΠΎΠ΄Π΅ (ΠΏΠΎΠ»ΠΈΠΏΠ΅ΠΏΡΠΈΠ΄ΠΈ), ΠΊΠΎΡΠΈ ΡΠ΅ ΡΠ°ΡΡΠΎΡΠ΅ ΠΎΠ΄ Π°ΠΌΠΈΠ½ΠΎ ΠΊΠΈΡΠ΅Π»ΠΈΠ½Π° ΠΈ ΠΏΡΠ΅Π΄ΡΡΠ°Π²ΡΠ°ΡΡ ΠΎΡΠ½ΠΎΠ²Π½Ρ Π³ΡΠ°Π΄ΠΈΠ²Π½Ρ ΡΠ΅Π΄ΠΈΠ½ΠΈΡΡ ΡΠ²Π°ΠΊΠ΅ ΡΠ΅Π»ΠΈΡΠ΅...Proteins are signicant biological macromolecules of polymeric nature
(polypeptides), which contain amino acids and are basic structural units of each cell.
Their contents include 20+3 amino acids and, as a consequence, they are presented
in biological databases as sequences formed from 23 dierent characters. Proteins
can be classied based on their primary structure, secondary structure, function etc.
One of possible classications of proteins by their function is related to their contents
in a certain cluster of ortholologous groups (COGs). This classication is based on
the previous comparison of proteins by their similarities in their primary structures,
which is most often a result of homology, i.e. their mutual (evolutionary) origin.
COG database is obtained by comparison of the known and predicted proteins encoded
in the completely sequenced prokaryotic (archaea and bacteria) genomes and
their classication by orthology. The proteins are classied in 25 categories which
can be ordered in three basic functional groups (the proteins responsible for: (1)
information storage and processing; (2) cellular processes and signaling; and (3)
metabolism), or in a group of poorly characterized proteins. Classication of proteins
by their contents in certain COG category (euKaryote Orthologous Groups-
KOG for eukaryotic organisms) is signicant for better understanding of biological
processes and various pathological conditions in people and other organisms.
The dissertation proposed the model for classication of proteins in COG categories
based on amino acid n-grams (sequences of n- length). The set of data contains
protein sequences of genomes from 8 dierent taxonomic classes [TKL97] of bacteria
(Aquicales, Bacteroidia, Chlamydiales, Chlorobia, Chloroexia, Cytophagia,
Deinococci, Prochlorales), which are known to have been classied by COG categories.
The new method is presented, based on the generalized systems of Boolean
equations, used for separation of n-grams characteristic for proteins of corresponding
COG categories. The presented method signicantly reduces the number of
processed n-grams in comparison to previously used methods of n-gram analysis,
thus more memory space is provided and less time for protein procession is necessary.
The previously known methods for classication of proteins by functional categories
compared each new protein (whose function had to be determined) to the set of all
proteins which had already been classied by functions in order to determine the
group which contained most similar proteins to the one which was to be classied.
In relation to the previous, the advantage of the new method is in its avoidance
of sequence-sequence comparison and in search for those patterns (n-grams, up to
10 long) in a protein which are characteristic of the corresponding COG category.
The selected patterns are added to a corresponding COG category and describe
sequences of certain length, which have previously appeared in that COG category
only, not in the proteins of other COG categories.
On the basis of the proposed method, the predictor for determination of the corresponding
COG category for a new protein is implemented. Minimal precision of the
prediction is one of the predictors arguments. During the test phase the constructed
predictor shown excellent results, with maximal precision of 99% reached for some
proteins.
According to its properties and relatively simple construction, the proposed method
can be applied in similar domains where the solution of problem is based on n-gram
sequence analysis
NORMALIZATION OF HEALTH RECORDS IN THE SERBIAN LANGUAGE WITH THE AIM OF SMART HEALTH SERVICES REALIZATION
The development of information technology increases its use in various spheres of human activity, including healthcare. Bundles of data and reports are generated and stored in textual form, such as symptoms, medical history, and doctorβs observations of patients' health. Electronic recording of patient data not only facilitates day-to-day work in hospitals, enables more efficient data management and reduces material costs, but can also be used for further processing and to gain knowledge to improve public health. Publicly available health data would contribute to the development of telemedicine, e-health, epidemic control, and smart healthcare within smart cities. This paper describes the importance of textual data normalization for smart healthcare services. An algorithm for normalizing medical data in Serbian is proposed in order to prepare them for further processing (F1-score=0,816), in this case within the smart health framework. By applying this algorithm, in addition to the normalized medical records, corpora of keywords and stop words, which are specific to the medical domain, are also obtained and can be used to improve the results in the normalization of medical textual data.
Mining sequential patterns for determination of protein characteristics
ΠΠ΅Π»Π°Π½ΡΠ΅Π²ΠΈΠ½Π΅ ΠΈΠ»ΠΈ ΠΏΡΠΎΡΠ΅ΠΈΠ½ΠΈ ΡΡ Π²Π°ΠΆΠ½ΠΈ Π±ΠΈΠΎΠ»ΠΎΡΠΊΠΈ ΠΌΠ°ΠΊΡΠΎΠΌΠΎΠ»Π΅ΠΊΡΠ»ΠΈ ΠΏΠΎΠ»ΠΈΠΌΠ΅ΡΠ½Π΅ ΠΏΡΠΈΡΠΎΠ΄Π΅ (ΠΏΠΎΠ»ΠΈΠΏΠ΅ΠΏΡΠΈΠ΄ΠΈ), ΠΊΠΎΡΠΈ ΡΠ΅ ΡΠ°ΡΡΠΎΡΠ΅ ΠΎΠ΄ Π°ΠΌΠΈΠ½ΠΎ ΠΊΠΈΡΠ΅Π»ΠΈΠ½Π° ΠΈ ΠΏΡΠ΅Π΄ΡΡΠ°Π²ΡΠ°ΡΡ ΠΎΡΠ½ΠΎΠ²Π½Ρ Π³ΡΠ°Π΄ΠΈΠ²Π½Ρ ΡΠ΅Π΄ΠΈΠ½ΠΈΡΡ ΡΠ²Π°ΠΊΠ΅ ΡΠ΅Π»ΠΈΡΠ΅...Proteins are signicant biological macromolecules of polymeric nature
(polypeptides), which contain amino acids and are basic structural units of each cell.
Their contents include 20+3 amino acids and, as a consequence, they are presented
in biological databases as sequences formed from 23 dierent characters. Proteins
can be classied based on their primary structure, secondary structure, function etc.
One of possible classications of proteins by their function is related to their contents
in a certain cluster of ortholologous groups (COGs). This classication is based on
the previous comparison of proteins by their similarities in their primary structures,
which is most often a result of homology, i.e. their mutual (evolutionary) origin.
COG database is obtained by comparison of the known and predicted proteins encoded
in the completely sequenced prokaryotic (archaea and bacteria) genomes and
their classication by orthology. The proteins are classied in 25 categories which
can be ordered in three basic functional groups (the proteins responsible for: (1)
information storage and processing; (2) cellular processes and signaling; and (3)
metabolism), or in a group of poorly characterized proteins. Classication of proteins
by their contents in certain COG category (euKaryote Orthologous Groups-
KOG for eukaryotic organisms) is signicant for better understanding of biological
processes and various pathological conditions in people and other organisms.
The dissertation proposed the model for classication of proteins in COG categories
based on amino acid n-grams (sequences of n- length). The set of data contains
protein sequences of genomes from 8 dierent taxonomic classes [TKL97] of bacteria
(Aquicales, Bacteroidia, Chlamydiales, Chlorobia, Chloroexia, Cytophagia,
Deinococci, Prochlorales), which are known to have been classied by COG categories.
The new method is presented, based on the generalized systems of Boolean
equations, used for separation of n-grams characteristic for proteins of corresponding
COG categories. The presented method signicantly reduces the number of
processed n-grams in comparison to previously used methods of n-gram analysis,
thus more memory space is provided and less time for protein procession is necessary.
The previously known methods for classication of proteins by functional categories
compared each new protein (whose function had to be determined) to the set of all
proteins which had already been classied by functions in order to determine the
group which contained most similar proteins to the one which was to be classied.
In relation to the previous, the advantage of the new method is in its avoidance
of sequence-sequence comparison and in search for those patterns (n-grams, up to
10 long) in a protein which are characteristic of the corresponding COG category.
The selected patterns are added to a corresponding COG category and describe
sequences of certain length, which have previously appeared in that COG category
only, not in the proteins of other COG categories.
On the basis of the proposed method, the predictor for determination of the corresponding
COG category for a new protein is implemented. Minimal precision of the
prediction is one of the predictors arguments. During the test phase the constructed
predictor shown excellent results, with maximal precision of 99% reached for some
proteins.
According to its properties and relatively simple construction, the proposed method
can be applied in similar domains where the solution of problem is based on n-gram
sequence analysis
Mining sequential patterns for determination of protein characteristics
ΠΠ΅Π»Π°Π½ΡΠ΅Π²ΠΈΠ½Π΅ ΠΈΠ»ΠΈ ΠΏΡΠΎΡΠ΅ΠΈΠ½ΠΈ ΡΡ Π²Π°ΠΆΠ½ΠΈ Π±ΠΈΠΎΠ»ΠΎΡΠΊΠΈ ΠΌΠ°ΠΊΡΠΎΠΌΠΎΠ»Π΅ΠΊΡΠ»ΠΈ ΠΏΠΎΠ»ΠΈΠΌΠ΅ΡΠ½Π΅ ΠΏΡΠΈΡΠΎΠ΄Π΅ (ΠΏΠΎΠ»ΠΈΠΏΠ΅ΠΏΡΠΈΠ΄ΠΈ), ΠΊΠΎΡΠΈ ΡΠ΅ ΡΠ°ΡΡΠΎΡΠ΅ ΠΎΠ΄ Π°ΠΌΠΈΠ½ΠΎ ΠΊΠΈΡΠ΅Π»ΠΈΠ½Π° ΠΈ ΠΏΡΠ΅Π΄ΡΡΠ°Π²ΡΠ°ΡΡ ΠΎΡΠ½ΠΎΠ²Π½Ρ Π³ΡΠ°Π΄ΠΈΠ²Π½Ρ ΡΠ΅Π΄ΠΈΠ½ΠΈΡΡ ΡΠ²Π°ΠΊΠ΅ ΡΠ΅Π»ΠΈΡΠ΅...Proteins are signicant biological macromolecules of polymeric nature
(polypeptides), which contain amino acids and are basic structural units of each cell.
Their contents include 20+3 amino acids and, as a consequence, they are presented
in biological databases as sequences formed from 23 dierent characters. Proteins
can be classied based on their primary structure, secondary structure, function etc.
One of possible classications of proteins by their function is related to their contents
in a certain cluster of ortholologous groups (COGs). This classication is based on
the previous comparison of proteins by their similarities in their primary structures,
which is most often a result of homology, i.e. their mutual (evolutionary) origin.
COG database is obtained by comparison of the known and predicted proteins encoded
in the completely sequenced prokaryotic (archaea and bacteria) genomes and
their classication by orthology. The proteins are classied in 25 categories which
can be ordered in three basic functional groups (the proteins responsible for: (1)
information storage and processing; (2) cellular processes and signaling; and (3)
metabolism), or in a group of poorly characterized proteins. Classication of proteins
by their contents in certain COG category (euKaryote Orthologous Groups-
KOG for eukaryotic organisms) is signicant for better understanding of biological
processes and various pathological conditions in people and other organisms.
The dissertation proposed the model for classication of proteins in COG categories
based on amino acid n-grams (sequences of n- length). The set of data contains
protein sequences of genomes from 8 dierent taxonomic classes [TKL97] of bacteria
(Aquicales, Bacteroidia, Chlamydiales, Chlorobia, Chloroexia, Cytophagia,
Deinococci, Prochlorales), which are known to have been classied by COG categories.
The new method is presented, based on the generalized systems of Boolean
equations, used for separation of n-grams characteristic for proteins of corresponding
COG categories. The presented method signicantly reduces the number of
processed n-grams in comparison to previously used methods of n-gram analysis,
thus more memory space is provided and less time for protein procession is necessary.
The previously known methods for classication of proteins by functional categories
compared each new protein (whose function had to be determined) to the set of all
proteins which had already been classied by functions in order to determine the
group which contained most similar proteins to the one which was to be classied.
In relation to the previous, the advantage of the new method is in its avoidance
of sequence-sequence comparison and in search for those patterns (n-grams, up to
10 long) in a protein which are characteristic of the corresponding COG category.
The selected patterns are added to a corresponding COG category and describe
sequences of certain length, which have previously appeared in that COG category
only, not in the proteins of other COG categories.
On the basis of the proposed method, the predictor for determination of the corresponding
COG category for a new protein is implemented. Minimal precision of the
prediction is one of the predictors arguments. During the test phase the constructed
predictor shown excellent results, with maximal precision of 99% reached for some
proteins.
According to its properties and relatively simple construction, the proposed method
can be applied in similar domains where the solution of problem is based on n-gram
sequence analysis