unknown

PROTEIN SOLUBILITY CLASSIFICATION IN BIOMEDICAL CONCEPTS SPACE

Abstract

Proteini so pomemben del vsakega organizma in imajo številne pomembne funkcije, katere so v veliki meri odvisne od strukture proteina. Zadnja je mnogokrat predmet raziskav, kjer strokovnjaki izolirajo posamezen protein in proučijo njegove strukturne lastnosti. Na proces izolacije proteina v veliki meri vpliva njegova topnost, saj je protein z nizko stopnjo topnosti zelo težko izolirati. Prav tako so netopni proteini razlog za nekatere pomembne bolezni. Zaradi teh razlogov želijo strokovnjaki velikokrat vnaprej vedeti, kateri proteini imajo več možnosti za visoko stopnjo topnosti. Posledično so se razvile številne metode, ki uporabljajo tehnike nadzorovanega strojnega učenja za klasifikacijo topnosti proteinov. Te metode klasificirajo proteine v topne in ne-topne ter se uporabljajo za napovedovanje topnosti za nove primerke. V disertaciji predlagamo novo metodo za klasifikacijo topnosti proteinov, ki s pomočjo tehnik tekstovnega rudarjenja izlušči medicinsko znanje iz strokovne literature in ga predstavi v obliki atributov. Te atribute poimenujemo atributi biomedicinskih konceptov in predstavljajo novost na področju klasifikacije topnosti proteinov. Do sedaj uporabljene metode so namreč omejene z uporabo atributov, ki so večinoma izpeljani le iz sekvence proteina. V okviru disertacije tako podamo številne znanstvene prispevke. Predlagana je metoda za ekstrakcijo atributov biomedicinskih konceptov iz strokovne literature na podlagi imena oziroma identifikacijske številke proteina. Nadalje ponudimo originalno primerjavo metod, ki uporabljajo nove atribute, z metodami, ki ponujajo že uveljavljene atribute izpeljane iz sekvence proteina. Kot se pokaže v disertaciji, novi atributi doprinesejo k uspešnosti klasifikacije topnosti proteinov. Podan je tudi algoritem za implementacijo najuspešega klasifikatorja z atributi biomedicinskih konceptov. Zadnji prispevek vključuje novo medicinsko znanje, ki ponudi indice o tem, katere skupine besed in besednih zvez iz strokovne literature so najbolj povezane s topnostjo proteinov. Disertacija je sestavljena iz skupno osem poglavij, katera podrobno predstavijo teoretično ozadje področij, kot so nadzorovano strojno učenje, tekstovno rudarjenje ter struktura in topnost proteinov. Obsežen del disertacije je namenjen opisu proteinskih podatkovnih baz, ki ponujajo informacije o topnosti proteinov ter opisu razvite metode in njene primerjave z do sedaj uporabljanimi metodami. Izvedena je empirična primerjava dvajsetih baz sekvenčnih atributov, ki jim postopoma dodajamo nove atribute in spremljamo doprinose k uspešnosti treh pogosto uporabljanih klasifikacijskih metod.Proteins are an essential part of every organism and each protein has its own function, which depends on the protein’s structure. The latter is an important research topic and researchers often isolate proteins from complex mixtures to study their structures. The isolation process is in many ways influenced by the protein’s solubility since insoluble proteins are usually harder to isolate than soluble ones. In addition, low protein solubility has been linked to different diseases. For these reasons, researchers often wish to indentify which proteins are more likely to be soluble. As a result, several protein solubility classification algorithms have been proposed. Roughly speaking, these algorithms take a set of soluble and insoluble proteins as an input, learn their differences and product a classifier that can be used to predict solubility for new proteins. In this thesis we propose a new method for protein solubility classification, which uses text mining techniques to define protein attributes. This new method extracts biomedical knowledge from scientific literature and presents this knowledge in the form of so called biomedical concept attributes. These attributes present a novel approach of describing proteins in the classification process, since today’s state-of-the-art classification methods use mostly attributes derived from the protein’s sequence. To evaluate the new method, this thesis describes the classification scheme for an empirical study which measures the impact of the new attributes on the protein solubility classification. In the study, the twenty most common sequence derived attribute datasets are analysed, to which we gradually add five types of biomedical concept attributes. We measure the performance of the classifiers obtained by these attribute datasets. As a result, this thesis introduces several original scientific contributions. First of all, an analysis of protein databases that contain information about protein solubility is performed. Secondly, the method for extracting biomedical concept attributes is presented. Next, an original comparison of methods that use biomedical concept attributes with those that use only sequence-derived attributes is performed. The thesis demonstrates that the new attributes increase the performance of some classifiers. Finally, it identifies types of words and word associations from the medical literature that are associated with protein solubility

    Similar works