Proteins perform very important functions within organisms. Predicting thesefunctions is a major problem in biology. To address this issue, predictive models of functionalfamilies from the sequences of amino acids that form the proteins have been developed. TheDyliss team developed a machine learning algorithm, named Protomata-learner, that learnsweighted automata representing these families and the possible disjunctions between members.New sequences can be compared to these models and assigned a score to predict their belongingto the family.Despite good results, the sequence weighting strategy and the null-models in Protomata arerather basic. During my internship, I investigated alternative sequence weighting strategies andnull-models. Besides, the expressivity of Protomata leads to a great variability of scores and thechoice of the classification threshold was left to the user. So, I proposed a normalization of thescore, and a method to assess the significance of scores, to simplify the prediction. I implementedthese new strategies and compared them on several data sets. Preliminary results show a goodimprovement of the prediction power of the computed models
To submit an update or takedown request for this paper, please submit an Update/Correction/Removal Request.