Discrimination between disease-causing missense mutations and neutral polymorphisms is a key challenge in current sequencing studies. It is there- fore critical to be able to evaluate fairly and without bias the performance of the many in silico predictors of deleteriousness. However, current analy- ses of such tools and their combinations are liable to suffer from the effects of circularity, which occurs when predictors are evaluated on data that are not independent from those that were used to build them, and may lead to overly optimistic results. Circularity can first stem from the overlap between training and evaluation datasets, which may result in the well-studied phe- nomenon of overfitting: a tool that is too tailored to a given dataset will be more likely than others to perform well on that set, but incurs the risk of failing more heavily at classifying novel variants. Second, we find that circu- larity may result from an investigation bias in the way mutation databases are populated: in most cases, all the variants of the same protein are anno- tated with the same (neutral or pathogenic) status. Furthermore, proteins containing only deleterious SNVs comprise many more labeled variants than their counterparts containing only neutral SNVs. Ignoring this, we find that assigning a variant the same status as that of its closest variant on the genomic sequence outperforms all state-of-the-art tools. Given these barriers to valid assessment of the performance of deleteriousness predic- tion tools, we employ approaches that avoid circularity, and hence provide independent evaluation of ten state-of-the-art tools and their combinations. Our detailed analysis provides scientists with critical insights to guide their choice of tool as well as the future development of new methods for deleter- iousness prediction. In particular, we demonstrate that the performance of FatHMM-W relies mostly on the knowledge of the labels of neighboring variants, which may hinder its ability to annotate variants in the less explored regions of the genome. We also find that PolyPhen2 performs as well or better than all other tools at discriminating between cases and controls in a novel autism-relevant dataset. Based on our findings about the mutation databases available for training deleteriousness prediction tools, we predict that retraining PolyPhen2 features on the Varibench dataset will yield even better performance, and we show that this is true for the autism-relevant dataset