Source code for "Previously Unidentified Duplicate Registrations of Clinical Trials: an Exploratory Analysis of Registry Data Worldwide" (under review).
This code was used to process the WHO International Clinical Trials Registry Platform (ICTRP) dataset retrieved in April 2015 (see related). The code imports the XML data into a SQL database and performs a number of standardizations. There is also code to group records by referenced primary registry IDs and to perform text-based similarity scoring on registration fields.
The README file included with the code provides detailed instructions on dependencies and running the code