Automated metadata collection from the researcher CV Lattes Platform to aid IR ingest

Abstract

In 1999, the Brazilian National Council for Scientific and Technological Development (CNPq) launched the Lattes CV Platform, and all Brazilian HEIs oblige their researchers and staff to inform and update their publication metadata on the Platform. The Lattes CVs thus represent a rich source of metadata for Brazilian HEIs needing to identify which publications should be in their IR, populating the IR with this metadata in a concealed way until the full text file is ingested. Despite being publicly accessible on the web and belonging to HEIs, the automated extraction of data available on the Lattes Platform has been restricted by the recent addition of a CAPTCHA to the Platform. To overcome this, we developed a proxy server (available at https://github.com/nitmateriais/cnpqwsproxy) based on the OpenResty platform to share access to the Lattes SOAP services, and permits the HEI to manage its internal IP addresses that can access the services as well as guaranteeing that multiple apps from the same institution do not overload the CNPq servers by creating local data caches. These data are in XML format and are processed by scripts developed in Python, with the aid of the lxml library and the XPath standard. Publication duplicates (i.e. identical metadata published in different curricula pertaining to the different authors of the same paper) are detected by the DOI or from similar titles according to the Jaccard metric. In applying this solution, we were able to retrieve 1,166 curricula of researchers working at our HEI in 11 minutes, representing 573 MB of XML data composed of the metadata of 78,370 journal and Proceedings papers. In this way, the specific objective of gaining direct and official access to public metadata hosted on the Lattes Platform was attained

    Similar works

    Full text

    thumbnail-image

    Available Versions