2 research outputs found

    A sentence classification framework to identify geometric errors in radiation therapy from relevant literature

    Get PDF
    The objective of systematic reviews is to address a research question by summarizing relevant studies following a detailed, comprehensive, and transparent plan and search protocol to reduce bias. Systematic reviews are very useful in the biomedical and healthcare domain; however, the data extraction phase of the systematic review process necessitates substantive expertise and is labour-intensive and time-consuming. The aim of this work is to partially automate the process of building systematic radiotherapy treatment literature reviews by summarizing the required data elements of geometric errors of radiotherapy from relevant literature using machine learning and natural language processing (NLP) approaches. A framework is developed in this study that initially builds a training corpus by extracting sentences containing different types of geometric errors of radiotherapy from relevant publications. The publications are retrieved from PubMed following a given set of rules defined by a domain expert. Subsequently, the method develops a training corpus by extracting relevant sentences using a sentence similarity measure. A support vector machine (SVM) classifier is then trained on this training corpus to extract the sentences from new publications which contain relevant geometric errors. To demonstrate the proposed approach, we have used 60 publications containing geometric errors in radiotherapy to automatically extract the sentences stating the mean and standard deviation of different types of errors between planned and executed radiotherapy. The experimental results show that the recall and precision of the proposed framework are, respectively, 97% and 72%. The results clearly show that the framework is able to extract almost all sentences containing required data of geometric errors

    Automatic Data Extraction Utilizing Structural Similarity From A Set of Portable Document Format (PDF) Files

    Get PDF
    Instead of storing data in databases, common computer-aided office workers often choose to keep data related to their work in the form of document or report files that they can conveniently and comfortably access with popular off-the-shelf softwares, such as in Portable Document Format (PDF) format files. Their workplaces may actually use databases but they usually do not possess the privilege nor the proficiency to fully utilize them. Said workplaces likely have front-end systems such as Management Information System (MIS) from where workers get their data containing reports or documents.These documents are meant for immediate or presentational uses but workers often keep these files for the data inside which may come to be useful later on. This way, they can manipulate and combine data from one or more report files to suit their work needs, on the occasions that their MIS were not able to fulfill such needs. To do this, workers need to extract data from the report files. However, the files also contain formatting and other contents such as organization banners, signature placeholders, and so on. Extracting data from these files is not easy and workers are often forced to use repeated copy and paste actions to get the data they want. This is not only tedious but also time-consuming and prone to errors. Automatic data extraction is not new, many existing solutions are available but they typically require human guidance to help the data extraction before it can become truly automatic. They may also require certain expertise which can make workers hesitant to use them in the first place. A particular function of an MIS can produce many report files, each containing distinct data, but still structurally similar. If we target all PDF files that come from such same source, in this paper we demonstrated that by exploiting the similarity it is possible to create a fully automatic data extraction system that requires no human guidance. First, a model is generated by analyzing a small sample of PDFs and then the model is used to extract data from all PDF files in the set. Our experiments show that the system can quickly achieve 100% accuracy rate with very few sample files. Though there are occasions where data inside all the PDFs are not sufficiently distinct from each other resulting in lower than 100% accuracy, this can be easily detected and fixed with slight human intervention. In these cases, total no human intervention may not be possible but the amount needed can be significantly reduced.
    corecore