Presentation Authors: Anobel Odisho*, San Francisco, CA, Briton Park, Nicholas Altieri, William Murdoch, Berkeley, CA, Peter Carroll, Matthew Coopberberg, San Francisco, CA, Bin Yu, Berkeley, CA
Introduction: Detailed pathologic information for the estimated 1.74 million Americans diagnosed with cancer every year is locked away as unstructured free text, unavailable for use without manual abstraction. Our objective was to optimize NLP algorithms to extract detailed pathologic details from cancer pathology reports. Building on current pipelines, we developed feature specific optimizations for information extraction from prostate cancer pathology reports and evaluate if high quality information extraction was possible with a minimal training data.
Methods: We used a corpus of 3,232 free text pathology reports from radical prostatectomy specimens at UCSF, each with detailed manual annotations for 20 data elements, such as Gleason scores, margin status, extracapsular extension, seminal vesicle invasion, tumor volume, and numbers of lymph nodes (positive and dissected). The full corpus was divided up so that the training, validation, development test, and true test sets contained 65%, 15%, 10%, and 10% reports each, respectively. We then created an NLP pipeline using NLTK and investigated the performance of multiple machine learning methods using scikit-learn and pytorch. We applied random forests, support vector machines, boosting, logistic regression and convolutional networks to the full training dataset as well randomly selected subsets of 8, 16, 32, 64, and 128 reports.
Results: We calculated the F1 evaluation metric weighted by support (number of true instances for each label) for each data field using the development test set. When working with the full training corpus (n= 2066), convolutional networks perform the best (mean weighted F1 0.968 across all 12 clinical data elements). However, under smaller data conditions with less annotated data for training, boosting typically performs best. Moreover, with only 32 labeled reports we are able to achieve a mean weighted F1 score of 0.91 across fields.
Conclusions: An NLP pipeline using both traditional statistical and machine-learning based methods can extract detailed prostate cancer pathology data from unstructured free text pathology reports with high accuracy, even with small sets of training data.