Labelling text
All the notes were manually read by a trained staff member, after which patients were classified as requiring or not requiring support in each of two categories - physical and social wellbeing. These labels are used to train and then assess the NLP algorithm.
Processing text
Free-text notes need to be converted into a format that is interpretable by an algorithm - in our case this is a vectorised numerical format. Two principle techniques are applied. We used the NLTK toolkit from Python. See Figure 3.
- Lemmatisation: Words are grouped according to common meaning. For example survive, survival and survivorship are grouped to “survive” - called a lemma. “Good” and “better” group into “good”. English stop words such as “the” and “and” are excluded.
- Vectorisation: Lemmas are given weights according to how important they are in the document. If in the same note “depression” is mentioned many times, it is given more weight.. Separately, if “depression” is mentioned in many other notes, the weight is reduced, compared to if it was only mentioned in a few notes where its weight would be higher. This is called the inverse document frequency (TF-IDF) method.[2]
Natural language processing algorithm
An NLP algorithm is trained to interpret the processed notes, and output what category the patient falls under - either requiring or not requiring more support in one of two categories - physical and social health. We used a linear support vector machine (SVM) for this purpose, which excels at high dimensional classification, using scikit-learn in Python. An SVM is is an example of supervised machine learning. See Figure 4.
The notes were split into a training and testing set in a 3:1 ratio. The model was trained using the training set and learned from the manual labels. It was then blinded to the labels of the testing set, and created predictions of the labels. These predictions were compared to the labels to evaluate performance.
Model evaluation
In 365 patients (average age 51, 71% female), many patients required additional physical (54%) and psychosocial support (33%). See Figure 5. Common physical symptoms causing distress included pain, nausea and fatigue, and common social ailments pertained to inadequate family supports, physical isolation, diet, exercise, sleep and mental health concerns.
The NLP algorithm had a sensitivity of 90% and specificity of 50% across both categories for detecting more care (F-score 0.68) in the test set. This represents a low false negative rate, performing well as a screen for patients who require more care. However it is non-specific, so it cannot reliably be used to pinpoint increased patient needs. Nevertheless it represents a promising pilot of NLP applied to clinical text notes, using readily available digital tools that are freely available.
NLP in clinical medicine
NLP is still in its infancy in its application to clinical medicine, but the potential insights from mountains of underutilised clinical text data make the prospects enticing in the future. Many optimisations are possible - for example more nuanced AI models such as neural nets (e.g. GloVe, BERT or ELMo), or additional text-processing techniques such as boundary detection, segmentation, semantic analysis and medical specific lemmatising libraries (e.g. BioLemmatiser).[3]
There are limitations to NLP. Lemmatisation misunderstandings (e.g. breathlessness -> breath) can occur. Text errors, inconsistent abbreviations and differing author styles reduce accuracy. While our performance was sound, this was aided by homogeneity in our note structure. This may be less true with text from other clinical contexts. Maintaining patient privacy and confidentiality is always a crucial consideration. AI predictions are generally less accurate on new datasets compared to what they were trained on. Labelling data is also very laborious - however there are unsupervised AI techniques which do not require labelling of all clinical data that demonstrate good performance.[4]