Cracking the challenge of unstructured medical text
Updated: Jun 3
When I was a little kid, I loved to take my dad's books and act as if I was reading them. My dad always knew that I was only pretending. How? Because I would move my finger line by line, down to the end of the page without knowing that the bottom of the page contained only comments and footnotes - and as a "devoted" reader, I also read them with my finger.
Natural Language Processing (NLP) is a technology built to help computers understand human language. Many advances have been made in recent years as artificial intelligence research has intersected with NLP. Today, NLP is used for a wide range of applications like translation, voice assistants, document classification, and more.
At DigitalOwl, we have harnessed NLP's capabilities into the world of "medical insurance", helping Underwriters and Claim Analysts assess applicants’ and insureds' medical records. With advanced algorithms, we can identify all meaningful information in medical documents (medical conditions, dates, body parts, treatments, outcomes, etc.). Just as important, we can extract pertinent non-medical phrases that are critical to understanding the full context of the subject’s medical history specifically for insurance purposes (return to work, ADLs, restrictions, and limitations, etc.).
As pioneers in applying NLP to the insurance industry, we face many unique challenges that arise from the integration of NLP and medical information like the variety of writing-styles of different physicians and the amount of information in each case.
Today, I want to focus on the fascinating solution we developed for understanding the context of words in a medical document: Analyzing the position of words in the document.
The meaning of the position of words in a sentence:
The order of words in the sentence matter. Different orders of the same words generate different meanings. The set of words: I / Like / Do / Not / Why / Trips can have a positive or negative meaning when you change the orders of the words:
"Why do I not like trips?" -Vs.- "I do like trips, why not?"
Imagine that you come home after a long working day, and your partner says "You seem to have gone through a hard day, you deserve a long rest," but the words are mixed, and instead you hear "You seem to have gone through a long day of rest, you deserve hard work."
In all NLP tasks, the form in which the text is analyzed is in the form of a sequence. That means that every word has a number. Just like me in my little story at the beginning of the article, the computer goes through the text line by line without considering the page structure at all.
The meaning of the position of words on the page:
To understand the text, mainstream NLP models index each word using a simple sequence. For example, the top left word is “1”, the next word to the right is “2”, and so forth, line by line.
But this isn’t good enough. As humans, when we read a document, we not only scan the text from left to right, but our brain also directs us to "strategic" places on the page, searching for familiar patterns.
For example, in medical records, the date in one of the top corners of the page is usually the visit date (even if there are few dates in the text), and the name in the top right corner is often the hospital name.
That's why we've developed a unique model, which is aware of the locations of the words on the page. Let's say you have a page with two lists of medical findings:
As we mentioned, one way to process the words is to index them by sequence from left to right.
The results of this processing method will be that the model gets this input:
And in this way, how can the models possibly know if Anamnesis (12) is an existing or non-existent condition?
Our solution is to enter all the information to the model:
In this way, every word is coordinated in space. The word “Hand” gets the coordinate (20, 14), and “Anamnesis” gets (28, 57). In this way, the model gets the full structure of this page, and can easily say that Anamnesis is a non-existent condition.
Sometimes, it is not just the context between words that is location-dependent, but also the role of each word. Sometimes a page has many dates, but each page only has one printed date. Many times this date will be written in the top right corner (as you can see in the following image)
These capabilities make our NLP model more precise and faster.
Left Picture - The focus on finding the visit date.
Right Picture - The focus on finding medical conditions.
Of course, all of this does not make the model refer only to location, but it certainly helps assign a better meaning for each word.
If I asked you to find the doctor's name in a document, would you start from the top left corner and go line by line? Probably not. So why should the NLP model work this way?