negation in clinical note text mining
Hi all,
I'm working with a large set of clinical notes and it seems like the clinicians are trained to spend half their time writing down what is NOT going on with the patient. So, in order to apply many text mining techniques I'm having to learn how to handle negation in context.
I've seen a brief dialog about this topic in which @mschmitz and @SvenVanPoucke discussed the issue https://community.rapidminer.com/t5/RapidMiner-Studio-Forum/Include-Negations-in-Dictionary-based-Sentiment-Approach/m-p/44266/highlight/true#M29247
And, I see that Martin added negation with a word window to the Operator toolkit Dictionary Based Sentiment. I think the way it was implemented is very flexible and I look forward to using it when I focus on sentiment.
Right now, I'm attempting to 'tag' my corpus of documents regarding "Suicide-Mentioned" vs. "Suicide-Deny-Mention" as a way to make our documents search a little better. It's difficult to write the regex or Lucene queries needed to reliably find Suicide related notes so I want to preprocess and tag the notes for the clinicians using Python or RapidMiner's more sophisticated toolsets.
There are 2M documents in the corpus, each of which may be as short at 1-2 sentences to as long as several pages. They are typical unstructured text notes although there are patterns in how the different clinicians discuss suicidality (deny or endorse).
My first pass at the task used regex inside of SQL Server and ran for three days to get through the 2M documents. The quality is being reviewed now, but I don't think it will be acceptable to the clinical director. Recall may not be high enough for field use with this approach.
There are some medical note negation tools available Negex and PyContext and several papers that address the issue. I'm new to RapidMiner and would like to apply RM to the issue and thought to ask for advice on how folks here might address an issue like this.
Thanks in advance for your help/advice...Steve
Answers
Indeed, "probably not a possible tumor" are the sentences found in the real world. I am very interested in your project. Could you send me a few example texts? Anyway I would like to help you with this.
Cheers
Sven
Hi Sven,
One of the other important projects on my plate is the de-identification of our medical records, making it easier to share without requiring a Business Associate Agreement, BAA. I will check to see if we can create a sample dataset of suicide related sentence (de-identified) as a community dataset. About how many sample sentences of each type do you think would be the minimum number that would allow for some meaningful analysis in a community dataset?
From the regex SQL work project:
['Suicide-Deny-Mention'] => 195449 documents tagged
['Suicide-Mention'] => 28395 documents tagged
['Suicide-Unclear-Mention']=> 4231 documents tagged
I attached the SQL that I used to do pattern matching in SQL server. In it you can see the type of negation patterns I used:
where value like 'never' or value like '%n''t' or value like 'no' or value like 'not' or value like 'den[yi]%'
and the key terms in the source data:
where (AssessmentRaw like '%suicid%' or AssessmentRaw like '%[ /:-,;]si[/ :-,;]%' or AssessmentRaw like '%kill herself%' or AssessmentRaw like '%kill himself%' or AssessmentRaw like '%kill myself%' )
...Steve
De-identification.
Cheers
Sven
Hi,
i am definitly not a medical expert like you or Sven, but here is my DS view on it.
I think you can treat it in three different ways:
1. Manual "keyword based filtering". This essentially runs either into Lucene Queries or something like Dict Based Sentiment. It requieres a list of either quieres of keywords. Based on this you calculate the sum of this as a score. This can be used as a "prefilter". Afterwards experts can read through this list and tag false positives. This can then be used as an input for a machine learning algorithm.
2. One might try unsupervised methods on it. E.g. LDA topic recognition. The idea would be to find "suicide-topics".
3. I had a client who used external data on medical terms. The idea would be to not just use the text but also e.g. the wikipedia result for a desease. This can "enhance" the text itself. You can do something like. If (Wikipedia lookup includes "suicide") then score+0.5. But i think this is the hardest approach of the 3.
Best,
Martin
Dortmund, Germany
Hi Steve,
can you eventually retrieve the files or database dump into your computer? I think you will have performance issues when retrieving records from the DB and processing with RapidMiner on the fly. The best case scenario would be when you can load all the data into memory.
This is a important issue, because if you cannot move the data, you will have to bring the analysis to the data (as procedure on the DB).
Best,
Sebastian
Hi Sven,
I saw that and played with the library a bit this weekend. It would take a bit of work to convert it to English but very doable. There are some other approaches that are referenced in the literature (MIST and others) - pros/cons to each. I haven't settled on an approach yet but the Deduce approach looks viable.
Whatever techinique I use will include proof reading as the last pass so the data set will be small maybe 600 notes. Even then, the risk/reward ratio may prevent us from publishing.
...Steve
Hi Martin,
Thanks for the suggestions, thought provoking.
1) Keyword based filtering - so, do you think I should try the Sentiment operator you provide on the toolkit to score for endorse vs. deny? I think we can use a regex filter to collect the documents that mention suicidality. Do you think this would be a better technique than using regex to determine endorse vs. deny?
I'm planning on hiriing a temp/intern to follow a protocol and label a few thousand notes so we would then apply various supervised learning techniques on the rest of the corpus. I'm just hoping to give the human a decent set of pre-labeled notes.
2) I've played with LDA on an off over the last month as a way to identify key topics in our notes. The idea I'm pursuing is to collect the key terms in each note and then use D3 to show a network graph for a client allowing her clinician to visualize how key topics relate to each other (or not) within her medical record. Word clouds are easy and fun to look at but don't convey much useful data - even when we played word clouds over time it didn't drive enough clinical value. My hypothesis is that a graph (nodes/edges) might be a richer visualization tool. The challenge I haven't figured out (yet) is how to discover edges between nodes (keywords) that might actually have value/meaning to clinicians. I have some ideas for the edges (word2vec similarity scores for one) but haven't decided yet how to proceed.
I don't yet trust my skill with LDA to reliably find 'suicide-topics'. Do you think this is a likely approach?
3) Hmm, sounds like fun and I would never have considered an approach like this. Sounds like it beyond my current skillset ;>)
Thanks for your ideas Martin...Steve
Hi Sebastian,
Good point, not sure.
2M medical records is 11 Gb as an Azure Search index and 24Gb as a SQL table. I can create a VM with 192GB of RAM, so I guess it depends largely on the algorithm we are using.
I'm considering ways of segmenting data based on diagnosis or level of care allowing our models to be based on much smaller but still large enough to have > 100k records for training. Of course, suicidality cuts across all diagnoses and care levels. One of the data issues we are facing is how comparatively sparse the data is, < 2% of our notes mention people endorsing suicidal ideation. In a typical year we lose maybe 35 people out of the >10k that we serve (and we are focused on driving that number to 0).
...Steve
Hi,
honestly, if you have an intern doing labeling - thats the easiest way to do it
BR,
Martin
Dortmund, Germany
Hi,
I'm guessing that false positives are not much of an issue here. If you can generate a set of representative features, you could turn it into an annomaly detection problem with a rather conservative model (density < 5% are annomalies).
Having seen your SQL file I can understand why you want to move to RapidMiner!
If there are no disclosure problems (or the data can be annonimized), I think this is a good candidate for a Kaggle dataset (or competition if you have some funds).
Best regards,
Sebastian
Hi Sebastian,
I agree regarding competition(s), I think decent size mental health clinical dataset would benefit a lot of agencies that can't afford to hire the caliber of people that compete and yet would be fun/challenging for the competitors. I think we could find adequate prize money too.
Helping agencies dedicated to helping others is why I chose to work to join this org five years ago (a great decision on my part).
My guess is that I won't have an approved anonymized set until early next year, and even then I might only be allowed to share it with select people under an NDA - we'll see, not something we've done before. I'll definitely report back when I know more.
One think I could use this forum's help with is to 'design' a dataset that maximizes the opportunities for text mining projects. Would you be interested in collaborating on this stage? I can disclose a lot about our data without disclosing PHI data in order to optimize what we anonymize. For instance, I'm just starting to draft a protocol for a temp/intern to start reviewing and manually tagging a subset of our data. I could post that to this forum for feedback.
Thoughts?
...Steve
Hi,
Looks like an interesting project.
Sven
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2991497/