Document similarity of 2 excel spreadsheets containing text
Hello,
I've posted before about text mining but I think my question was too vague and didn't explain what I wanted to do very well. So I've gone away, watched some (a lot!) of tutorials and tried again.
So what I've done is read in 2 excel spreadsheets 1 containing relevant text, keywords etc and 1 containing 504 references exported from medical databases. Both spreadsheets contain title and abstract and for the 504 each reference is on a new row with the aim of comparing the 2 spreadsheets to find the most relevant references compared to the text in the 1st excel spreadsheet.
Ok so I've played around with this alot and got a few things to work (eucalidean distance, cosine similarity etc) but it's not quite doing what I wanted it to... I want it to re-order the 504 references with regards to how similar they are to the relevant text in the first excel spreadsheet. Ideally so that the most relevant references are first in the list and then the least relevant are down the bottom of the list... if that makes sense.
Also just to clarify I am no data scientist so I don't actually know what the results mean when I run cosine similarity and eucalidean distance and that. All i know is I got it to work without any errors, which at the moment is a pretty good achievement for me.
Anyway, I've gone off topic. Can anyone help with what I'm aiming to do with the ranking of the documents?? Also, I don't know if you need to see what I've done so far?
Thank you so much
Answers
Hi @ekotas,
Here is a good community post about basics of text mining and also refers to sample process withing RapidMiner Studio >> Community samples on how to see similarity to each row. You could easily translate this into your use case.
https://community.rapidminer.com/t5/RapidMiner-Text-Analytics-Web/Term-Frequencies-and-TF-IDF-How-are-these-calculated/ta-p/46333
Hope this helps.
Cheers
Lovely stuff, looks very relevant. I'll give it a go, thank you!
can you tell us which tutorials you watch and which were helpful? Where did you find them and what are you missing? Background: we are always working on new stuff so we are interested what people are having issues with...
Cheers, K.
Hi,
I watched a series of 5 tutorials on youtube, starts with this one: https://www.youtube.com/watch?v=hpvda_Rfg3s. They were really helpful. I also got some information off this forum and just other tutorials on youtube but I didn't find them as helpful as the series of 5. I was missing what I wanted to do really which was the rank them and also because I don't know much about cosine similiarity and that I could have used some help with what the numbers meant in the output.
Great - thanks for the feedback!
hello,
I'm back with some more problems... I'm trying to classify my excel spreadsheet and see if this shows anything interesting in the data... it might not like.
Anyway, it's all set up but I'm getting an error... "Attributes do not match. The input ExampleSet does not match the training ExampleSet. Missing attribute: 'Abstract = ?' "
I have tried to work it out via the forums etc but I can't work it out. Can anyone help please?
Thanks!
Hi @ekotas
Please share rapidMiner process XML, to help us see what the error is in detail.
Cheers,
Hello,
i've attched the xml... hope I've done it right!
thanks
Hi @ekotas,
The topic seems interesting. The relevance surely depends also on factors other than word vectors (Author, Impact, Times Referred, etc.), but the text analysis is a good start. I would like to know more about the Excel files, at least which columns it has with what kind of data.
Best,
Regards