Document similarity of 2 excel spreadsheets containing text

ekotas · August 2018

Hello,

I've posted before about text mining but I think my question was too vague and didn't explain what I wanted to do very well. So I've gone away, watched some (a lot!) of tutorials and tried again.

So what I've done is read in 2 excel spreadsheets 1 containing relevant text, keywords etc and 1 containing 504 references exported from medical databases. Both spreadsheets contain title and abstract and for the 504 each reference is on a new row with the aim of comparing the 2 spreadsheets to find the most relevant references compared to the text in the 1st excel spreadsheet.

Ok so I've played around with this alot and got a few things to work (eucalidean distance, cosine similarity etc) but it's not quite doing what I wanted it to... I want it to re-order the 504 references with regards to how similar they are to the relevant text in the first excel spreadsheet. Ideally so that the most relevant references are first in the list and then the least relevant are down the bottom of the list... if that makes sense.

Also just to clarify I am no data scientist so I don't actually know what the results mean when I run cosine similarity and eucalidean distance and that. All i know is I got it to work without any errors, which at the moment is a pretty good achievement for me.

Anyway, I've gone off topic. Can anyone help with what I'm aiming to do with the ranking of the documents?? Also, I don't know if you need to see what I've done so far?

Thank you so much

Pavithra_Rao · August 2018

Hi @ekotas,

Here is a good community post about basics of text mining and also refers to sample process withing RapidMiner Studio >> Community samples on how to see similarity to each row. You could easily translate this into your use case.

https://community.rapidminer.com/t5/RapidMiner-Text-Analytics-Web/Term-Frequencies-and-TF-IDF-How-are-these-calculated/ta-p/46333

Hope this helps.

Cheers

ekotas · August 2018

Lovely stuff, looks very relevant. I'll give it a go, thank you!

Knut-RM · August 2018

can you tell us which tutorials you watch and which were helpful? Where did you find them and what are you missing? Background: we are always working on new stuff so we are interested what people are having issues with...

Cheers, K.

ekotas · August 2018

Hi,

I watched a series of 5 tutorials on youtube, starts with this one: https://www.youtube.com/watch?v=hpvda_Rfg3s. They were really helpful. I also got some information off this forum and just other tutorials on youtube but I didn't find them as helpful as the series of 5. I was missing what I wanted to do really which was the rank them and also because I don't know much about cosine similiarity and that I could have used some help with what the numbers meant in the output.

Knut-RM · August 2018

Great - thanks for the feedback!

ekotas · August 2018

hello,

I'm back with some more problems... I'm trying to classify my excel spreadsheet and see if this shows anything interesting in the data... it might not like.

Anyway, it's all set up but I'm getting an error... "Attributes do not match. The input ExampleSet does not match the training ExampleSet. Missing attribute: 'Abstract = ?' "

I have tried to work it out via the forums etc but I can't work it out. Can anyone help please?

Thanks!

Pavithra_Rao · August 2018

Hi @ekotas

Please share rapidMiner process XML, to help us see what the error is in detail.

Cheers,

ekotas · August 2018

Hello,

i've attched the xml... hope I've done it right!

thanks

SGolbert · September 2018

Hi @ekotas,

The topic seems interesting. The relevance surely depends also on factors other than word vectors (Author, Impact, Times Referred, etc.), but the text analysis is a good start. I would like to know more about the Excel files, at least which columns it has with what kind of data.

Best,

Regards

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

Document similarity of 2 excel spreadsheets containing text

Answers