Term frequency from Excel file

federica_gatto9 · June 2018

Hi everyone,

I have an excel list with customer reviews and I would like to get the frequency of the words. I tried to use directly Generate TFIDF but it considers the frequency of the whole text in each example instead of each word.

Since I also wanted to tokenize and remove stopwords and these operators only support documents, I am not sure how I should convert the excel file into document. With Process Documents from Data I get a word list and this still doesn't work and with Extract Document I can only select one example, and in the end it still considers the text as a whole.

I hope I could explain well my problem!

Best regards,

Federica

Thomas_Ott · June 2018

@federica_gatto9 Please use a Read Excel operator to load in the data, then Select Attributes to select the column with the text, then Nominal to Text operator to convert it to Text that the Process Documents from Data operator can read. Then output the EXA port on the Process Documents from Data operator.

federica_gatto9 · June 2018

Hi,

The attribute I want to analyze is already set as text. I slved the problem, I had to put tokenize and stopwords within Process Documents to Data and not after. Other question: how are the results to be interpreted? Like, if for a word I have Min:0 Max:0.864, what does 0.864 mean?

Thank you!

Federica

Thomas_Ott · June 2018

@federica_gatto9 I don't know what you're doing, so I can't help you interpret the output. It would be best to post a screenshot at the very least. Normally we'd ask you to post your XML process and some sample data.

federica_gatto9 · June 2018

You can find attached a picture of the preocess and one of the results on the statistics window. The numers (min, max, average) are what I cannot interpret. I hope the üictures help.

Best regards,

Federica

Telcontar120 · June 2018

It would be much better to post the actual process and data, since very little can be learned from the pictures. For example, you have an operator labeled "Generate TF-IDF" but I have no idea what it is or what it is doing, since generating the TF-IDF vector is automatically part of the output (if selected) from Process Documents.

But in general, the values you are seeing should be the values for the word vector calculations, presumably based on the TF-IDF method. You can read about it here: https://en.wikipedia.org/wiki/Tf%E2%80%93idf

It is an adjusted frequency value and is always between 0 and 1. Generally a higher value means that specific document is more relevant for that term, and a lower value means it is not, and a zero value means that document does not contain that term at all.

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

Term frequency from Excel file

Answers