The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here
Term frequency from Excel file
federica_gatto9
Member Posts: 7 Learner III
Hi everyone,
I have an excel list with customer reviews and I would like to get the frequency of the words. I tried to use directly Generate TFIDF but it considers the frequency of the whole text in each example instead of each word.
Since I also wanted to tokenize and remove stopwords and these operators only support documents, I am not sure how I should convert the excel file into document. With Process Documents from Data I get a word list and this still doesn't work and with Extract Document I can only select one example, and in the end it still considers the text as a whole.
I hope I could explain well my problem!
Best regards,
Federica
Tagged:
0
Answers
@federica_gatto9 Please use a Read Excel operator to load in the data, then Select Attributes to select the column with the text, then Nominal to Text operator to convert it to Text that the Process Documents from Data operator can read. Then output the EXA port on the Process Documents from Data operator.
Hi,
The attribute I want to analyze is already set as text. I slved the problem, I had to put tokenize and stopwords within Process Documents to Data and not after. Other question: how are the results to be interpreted? Like, if for a word I have Min:0 Max:0.864, what does 0.864 mean?
Thank you!
Federica
@federica_gatto9 I don't know what you're doing, so I can't help you interpret the output. It would be best to post a screenshot at the very least. Normally we'd ask you to post your XML process and some sample data.
You can find attached a picture of the preocess and one of the results on the statistics window. The numers (min, max, average) are what I cannot interpret. I hope the üictures help.
Best regards,
Federica
It would be much better to post the actual process and data, since very little can be learned from the pictures. For example, you have an operator labeled "Generate TF-IDF" but I have no idea what it is or what it is doing, since generating the TF-IDF vector is automatically part of the output (if selected) from Process Documents.
But in general, the values you are seeing should be the values for the word vector calculations, presumably based on the TF-IDF method. You can read about it here: https://en.wikipedia.org/wiki/Tf%E2%80%93idf
It is an adjusted frequency value and is always between 0 and 1. Generally a higher value means that specific document is more relevant for that term, and a lower value means it is not, and a zero value means that document does not contain that term at all.
Lindon Ventures
Data Science Consulting from Certified RapidMiner Experts