The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here
TF-IDF and Aspect grouping with Rapid Miner
HeikoeWin786
Member Posts: 64 Contributor I
in Help
Dear all,
I am new to RapidMiner and I got few questions really seeking your kind support.
I have a airline dataset with labelled data of sentiment (pos, neg, and netural).
I had divided the dataset 75/25 data split and perform the text processing (i.e. nominal to text, data to document, preprocess document with tokenization, stopwords).
Q1: However, when the result out in word from preprocess document operator, I found the neg,pos and netural data columns have all zero value. Is this normal or am I missing something?
Q2: I want to perform the aspect categorization i.e. I have 5 topics as aspect groups (e.g. flight, service, ...) and the output of TF-IDF consists of the highest frequency words, and those words I want to group under the 5 topics. After that, I will perform Navies Bayes Classification to know the sentiment classification for each aspects. Is there any efficient way I can perform this in RapidMiner?
I am a really starter in Rapidminer and i am so sorry if I am asking very basic questions. But, I do hope your kind support in helping me to learn this.
Thanks and regarda,
Hikoe
I am new to RapidMiner and I got few questions really seeking your kind support.
I have a airline dataset with labelled data of sentiment (pos, neg, and netural).
I had divided the dataset 75/25 data split and perform the text processing (i.e. nominal to text, data to document, preprocess document with tokenization, stopwords).
Q1: However, when the result out in word from preprocess document operator, I found the neg,pos and netural data columns have all zero value. Is this normal or am I missing something?
Q2: I want to perform the aspect categorization i.e. I have 5 topics as aspect groups (e.g. flight, service, ...) and the output of TF-IDF consists of the highest frequency words, and those words I want to group under the 5 topics. After that, I will perform Navies Bayes Classification to know the sentiment classification for each aspects. Is there any efficient way I can perform this in RapidMiner?
I am a really starter in Rapidminer and i am so sorry if I am asking very basic questions. But, I do hope your kind support in helping me to learn this.
Thanks and regarda,
Hikoe
0
Best Answer
-
kayman Member Posts: 662 UnicornIn the workflow, the pre-process --> TF-IDF = it is what "process document" operator does right? exa output provide the dataset that will be an input for NBC. And the word output from the operator is just for the analysis of TF-IDF, correct? So, for NBC, we need to use the exa output of "process operator"
- Yes, but you can also use this word list as a filter for your unseen (new) data. Not needed in case of NBC, some other models require this. It's always good practice
But the performance matrix (confusion matrix) is to test on training model or test model? Bec I see it shows 83% for training but 0% for test- You will have to ignore these figures as they were created by a non optimal workflow and are therefore not useful. I think in the example I provided the x-fold will give you both training and test accuracy, the single one only the test one. This is anyway the most important but it's always good to compare training and test accuracy to see if there are major differences.
5
Answers
Now the wordlist will return multiple columns, one by label
Dear Kayman, Thanks for your answer. yes, I do label my label column as label using the set role operator. I had attached here the pre-process steps I did. However, the find the output come out as multiple columns for each word tokenized. Also for the result of TF-IDF, the frequency word but the three columns which are the label (negative,postive nd netural), they have only 0 value and no other value. Any idea what I could have done right? thanks a lot.
There are a few typical candidates, like the labeling and nominal to text, but these appear covered indeed.
BTW, you can skip a few parts in your process. there is no need to use the multiply operators as the Store operators pass the content through also, so generate ID -> Store -> next operator. Not a big change but it reduces the clutter a bit.
You can also use the [process documents from data] operator. This way you do not need to convert the data to documents first, you just feed your data to the example port of the operator. Saves you again a few blocks.
Next validate you process operator. Ensure you create your vectors, otherwise your example out will be empty, but also ensure you are not pruning to much. When in doubt try to run without pruning first, it can take a longer time but at least you can validate if there is data coming through.
Same goes for the processing parts inside the operator. In case of doubt try with the minimum first, so like just tokenizing on spaces and nothing more. It's going to give a lot of garbage but at least you know it's doing something, then you start improving up to the moment nothing get's through anymore and you have your troublemaker.
The operator has 2 outputs, one is the example set and if you use the vector option this becomes your bag of words with the TF-IDF score for each of them. No further processing is needed, this set is used for training as it is. I would suggest some pruning and further tuning in the operator itself to reduce the vector size but apart from that it should work.
the other one is your word list, and if your labels are set correctly this will split the words by associated label. You used an attribute called confident_score for your label, are you sure this is the right attribute (the one stating positive, negative or neutral) ?
I've attached a streamlined version of your process, hope it helps.
Dear Kayman,
Thanks a lot for highlighting what I was missing. I had read more on what you had advised and now somehow I understood.
Just one thing, I would like to understand is regarding the labeling of the dataset. The orginal dataset has "recommended" option where user give "yes/no/empty", I label the comment as "Yes" for positve, no for "negative" and netural for "empty". However, my dataset is in excel format and each comments is in excel row. will this effect the pre-processing? I had attached the snapshot of dataset for your kind reference.
Could you explain me a bit more regarding this?
"the other one is your word list, and if your labels are set correctly this will split the words by associated label. You used an attribute called confident_score for your label, are you sure this is the right attribute (the one stating positive, negative or neutral) ?"
Do I need to put sentiment score as the numbering like (postive = 3, netural = 2, negative = 1) or can keep in text?
Really, Kayman, I am so thankful for your kind input here
Regards,
Heikoe
However, the 6.4 returns 0 records for me and the performance matrix shows 83% for training but 0% for test. I am truly confused what is the mistake here? The dataset itself or the way I label or Am I missing any key steps?
It is very nice discussion with you and I would be truly appreciate if I can learn from you so that I can explore further confidently.
thanks Kayman
The only importent things are therefore :
1. make confident_score special attribute label
2. make customer_review text as nominals will be treated as metadata and therefor not vectorized by the [process document] operator
3. Tune pruning and filters inside of the process operator to reduce the amount of low impact attributes (keywords)
But that's what you have so should be ok.
Thanks much for your prompt reply. I just want to double confirm if I understood correctly.
1. make confident_score special attribute label --> this is to configure in "Set Role" as the column confident_score as label. correct? i.e. Target role = label.
2. make customer_review text as nominals will be treated as metadata and therefor not vectorized by the [process document] operator ----> This one, I am not sure, where I have to change that to nominal. As this is set polynominal when I loaded. Any way I can change that to metadata so that it wont be vectorized?
3. Tune pruning and filters inside of the process operator to reduce the amount of low impact attributes (keywords) ---> yes this one I got!
thanksss one again!!! Kayman
so in it's simplest form it is vectorset -> shuffle (no must but always smart) -> split between train and validation -> Train model (NBC) -> apply to test data -> check performance.
I'd suggest (based on my own experience) to use some cross validation also, as this will improve the results pretty much
attached some high level framework showing both options *untested
You can also use the nominal to text operator and just select customer_review. I think you already did that before so should be ok already.
Thats well-explained Kayman. Few questions:
1) For training you only need your vectorset (so the exa output of your process documents operator) as this contains the TF-IDF data the NBC needs to construct its mathematical magic.
- the dataset I merged in pre-processing is 75% training data with the pos/neg/neu value in confident_score and 25% test data which has no value in confident_score. Therefore, the processed data is the combination of that 75% and 25%. Because I did the split data before I go for text processing. Will this be an issue? I just need to put whole labeled dataset without splitting 25/75? If not, the whole dataset will have label value and I am not sure how NBC will detect training and test datasets. Test datasets must have empty value in label column, correct?
Thanks much!!
- while there is a label in your data, rapidminer will ignore this for the prediction but only use it to compare it's prediction (the outcome of NBC) with the provided label to check the accuracy.
So you preprocesses 100% of data first, then split the data (75/25 or 80/20, doesn't matter that much)
- Training starts on 75%, rapidminer validates with label if the logic is correct, model is created.
- Test data uses this model, and the label to define if prediction was correct or not and shows this in the accuracy part
- If you are happy with the scores you have a good model
Now you can use this model with new unlabeled data, and trust the predictions. The unlabeled data will always have to use the same pre-processing flow as you did for training, but without labels this time and you do not need to create vectors anymore as you already have a model.The only important thing therefore is that your text field is 'reshaped' the same way as you did with your training data. So tokenized, stopwords removed etc. You can use the saved wordlist file from your training process as a filter (since the model isn't trained on any new words so they have no value anyway) by adding it to the wordlist input of your process document operator.
As you have 25% data without label you cannot use that for training, this will be your 'unseen' data.
The remainder (with valid label) will be your training data, and this will need to be split again in train (75~80%) and test (20~25%).
So workflow :
Training data -> Preprocess -> TFIDF -> Train and validate -> save model
Unseen data -> Preprocess -> apply model
Preprocessing should be the same for both
Now this truly explained what I had been missing.
Sure, I will proceed with the advised workflow by tuning some processing parts.
Just two more questions
1) In the workflow, the pre-process --> TF-IDF = it is what "process document" operator does right? exa output provide the dataset that will be an input for NBC. And the word output from the operator is just for the analysis of TF-IDF, correct? So, for NBC, we need to use the exa output of "process operator".
2) But the performance matrix (confusion matrix) is to test on training model or test model? Bec I see it shows 83% for training but 0% for test.
Thanks much for all your kind explanation and patience with me here , Kayman!
Bravo!! Kayman....Now I got the complete picture for this whole process. I am truly appreciated.
Just one last question, The performance input is showing error in connection. Is it ok if the mod output from apply model 2 is not connected to per input of performance 2?
Thanks a lot. I run the your suggested framework and that works!!
I got the accuracy for 75% and I plan to optimize this by subjective detection.
At least I got an idea and I can now see for optimization.
Really appreciated!
Stay safe and take care
In rapidminer you can (in general) only connect the same types with each other. So if your output port is labeled mod, it can only be attached to an input port labeled mod. Anything else will give you an error message.
There are a few exceptions to this rule, for instance the store operators accept a lot of formats but typically input and output formats should be the same if you want to connect operators together
Hello Kayman, I am sorry but I got one issue and I would like to seek for your advise.
I am trying to convert all the examples of the attribute of an excel file that contains customer reviews so that I can get a list of sentences, which I will use as an input for "extract sentiment operator".
In possible, I want to split the reviews into sentences but keep the sentiment_score as assigned, and generate sentence ID for each sentence.
I had tried my best to create the workflow but it is throwing error. I would be truly appreciate if you can advise me what I am doing wrong here?
Thanks a lot for all your know-how sharing,
Heikoe
Hello Kayman, I re-read our conversation and I came across one confusion.
As mentioned by you in above: "The only important thing therefore is that your text field is 'reshaped' the same way as you did with your training data. So tokenized, stopwords removed etc. You can use the saved wordlist file from your training process as a filter (since the model isn't trained on any new words so they have no value anyway) by adding it to the wordlist input of your process document operator."
Could you kindly please explain me here what it means by adding the word file?
After pre-processing i received 2 file, one is exa and one is word file, then I use exa file to run my NBC.
However, i believed this exc file contains TFIDF vector as well. and the word file is a list of words from TFIDF. For NBC for unseen data, do I need to input both exc and wordlist?
thanks and regards,
Heikoe
Therefore the wordlist as originated by your model can be used as a filter for unseen data, as this will reduce the number of attributes the process needs to take into consideration.
For unseen data you need to run the complete (and same) pre-processing flow (so generate a new example set). No need to create a new wordlist here.
Hello Kayman,
Thanks for your kind clarification here.
I understood that the unseen data can use wordlist generated from the training dataset as they are independent.
However, my confusion is where do we use this wordlist and what is the purpose? Because I didnt include the wordlist (i.e. word output from process document) in my NBC training. I only use exc output from process document as an input for my NBC training.
Pardon me if I am slow to understand here, coz this is my first time trying to learn modelling concept.
thanks so much for all your kind explanation,
regards.
Heikoe
You pre-process this text to create an exampleset, aka vectorset (bag of words) that will be used to create an NBC model.
Your training process also outputs this bag of words as a wordlist. Or in other words, a list with all the words that are relevant for your model. Any word not in this list will not be used by the model.
Your new data (unseen) can contain new words, these were not in the training bag of words and are therefore not used by your model, so they are just redundant by default. Here is where your (saved) training wordlist can be handy, in your prediction flow you can use this as an input (so left side of the operator) so any token in your unseen data that is not in the wordlist will be ignored.
Using the wordlist therefore means you could simplify your prediction preprocess part a bit, because you don't need to use additional filters anymore, the wordlist handles that part. But the tokenizing part needs to be exactly the same as your training setup.
Retrieve (unseen data) --> Preprocess --> Apply Model (the model we saved during the training process).
Is that so? And the wordlist is added as an input in pre-process phase?
Sorry for taking very long to understand this. I couldn't visualize how to design the flow for unseen data.
And, one more thing, When I run NBC and Cross-validation for different datasets, but receive the same performance result, is that typical? I expected different result since they are different model. Please correct me if my knowledge is wrong here.
Thanks much,
Heikoe