The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here
[SOLVED] Join wordlists
kasper2304
Member Posts: 28 Contributor II
Hi
Was reading a paper yesterday saying that some times it can be wise to do feature extraction separately on each class when doing text analysis. This I did by using two nodes for process documents from files, and then apply same setup on both, whereafter I merge the example sets. The results was very good...!
So now my problem is that I want to merge the two wordlists in order to apply the wordlist on the entire corpus, but I simply cannot figure out how to do it... Any suggestions?
Can see that the same question have been posted in another thread in January without any answer...
Best
Kasper
Was reading a paper yesterday saying that some times it can be wise to do feature extraction separately on each class when doing text analysis. This I did by using two nodes for process documents from files, and then apply same setup on both, whereafter I merge the example sets. The results was very good...!
So now my problem is that I want to merge the two wordlists in order to apply the wordlist on the entire corpus, but I simply cannot figure out how to do it... Any suggestions?
Can see that the same question have been posted in another thread in January without any answer...
Best
Kasper
Tagged:
0
Answers
there is currently no possibility to combine the actual wordlist output (wor) of Process Documents. But you are probably trying to combine the example outputs (exa), right? Probably you have tried Append, which does not work because both sets contain different attributes. Try a combination of Union and Replace Missing Values instead!
Btw, when referencing other posts, a link would be helpful
Best regards,
Marius
Well I am actually trying to figure out a way to combine the actual wordlists, because I need it later when I need to create the corpus I want to apply my model on... The thing is that I did actually combine the example sets like you suggest, and performed modeling on it, with very good results on my test set. Going from 60% on precision and recall to around 90% with linear SVM, tf-idf and a downsampled trainingset of 286 positives and 286 negatives. But if I cannot extract the exact same word vector from my entire corpus then my new methods is no use...:/
But... When thinking about it, what I might actually just want to do is to also create two examplesets of my corpus, and then merge them in the same manner I am with my trainingset... Am I right?
The link to the other post is below, as well as my setup of how to create one training set based on two process documents from files nodes.
http://rapid-i.com/rapidforum/index.php/topic,6086.0.html
I have merged word lists by doing some gymnastics as follows
Convert word lists to example sets
Keep the word attribute only in the example sets
Append the example sets
Remove duplicates
Convert the word attribute to be of type text
Create a word vector from this using process documents from data
Here's a simple example - hope it helps regards
Andrew
Thanks a lot for the help guys!