The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here
generate a subset of wordlist based on a given weight table
winecoding
Member Posts: 6 Contributor II
I have generated a wordlist file based on the processing a document corpus. The following is a screenshot of part of the wordlist file.. Thare are around 15000 rows(15000 different tokenized words). Based on the feature selection method, I already have a list of words that should be kept. This list only contains 500 words, and is saved in the weight object. How can I join this two items, a wordlist and a weight table to generate a short wordlist which only has 500 rows.
Tagged:
0
Answers
Hi,
it's relativly easy to filter the wordlist and get an example set with only those fullfilling a weight requierement. However, i don't know away except execute script to turn this into a wordlist again.
Could you explain why you need to do this, and why it is no option to use Select by Weights on the resulting table?
~Martin
Dortmund, Germany
Hi Mschmitz,
Thanks for the reply.
The following is the current prediction script. The Retrieve operator (circled with red) retrieve the original wordlist, which for instance has about 15000 words. By combining with stored weight the example set passed to Apply Model operator has a reduced size. However, if I can reduce the original wordlist offline. For instance, I get the reduced wordlist based on the stored weight table before launching this prediction script. The passed wordlist will be a filtered one, which has about 500 words based on the top weights. Then I don't need including the part (circled with yellow) altogether.
It's possible to make a wordlist from an example set containing 500 examples each representing a word as follows
Here's an example
Heh, I hadn't thought about doing it this way but I think that works. You can then pass the weighted words back into a Process Documents to Data operator and then output the WordList for scoring. Sweet!
Hi Andrew,
i do not think that this works like it should, since you would need to process the whole data again. if you only throw in the list of attributes as an example set, you would not get proper normalization factors for TF/IDF.
~Martin
Dortmund, Germany
I don't know what the use case is from the OP but maybe they don't need TFIDF, maybe the can use Binary Occurances?
Thank you for your response, let me try your suggestions. I use binary occurrence.
Hello Martin
It's not exactly clear why the OP wants to do this - but the technique definately works if you want to create a word list from an example set that was originally derived from a word list but which has been reduced in some way. The last book chapter I wrote did this extensively.
regards
Andrew
Hi awchisholm,
Thank you for the reply. I just have one question regarding this approach.
I am saving the generated weight object into a csv file, and can keep the top 500 words, and make it as a text data file (each row represents a file) for Rapidminer to process. However, generating the wordlist object need the example set to have class information. The weight file itself does not have label information. The original training process is built for a six class categorization work. How can I solve this kind of discrepancy?