The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here
Problem Processing Data and Filter Stopwords for LDA
Hi,
I really need the help of you as a community. I already tried out all solutions that were suggested to others in community posts regarding the filter stopwords operator but nothing worked so far. I have reviews from which I want to extract topics with LDA. I followed tutorials on how to pre-process the data and filter stopwords etc. but unfortunately, it does not seem to work. Despite the transform cases into lowercase I still have words with capital letters in my output and it does not filter out the stopwords I attached in the .txt file. Also, the replace token operator does not seem to work. As I have the filter Tokens by POS operator (that takes a lot of time) I used a sample of only 100 (what can be enabled any time). I also tried it without the filter tokens by POS and with the whole data set. Unfortunately, it just does not seem to work. I attached all my files and processes. Could you please help me with my process? Thank you so much!
I am not sure if this goes too far for one post but can someone also tell me how to find out the ideal number of topics for LDA?
Thank you, Larissa
0
Best Answer
-
jacobcybulski Member, University Professor Posts: 391 UnicornYou have a number of issues in your process. If indeed you wanted to use Process Documents from Data to do some pre-processing of text before LDA then you need to keep text it generates ("keep text" option), it also means that LDA must then be processing the attribute "text" and not your original "Review". The "Review" is polynomial and it is not automatically of type text, so your intuition to use Nominal to Text was correct but you need to apply it to "Review". Next, you cannot filter the tokens by POS as you have not done any stemming and so no POS tags are present (you would need a dictionary stemming to get these). Finally, all your stop words would be eliminated by the default English stop word filter anyway, so do you really need it? Good luck!0
Answers
One way to avoid would be to carefully order your replacements, or use regular expressions. Something like \bharry\b will only replace harry when it is a word on it's own.
For LDA there is no real need for POS filtering, in a traditional NLP flow this makes sense but the power of LDA is that it 'sees' the relations between words so it reduces the need to normalise to an extreme level. Even filtering is just an option. I'd suggest to do just some basic cleaning to get rid of the most obvious dirt and let LDA do the rest for you.