SVD Performance on large TF-IDF Matrices
All - I have 25K relatively short survey responses (most < 255 words). I am trying to cluster them into similar groups. My plan was to run the TF-IDF matrix thru SVD and then cluster them. Unfortunately the TF-IDF is very large (25K x 140K). The TDM alone took 15 minutes to process on my machine. SVD locks up after a few minutes of processing. This is an educational application and I am considering running the SVD in the cloud w/ my 100 credits. I fear this will not even come close to being enough. Has anyone got any ideas, suggestions or alternatives? Thanks.
Best Answer
-
Thomas_Ott RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,761 Unicorn
Has pruning been evaluated too? The pruning method parameter on the Process Docuemnts from Data operator can do wonders for a large TFIDF set.
0
Answers
It sounds like you should look at some text preprocessing to thin out your tokens. Did you filter by stopwords, by token length, by part of speech, etc.? Usually a raw wordlist can be reduced significantly using those methods. Look at your current wordlist and think about what you would like to drop. After that, if the matrix is still large, you might want to consider taking a sample and developing a wordlist from that, and then applying it to your larger dataset.
Lindon Ventures
Data Science Consulting from Certified RapidMiner Experts
Brian - Thanks for the response. Below is a list and attached is a snip of the text processing I've done within the process documents operator so far:
-transform lower case
-tokenize using non-letters as the criteria
-Filter English Stopwords
-Stem(Porter)
-Filter Tokens by length characters
-generate n-grams max length =2
You also mentioned part of speech filtering. I see the two operators that filter by POS Tags and POS Ratio. Do you reccomend one over the other or have suggestions on the settings? The help for these operators is not completely clear to me. For example, in hte POS ratio, it says "min ratio of adjectives [verbs, nouns, etc] for each token to be kept. Does that mean if I set a 0.3 ratio for adjectives, then no adjectives will be kept if there are less than 30% in an individual document or the entire corpus? If the .3 is exceed then all of them will be kept correct?
Additionally, I think I understand how to take a smaller sample and develop a word list as you suggested, but I don't know how to tell RapidMiner to apply that word list to a larger corpus. Can you walk me thru that process?
Thanks again for the response. It was very helpful.
I am glad my first comments were helpful. Here are a few additional comments in response to your questions:
Lindon Ventures
Data Science Consulting from Certified RapidMiner Experts
Awesome Brian! Thanks so much...that's exactly what I needed. I'll give it a whirl. Any idea what size matrix I need to be below so the SVD operator doesn't choke?
Thanks Again
That I don't know--you'll probably have to do some testing to find out. I'd be curious what you find out on that score, though, so if you can update the thread with your results it would be helpful!
Lindon Ventures
Data Science Consulting from Certified RapidMiner Experts
Will do. I can tell you I ran a 1K x 1.5K Matrix local on my surface pro 3 the other day and it choked. That might have been the RAM available on my surface. I haven't tried the cloud w a higher RAM bc I only have access to 1 processer w/ my educational account. I fear it willl take so long I'll burn thru all my credits and never complete. I'll let you know what happens either way.
Thanks, I hadn't considered that..i'll give it a try