The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here
"Text Mining / Too slow"
Hello,
I'm having a process in which I'm processing text from data(prunning below 3% and above 40% , vector TF-IDF) that implies stemming (snowball), tokenize, uppercase, stop words..
My data is an example set of about 800 000 lines and I'm treating 3 text attributes.
The attributes:
- First one: has several words
- Second one: has none or 2-3 words
- Third one: has about 300 words
I'm having a 15.5 GB for my machine and 12GB for RapidMiner.
My process treated 20 000 lines in 3 hours and a half...so I estimate that the process should take 6 days and a half. (Which is not really acceptable?)
1. Are there any ways in optimizing a text processing process?
2. Does this seem to you that I have a problem in my process? (normally I followed the tutorials, it doesn’t have anything of really special)
3. Are there any benchmark studyies on the speed of rapidminer?
Thank you in advance,
Best regards,
I'm having a process in which I'm processing text from data(prunning below 3% and above 40% , vector TF-IDF) that implies stemming (snowball), tokenize, uppercase, stop words..
My data is an example set of about 800 000 lines and I'm treating 3 text attributes.
The attributes:
- First one: has several words
- Second one: has none or 2-3 words
- Third one: has about 300 words
I'm having a 15.5 GB for my machine and 12GB for RapidMiner.
My process treated 20 000 lines in 3 hours and a half...so I estimate that the process should take 6 days and a half. (Which is not really acceptable?)
1. Are there any ways in optimizing a text processing process?
2. Does this seem to you that I have a problem in my process? (normally I followed the tutorials, it doesn’t have anything of really special)
3. Are there any benchmark studyies on the speed of rapidminer?
Thank you in advance,
Best regards,
Tagged:
0
Answers
One thing I noticed though is that if you have big data sets it might saturate all your resources and it did happen to me that the process crashed (it might be just because I have a very old machine with 2GB and 4 cores).
Anyhow.. to avoid that I found much more useful to work on data coming from a DB... So what I normally do is:
Load my data into a DB table using RapidMiner
Use the StreamDB operator to feed things into the parallelized "process documents"
write my results on another DB table.
Hope this helps
Igor
I'm answering a bit late but, yes I used the parallelisation extension. (thank you for your answer)
However only the process document can be paralelized.. I would need a bit more that that as parallelisation.. are there any other solutions?
Alina
If you search the forum there are some good tutorial on how to set up RM on AWS