"Text Mining Generate n-germs giving me bad results"
First time user of RapidMiner so be gentle.
I have a file of support call notes that I'm trying to text mine to get the most used 2-word phrases. I've watched a couple of videos and read a couple of posts on how to do this. So I think I have everything set correct (but maybe not since it's not working). Before using Generate n-germs, the process returns single words just fine. After I add Generate n-germs with max length of 2. The below screen caps give a peek into my set up and results.
The Process:
https://photos.app.goo.gl/ue98yXSvkeMKbuzq9
The results:
https://photos.app.goo.gl/nJADtcHLDeMH2wMk9
Any help or direction would be greatly appreciated.
Answers
Hello @pbailey - welcome to the community. Don't worry...we are actually very gentle here!
So in general the best way for us to help is for you to post your process XML and at least a little bit of your data (if it's sensitive, people use "dummy" data). This way we can actually run the process, tweak, and share to others. You can find instructions on how to do this here.
Looking at your images, I honestly think that it is working. Why do you think it isn't? My hunch is that you have a lot of "junk" tokens that you'll probably want to filter out like "aaacds" and "aaba" in order to get some better resultsl. That's easy to do. Just use the "Filter Tokens (by Content)" operator. You may want to play around with the parameters and use the "matches" method with regular expressions. For example:
This will filter OUT any token that starts with the letters "aa". Regular expressions are VERY helpful in text mining.
Good luck!
Scott