The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here

"Text Mining Generate n-germs giving me bad results"

pbaileypbailey Member Posts: 1 Learner I
edited May 2019 in Help

First time user of RapidMiner so be gentle.     

 

I have a file of support call notes that I'm trying to text mine to get the most used 2-word phrases.    I've watched a couple of videos and read a couple of posts on how to do this.   So I think I have everything set correct (but maybe not since it's not working).    Before using Generate n-germs,  the process returns single words just fine.   After I add Generate n-germs with max length of 2.   The below screen caps give a peek into my set up and results.

 

The Process:

 https://photos.app.goo.gl/ue98yXSvkeMKbuzq9

The results:

 https://photos.app.goo.gl/nJADtcHLDeMH2wMk9

Any help or direction would be greatly appreciated.

Answers

  • sgenzersgenzer Administrator, Moderator, Employee-RapidMiner, RapidMiner Certified Analyst, Community Manager, Member, University Professor, PM Moderator Posts: 2,959 Community Manager

    Hello @pbailey - welcome to the community. Don't worry...we are actually very gentle here! 

     

    So in general the best way for us to help is for you to post your process XML and at least a little bit of your data (if it's sensitive, people use "dummy" data). This way we can actually run the process, tweak, and share to others. You can find instructions on how to do this here.

     

    Looking at your images, I honestly think that it is working. Why do you think it isn't? My hunch is that you have a lot of "junk" tokens that you'll probably want to filter out like "aaacds" and "aaba" in order to get some better resultsl. That's easy to do. Just use the "Filter Tokens (by Content)" operator. You may want to play around with the parameters and use the "matches" method with regular expressions. For example:

     

    Screen Shot 2018-11-02 at 10.19.30 AM.png

     

    This will filter OUT any token that starts with the letters "aa". Regular expressions are VERY helpful in text mining. :)

     

    Good luck!

     

    Scott

     

     

     

Sign In or Register to comment.