The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here

text mining and words counting problem

PatrickHouPatrickHou Member Posts: 6 image Learner III
edited November 2018 in Help

Hi 

 

I'm new to rapidminer and I have an analysis now with several txt document. Let's say I have get the most 20 frequently appear words and I want to know (and only know) how many times they show up in each document, can some one give me some ideas?

 

Also I have a problem that I find "united", "states" and "united_states" all appear in my result but I can't just replace them because not all "united" are related to "united states". How can I drag those "united_states" out without counting on "united" and "states"?

 

Thanks 

Patrick

Answers

  • Telcontar120Telcontar120 RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,635 image Unicorn

    For your first question, when you use Process Documents and supply a specific wordlist to use (your 20 words) and then compute the word vector using Term Occurrences.

    For your second question, you can use Generate N-Grams after you Tokenize (and do other text preprocessing) which will give you a separate token for "united_states" than either "united" or "states".

     

    Brian T.
    Lindon Ventures 
    Data Science Consulting from Certified RapidMiner Experts
  • PatrickHouPatrickHou Member Posts: 6 image Learner III

    Thanks for the reply!

     

    I have already used term occurace but that gave me overall occurace for my word and I want to know the word occurace in each document(I have about 50 files).

     

    For second question, is that means those "united" and "states" are not related to "united_states"?

     

    Patrick

  • PatrickHouPatrickHou Member Posts: 6 image Learner III

    I looked up into ducoments and it seems when I use n-Gram opperator all word no matter if they are related, that means I need a filter or purne for those words I think? But how?

  • Telcontar120Telcontar120 RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,635 image Unicorn

    You might want to post your process XML (see the instructions in the right sidebar), since the count should be generated for each document assuming each document is a separate entity in your input data.  Do you have the "create word vector" parameter checked?

    The single counts are not exclusive of the n-gram, but the exclusive uses can be easily calculated via subtraction.  So if there are 10 total occurrences of "united" and 6 occurrences of "united_states" then you know that 4 of the "united" occurrences were not associated with "united_states".

    Brian T.
    Lindon Ventures 
    Data Science Consulting from Certified RapidMiner Experts
  • PatrickHouPatrickHou Member Posts: 6 image Learner III

    I found that stopwords(dictionary) can do the trick by manually add words I don't need after all in process documents. For a small case I'm doing it's enough but I'll still look for operators may deal with this problem.

     

    Thank you.

  • sgenzersgenzer Administrator, Moderator, Employee-RapidMiner, RapidMiner Certified Analyst, Community Manager, Member, University Professor, PM Moderator Posts: 2,959 image Community Manager

Sign In or Register to comment.