The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here

Best practices for text mining an academic text

yoram_schafferyoram_schaffer Member Posts: 3 Learner III
edited December 2018 in Help

I have long, complex texts which I want to classify to categories such as psychology, history etc.

What processes would you recommend to use? Eg. tokenization, n-grams etc.

Thank you

Answers

  • kypexinkypexin RapidMiner Certified Analyst, Member Posts: 291 Unicorn

    Hi @yoram_schaffer

     

    Your question might seem a bit too general, as text categorizing is a pretty big topic :) 

    I migh cite my own answer on the very same subject from another discussion some time ago: 

     

    * https://community.rapidminer.com/t5/RapidMiner-Studio-Forum/autotagging-and-autocategorizing-text-pieces/m-p/43717/highlight/true#M29049 

     

    Maybe this could help with some ideas for your problem as well. 

  • yoram_schafferyoram_schaffer Member Posts: 3 Learner III

    kypexin for taking the time to reply to me. 

    I read your other reply thoroughly. Did you ever try using some of the other processes, like stemming, locating POS?

    The texts I'm analyzing are academic in nature - i.e - I'm not trying to analyze client behavior, not do I try to locate a dependency between different factors (e.g - weather against purchase habits).

    My intention is to categorize texts according to the topic they are dealing with. The texts are usually 100-300 words.

    I understand it's beyond your experience. Do you have any idea for a resource which my be helpful on that?

  • kypexinkypexin RapidMiner Certified Analyst, Member Posts: 291 Unicorn

    Hi @yoram_schaffer

     

    Well, basically I did more or less the same task - categorizing site contents (actually means, text data) into separate predefined categories. I used all the standard things there (like tokenization, stemming) inb my process, see screenshot #2. One thing though I didn't use were n-grams, as it would be pretty memory-consuming; otherwise I see that your problem is actually VERY similar, so I would recommend that you begin with re-creating the process setup as I have described and see the results (believe me, it really works! :)). I think one crucial thing here is to have a good training set, which means manually categorized documents corpus (and the complexity of this part depends on how much unique categories and total documents you have).

     

    In a more general sense, text mining is one of most popular topics so you can find a lot of posts on this forum if you search for 'text mining' and similar. Also look for operators description from Text Mining RM extension, everything basically is built around it. And Google suggests pretty much different resources about 'text mining rapidminer', and even some tutorial videos. 

     

     

  • yoram_schafferyoram_schaffer Member Posts: 3 Learner III

    Thank you very much @kypexin!
    I will tryo the different setting, having your illustration as a source of inspiration. Yes, I have quite good samples as I'm working on it for a long time (actually, started with RapidMiner afterr seeing, to my surprise, how limited is Amazon ML in terms of applying different processes).

    Will report and share with the community once I have some insights about what brngs better results, at least for academic texts.

Sign In or Register to comment.