The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here

"pdf tokenization (?)"

margkwmargkw Member Posts: 14 Contributor II
edited June 2019 in Help
Hello guys,
I am totally new here and to the rapidminer!!
I have an assignment to get done so there is not much time for me to explore rapid miner. I will set my question here and I hope I will find the answer. It might be trivial.I apologise for that..

I have several pdf files. I want to tokenize them, i.e to see the multiple appearances of each word and how many times each word appears..
For example let's assume that in a pdf there is the word "process"..I want to see how many times this word appears. And that is what I want to do for all the words in the pdf file. Is tokenization what I need to do? If yes, how do I do it? If not what do you propose?
Thank you in advance!
Tagged:

Answers

  • MariusHelfMariusHelf RapidMiner Certified Expert, Member Posts: 1,869 Unicorn
    Yes, it is. Just load the data with Read Documents from Files, connect it to Process Documents, inside Process Documents add the Tokenize operator, and finally connect the output ports of the Process Documents operator to the process output.

    To get the aforementioned operators, you have to install the Text Processing extension.

    Best, Marius
  • margkwmargkw Member Posts: 14 Contributor II
    Thank you very much.I will try that out and I will get back to you if I have any problem...Many many thanks!!!! :):):):):)
  • margkwmargkw Member Posts: 14 Contributor II
    It's me again!How can I insert the tokenize operator inside Process Documents?

    And the process output should be what?

    Sorry for the stupid questions..I am completely new to this..
  • MariusHelfMariusHelf RapidMiner Certified Expert, Member Posts: 1,869 Unicorn
    Hi,

    these are very important concepts which are rather easy to understand, but hard to explain here in text form. I would like to forward you to our video tutorials on our website; there is one complete section about text processing.

    You'll find the link to the tutorials in the post linked in my signature.

    Happy Mining!
      -Marius
  • margkwmargkw Member Posts: 14 Contributor II
    THANKS!I will be back with more questions! :D
Sign In or Register to comment.