The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here

[SOLVED] Problem with tokenize

josejose Member Posts: 16 Contributor II
edited November 2018 in Help
hello!

My question is this, so I have understood the tokenize operator divides the sentences into words. there is some way of dividing the prayers taking two words and not a word as usual the operator tokenize?.

Answers

  • text_minertext_miner Member Posts: 11 Contributor II
    Hi Jose,

    Are you asking if you can have terms of more than one word/token?  If so, the answer is yes.  After you tokenize, use the Generate n-Grams (Terms) operator.  This will generate phrases of n sequential tokens.  Note: you will still have the single terms in your term-by-document matrix too.  For example, generating 2-grams you would have "heart", "attack", and "heart attack" in the matrix.
  • josejose Member Posts: 16 Contributor II
    ok, perfect,  thanks
  • kmkm Member Posts: 1 Learner III
    How can I have only 2-grams and not 1-grams? e.g. "Heart Attack" and not "Heart" + "Attack" in the matrix?
  • MartinLiebigMartinLiebig Administrator, Moderator, Employee-RapidMiner, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,533 RM Data Scientist
    Hi,

    i think there is no way from preventing it to generate the table. There is the option however to use a clever Regex in Select Attributes and simply remove them.

    ~Martin
    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
Sign In or Register to comment.