The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here

How can I use document metadata in function expressions?

cc4699cc4699 Member Posts: 6 Contributor I
edited November 2018 in Help

Hi,

 

I have a very simple process with 3 main operators.

 

1) Get Page

2) Extract Content

3) Proccess Documents

 

Under "Process Documents" there are sub-processes, Tokenize, Aggregate Token Length.

 

Everything works fine in terms of creating the tokens(keywords) with total occurrences, however, I'm trying to calculate the density of each word and include as a custom attribute. I have the token_number metadata which holds the number of keywords but I cannot seem to access that information. How can I achieve this so my example set result looks similar to this? 

 

Thanks,

Capture.PNG

 

 

Answers

  • Telcontar120Telcontar120 RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,635 Unicorn

    If you set the vector creation parameter for your "process documents" operator to term frequency, isn't that calculating what you are looking for?

     

    Brian T.
    Lindon Ventures 
    Data Science Consulting from Certified RapidMiner Experts
  • cc4699cc4699 Member Posts: 6 Contributor I

    Thanks for the reply. Vector creation parameter options don't make any changes on the result regardless of any option I choose. I always get the same results: total occurrences, and document occurrences.

  • Telcontar120Telcontar120 RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,635 Unicorn

    It sounds like that is the output from the wordlist.  If you also connect the output port for the exampleset and examine that, you should see each token as its own attribute and the value that it has will be based on the term frequency (percentage of document tokens that particular token represents). 

     

    Brian T.
    Lindon Ventures 
    Data Science Consulting from Certified RapidMiner Experts
  • Telcontar120Telcontar120 RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,635 Unicorn

    You can also use the operator "Wordlist to Data" and then the resulting wordlist will be its own exampleset, at which point you can then use "Generate Attributes" and create whatever transformations/expressions you want with each token as its own example.

    Brian T.
    Lindon Ventures 
    Data Science Consulting from Certified RapidMiner Experts
  • cc4699cc4699 Member Posts: 6 Contributor I

    That's correct and now I do see the frequency for each term in a single a row as you described but now I'm missing the total occurrence information. I just cannot seem to find to merge both data into one result. 

Sign In or Register to comment.