The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here
Polynominal Value Reduction
Hi
I would like to replicate a process i have done in Python/scikit-learn/R:
I am looking at Advertising Click Through Rate prediction. ( Millions of rows, say ~5 polynominal features... each with up to 1000 different values (eg feature=Website, Country etc).
Since the feature data is "skewed", ie many values have very few instances in data and vice versa, I want to restrict the polynominal features to those that change CTR significantly from base CTR ( and replace the "long tail" by a single "NA" category for each polynominal feature).
Is there any way of doing this within rapid miner?
I would like to replicate a process i have done in Python/scikit-learn/R:
I am looking at Advertising Click Through Rate prediction. ( Millions of rows, say ~5 polynominal features... each with up to 1000 different values (eg feature=Website, Country etc).
Since the feature data is "skewed", ie many values have very few instances in data and vice versa, I want to restrict the polynominal features to those that change CTR significantly from base CTR ( and replace the "long tail" by a single "NA" category for each polynominal feature).
Is there any way of doing this within rapid miner?
0
Answers
as far as I understand the problem I would do two things first:
- get a sample of your data (reduce rows, 1%)
- apply operator "NominalToBinominal"
Then analyse how sparse your data is.
For more advice examples are useful.
feature data is JUST IDs: WebsiteID, AdID etc [ eg google.com=1, yahoo.com=2, cnbc.com=3,....], so no description of website.
So yes I want to to NominaltoBinominal, but then/at same time/before I want to FILTER out those Binominals eg certain websites for which there is little training data]
( see eg http://www.kaggle.com/about/papers ... click though rate)