Stratified sampling with multiple strata
There is a operator called "Sample (Stratified)". To me it can handle one strata at the time, such as Girls vs Boys.
But how should I solve the situation sampling with multiple strata, such as Gender (Girls/Boys), Location (area1/area2/area3) and nationality (locals/others)?
Best Answer
-
Telcontar120 RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,635 Unicorn
If you need to treat these as independent attributes and simultaneously stratify across all of these 3 variables, you are probably going to have to create a single new attribute using Generate Attributes (with if statements) that represents all the combinations: for example, male area 1 local, female area 1 local, etc. It looks like you will have 12 possible values, and you can then compute the sample proportion that each one will comprise of the total by multiplying through.
Once you have that, you will be able to use the sample attribute to pull the appropriate number (or proportion) of each of the individual classes.
1
Answers
"Generate Weight (Stratification" works fine with multi-class labels and will assign weights to distribute the sum of weights equally across all classes. However, if you are trying to incorporate information from multiple attributes (as your example seems to suggest), that is much more complicated. But you can always generate your own weights using "Generate Attributes" and define them however you like, and then use "Set Role" to assign your weight variable.
Lindon Ventures
Data Science Consulting from Certified RapidMiner Experts
Also, I should have mentioned, "Sample(Stratified)" is designed to ensure that you have the same class distribution across your samples, not to balance your classes. It does work with mulitple classes but it doesn't do what you want.
If you want a pure sampling solution, you can actually use the normal "Sample" operator and activate the "balance data" parameter (an advanced parameter) and then specify the sample size (absolute or relative) for each class in a multi-class label. But you will only be able to downsample and you can't incorporate information from any other attribute--that's why I first mentioned the weighting alternative.
Lindon Ventures
Data Science Consulting from Certified RapidMiner Experts
First, thank you of fast reply.
In order to select a right approach, I perhaps need to describe the case:
1. It is question of a survey with a rather restricted budget and resources
2. The population is 100000 persons. I know the strata which should match also for the target population (to be surveyed). Gender distribution is F=45%, M= 55%, location distribution is area 1= 20%, area 2= 65%, area 3= 15%, nationality distribution is locals 80%, others 20%.
3. The target (to be surveyed) is around 1500 persons. Response rate is expected to be 50%.
=> How this kind of sampling frame could be implemented in RM?
Thanks,
this is what I was thinking as a potential solution. And whether or right, at least this explanation confirmed my thinking