The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here
"stratified sampling (sample size: absolute)"
Hi,
I'm using the operator sample (stratified) to draw a sample of 6500 cases from my dataset which has a total number of cases well over 150000. However, if even though I set the sample size (absolute) to 6500 samples, I only get 4855 cases as a result of the process. Anybody got an idea why this might be?
Thanks,
Lise
Tagged:
0
Answers
hi @lghansse - no that is rather odd. Normally if you ask for n=6500, you get n=6500. Are you doing anything else AFTER the sampling? Maybe Filter Examples?
Please share your XML so we can take a look.
Scott
Hi,
No I'm not doing anything afterwards. I've included my code below.
ok @lghansse sorry for the delay. You have a WHOLE LOT of stuff going on here. What is the rmx_toolkit? That does not look familiar to me. I cannot replicate your process without having this extension.
If I were to take a wild guess I would say that there are some metadata propagation errors going on and hence Sample (which is at the end of this 400 line process) does not know what to do.
Scott
Hi,
No problem. I guess (since I don't write the XML - but work with the operators) that the rmx_toolkit is part of the jackhammer-extention, and more specifically the 'execute process' operator.
But, are you saying that if I try to store my list of contacts and sample from that stored list, it might solve the problem? Because in the steps before the sampling I'm just creating and cleaning my list.
Lise
ah...Jackhammer extension. That makes sense.
Yes try just storing and then retrieving the ExampleSet before sampling. That will "refresh" the metadata and may solve your problem.
And if @land is kind enough to get me a Jackhammer license key, I can see if I can replicate your process.
Scott
Hi,
I doubt that it has to do with the meta data. The attribute that is used for sampling is just generated two operators before. I'm missing the data to see what happens.
@lghansse Sure that there are more than 6500 examples before? And how is the class distribution?
@sgenzer Scott, I'm absolutely fine with providing you a license. Probably makes sense if people ask questions here involving our extensions. Although they are of course invited to ask them to us, as well, if our operators are responsible
Greetings,
Sebastian
Hi @land,
I'm very very sure that their are more than 6500 samples in the dataset, even more: if I make the sample size larger the sample goes up with it but it's never the requested sample size (if you want I could share a screenshot with the results before and after sampling).
I want to thank you both in advance for the help, but also add that probably you will not be able to fully recreate my process since the data I use is protected and won't be accessible for you. I just shared my process to give a general overview of what I'm doing in the process (so I fully understand if you can't really help me any further).
Lise
Hi @lghansse, @land,
shouldn't Sample throw an UserError if Sample Size > ExampleSet.size() ?
BR,
Martin
Dortmund, Germany
Hi Martin,
Sample indeed does, Sample(Stratified) does not...Probably there's a logic, but it escapes me right now
I just tested and it cannot be because of the class balances or unused nominal values. This works pretty well.
@lghansse Did you insert a breakpoint before and really checked what is delivered to the sample?
Greetings,
Sebastian
Hi,
@sgenzer, I've just tried to store my results just before sampling, but that doesn't solve the issue. The outcome remains the same with or without storing.
@land, yes, I've inserted a breakpoint before. There are over 180.000 examples before sampling, so the size of that dataset really shouldn't be an issue. I played around with the label and I'm guessing it has something to do with how my label is build. However, I can not think of any statistical reason why the distribution in my label results in a sample set of less than 5000, when dataset contains more than 30 times the data of the sample I'm asking for. I've tried simplifying my label (e.g.: using only postal codes or age groups as label) and even then I don't get the absolute value I'm asking for in my sample. In the first instane there was a small underestimation of the sample size, in the latter a small overestimation...
Lise