The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here
Balancing Data based on class
Hey folks,
Get a bit lost here playing with Sampling Operators but not getting anywhere. I have a record set of 150k entries with three classes two of the classes are very small less than 10k each. I would like to output a result where I have an equal amount of all three classes so if I have 15k then I'll have 5k Class A,5k Class B and 5k Class C. I will lose a lot of the largest class but I want to compare all three classes in this way. Would anyone have any pointers? Thanks in advance.
Neil.
Get a bit lost here playing with Sampling Operators but not getting anywhere. I have a record set of 150k entries with three classes two of the classes are very small less than 10k each. I would like to output a result where I have an equal amount of all three classes so if I have 15k then I'll have 5k Class A,5k Class B and 5k Class C. I will lose a lot of the largest class but I want to compare all three classes in this way. Would anyone have any pointers? Thanks in advance.
Neil.
0
Best Answer
-
IngoRM Employee-RapidMiner, RapidMiner Certified Analyst, RapidMiner Certified Expert, Community Manager, RMResearcher, Member, University Professor Posts: 1,751 RM FounderHi Neil,You can use the operator Sample for this with the "balance data" option activated. If you do this, you can specify the desired number of classes for each of your classes. Below is a small example process demonstrating this.Hope this helps,
Ingo<?xml version="1.0" encoding="UTF-8"?><process version="9.2.001"><br> <context><br> <input/><br> <output/><br> <macros/><br> </context><br> <operator activated="true" class="process" compatibility="9.2.001" expanded="true" name="Process"><br> <parameter key="logverbosity" value="init"/><br> <parameter key="random_seed" value="2001"/><br> <parameter key="send_mail" value="never"/><br> <parameter key="notification_email" value=""/><br> <parameter key="process_duration_for_mail" value="30"/><br> <parameter key="encoding" value="UTF-8"/><br> <process expanded="true"><br> <operator activated="true" class="retrieve" compatibility="9.2.001" expanded="true" height="68" name="Retrieve Titanic Training" width="90" x="45" y="34"><br> <parameter key="repository_entry" value="//Samples/data/Titanic Training"/><br> </operator><br> <operator activated="true" class="sample" compatibility="9.2.001" expanded="true" height="82" name="Sample" width="90" x="179" y="34"><br> <parameter key="sample" value="absolute"/><br> <parameter key="balance_data" value="true"/><br> <parameter key="sample_size" value="100"/><br> <parameter key="sample_ratio" value="0.1"/><br> <parameter key="sample_probability" value="0.1"/><br> <list key="sample_size_per_class"><br> <parameter key="Yes" value="200"/><br> <parameter key="No" value="200"/><br> </list><br> <list key="sample_ratio_per_class"/><br> <list key="sample_probability_per_class"/><br> <parameter key="use_local_random_seed" value="false"/><br> <parameter key="local_random_seed" value="1992"/><br> </operator><br> <connect from_op="Retrieve Titanic Training" from_port="output" to_op="Sample" to_port="example set input"/><br> <connect from_op="Sample" from_port="example set output" to_port="result 1"/><br> <portSpacing port="source_input 1" spacing="0"/><br> <portSpacing port="sink_result 1" spacing="0"/><br> <portSpacing port="sink_result 2" spacing="0"/><br> </process><br> </operator><br></process>
6
Answers