The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here
I want to duplicate data in RapidMiner
Best Answers
-
lionelderkrikor RapidMiner Certified Analyst, Member Posts: 1,195 UnicornI think you can use the Sample operator and define the ratio of your 2 classes you want to obtain.
Regards
Lionel.7 -
varunm1 Member Posts: 1,207 Unicorn@lionelderkrikor mentioned well in his answer. You can use "SMOTE upsampling", this will give you data with an equal number of classes. Sample below for titanic dataset where samples labeled are Yes (349) and No (547). Now once SMOTE operator is applied it gives each class 567 samples. XML code below, to use XML you should copy this code and go to View --> Show Panel --> XML. Paste this code in the XML window and click on the green tick mark, you can see the process.
<?xml version="1.0" encoding="UTF-8"?><process version="9.2.001"> <context> <input/> <output/> <macros/> </context> <operator activated="true" class="process" compatibility="9.2.001" expanded="true" name="Process"> <parameter key="logverbosity" value="init"/> <parameter key="random_seed" value="2001"/> <parameter key="send_mail" value="never"/> <parameter key="notification_email" value=""/> <parameter key="process_duration_for_mail" value="30"/> <parameter key="encoding" value="SYSTEM"/> <process expanded="true"> <operator activated="true" class="retrieve" compatibility="9.2.001" expanded="true" height="68" name="Retrieve Titanic Training" width="90" x="45" y="136"> <parameter key="repository_entry" value="//Samples/data/Titanic Training"/> </operator> <operator activated="true" class="operator_toolbox:smote" compatibility="1.8.000" expanded="true" height="82" name="SMOTE Upsampling" width="90" x="313" y="136"> <parameter key="number_of_neighbours" value="5"/> <parameter key="normalize" value="true"/> <parameter key="equalize_classes" value="true"/> <parameter key="upsampling_size" value="1000"/> <parameter key="auto_detect_minority_class" value="true"/> <parameter key="round_integers" value="true"/> <parameter key="nominal_change_rate" value="0.5"/> <parameter key="use_local_random_seed" value="false"/> <parameter key="local_random_seed" value="1992"/> </operator> <connect from_op="Retrieve Titanic Training" from_port="output" to_op="SMOTE Upsampling" to_port="exa"/> <connect from_op="SMOTE Upsampling" from_port="ups" to_port="result 1"/> <portSpacing port="source_input 1" spacing="0"/> <portSpacing port="sink_result 1" spacing="0"/> <portSpacing port="sink_result 2" spacing="0"/> </process> </operator> </process>
In general, we work with imbalanced data rather than sampling, this is because we need to deal with this sort of data in real-world settings. If sampling is good in your work, then you can follow the aforementioned process.
Hope this helps.Regards,
Varun
https://www.varunmandalapu.com/Be Safe. Follow precautions and Maintain Social Distancing
6
Answers
I have difficulties to understand. Can you provide an example of what you have and what you want to obtain ?
Regards,
Lionel
Are you asking for making copies of same data to attach to different operators? If so, you can use multiply operator. This will give you same data as many copies you need.
Thanks,
Varun
Varun
https://www.varunmandalapu.com/
Be Safe. Follow precautions and Maintain Social Distancing
I have a data of clients, which are labeled as "Bad clients" and "Good clients". My problem is that I have a lot of "Good clients" and very few "Bad clients". I want to make copies of "Bad clients" in order to increase prediction accuracy. I plan to create credit scoring model by identifying predictors of "Bad clients".
Here an example of my data:
att1 att2 att3 att4
M No Yes Good client
M Yes Yes Good client
M Yes No Good client
M Yes No Good client
M No No Good client
M No Yes Good client
M Yes No Bad client
att1 att2 att3 att4
M No Yes Good client
M Yes Yes Good client
M Yes No Good client
M Yes No Good client
M No No Good client
M No Yes Good client
M Yes No Bad client
To have this, where "Bad clients" are duplicated:
att1 att2 att3 att4
M No Yes Good client
M Yes Yes Good client
M Yes No Good client
M Yes No Good client
M No No Good client
M No Yes Good client
M Yes No Bad client
M Yes No Bad client
M Yes No Bad client
M Yes No Bad client
M Yes No Bad client
As a result I will have approximately equal amount of "Bad clients" and "Good clients"
Sorry for inconvenience, I am new here = )