I want to duplicate data in RapidMiner

Adiletkgz · March 2019

Would you mind to help me to find an operator that will duplicate (increase number of needed information) rows with data of specific types of labeled data.

lionelderkrikor · March 2019

I think you can use the Sample operator and define the ratio of your 2 classes you want to obtain.

Regards

Lionel.

varunm1 · March 2019

@lionelderkrikor mentioned well in his answer. You can use "SMOTE upsampling", this will give you data with an equal number of classes. Sample below for titanic dataset where samples labeled are Yes (349) and No (547). Now once SMOTE operator is applied it gives each class 567 samples. XML code below, to use XML you should copy this code and go to View --> Show Panel --> XML. Paste this code in the XML window and click on the green tick mark, you can see the process.

<?xml version="1.0" encoding="UTF-8"?><process version="9.2.001">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="9.2.001" expanded="true" name="Process">
    <parameter key="logverbosity" value="init"/>
    <parameter key="random_seed" value="2001"/>
    <parameter key="send_mail" value="never"/>
    <parameter key="notification_email" value=""/>
    <parameter key="process_duration_for_mail" value="30"/>
    <parameter key="encoding" value="SYSTEM"/>
    <process expanded="true">
      <operator activated="true" class="retrieve" compatibility="9.2.001" expanded="true" height="68" name="Retrieve Titanic Training" width="90" x="45" y="136">
        <parameter key="repository_entry" value="//Samples/data/Titanic Training"/>
      </operator>
      <operator activated="true" class="operator_toolbox:smote" compatibility="1.8.000" expanded="true" height="82" name="SMOTE Upsampling" width="90" x="313" y="136">
        <parameter key="number_of_neighbours" value="5"/>
        <parameter key="normalize" value="true"/>
        <parameter key="equalize_classes" value="true"/>
        <parameter key="upsampling_size" value="1000"/>
        <parameter key="auto_detect_minority_class" value="true"/>
        <parameter key="round_integers" value="true"/>
        <parameter key="nominal_change_rate" value="0.5"/>
        <parameter key="use_local_random_seed" value="false"/>
        <parameter key="local_random_seed" value="1992"/>
      </operator>
      <connect from_op="Retrieve Titanic Training" from_port="output" to_op="SMOTE Upsampling" to_port="exa"/>
      <connect from_op="SMOTE Upsampling" from_port="ups" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
    </process>
  </operator>
</process>

In general, we work with imbalanced data rather than sampling, this is because we need to deal with this sort of data in real-world settings. If sampling is good in your work, then you can follow the aforementioned process.

Hope this helps.

lionelderkrikor · March 2019

Hi @Adiletkgz,

I have difficulties to understand. Can you provide an example of what you have and what you want to obtain ?

Regards,

Lionel

varunm1 · March 2019

Hi @Adiletkgz

Are you asking for making copies of same data to attach to different operators? If so, you can use multiply operator. This will give you same data as many copies you need.

Thanks,
Varun

Adiletkgz · March 2019

Hello friends!
I have a data of clients, which are labeled as "Bad clients" and "Good clients". My problem is that I have a lot of "Good clients" and very few "Bad clients". I want to make copies of "Bad clients" in order to increase prediction accuracy. I plan to create credit scoring model by identifying predictors of "Bad clients".
Here an example of my data:
att1 att2 att3 att4
M No Yes Good client
M Yes Yes Good client
M Yes No Good client
M Yes No Good client
M No No Good client
M No Yes Good client
M Yes No Bad client

Adiletkgz · March 2019

I want from this data, where I have limited "Bad clients":
att1 att2 att3 att4
M No Yes Good client
M Yes Yes Good client
M Yes No Good client
M Yes No Good client
M No No Good client
M No Yes Good client
M Yes No Bad client

To have this, where "Bad clients" are duplicated:
att1 att2 att3 att4
M No Yes Good client
M Yes Yes Good client
M Yes No Good client
M Yes No Good client
M No No Good client
M No Yes Good client
M Yes No Bad client
M Yes No Bad client
M Yes No Bad client
M Yes No Bad client
M Yes No Bad client

As a result I will have approximately equal amount of "Bad clients" and "Good clients"

Adiletkgz · March 2019

@lionelderkrikor @varunm1
Sorry for inconvenience, I am new here = )

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

I want to duplicate data in RapidMiner

Best Answers

Be Safe. Follow precautions and Maintain Social Distancing

Answers

Be Safe. Follow precautions and Maintain Social Distancing