How to generate data points for each row of data based on a frequency and average value?

caryknoop · December 2017

Suppose you have rows of data in the following form:

City Population Average Income

--------------------------------------------------

CityA 100,000 60,000

CityB 300,000 40,000

CityC 40,000 70,000

I would like to generate rows with data points based on a given (typically normal) distribution.

Thus using the above example we would generate 100,000 + 300,000 + 40,000 = 440,000 rows each containing an actual (but hypothetical) income based on a given (typically normal) distribution of income of the city in question.

lionelderkrikor · December 2017

Hi @caryknoop,

If you did'nt find an operator in RapidMiner which perform what you want,

you can find here a process using Execute Python operator (if the Python environment is installed on your computer).

You have just to set the standard deviations associated to the towns in the code :

<?xml version="1.0" encoding="UTF-8"?><process version="8.0.001">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="8.0.001" expanded="true" name="Process">
    <process expanded="true">
      <operator activated="true" class="read_excel" compatibility="8.0.001" expanded="true" height="68" name="Read Excel" width="90" x="45" y="34">
        <parameter key="excel_file" value="C:\Users\Lionel\Documents\Formations_DataScience\Rapidminer\Tests_Rapidminer\Generate_Income.xlsx"/>
        <parameter key="imported_cell_range" value="A1:D4"/>
        <parameter key="first_row_as_names" value="false"/>
        <list key="annotations">
          <parameter key="0" value="Name"/>
        </list>
        <list key="data_set_meta_data_information">
          <parameter key="0" value="City.true.polynominal.attribute"/>
          <parameter key="1" value="Population.true.integer.attribute"/>
          <parameter key="2" value="Average.true.integer.attribute"/>
          <parameter key="3" value="Income.true.attribute_value.attribute"/>
        </list>
      </operator>
      <operator activated="true" class="python_scripting:execute_python" compatibility="7.4.000" expanded="true" height="82" name="Calculate Income" width="90" x="179" y="34">
        <parameter key="script" value="from numpy.random import normal&#10;&#10;# rm_main is a mandatory function, &#10;# the number of arguments has to be the number of input ports (can be none)&#10;def rm_main(data):&#10;&#10;  # set the standard deviation associated to the towns : &#10;  #the first element is std dev from CityA, the second from City B etc.&#10;  std_deviation =  [1,2,3]&#10;&#10;&#10;  data['pop_****'] = data['Population'] &#10;  &#10;  for i in range(1,len(data)) :&#10;  &#10;    data.loc[i,'pop_****'] = data.loc[i-1,'pop_****'] +  data.loc[i,'Population']&#10;    &#10;    &#10;  for j in range(0,int(data.loc[0,'Population'])):&#10;   &#10;    data.loc[j,'Income'] = normal(data.loc[0,'Average'],std_deviation[0])&#10;&#10;  try:&#10;    &#10;    for i in range(1,len(data)) : &#10;    &#10;      for j in range(int(data.loc[i-1,'pop_****']),int(data.loc[i,'pop_****'])):&#10;    &#10;        data.loc[j,'Income'] =  normal(data.loc[i,'Average'],std_deviation[i])&#10;&#10;  except ValueError:&#10;    &#10;    del data['pop_****']&#10;    exit&#10;&#10;    # connect 1 output port to see the results&#10;  return data"/>
      </operator>
      <connect from_op="Read Excel" from_port="output" to_op="Calculate Income" to_port="input 1"/>
      <connect from_op="Calculate Income" from_port="output 1" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
    </process>
  </operator>
</process>

Regards,

Lionel

NB : I used the name of attributes of your example

NB2 : In attached file, an excel example file

earmijo · December 2017

To my knowledge, this cannot be done in RM. RM does not have any random number generators. Of course, it is trivial to do in R (or Python) and you can do it inside RM using the R extension.

lionelderkrikor · December 2017

Hi @caryknoop,

I think it can be done using the Execute Script operator (using Java language) :

Here a ressource of @mschmitz about generating example set :

How to Create Example Sets Using Groovy Script

I hope it can help you

Regards,

Lionel

Thomas_Ott · December 2017

I believe you use the Generate Data by User Specification to do this. There is an editor that's like the Generate Attributes operator, you can create a fuction based on a STD dev or something.

Telcontar120 · December 2017

Actually I think "Generate Data" can be used to do what you want along with "Generate Guassian". You simply specify the number of examples you want and then the mean and standard variation.

caryknoop · January 2018

I would love to use 'generate data' for this but given 'generate data' has no input connections I cannot see how this could work based on input rows.

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

How to generate data points for each row of data based on a frequency and average value?

Best Answer

Answers