"stratified sampling (sample size: absolute)"

lghansse · October 2018

Hi,

I'm using the operator sample (stratified) to draw a sample of 6500 cases from my dataset which has a total number of cases well over 150000. However, if even though I set the sample size (absolute) to 6500 samples, I only get 4855 cases as a result of the process. Anybody got an idea why this might be?

Thanks,

Lise

sgenzer · October 2018

hi @lghansse - no that is rather odd. Normally if you ask for n=6500, you get n=6500. Are you doing anything else AFTER the sampling? Maybe Filter Examples?

Please share your XML so we can take a look.

Scott

lghansse · October 2018

Hi,

No I'm not doing anything afterwards. I've included my code below.

<?xml version="1.0" encoding="UTF-8"?><process version="8.1.000">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="8.1.000" expanded="true" name="Process">
    <process expanded="true">
      <operator activated="true" class="jdbc_connectors:read_database" compatibility="8.1.000" expanded="true" height="68" name="Read DB" width="90" x="179" y="85">
        <parameter key="connection" value="%{dbconnection}"/>
        <parameter key="define_query" value="table name"/>
        <parameter key="table_name" value="civicrm_contact"/>
        <enumeration key="parameters"/>
      </operator>
      <operator activated="true" class="subprocess" compatibility="8.1.000" expanded="true" height="82" name="exclude 1" width="90" x="313" y="85">
        <process expanded="true">
          <operator activated="true" class="rmx_toolkit:execute_process_advanced" compatibility="2.1.692" expanded="true" height="103" name="Get groups" width="90" x="447" y="238">
            <parameter key="process_location" value="//AIVL_OpenMinds/home/fw_operations_ongoing/civi_odbc/contact/civiodbc_wegduwbestand_AIgroups&amp;employees"/>
            <list key="macros"/>
          </operator>
          <operator activated="true" class="select_attributes" compatibility="8.1.000" expanded="true" height="82" name="Select Attributes (19)" width="90" x="581" y="238">
            <parameter key="attribute_filter_type" value="subset"/>
            <parameter key="attributes" value="functie 4|functie 5|functie 6|contact_id|is_active_employee"/>
          </operator>
          <operator activated="true" class="free_memory" compatibility="8.1.000" expanded="true" height="82" name="Free Memory (10)" width="90" x="45" y="34"/>
          <operator activated="true" class="subprocess" compatibility="8.1.000" expanded="true" height="82" name="Filter" width="90" x="179" y="34">
            <process expanded="true">
              <operator activated="true" class="filter_examples" compatibility="8.1.000" expanded="true" height="103" name="Filter 1" width="90" x="45" y="34">
                <list key="filters_list">
                  <parameter key="filters_entry_key" value="is_deleted.ne.1"/>
                </list>
              </operator>
              <operator activated="true" class="filter_examples" compatibility="8.1.000" expanded="true" height="103" name="Filter 2" width="90" x="179" y="34">
                <list key="filters_list">
                  <parameter key="filters_entry_key" value="is_deceased.ne.1"/>
                </list>
              </operator>
              <operator activated="true" class="filter_examples" compatibility="8.1.000" expanded="true" height="103" name="Filter 3" width="90" x="313" y="34">
                <list key="filters_list">
                  <parameter key="filters_entry_key" value="do_not_email.ne.1"/>
                </list>
              </operator>
              <connect from_port="in 1" to_op="Filter 1" to_port="example set input"/>
              <connect from_op="Filter 1" from_port="example set output" to_op="Filter 2" to_port="example set input"/>
              <connect from_op="Filter 2" from_port="example set output" to_op="Filter 3" to_port="example set input"/>
              <connect from_op="Filter 3" from_port="example set output" to_port="out 1"/>
              <portSpacing port="source_in 1" spacing="0"/>
              <portSpacing port="source_in 2" spacing="0"/>
              <portSpacing port="sink_out 1" spacing="0"/>
              <portSpacing port="sink_out 2" spacing="0"/>
            </process>
          </operator>
          <operator activated="true" class="select_attributes" compatibility="8.1.000" expanded="true" height="82" name="Select Attributes (18)" width="90" x="313" y="34">
            <parameter key="attribute_filter_type" value="subset"/>
            <parameter key="attributes" value="hash|api_key|source|is_deceased|employer_id|primary_contact_id|houshold_name|deceased_date|job_title|preferred_mail_format|display_name|legal_identifier|do_not_sms|do_not_phone|do_not_trade|image_URL|preferred_language|formal_title|communication_style_id|nick_name|sort_name|preferred_communication_method|household_name|sic_code|user_unique_id|is_deleted|modified_date"/>
            <parameter key="invert_selection" value="true"/>
          </operator>
          <operator activated="true" class="rename" compatibility="8.1.000" expanded="true" height="82" name="Rename (2)" width="90" x="447" y="34">
            <parameter key="old_name" value="id"/>
            <parameter key="new_name" value="contact_id"/>
            <list key="rename_additional_attributes"/>
          </operator>
          <operator activated="true" class="filter_examples" compatibility="8.1.000" expanded="true" height="103" name="contact sub_type" width="90" x="581" y="34">
            <parameter key="invert_filter" value="true"/>
            <list key="filters_list">
              <parameter key="filters_entry_key" value="contact_sub_type.contains.recruiter"/>
              <parameter key="filters_entry_key" value="contact_sub_type.contains.recruiting"/>
            </list>
            <parameter key="filters_logic_and" value="false"/>
          </operator>
          <operator activated="true" class="concurrency:join" compatibility="8.1.000" expanded="true" height="82" name="Join (9)" width="90" x="715" y="34">
            <parameter key="join_type" value="left"/>
            <parameter key="use_id_attribute_as_key" value="false"/>
            <list key="key_attributes">
              <parameter key="contact_id" value="contact_id"/>
            </list>
          </operator>
          <operator activated="true" class="filter_examples" compatibility="8.1.000" expanded="true" height="103" name="Filter 4" width="90" x="849" y="34">
            <parameter key="parameter_expression" value="([functie 4] == &quot;Personeelslid&quot; || [functie 5] == &quot;Personeelslid&quot; || [functie 6] == &quot;Personeelslid&quot;) &amp;&amp; (is_active_employee == &quot;1&quot;)"/>
            <parameter key="condition_class" value="expression"/>
            <parameter key="invert_filter" value="true"/>
            <list key="filters_list">
              <parameter key="filters_entry_key" value="functie 4.equals.Personeelslid"/>
              <parameter key="filters_entry_key" value="functie 5.equals.Personeelslid"/>
              <parameter key="filters_entry_key" value="functie 6.equals.Personeelslid"/>
            </list>
            <parameter key="filters_logic_and" value="false"/>
          </operator>
          <operator activated="true" class="free_memory" compatibility="8.1.000" expanded="true" height="82" name="Free Memory (17)" width="90" x="983" y="34"/>
          <connect from_port="in 1" to_op="Free Memory (10)" to_port="through 1"/>
          <connect from_op="Get groups" from_port="output 1" to_op="Select Attributes (19)" to_port="example set input"/>
          <connect from_op="Select Attributes (19)" from_port="example set output" to_op="Join (9)" to_port="right"/>
          <connect from_op="Free Memory (10)" from_port="through 1" to_op="Filter" to_port="in 1"/>
          <connect from_op="Filter" from_port="out 1" to_op="Select Attributes (18)" to_port="example set input"/>
          <connect from_op="Select Attributes (18)" from_port="example set output" to_op="Rename (2)" to_port="example set input"/>
          <connect from_op="Rename (2)" from_port="example set output" to_op="contact sub_type" to_port="example set input"/>
          <connect from_op="contact sub_type" from_port="example set output" to_op="Join (9)" to_port="left"/>
          <connect from_op="Join (9)" from_port="join" to_op="Filter 4" to_port="example set input"/>
          <connect from_op="Filter 4" from_port="example set output" to_op="Free Memory (17)" to_port="through 1"/>
          <connect from_op="Free Memory (17)" from_port="through 1" to_port="out 1"/>
          <portSpacing port="source_in 1" spacing="0"/>
          <portSpacing port="source_in 2" spacing="0"/>
          <portSpacing port="sink_out 1" spacing="0"/>
          <portSpacing port="sink_out 2" spacing="0"/>
        </process>
        <description align="center" color="transparent" colored="false" width="126">Exclude some technical groups</description>
      </operator>
      <operator activated="true" class="subprocess" compatibility="8.1.000" expanded="true" height="82" name="exclude 2" width="90" x="447" y="85">
        <process expanded="true">
          <operator activated="true" class="set_role" compatibility="8.1.000" expanded="true" height="82" name="Set Role (2)" width="90" x="447" y="34">
            <parameter key="attribute_name" value="contact_id"/>
            <parameter key="target_role" value="id"/>
            <list key="set_additional_roles"/>
          </operator>
          <operator activated="true" class="rmx_toolkit:cache" compatibility="2.1.692" expanded="true" height="82" name="activity_contact" width="90" x="45" y="187">
            <parameter key="cache_name" value="activity_contact"/>
            <enumeration key="cache_dependencies"/>
            <process expanded="true">
              <operator activated="true" class="jdbc_connectors:read_database" compatibility="8.1.000" expanded="true" height="68" name="Read activities (3)" width="90" x="112" y="34">
                <parameter key="connection" value="AIVLtestsplit1"/>
                <parameter key="define_query" value="table name"/>
                <parameter key="table_name" value="civicrm_activity"/>
                <enumeration key="parameters"/>
              </operator>
              <operator activated="true" class="filter_examples" compatibility="8.1.000" expanded="true" height="103" name="Filter Examples (12)" width="90" x="246" y="34">
                <list key="filters_list">
                  <parameter key="filters_entry_key" value="is_test.ne.1"/>
                  <parameter key="filters_entry_key" value="is_deleted.ne.1"/>
                </list>
              </operator>
              <operator activated="true" class="rename" compatibility="8.1.000" expanded="true" height="82" name="Rename (4)" width="90" x="380" y="34">
                <parameter key="old_name" value="id"/>
                <parameter key="new_name" value="activity_id"/>
                <list key="rename_additional_attributes"/>
              </operator>
              <operator activated="true" class="jdbc_connectors:read_database" compatibility="8.1.000" expanded="true" height="68" name="Read activity_contact (3)" width="90" x="112" y="238">
                <parameter key="connection" value="AIVLtestsplit1"/>
                <parameter key="define_query" value="table name"/>
                <parameter key="table_name" value="civicrm_activity_contact"/>
                <enumeration key="parameters"/>
              </operator>
              <operator activated="true" class="filter_examples" compatibility="8.1.000" expanded="true" height="103" name="Filter Examples (13)" width="90" x="246" y="238">
                <list key="filters_list">
                  <parameter key="filters_entry_key" value="record_type_id.eq.3"/>
                </list>
              </operator>
              <operator activated="true" class="select_attributes" compatibility="8.1.000" expanded="true" height="82" name="Select Attributes (13)" width="90" x="380" y="238">
                <parameter key="attribute_filter_type" value="subset"/>
                <parameter key="attributes" value="contact_id|activity_id"/>
              </operator>
              <operator activated="true" class="concurrency:join" compatibility="8.1.000" expanded="true" height="82" name="Join (28)" width="90" x="581" y="34">
                <parameter key="join_type" value="right"/>
                <parameter key="use_id_attribute_as_key" value="false"/>
                <list key="key_attributes">
                  <parameter key="activity_id" value="activity_id"/>
                </list>
              </operator>
              <operator activated="true" class="filter_examples" compatibility="8.1.000" expanded="true" height="103" name="Filter Examples (14)" width="90" x="715" y="34">
                <list key="filters_list">
                  <parameter key="filters_entry_key" value="is_test.ne.1"/>
                </list>
              </operator>
              <connect from_op="Read activities (3)" from_port="output" to_op="Filter Examples (12)" to_port="example set input"/>
              <connect from_op="Filter Examples (12)" from_port="example set output" to_op="Rename (4)" to_port="example set input"/>
              <connect from_op="Rename (4)" from_port="example set output" to_op="Join (28)" to_port="left"/>
              <connect from_op="Read activity_contact (3)" from_port="output" to_op="Filter Examples (13)" to_port="example set input"/>
              <connect from_op="Filter Examples (13)" from_port="example set output" to_op="Select Attributes (13)" to_port="example set input"/>
              <connect from_op="Select Attributes (13)" from_port="example set output" to_op="Join (28)" to_port="right"/>
              <connect from_op="Join (28)" from_port="join" to_op="Filter Examples (14)" to_port="example set input"/>
              <connect from_op="Filter Examples (14)" from_port="example set output" to_port="output 1"/>
              <portSpacing port="source_input 1" spacing="0"/>
              <portSpacing port="sink_output 1" spacing="0"/>
              <portSpacing port="sink_output 2" spacing="0"/>
            </process>
          </operator>
          <operator activated="true" class="select_attributes" compatibility="8.1.000" expanded="true" height="82" name="Select Attributes (20)" width="90" x="179" y="187">
            <parameter key="attribute_filter_type" value="subset"/>
            <parameter key="attributes" value="activity_date_time|activity_id|activity_type_id|contact_id|subject|campaign_id"/>
          </operator>
          <operator activated="true" class="filter_examples" compatibility="8.1.000" expanded="true" height="103" name="Filter actiepakket bestellers" width="90" x="313" y="187">
            <list key="filters_list">
              <parameter key="filters_entry_key" value="subject.contains.Schrijfmarathon actiepakketten"/>
            </list>
          </operator>
          <operator activated="true" class="set_role" compatibility="8.1.000" expanded="true" height="82" name="Set Role (3)" width="90" x="447" y="187">
            <parameter key="attribute_name" value="contact_id"/>
            <parameter key="target_role" value="id"/>
            <list key="set_additional_roles"/>
          </operator>
          <operator activated="true" class="set_minus" compatibility="8.1.000" expanded="true" height="82" name="Set Minus" width="90" x="581" y="34"/>
          <operator activated="true" class="free_memory" compatibility="8.1.000" expanded="true" height="82" name="Free Memory (18)" width="90" x="782" y="34"/>
          <operator activated="true" class="filter_examples" compatibility="8.1.000" expanded="true" height="103" name="filter petitieondertekenaars" width="90" x="447" y="340">
            <list key="filters_list">
              <parameter key="filters_entry_key" value="activity_type_id.eq.93"/>
            </list>
          </operator>
          <operator activated="true" class="filter_examples" compatibility="8.1.000" expanded="true" height="103" name="petitions (4)" width="90" x="581" y="340">
            <list key="filters_list">
              <parameter key="filters_entry_key" value="campaign_id.eq.1348"/>
              <parameter key="filters_entry_key" value="campaign_id.eq.1349"/>
              <parameter key="filters_entry_key" value="campaign_id.eq.1350"/>
              <parameter key="filters_entry_key" value="campaign_id.eq.1351"/>
              <parameter key="filters_entry_key" value="campaign_id.eq.1352"/>
              <parameter key="filters_entry_key" value="campaign_id.eq.1353"/>
              <parameter key="filters_entry_key" value="campaign_id.eq.1354"/>
              <parameter key="filters_entry_key" value="campaign_id.eq.1355"/>
              <parameter key="filters_entry_key" value="campaign_id.eq.1356"/>
              <parameter key="filters_entry_key" value="campaign_id.eq.1357"/>
            </list>
            <parameter key="filters_logic_and" value="false"/>
          </operator>
          <operator activated="true" class="filter_examples" compatibility="8.1.000" expanded="true" height="103" name="petitions (5)" width="90" x="715" y="340">
            <list key="filters_list">
              <parameter key="filters_entry_key" value="activity_date_time.ge.12/08/2017 00:00:00 AM"/>
              <parameter key="filters_entry_key" value="activity_date_time.le.12/31/2017 00:00:00 PM"/>
            </list>
          </operator>
          <operator activated="true" class="set_role" compatibility="8.1.000" expanded="true" height="82" name="Set Role (4)" width="90" x="849" y="340">
            <parameter key="attribute_name" value="contact_id"/>
            <parameter key="target_role" value="id"/>
            <list key="set_additional_roles"/>
          </operator>
          <operator activated="true" class="set_minus" compatibility="8.1.000" expanded="true" height="82" name="Set Minus (2)" width="90" x="1050" y="34"/>
          <connect from_port="in 1" to_op="Set Role (2)" to_port="example set input"/>
          <connect from_op="Set Role (2)" from_port="example set output" to_op="Set Minus" to_port="example set input"/>
          <connect from_op="activity_contact" from_port="output 1" to_op="Select Attributes (20)" to_port="example set input"/>
          <connect from_op="Select Attributes (20)" from_port="example set output" to_op="Filter actiepakket bestellers" to_port="example set input"/>
          <connect from_op="Filter actiepakket bestellers" from_port="example set output" to_op="Set Role (3)" to_port="example set input"/>
          <connect from_op="Filter actiepakket bestellers" from_port="unmatched example set" to_op="filter petitieondertekenaars" to_port="example set input"/>
          <connect from_op="Set Role (3)" from_port="example set output" to_op="Set Minus" to_port="subtrahend"/>
          <connect from_op="Set Minus" from_port="example set output" to_op="Free Memory (18)" to_port="through 1"/>
          <connect from_op="Free Memory (18)" from_port="through 1" to_op="Set Minus (2)" to_port="example set input"/>
          <connect from_op="filter petitieondertekenaars" from_port="example set output" to_op="petitions (4)" to_port="example set input"/>
          <connect from_op="petitions (4)" from_port="example set output" to_op="petitions (5)" to_port="example set input"/>
          <connect from_op="petitions (5)" from_port="example set output" to_op="Set Role (4)" to_port="example set input"/>
          <connect from_op="Set Role (4)" from_port="example set output" to_op="Set Minus (2)" to_port="subtrahend"/>
          <connect from_op="Set Minus (2)" from_port="example set output" to_port="out 1"/>
          <portSpacing port="source_in 1" spacing="0"/>
          <portSpacing port="source_in 2" spacing="0"/>
          <portSpacing port="sink_out 1" spacing="0"/>
          <portSpacing port="sink_out 2" spacing="0"/>
        </process>
      </operator>
      <operator activated="true" breakpoints="after" class="subprocess" compatibility="8.1.000" expanded="true" height="82" name="add information" width="90" x="581" y="85">
        <process expanded="true">
          <operator activated="true" class="subprocess" compatibility="8.1.000" expanded="true" height="82" name="read DB" width="90" x="112" y="136">
            <process expanded="true">
              <operator activated="true" class="jdbc_connectors:read_database" compatibility="8.1.000" expanded="true" height="68" name="Read DB (2)" width="90" x="45" y="34">
                <parameter key="connection" value="AIVLtestsplit1"/>
                <parameter key="define_query" value="table name"/>
                <parameter key="table_name" value="civicrm_email"/>
                <enumeration key="parameters"/>
              </operator>
              <operator activated="true" class="filter_examples" compatibility="8.1.000" expanded="true" height="103" name="Filter 1 (2)" width="90" x="179" y="34">
                <list key="filters_list">
                  <parameter key="filters_entry_key" value="is_primary.eq.1"/>
                </list>
              </operator>
              <operator activated="true" class="filter_examples" compatibility="8.1.000" expanded="true" height="103" name="Filter 2 (2)" width="90" x="313" y="34">
                <list key="filters_list">
                  <parameter key="filters_entry_key" value="location_type_id.ne.11"/>
                </list>
              </operator>
              <operator activated="true" class="select_attributes" compatibility="8.1.000" expanded="true" height="82" name="Select Attributes (11)" width="90" x="447" y="34">
                <parameter key="attribute_filter_type" value="subset"/>
                <parameter key="attributes" value="email|contact_id"/>
              </operator>
              <connect from_op="Read DB (2)" from_port="output" to_op="Filter 1 (2)" to_port="example set input"/>
              <connect from_op="Filter 1 (2)" from_port="example set output" to_op="Filter 2 (2)" to_port="example set input"/>
              <connect from_op="Filter 2 (2)" from_port="example set output" to_op="Select Attributes (11)" to_port="example set input"/>
              <connect from_op="Select Attributes (11)" from_port="example set output" to_port="out 1"/>
              <portSpacing port="source_in 1" spacing="0"/>
              <portSpacing port="sink_out 1" spacing="0"/>
              <portSpacing port="sink_out 2" spacing="0"/>
            </process>
          </operator>
          <operator activated="true" class="concurrency:join" compatibility="8.1.000" expanded="true" height="82" name="Join (27)" width="90" x="246" y="34">
            <parameter key="join_type" value="left"/>
            <parameter key="use_id_attribute_as_key" value="false"/>
            <list key="key_attributes">
              <parameter key="contact_id" value="contact_id"/>
            </list>
          </operator>
          <operator activated="true" class="free_memory" compatibility="8.1.000" expanded="true" height="82" name="Free Memory (19)" width="90" x="380" y="34"/>
          <operator activated="true" class="subprocess" compatibility="8.1.000" expanded="true" height="82" name="add groups" width="90" x="514" y="34">
            <process expanded="true">
              <operator activated="true" class="generate_attributes" compatibility="8.1.000" expanded="true" height="82" name="Generate Age (2)" width="90" x="45" y="34">
                <list key="function_descriptions">
                  <parameter key="contact_agenow" value="(date_diff(birth_date,date_now())/86400000)/365.25"/>
                </list>
              </operator>
              <operator activated="true" class="generate_copy" compatibility="8.1.000" expanded="true" height="82" name="Generate Copy (2)" width="90" x="179" y="34">
                <parameter key="attribute_name" value="contact_agenow"/>
                <parameter key="new_name" value="age_groups"/>
              </operator>
              <operator activated="true" class="discretize_by_user_specification" compatibility="8.1.000" expanded="true" height="103" name="Discretize (3)" width="90" x="313" y="34">
                <parameter key="attribute_filter_type" value="single"/>
                <parameter key="attribute" value="age_groups"/>
                <list key="classes">
                  <parameter key="0-25" value="25.0"/>
                  <parameter key="26-35" value="35.0"/>
                  <parameter key="36-45" value="45.0"/>
                  <parameter key="46-55" value="55.0"/>
                  <parameter key="56-65" value="65.0"/>
                  <parameter key="66-75" value="75.0"/>
                  <parameter key="76-85" value="85.0"/>
                  <parameter key="86-95" value="95.0"/>
                  <parameter key="96-105" value="405.0"/>
                </list>
              </operator>
              <operator activated="true" class="free_memory" compatibility="8.1.000" expanded="true" height="82" name="Free Memory (20)" width="90" x="447" y="34"/>
              <connect from_port="in 1" to_op="Generate Age (2)" to_port="example set input"/>
              <connect from_op="Generate Age (2)" from_port="example set output" to_op="Generate Copy (2)" to_port="example set input"/>
              <connect from_op="Generate Copy (2)" from_port="example set output" to_op="Discretize (3)" to_port="example set input"/>
              <connect from_op="Discretize (3)" from_port="example set output" to_op="Free Memory (20)" to_port="through 1"/>
              <connect from_op="Free Memory (20)" from_port="through 1" to_port="out 1"/>
              <portSpacing port="source_in 1" spacing="0"/>
              <portSpacing port="source_in 2" spacing="0"/>
              <portSpacing port="sink_out 1" spacing="0"/>
              <portSpacing port="sink_out 2" spacing="0"/>
            </process>
          </operator>
          <operator activated="true" class="subprocess" compatibility="8.1.000" expanded="true" height="82" name="add fin info" width="90" x="648" y="34">
            <process expanded="true">
              <operator activated="true" class="retrieve" compatibility="8.1.000" expanded="true" height="68" name="Retrieve tussenstap_acquisitionSM2018" width="90" x="246" y="136">
                <parameter key="repository_entry" value="tussenstap_acquisitionSM2018"/>
              </operator>
              <operator activated="true" class="concurrency:join" compatibility="8.1.000" expanded="true" height="82" name="Join (26)" width="90" x="514" y="34">
                <parameter key="join_type" value="left"/>
                <parameter key="use_id_attribute_as_key" value="false"/>
                <list key="key_attributes">
                  <parameter key="contact_id" value="contact_id"/>
                </list>
              </operator>
              <operator activated="true" class="free_memory" compatibility="8.1.000" expanded="true" height="82" name="Free Memory (22)" width="90" x="648" y="34"/>
              <connect from_port="in 1" to_op="Join (26)" to_port="left"/>
              <connect from_op="Retrieve tussenstap_acquisitionSM2018" from_port="output" to_op="Join (26)" to_port="right"/>
              <connect from_op="Join (26)" from_port="join" to_op="Free Memory (22)" to_port="through 1"/>
              <connect from_op="Free Memory (22)" from_port="through 1" to_port="out 1"/>
              <portSpacing port="source_in 1" spacing="0"/>
              <portSpacing port="source_in 2" spacing="0"/>
              <portSpacing port="sink_out 1" spacing="0"/>
              <portSpacing port="sink_out 2" spacing="0"/>
            </process>
          </operator>
          <operator activated="true" class="subprocess" compatibility="8.1.000" expanded="true" height="82" name="read DB (2)" width="90" x="648" y="136">
            <process expanded="true">
              <operator activated="true" class="jdbc_connectors:read_database" compatibility="8.1.000" expanded="true" height="68" name="read address" width="90" x="45" y="34">
                <parameter key="connection" value="%{dbconnection}"/>
                <parameter key="define_query" value="table name"/>
                <parameter key="table_name" value="civicrm_address"/>
                <enumeration key="parameters"/>
              </operator>
              <operator activated="true" class="filter_examples" compatibility="8.1.000" expanded="true" height="103" name="Filter Examples (11)" width="90" x="179" y="34">
                <list key="filters_list">
                  <parameter key="filters_entry_key" value="is_primary.eq.1"/>
                </list>
              </operator>
              <operator activated="true" class="select_attributes" compatibility="8.1.000" expanded="true" height="82" name="Select Attributes (9)" width="90" x="313" y="34">
                <parameter key="attribute_filter_type" value="subset"/>
                <parameter key="attributes" value="contact_id|postal_code"/>
              </operator>
              <connect from_op="read address" from_port="output" to_op="Filter Examples (11)" to_port="example set input"/>
              <connect from_op="Filter Examples (11)" from_port="example set output" to_op="Select Attributes (9)" to_port="example set input"/>
              <connect from_op="Select Attributes (9)" from_port="example set output" to_port="out 1"/>
              <portSpacing port="source_in 1" spacing="0"/>
              <portSpacing port="sink_out 1" spacing="0"/>
              <portSpacing port="sink_out 2" spacing="0"/>
            </process>
          </operator>
          <operator activated="true" class="concurrency:join" compatibility="8.1.000" expanded="true" height="82" name="Join (15)" width="90" x="782" y="34">
            <parameter key="join_type" value="left"/>
            <parameter key="use_id_attribute_as_key" value="false"/>
            <list key="key_attributes">
              <parameter key="contact_id" value="contact_id"/>
            </list>
          </operator>
          <connect from_port="in 1" to_op="Join (27)" to_port="left"/>
          <connect from_op="read DB" from_port="out 1" to_op="Join (27)" to_port="right"/>
          <connect from_op="Join (27)" from_port="join" to_op="Free Memory (19)" to_port="through 1"/>
          <connect from_op="Free Memory (19)" from_port="through 1" to_op="add groups" to_port="in 1"/>
          <connect from_op="add groups" from_port="out 1" to_op="add fin info" to_port="in 1"/>
          <connect from_op="add fin info" from_port="out 1" to_op="Join (15)" to_port="left"/>
          <connect from_op="read DB (2)" from_port="out 1" to_op="Join (15)" to_port="right"/>
          <connect from_op="Join (15)" from_port="join" to_port="out 1"/>
          <portSpacing port="source_in 1" spacing="0"/>
          <portSpacing port="source_in 2" spacing="0"/>
          <portSpacing port="sink_out 1" spacing="0"/>
          <portSpacing port="sink_out 2" spacing="0"/>
        </process>
      </operator>
      <operator activated="true" class="subprocess" compatibility="8.1.000" expanded="true" height="82" name="Subprocess (5)" width="90" x="715" y="85">
        <process expanded="true">
          <operator activated="true" class="subprocess" compatibility="8.1.000" expanded="true" height="82" name="processing" width="90" x="179" y="34">
            <process expanded="true">
              <operator activated="true" class="free_memory" compatibility="8.1.000" expanded="true" height="82" name="Free Memory (23)" width="90" x="45" y="34"/>
              <operator activated="true" class="numerical_to_polynominal" compatibility="8.1.000" expanded="true" height="82" name="Numerical to Polynominal (2)" width="90" x="179" y="34">
                <parameter key="attribute_filter_type" value="single"/>
                <parameter key="attribute" value="gender_id"/>
              </operator>
              <operator activated="true" class="replace_missing_values" compatibility="8.1.000" expanded="true" height="103" name="Replace Missing Values" width="90" x="313" y="34">
                <parameter key="attribute_filter_type" value="subset"/>
                <parameter key="attributes" value="age_groups|gender_id"/>
                <parameter key="default" value="value"/>
                <list key="columns"/>
                <parameter key="replenishment_value" value="unknown"/>
              </operator>
              <operator activated="true" class="replace_missing_values" compatibility="8.1.000" expanded="true" height="103" name="Replace Missing Values (2)" width="90" x="447" y="34">
                <parameter key="attribute_filter_type" value="subset"/>
                <parameter key="attributes" value="gift_ever"/>
                <parameter key="default" value="value"/>
                <list key="columns"/>
                <parameter key="replenishment_value" value="N"/>
              </operator>
              <operator activated="true" class="filter_examples" compatibility="8.1.000" expanded="true" height="103" name="Filter Examples (15)" width="90" x="648" y="34">
                <list key="filters_list">
                  <parameter key="filters_entry_key" value="email.is_not_missing."/>
                </list>
              </operator>
              <connect from_port="in 1" to_op="Free Memory (23)" to_port="through 1"/>
              <connect from_op="Free Memory (23)" from_port="through 1" to_op="Numerical to Polynominal (2)" to_port="example set input"/>
              <connect from_op="Numerical to Polynominal (2)" from_port="example set output" to_op="Replace Missing Values" to_port="example set input"/>
              <connect from_op="Replace Missing Values" from_port="example set output" to_op="Replace Missing Values (2)" to_port="example set input"/>
              <connect from_op="Replace Missing Values (2)" from_port="example set output" to_op="Filter Examples (15)" to_port="example set input"/>
              <connect from_op="Filter Examples (15)" from_port="example set output" to_port="out 1"/>
              <portSpacing port="source_in 1" spacing="0"/>
              <portSpacing port="source_in 2" spacing="0"/>
              <portSpacing port="sink_out 1" spacing="0"/>
              <portSpacing port="sink_out 2" spacing="0"/>
            </process>
          </operator>
          <operator activated="true" class="generate_attributes" compatibility="8.1.000" expanded="true" height="82" name="Generate Attributes (6)" width="90" x="313" y="34">
            <list key="function_descriptions">
              <parameter key="strat_label" value="concat(age_groups, gender_id,gift_ever, postal_code)"/>
            </list>
          </operator>
          <operator activated="true" class="set_role" compatibility="8.1.000" expanded="true" height="82" name="Set Role" width="90" x="447" y="34">
            <parameter key="attribute_name" value="strat_label"/>
            <parameter key="target_role" value="label"/>
            <list key="set_additional_roles"/>
          </operator>
          <operator activated="true" breakpoints="after" class="generate_weight_stratification" compatibility="8.1.000" expanded="true" height="82" name="Generate Weight (Stratification)" width="90" x="581" y="34">
            <parameter key="total_weight" value="1000.0"/>
          </operator>
          <operator activated="true" class="sample_stratified" compatibility="8.1.000" expanded="true" height="82" name="Sample (Stratified)" width="90" x="715" y="34">
            <parameter key="sample_size" value="6500"/>
          </operator>
          <connect from_port="in 1" to_op="processing" to_port="in 1"/>
          <connect from_op="processing" from_port="out 1" to_op="Generate Attributes (6)" to_port="example set input"/>
          <connect from_op="Generate Attributes (6)" from_port="example set output" to_op="Set Role" to_port="example set input"/>
          <connect from_op="Set Role" from_port="example set output" to_op="Generate Weight (Stratification)" to_port="example set input"/>
          <connect from_op="Generate Weight (Stratification)" from_port="example set output" to_op="Sample (Stratified)" to_port="example set input"/>
          <connect from_op="Sample (Stratified)" from_port="example set output" to_port="out 1"/>
          <portSpacing port="source_in 1" spacing="0"/>
          <portSpacing port="source_in 2" spacing="0"/>
          <portSpacing port="sink_out 1" spacing="0"/>
          <portSpacing port="sink_out 2" spacing="0"/>
        </process>
        <description align="center" color="transparent" colored="false" width="126">set label and get sample</description>
      </operator>
      <connect from_op="Read DB" from_port="output" to_op="exclude 1" to_port="in 1"/>
      <connect from_op="exclude 1" from_port="out 1" to_op="exclude 2" to_port="in 1"/>
      <connect from_op="exclude 2" from_port="out 1" to_op="add information" to_port="in 1"/>
      <connect from_op="add information" from_port="out 1" to_op="Subprocess (5)" to_port="in 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
    </process>
  </operator>
</process>

sgenzer · October 2018

ok @lghansse sorry for the delay. You have a WHOLE LOT of stuff going on here. What is the rmx_toolkit? That does not look familiar to me. I cannot replicate your process without having this extension.

If I were to take a wild guess I would say that there are some metadata propagation errors going on and hence Sample (which is at the end of this 400 line process) does not know what to do.

Scott

lghansse · October 2018

Hi,

No problem. I guess (since I don't write the XML - but work with the operators) that the rmx_toolkit is part of the jackhammer-extention, and more specifically the 'execute process' operator.

But, are you saying that if I try to store my list of contacts and sample from that stored list, it might solve the problem? Because in the steps before the sampling I'm just creating and cleaning my list.

Lise

sgenzer · October 2018

ah...Jackhammer extension. That makes sense.

Yes try just storing and then retrieving the ExampleSet before sampling. That will "refresh" the metadata and may solve your problem.

And if @land is kind enough to get me a Jackhammer license key, I can see if I can replicate your process.

Scott

land · October 2018

Hi,

I doubt that it has to do with the meta data. The attribute that is used for sampling is just generated two operators before. I'm missing the data to see what happens.

@lghansse Sure that there are more than 6500 examples before? And how is the class distribution?

@sgenzer Scott, I'm absolutely fine with providing you a license. Probably makes sense if people ask questions here involving our extensions. Although they are of course invited to ask them to us, as well, if our operators are responsible

Greetings,

Sebastian

lghansse · October 2018

Hi @land,

I'm very very sure that their are more than 6500 samples in the dataset, even more: if I make the sample size larger the sample goes up with it but it's never the requested sample size (if you want I could share a screenshot with the results before and after sampling).

I want to thank you both in advance for the help, but also add that probably you will not be able to fully recreate my process since the data I use is protected and won't be accessible for you. I just shared my process to give a general overview of what I'm doing in the process (so I fully understand if you can't really help me any further).

Lise

MartinLiebig · October 2018

Hi @lghansse, @land,

shouldn't Sample throw an UserError if Sample Size > ExampleSet.size() ?

BR,

Martin

land · October 2018

Hi Martin,

Sample indeed does, Sample(Stratified) does not...Probably there's a logic, but it escapes me right now

I just tested and it cannot be because of the class balances or unused nominal values. This works pretty well.

@lghansse Did you insert a breakpoint before and really checked what is delivered to the sample?

Greetings,

Sebastian

lghansse · October 2018

Hi,

@sgenzer, I've just tried to store my results just before sampling, but that doesn't solve the issue. The outcome remains the same with or without storing.

@land, yes, I've inserted a breakpoint before. There are over 180.000 examples before sampling, so the size of that dataset really shouldn't be an issue. I played around with the label and I'm guessing it has something to do with how my label is build. However, I can not think of any statistical reason why the distribution in my label results in a sample set of less than 5000, when dataset contains more than 30 times the data of the sample I'm asking for. I've tried simplifying my label (e.g.: using only postal codes or age groups as label) and even then I don't get the absolute value I'm asking for in my sample. In the first instane there was a small underestimation of the sample size, in the latter a small overestimation...

Lise

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

"stratified sampling (sample size: absolute)"

Answers