how to parse and move vendor contact data into correct columns when some fields are missing

dataguy · March 2017

Dear Rapidminer Support,

I have vendor data which has missing cells because it was taken from a javascript parsed records page (knockout.js)?
The spreadsheet has multiple vendors for each manufacturers product. The issue was not discovered until after the website was taken offline.

The order of the data will always be the same columnwise (an eight column contact detail per vendor format):

vendor_name_1, vendor_address_1, vendor_phone_1, vendor_fax_1, vendor_contact_mobile_1, vendor_contact_email_1, etc.

When there is more than one vendor for that product (almost all most do), there is another repeat of the columns in the same order left to right:

vendor_name_2, vendor_address_2, vendor_phone_2, vendor_fax_2, vendor_contact_mobile_2, vendor_contact_email_2, etc.

At this point the sets of columns repeat as long as there are more vendors for the product on that row.

A "good" row will have all of the available data in the correct column:

Motion Distributors; 3231 Apex Drive; Dulles, Ohio 45321; (321) 542-6422(p); (321) 542-6428(f); (321) 542-6680(m); alan@motiondist.com; etc. etc.

A "bad" row will have (one or more) missing items for at least one vendor on the row, which of course effects everything to the right of that missing cell, so all of the data becomes "shifted".

Since some of the data in the cells are missing, the my problem is getting the data in each row back to the correct cell.

For example, if the vendor_fax number is missing, all of the cells to the right of that missing cell do not go into the correct column and are shifted.

To make things worse, because there are multiple vendors for the same product, the more missing cells per row, the more shifting occurs on that row.

Some manufacturer products have more than a dozen vendors, so would it be better to read the entire row into memory first?

Is there a way to fix this since each vendor data set will always be the same arrangement of eight columns, and I just need to know how many sets of eight columns to prepare before parsing the row?

Thanks in advance...

Thomas_Ott · March 2017

This sounds tricky, let me ask you this. Are all these vendors in different files? Or all in one file?

Let me see if I understand you correctly, you have a situation like this?

Vendor 1 + 7 more V1 columns + Vendor 2 + 7 more V2 columns + Vendor 3 + 7 more V2 columns

X X

dataguy · March 2017

Ok I am trying not to fanboy freak out right now because I am talking to the guy who made the YouTube videos I have been watching - - Rapidminer Royalty Thomas Ott - - your videos make sense to me and I am very grateful...

Ok I have regained my composure and am back to my usual analytical self again...

At this point I have consolidated all of the per mfg associated vendor data into one spreadsheet:

Each of the rows associated with a manufacturer will always have at least one vendor entry,

All of the vendor data is presented in the correct order (in spite of the missing data) no matter how many vendors are on one row.

The peculiar cases:

All of the vendors are usually crammed into the first vendor cell, but for some weird reason there may be four or five vendors in one cell, then there will be another cell to the right with even more vendors (I suspect typos with semicolons since the data is semicolon delimited), and

Vendor contacts with only one phone number will not have the (p), (m), or (f) suffix on the end of the phone number.

The order of the vendor data will always be the same, but null / missing data which would have been delimited by a semicolon is not there, so the spreadsheet columns are not being filled with the correct type of data for that column after any missing data thereafter for that row.

Attaching a whittled-down xml spreadsheet with a page of what we have versus a page of what we need to illustrate.

MartinLiebig · March 2017

Hi,

could you maybe post an excel/csv with 3 dummy examples? Kind of hard to imagine.

I guess there will be either a Regex or the Split operator involved...

Best,

Martin

dataguy · March 2017

Dear Martin,

Here is a revised xml spreadsheet version with three rows.

You mentioned regex - - is it possible to craft a regex statement for the cases when only one phone number is present and the phone suffix (p) is not used?

Do I need to make a subprocess for each regex variation / possibility?

Should I try to read the entire row in since I do not know in advance how many sets of eight columns will be necessary to parse into their correct columns?

Thanks again

MartinLiebig · March 2017

Hi,

i am getting closer . See attached process. I've transfered the xml to csv and it kind of works with a split. After this you need to go for a big rename and a De-Pivot. Only obstacle is that there is one address with an , - which is fairly annoying. Not sure if this can be handled in a nice way? Any pattern we can exploti to replace?

~Martin

<?xml version="1.0" encoding="UTF-8"?><process version="7.4.000">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="7.4.000" expanded="true" name="Process">
    <process expanded="true">
      <operator activated="true" class="read_csv" compatibility="7.4.000" expanded="true" height="68" name="Read CSV" width="90" x="112" y="34">
        <parameter key="csv_file" value="C:\Users\Martin\Arbeit\Forum\data_vendor_list_snippet.csv"/>
        <parameter key="column_separators" value=","/>
        <parameter key="first_row_as_names" value="false"/>
        <list key="annotations">
          <parameter key="0" value="Name"/>
        </list>
        <parameter key="encoding" value="windows-1252"/>
        <list key="data_set_meta_data_information">
          <parameter key="0" value="ID_Count.true.integer.attribute"/>
          <parameter key="1" value="Mfg_Ref.true.real.attribute"/>
          <parameter key="2" value="Product_Ref.true.real.attribute"/>
          <parameter key="3" value="vendor_name_1.true.polynominal.attribute"/>
          <parameter key="4" value="vendor_contact_title_1.true.attribute_value.attribute"/>
          <parameter key="5" value="vendor_street_1.true.attribute_value.attribute"/>
          <parameter key="6" value="vendor_city_state_zip_1.true.attribute_value.attribute"/>
          <parameter key="7" value="vendor_contact_phone_1.true.attribute_value.attribute"/>
          <parameter key="8" value="vendor_contact_mobile_1.true.attribute_value.attribute"/>
          <parameter key="9" value="vendor_contact_fax_1.true.attribute_value.attribute"/>
          <parameter key="10" value="vendor_contact_email_1.true.attribute_value.attribute"/>
          <parameter key="11" value="vendor_name_2.true.attribute_value.attribute"/>
          <parameter key="12" value="vendor_contact_title_2.true.attribute_value.attribute"/>
          <parameter key="13" value="vendor_street_2.true.attribute_value.attribute"/>
          <parameter key="14" value="vendor_city_state_zip_2.true.attribute_value.attribute"/>
          <parameter key="15" value="vendor_contact_phone_2.true.attribute_value.attribute"/>
          <parameter key="16" value="vendor_contact_mobile_2.true.attribute_value.attribute"/>
          <parameter key="17" value="vendor_contact_fax_2.true.attribute_value.attribute"/>
          <parameter key="18" value="vendor_contact_email_2.true.attribute_value.attribute"/>
          <parameter key="19" value="vendor_name_3.true.attribute_value.attribute"/>
          <parameter key="20" value="vendor_contact_title_3.true.attribute_value.attribute"/>
          <parameter key="21" value="vendor_street_3.true.attribute_value.attribute"/>
          <parameter key="22" value="vendor_city_state_zip_3.true.attribute_value.attribute"/>
          <parameter key="23" value="vendor_contact_phone_3.true.attribute_value.attribute"/>
          <parameter key="24" value="vendor_contact_mobile_3.true.attribute_value.attribute"/>
          <parameter key="25" value="vendor_contact_fax_3.true.attribute_value.attribute"/>
          <parameter key="26" value="vendor_contact_email_3.true.attribute_value.attribute"/>
          <parameter key="27" value="vendor_name_4.true.attribute_value.attribute"/>
          <parameter key="28" value="vendor_contact_title_4.true.attribute_value.attribute"/>
          <parameter key="29" value="vendor_street_4.true.attribute_value.attribute"/>
          <parameter key="30" value="vendor_city_state_zip_4.true.attribute_value.attribute"/>
          <parameter key="31" value="vendor_contact_phone_4.true.attribute_value.attribute"/>
          <parameter key="32" value="vendor_contact_mobile_4.true.attribute_value.attribute"/>
          <parameter key="33" value="vendor_contact_fax_4.true.attribute_value.attribute"/>
          <parameter key="34" value="vendor_contact_email_4.true.attribute_value.attribute"/>
        </list>
      </operator>
      <operator activated="true" class="select_attributes" compatibility="7.4.000" expanded="true" height="82" name="Select Attributes" width="90" x="246" y="34">
        <parameter key="attribute_filter_type" value="subset"/>
        <parameter key="attributes" value="ID_Count|Mfg_Ref|Product_Ref|vendor_name_1"/>
      </operator>
      <operator activated="true" class="replace" compatibility="7.4.000" expanded="true" height="82" name="Replace" width="90" x="380" y="34">
        <parameter key="replace_what" value="\(p\);"/>
        <parameter key="replace_by" value="\(p\)  "/>
      </operator>
      <operator activated="true" class="split" compatibility="7.4.000" expanded="true" height="82" name="Split" width="90" x="581" y="34">
        <parameter key="attribute_filter_type" value="subset"/>
        <parameter key="attribute" value="vendor_name_1"/>
        <parameter key="attributes" value="vendor_name_1"/>
        <parameter key="split_pattern" value=",|;"/>
      </operator>
      <connect from_op="Read CSV" from_port="output" to_op="Select Attributes" to_port="example set input"/>
      <connect from_op="Select Attributes" from_port="example set output" to_op="Replace" to_port="example set input"/>
      <connect from_op="Replace" from_port="example set output" to_op="Split" to_port="example set input"/>
      <connect from_op="Split" from_port="example set output" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
    </process>
  </operator>
</process>

Thomas_Ott · March 2017

Why not just do Read CSV, Select Attributes, and then Split. Split on the semi-colon, ";" and then all you have to do is use Rename operator?

<?xml version="1.0" encoding="UTF-8"?><process version="7.4.000">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="7.4.000" expanded="true" name="Process">
    <process expanded="true">
      <operator activated="true" class="read_csv" compatibility="7.4.000" expanded="true" height="68" name="Read CSV" width="90" x="112" y="34">
        <parameter key="csv_file" value="C:\Users\ThomasOtt\Desktop\data_vendor_list_snippet.csv"/>
        <parameter key="column_separators" value=","/>
        <parameter key="first_row_as_names" value="false"/>
        <list key="annotations">
          <parameter key="0" value="Name"/>
        </list>
        <parameter key="encoding" value="windows-1252"/>
        <list key="data_set_meta_data_information">
          <parameter key="0" value="ID_Count.true.integer.attribute"/>
          <parameter key="1" value="Mfg_Ref.true.real.attribute"/>
          <parameter key="2" value="Product_Ref.true.real.attribute"/>
          <parameter key="3" value="vendor_name_1.true.polynominal.attribute"/>
          <parameter key="4" value="vendor_contact_title_1.true.attribute_value.attribute"/>
          <parameter key="5" value="vendor_street_1.true.attribute_value.attribute"/>
          <parameter key="6" value="vendor_city_state_zip_1.true.attribute_value.attribute"/>
          <parameter key="7" value="vendor_contact_phone_1.true.attribute_value.attribute"/>
          <parameter key="8" value="vendor_contact_mobile_1.true.attribute_value.attribute"/>
          <parameter key="9" value="vendor_contact_fax_1.true.attribute_value.attribute"/>
          <parameter key="10" value="vendor_contact_email_1.true.attribute_value.attribute"/>
          <parameter key="11" value="vendor_name_2.true.attribute_value.attribute"/>
          <parameter key="12" value="vendor_contact_title_2.true.attribute_value.attribute"/>
          <parameter key="13" value="vendor_street_2.true.attribute_value.attribute"/>
          <parameter key="14" value="vendor_city_state_zip_2.true.attribute_value.attribute"/>
          <parameter key="15" value="vendor_contact_phone_2.true.attribute_value.attribute"/>
          <parameter key="16" value="vendor_contact_mobile_2.true.attribute_value.attribute"/>
          <parameter key="17" value="vendor_contact_fax_2.true.attribute_value.attribute"/>
          <parameter key="18" value="vendor_contact_email_2.true.attribute_value.attribute"/>
          <parameter key="19" value="vendor_name_3.true.attribute_value.attribute"/>
          <parameter key="20" value="vendor_contact_title_3.true.attribute_value.attribute"/>
          <parameter key="21" value="vendor_street_3.true.attribute_value.attribute"/>
          <parameter key="22" value="vendor_city_state_zip_3.true.attribute_value.attribute"/>
          <parameter key="23" value="vendor_contact_phone_3.true.attribute_value.attribute"/>
          <parameter key="24" value="vendor_contact_mobile_3.true.attribute_value.attribute"/>
          <parameter key="25" value="vendor_contact_fax_3.true.attribute_value.attribute"/>
          <parameter key="26" value="vendor_contact_email_3.true.attribute_value.attribute"/>
          <parameter key="27" value="vendor_name_4.true.attribute_value.attribute"/>
          <parameter key="28" value="vendor_contact_title_4.true.attribute_value.attribute"/>
          <parameter key="29" value="vendor_street_4.true.attribute_value.attribute"/>
          <parameter key="30" value="vendor_city_state_zip_4.true.attribute_value.attribute"/>
          <parameter key="31" value="vendor_contact_phone_4.true.attribute_value.attribute"/>
          <parameter key="32" value="vendor_contact_mobile_4.true.attribute_value.attribute"/>
          <parameter key="33" value="vendor_contact_fax_4.true.attribute_value.attribute"/>
          <parameter key="34" value="vendor_contact_email_4.true.attribute_value.attribute"/>
        </list>
      </operator>
      <operator activated="true" class="select_attributes" compatibility="7.4.000" expanded="true" height="82" name="Select Attributes" width="90" x="246" y="34">
        <parameter key="attribute_filter_type" value="subset"/>
        <parameter key="attributes" value="ID_Count|Mfg_Ref|Product_Ref|vendor_name_1"/>
      </operator>
      <operator activated="true" class="split" compatibility="7.4.000" expanded="true" height="82" name="Split (2)" width="90" x="380" y="34">
        <parameter key="split_pattern" value=";"/>
      </operator>
      <operator activated="false" breakpoints="after" class="replace" compatibility="7.4.000" expanded="true" height="82" name="Replace" width="90" x="581" y="238">
        <parameter key="replace_what" value="\(p\);"/>
        <parameter key="replace_by" value="\(p\)  "/>
      </operator>
      <operator activated="false" class="split" compatibility="7.4.000" expanded="true" height="82" name="Split" width="90" x="447" y="289">
        <parameter key="attribute_filter_type" value="subset"/>
        <parameter key="attribute" value="vendor_name_1"/>
        <parameter key="attributes" value="vendor_name_1"/>
        <parameter key="split_pattern" value=",|;"/>
      </operator>
      <connect from_op="Read CSV" from_port="output" to_op="Select Attributes" to_port="example set input"/>
      <connect from_op="Select Attributes" from_port="example set output" to_op="Split (2)" to_port="example set input"/>
      <connect from_op="Split (2)" from_port="example set output" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
    </process>
  </operator>
</process>

dataguy · March 2017

Dear Martin and Thomas,
Thank you both for getting back to me so quickly - - you guys are amazing - - I really appreciate the help.
Where this is failing is in the correct parsing into the correct column logic.
Fortunately, the semicolon is a unique delimiter.
Since there are eight columns each time, I have created a regex for each of the eight fields.
If you can point me to the proper operator to apply the regex in order to parse that "problem" semicolon-delimited concatenated cell into the correct respective column, I would be grateful.

So far I have tried to create a Process:
Retrieve >> Select Attributes >> Work on Subset >> Generate Extract
and a different process using:
"Generate Data by User Specification" with a loop and a document definition,
but I think I am misunderstanding the meaning of the term attribute or maybe my regex is incorrect?

vendor_name_out
/^.+$/

vendor_contact_title_out
/^.+$/

vendor_street_out
/^d+/s/

vendor_city_state_zip_out
/^.*,/s/d{5}$/

vendor_contact_phone_out
/^/(/d{3}/)/s/d{3}/-/d{4}/(p/)$/||/^/(/d{3}/)/s/d{3}/-/d{4}$/

vendor_contact_mobile_out
/^/(/d{3}/)/s/d{3}/-/d{4}/(m/)$/

vendor_contact_fax_out
/^/(/d{3}/)/s/d{3}/-/d{4}/(f/)$/

vendor_contact_email_out
/^/w+/@/w+./w+$/

I am not sure if using a double pipe to allow for the alternate case of single phone numbers without the (p) is allowed.
Here are both of my Process attempts in case one of them is worthwhile to pursue (attached):

At this point I would like to focus on fully parsing the vendor_name_1 field regardless of how many sets of columns of eight it takes to extract the data.

Hope that makes sense and that my newbie Process attempts make sense...
Thanks

dataguy · March 2017

BTW in the three rows of example data,

the top row is missing two pieces of data (two cells are empty) so two shifts over result,

the second row is missing one (one cell is empty) so one shift over result, and

the bottom row is missing none (no cells are empty) so no shifts occur.

Hope that helps,

Ricky

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

how to parse and move vendor contact data into correct columns when some fields are missing

Answers