Execute Python breaks Colum if text hasta commas

MarcoBarradas · March 2019

Hi I need some help I'm doing some crawling with Python (already tried with RM but I didn't get what I wanted in an easy way)
The last column of DF returns a big chunk of text that describes the product. for some reason when Execute Python creates the DataSet it creates new lines and erases the data that was sent on the DF. I tried writing the info from inside Python Execute and the outcome is a file with 1 row and 5 columns as expected.

Here is the process I'm using.

<?xml version="1.0" encoding="UTF-8"?><process version="9.1.000">
  <context>
    <input/>
    <output/>
    <macros>
      <macro>
        <key>url</key>
        <value>https://www.liverpool.com.mx/tienda/pdp/consola-playstation-4-pro-1-tb/1059665339?s=play+station&amp;skuId=1059665339</value&gt;
      </macro>
    </macros>
  </context>
  <operator activated="true" class="process" compatibility="9.1.000" expanded="true" name="Process">
    <parameter key="logverbosity" value="init"/>
    <parameter key="random_seed" value="2001"/>
    <parameter key="send_mail" value="never"/>
    <parameter key="notification_email" value=""/>
    <parameter key="process_duration_for_mail" value="30"/>
    <parameter key="encoding" value="SYSTEM"/>
    <process expanded="true">
      <operator activated="true" class="python_scripting:execute_python" compatibility="9.1.000" expanded="true" height="82" name="Execute Python" width="90" x="179" y="34">
        <parameter key="script" value="import requests&#10;from bs4 import BeautifulSoup&#10;import pandas as pd&#10;&#10;def rm_main():&#10;    headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36'}&#10;    columnas=['id','precio_n','precio_d','nombre','descripcion']&#10;    productos=pd.DataFrame(columns=columnas)   &#10;    session = requests.Session()&#10;    url='%{url}'&#10;    session.post(url,headers=headers)&#10;    content=session.get(url)&#10;    soup = BeautifulSoup(content.text,'html.parser')&#10;    precio_normal=soup.find(&quot;input&quot;,id=&quot;listPrice&quot;)&#10;    tipo=soup.find(&quot;a&quot;,_class=&quot;actual&quot;)&#10;    llave=soup.find(&quot;input&quot;,id=&quot;productId&quot;)&#10;    #productId&#10;    #gtmPrice&#10;    #productDisplayName&#10;    precio_descuento=soup.find(&quot;input&quot;,id=&quot;gtmPrice&quot;)&#10;    producto=soup.find(&quot;input&quot;,id=&quot;productDisplayName&quot;)&#10;    descripcion=soup.find(&quot;div&quot;,id=&quot;intro&quot;).find('p').get_text()&#10;    descripcion=descripcion.replace(',', '')&#10;    descripcion=descripcion.replace('', '')&#10;    #print(descripcion)&#10;    fila=[llave['value'],&#10;                          precio_normal['value'],&#10;                          precio_descuento['value'],&#10;                          producto['value'],&#10;                          descripcion&#10;                          ]&#10;    productos.loc[len(productos)]=fila&#10;    return productos"/>
        <parameter key="use_default_python" value="true"/>
        <parameter key="package_manager" value="conda (anaconda)"/>
      </operator>
      <operator activated="true" class="generate_attributes" compatibility="9.1.000" expanded="true" height="82" name="Generate Attributes" width="90" x="313" y="34">
        <list key="function_descriptions">
          <parameter key="Fecha" value="date_now()"/>
        </list>
        <parameter key="keep_all" value="true"/>
      </operator>
      <operator activated="true" class="date_to_nominal" compatibility="9.1.000" expanded="true" height="82" name="Date to Nominal" width="90" x="514" y="34">
        <parameter key="attribute_name" value="Fecha"/>
        <parameter key="date_format" value="yyyy/MM/dd hh:mm:ss"/>
        <parameter key="time_zone" value="SYSTEM"/>
        <parameter key="locale" value="English (United States)"/>
        <parameter key="keep_old_attribute" value="false"/>
      </operator>
      <connect from_op="Execute Python" from_port="output 1" to_op="Generate Attributes" to_port="example set input"/>
      <connect from_op="Generate Attributes" from_port="example set output" to_op="Date to Nominal" to_port="example set input"/>
      <connect from_op="Date to Nominal" from_port="example set output" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
    </process>
  </operator>
</process>

lionelderkrikor · March 2019

Hi @MarcoBarradas,

Very interesting problem !
To sum up : There is effectivly a bug in RapidMiner, but there is a workaround....(See the process at the end of this post)

To develop :
I say there is bug in RapidMiner because when the code is executed in a Python Jupyter Notebook, it works fine :

Image: https://us.v-cdn.net/6030995/uploads/editor/ng/zvh6kaxrqv9h.png

Maybe it is linked to the text attribute ???

The (far fetched) workaround :

1.I modified the Python script to generate the DF like that :

Image: https://us.v-cdn.net/6030995/uploads/editor/ig/smq058qrfem4.png

2. then I used the Transpose operator :

Image: https://us.v-cdn.net/6030995/uploads/editor/pa/exskeoc9wmdl.png

3. I used the Generate Aggregate to concatenate the attributes associated to "description" which have been "splitted" for an unknown reason.... :

Image: https://us.v-cdn.net/6030995/uploads/editor/dx/x5szhky4zykn.png

4. Finally , I rename correctly the relevant attributes and remove the useless attributes, to obtain the final exampleset :

Image: https://us.v-cdn.net/6030995/uploads/editor/g2/2a2lne8f8z69.png

5. The process :

<?xml version="1.0" encoding="UTF-8"?><process version="9.2.000">
  <context>
    <input/>
    <output/>
    <macros>
      <macro>
        <key>url</key>
        <value>https://www.liverpool.com.mx/tienda/pdp/consola-playstation-4-pro-1-tb/1059665339?s=play+station&amp;skuId=1059665339</value&gt;
      </macro>
    </macros>
  </context>
  <operator activated="true" class="process" compatibility="9.2.000" expanded="true" name="Process">
    <parameter key="logverbosity" value="init"/>
    <parameter key="random_seed" value="2001"/>
    <parameter key="send_mail" value="never"/>
    <parameter key="notification_email" value=""/>
    <parameter key="process_duration_for_mail" value="30"/>
    <parameter key="encoding" value="SYSTEM"/>
    <process expanded="true">
      <operator activated="true" class="python_scripting:execute_python" compatibility="9.2.000" expanded="true" height="82" name="Execute Python" width="90" x="45" y="34">
        <parameter key="script" value="import requests&#10;from bs4 import BeautifulSoup&#10;import pandas as pd&#10;&#10;def rm_main():&#10;    headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36'}&#10;    columnas=['id','precio_n','precio_d','nombre','descripcion']&#10;    productos=pd.DataFrame(columns=columnas)   &#10;    session = requests.Session()&#10;    url='%{url}'&#10;    session.post(url,headers=headers)&#10;    content=session.get(url)&#10;    soup = BeautifulSoup(content.text,'html.parser')&#10;    precio_normal=soup.find(&quot;input&quot;,id=&quot;listPrice&quot;)&#10;    tipo=soup.find(&quot;a&quot;,_class=&quot;actual&quot;)&#10;    llave=soup.find(&quot;input&quot;,id=&quot;productId&quot;)&#10;    #productId&#10;    #gtmPrice&#10;    #productDisplayName&#10;    precio_descuento=soup.find(&quot;input&quot;,id=&quot;gtmPrice&quot;)&#10;    producto=soup.find(&quot;input&quot;,id=&quot;productDisplayName&quot;)&#10;    descripcion=soup.find(&quot;div&quot;,id=&quot;intro&quot;).find('p').get_text()&#10;    descripcion=descripcion.replace(',', '')&#10;    descripcion=descripcion.replace('', '')&#10;    #print(descripcion)&#10;    fila=[llave['value'],&#10;                          precio_normal['value'],&#10;                          precio_descuento['value'],&#10;                          producto['value'],&#10;                          descripcion&#10;                          ]&#10;    productos = pd.DataFrame(data = fila)&#10;    &#10;    return productos"/>
        <parameter key="use_default_python" value="true"/>
        <parameter key="package_manager" value="conda (anaconda)"/>
      </operator>
      <operator activated="true" class="transpose" compatibility="9.2.000" expanded="true" height="82" name="Transpose" width="90" x="179" y="34"/>
      <operator activated="true" class="generate_aggregation" compatibility="9.2.000" expanded="true" height="82" name="Generate Aggregation" width="90" x="313" y="34">
        <parameter key="attribute_name" value="description"/>
        <parameter key="attribute_filter_type" value="subset"/>
        <parameter key="attribute" value=""/>
        <parameter key="attributes" value="att_5|att_6|att_7|att_8"/>
        <parameter key="use_except_expression" value="false"/>
        <parameter key="value_type" value="attribute_value"/>
        <parameter key="use_value_type_exception" value="false"/>
        <parameter key="except_value_type" value="time"/>
        <parameter key="block_type" value="attribute_block"/>
        <parameter key="use_block_type_exception" value="false"/>
        <parameter key="except_block_type" value="value_matrix_row_start"/>
        <parameter key="invert_selection" value="false"/>
        <parameter key="include_special_attributes" value="false"/>
        <parameter key="aggregation_function" value="concatenation"/>
        <parameter key="concatenation_separator" value="&quot; &quot;"/>
        <parameter key="keep_all" value="true"/>
        <parameter key="ignore_missings" value="true"/>
        <parameter key="ignore_missing_attributes" value="false"/>
      </operator>
      <operator activated="true" class="rename" compatibility="9.2.000" expanded="true" height="82" name="Rename" width="90" x="447" y="34">
        <parameter key="old_name" value="att_1"/>
        <parameter key="new_name" value="Id"/>
        <list key="rename_additional_attributes">
          <parameter key="att_2" value="precio_n"/>
          <parameter key="att_3" value="precio_d"/>
          <parameter key="att_4" value="nombre"/>
        </list>
      </operator>
      <operator activated="true" class="select_attributes" compatibility="9.2.000" expanded="true" height="82" name="Select Attributes" width="90" x="581" y="34">
        <parameter key="attribute_filter_type" value="regular_expression"/>
        <parameter key="attribute" value=""/>
        <parameter key="attributes" value=""/>
        <parameter key="regular_expression" value="att_.*"/>
        <parameter key="use_except_expression" value="false"/>
        <parameter key="value_type" value="attribute_value"/>
        <parameter key="use_value_type_exception" value="false"/>
        <parameter key="except_value_type" value="time"/>
        <parameter key="block_type" value="attribute_block"/>
        <parameter key="use_block_type_exception" value="false"/>
        <parameter key="except_block_type" value="value_matrix_row_start"/>
        <parameter key="invert_selection" value="true"/>
        <parameter key="include_special_attributes" value="false"/>
      </operator>
      <operator activated="true" class="generate_attributes" compatibility="9.2.000" expanded="true" height="82" name="Generate Attributes" width="90" x="715" y="34">
        <list key="function_descriptions">
          <parameter key="Fecha" value="date_now()"/>
        </list>
        <parameter key="keep_all" value="true"/>
      </operator>
      <operator activated="true" class="date_to_nominal" compatibility="9.2.000" expanded="true" height="82" name="Date to Nominal" width="90" x="849" y="34">
        <parameter key="attribute_name" value="Fecha"/>
        <parameter key="date_format" value="yyyy/MM/dd hh:mm:ss"/>
        <parameter key="time_zone" value="SYSTEM"/>
        <parameter key="locale" value="English (United States)"/>
        <parameter key="keep_old_attribute" value="false"/>
      </operator>
      <connect from_op="Execute Python" from_port="output 1" to_op="Transpose" to_port="example set input"/>
      <connect from_op="Transpose" from_port="example set output" to_op="Generate Aggregation" to_port="example set input"/>
      <connect from_op="Generate Aggregation" from_port="example set output" to_op="Rename" to_port="example set input"/>
      <connect from_op="Rename" from_port="example set output" to_op="Select Attributes" to_port="example set input"/>
      <connect from_op="Select Attributes" from_port="example set output" to_op="Generate Attributes" to_port="example set input"/>
      <connect from_op="Generate Attributes" from_port="example set output" to_op="Date to Nominal" to_port="example set input"/>
      <connect from_op="Date to Nominal" from_port="example set output" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
    </process>
  </operator>
</process>

6. Have fun with your future Playstation 4 .... !

Hope this helps,

Regards,

Lionel

MichaelKnopf · March 2019

Hi @MarcoBarradas, I can confirm that this a bug in the operator code. I will create a ticket.

It is however not the length of the description that causes the issue, but the newline characters in it. Thus, a workaround might be to remove all line breaks from the text.

My understanding is that you are already trying that, but your script only looks for Windows-style line breaks () and is missing Unix-style line breaks (\n) which are more common in the web.

For me changing the following lines did the trick:

# descripcion=descripcion.replace('\r\n', '')<br>descripcion=descripcion.replace('\r', '')<br>descripcion=descripcion.replace('\n', '')<br>

See Wikipedia for more info on the different line break styles.

MarcoBarradas · March 2019

Great!!! It works and yes it seems to be a bug.
I'll need to make some changes since sometimes the crawling may not have that attribute and the number of rows maybe dynamic but your workaround works like a charm.

sgenzer · March 2019

@MarcoBarradas can you pls be more specific about the bug? I'd like to push it internally but need more detail. I'm not a Python coder...

Scott

MarcoBarradas · March 2019

Hi @sgenzer the bug is that RM changes the Data Frame when it converts it to a RM Dataset. This happens when one of the attributes has a lot of text. In my example the Dataframe has a ágape of 1 example with 5 attributes. But once Execute Python ends it returns 3 example with 5 attributes and it only returne information of the last attribute. The one that had a lot of text

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

Execute Python breaks Colum if text hasta commas

Best Answers

Answers