The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here
Execute Python breaks Colum if text hasta commas
MarcoBarradas
Administrator, Employee-RapidMiner, RapidMiner Certified Analyst, Member Posts: 272 Unicorn
Hi I need some help I'm doing some crawling with Python (already tried with RM but I didn't get what I wanted in an easy way)
The last column of DF returns a big chunk of text that describes the product. for some reason when Execute Python creates the DataSet it creates new lines and erases the data that was sent on the DF. I tried writing the info from inside Python Execute and the outcome is a file with 1 row and 5 columns as expected.
Here is the process I'm using.
The last column of DF returns a big chunk of text that describes the product. for some reason when Execute Python creates the DataSet it creates new lines and erases the data that was sent on the DF. I tried writing the info from inside Python Execute and the outcome is a file with 1 row and 5 columns as expected.
Here is the process I'm using.
<?xml version="1.0" encoding="UTF-8"?><process version="9.1.000"> <context> <input/> <output/> <macros> <macro> <key>url</key> <value>https://www.liverpool.com.mx/tienda/pdp/consola-playstation-4-pro-1-tb/1059665339?s=play+station&skuId=1059665339</value> </macro> </macros> </context> <operator activated="true" class="process" compatibility="9.1.000" expanded="true" name="Process"> <parameter key="logverbosity" value="init"/> <parameter key="random_seed" value="2001"/> <parameter key="send_mail" value="never"/> <parameter key="notification_email" value=""/> <parameter key="process_duration_for_mail" value="30"/> <parameter key="encoding" value="SYSTEM"/> <process expanded="true"> <operator activated="true" class="python_scripting:execute_python" compatibility="9.1.000" expanded="true" height="82" name="Execute Python" width="90" x="179" y="34"> <parameter key="script" value="import requests from bs4 import BeautifulSoup import pandas as pd def rm_main(): headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36'} columnas=['id','precio_n','precio_d','nombre','descripcion'] productos=pd.DataFrame(columns=columnas) session = requests.Session() url='%{url}' session.post(url,headers=headers) content=session.get(url) soup = BeautifulSoup(content.text,'html.parser') precio_normal=soup.find("input",id="listPrice") tipo=soup.find("a",_class="actual") llave=soup.find("input",id="productId") #productId #gtmPrice #productDisplayName precio_descuento=soup.find("input",id="gtmPrice") producto=soup.find("input",id="productDisplayName") descripcion=soup.find("div",id="intro").find('p').get_text() descripcion=descripcion.replace(',', '') descripcion=descripcion.replace('', '') #print(descripcion) fila=[llave['value'], precio_normal['value'], precio_descuento['value'], producto['value'], descripcion ] productos.loc[len(productos)]=fila return productos"/> <parameter key="use_default_python" value="true"/> <parameter key="package_manager" value="conda (anaconda)"/> </operator> <operator activated="true" class="generate_attributes" compatibility="9.1.000" expanded="true" height="82" name="Generate Attributes" width="90" x="313" y="34"> <list key="function_descriptions"> <parameter key="Fecha" value="date_now()"/> </list> <parameter key="keep_all" value="true"/> </operator> <operator activated="true" class="date_to_nominal" compatibility="9.1.000" expanded="true" height="82" name="Date to Nominal" width="90" x="514" y="34"> <parameter key="attribute_name" value="Fecha"/> <parameter key="date_format" value="yyyy/MM/dd hh:mm:ss"/> <parameter key="time_zone" value="SYSTEM"/> <parameter key="locale" value="English (United States)"/> <parameter key="keep_old_attribute" value="false"/> </operator> <connect from_op="Execute Python" from_port="output 1" to_op="Generate Attributes" to_port="example set input"/> <connect from_op="Generate Attributes" from_port="example set output" to_op="Date to Nominal" to_port="example set input"/> <connect from_op="Date to Nominal" from_port="example set output" to_port="result 1"/> <portSpacing port="source_input 1" spacing="0"/> <portSpacing port="sink_result 1" spacing="0"/> <portSpacing port="sink_result 2" spacing="0"/> </process> </operator> </process>
Tagged:
0
Best Answers
-
lionelderkrikor RapidMiner Certified Analyst, Member Posts: 1,195 UnicornHi @MarcoBarradas,
Very interesting problem !
To sum up : There is effectivly a bug in RapidMiner, but there is a workaround....(See the process at the end of this post)
To develop :
I say there is bug in RapidMiner because when the code is executed in a Python Jupyter Notebook, it works fine :
Maybe it is linked to the text attribute ???
The (far fetched) workaround :
1.I modified the Python script to generate the DF like that :
2. then I used the Transpose operator :
3. I used the Generate Aggregate to concatenate the attributes associated to "description" which have been "splitted" for an unknown reason.... :
4. Finally , I rename correctly the relevant attributes and remove the useless attributes, to obtain the final exampleset :
5. The process :<?xml version="1.0" encoding="UTF-8"?><process version="9.2.000"> <context> <input/> <output/> <macros> <macro> <key>url</key> <value>https://www.liverpool.com.mx/tienda/pdp/consola-playstation-4-pro-1-tb/1059665339?s=play+station&skuId=1059665339</value> </macro> </macros> </context> <operator activated="true" class="process" compatibility="9.2.000" expanded="true" name="Process"> <parameter key="logverbosity" value="init"/> <parameter key="random_seed" value="2001"/> <parameter key="send_mail" value="never"/> <parameter key="notification_email" value=""/> <parameter key="process_duration_for_mail" value="30"/> <parameter key="encoding" value="SYSTEM"/> <process expanded="true"> <operator activated="true" class="python_scripting:execute_python" compatibility="9.2.000" expanded="true" height="82" name="Execute Python" width="90" x="45" y="34"> <parameter key="script" value="import requests from bs4 import BeautifulSoup import pandas as pd def rm_main(): headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36'} columnas=['id','precio_n','precio_d','nombre','descripcion'] productos=pd.DataFrame(columns=columnas) session = requests.Session() url='%{url}' session.post(url,headers=headers) content=session.get(url) soup = BeautifulSoup(content.text,'html.parser') precio_normal=soup.find("input",id="listPrice") tipo=soup.find("a",_class="actual") llave=soup.find("input",id="productId") #productId #gtmPrice #productDisplayName precio_descuento=soup.find("input",id="gtmPrice") producto=soup.find("input",id="productDisplayName") descripcion=soup.find("div",id="intro").find('p').get_text() descripcion=descripcion.replace(',', '') descripcion=descripcion.replace('', '') #print(descripcion) fila=[llave['value'], precio_normal['value'], precio_descuento['value'], producto['value'], descripcion ] productos = pd.DataFrame(data = fila) return productos"/> <parameter key="use_default_python" value="true"/> <parameter key="package_manager" value="conda (anaconda)"/> </operator> <operator activated="true" class="transpose" compatibility="9.2.000" expanded="true" height="82" name="Transpose" width="90" x="179" y="34"/> <operator activated="true" class="generate_aggregation" compatibility="9.2.000" expanded="true" height="82" name="Generate Aggregation" width="90" x="313" y="34"> <parameter key="attribute_name" value="description"/> <parameter key="attribute_filter_type" value="subset"/> <parameter key="attribute" value=""/> <parameter key="attributes" value="att_5|att_6|att_7|att_8"/> <parameter key="use_except_expression" value="false"/> <parameter key="value_type" value="attribute_value"/> <parameter key="use_value_type_exception" value="false"/> <parameter key="except_value_type" value="time"/> <parameter key="block_type" value="attribute_block"/> <parameter key="use_block_type_exception" value="false"/> <parameter key="except_block_type" value="value_matrix_row_start"/> <parameter key="invert_selection" value="false"/> <parameter key="include_special_attributes" value="false"/> <parameter key="aggregation_function" value="concatenation"/> <parameter key="concatenation_separator" value="" ""/> <parameter key="keep_all" value="true"/> <parameter key="ignore_missings" value="true"/> <parameter key="ignore_missing_attributes" value="false"/> </operator> <operator activated="true" class="rename" compatibility="9.2.000" expanded="true" height="82" name="Rename" width="90" x="447" y="34"> <parameter key="old_name" value="att_1"/> <parameter key="new_name" value="Id"/> <list key="rename_additional_attributes"> <parameter key="att_2" value="precio_n"/> <parameter key="att_3" value="precio_d"/> <parameter key="att_4" value="nombre"/> </list> </operator> <operator activated="true" class="select_attributes" compatibility="9.2.000" expanded="true" height="82" name="Select Attributes" width="90" x="581" y="34"> <parameter key="attribute_filter_type" value="regular_expression"/> <parameter key="attribute" value=""/> <parameter key="attributes" value=""/> <parameter key="regular_expression" value="att_.*"/> <parameter key="use_except_expression" value="false"/> <parameter key="value_type" value="attribute_value"/> <parameter key="use_value_type_exception" value="false"/> <parameter key="except_value_type" value="time"/> <parameter key="block_type" value="attribute_block"/> <parameter key="use_block_type_exception" value="false"/> <parameter key="except_block_type" value="value_matrix_row_start"/> <parameter key="invert_selection" value="true"/> <parameter key="include_special_attributes" value="false"/> </operator> <operator activated="true" class="generate_attributes" compatibility="9.2.000" expanded="true" height="82" name="Generate Attributes" width="90" x="715" y="34"> <list key="function_descriptions"> <parameter key="Fecha" value="date_now()"/> </list> <parameter key="keep_all" value="true"/> </operator> <operator activated="true" class="date_to_nominal" compatibility="9.2.000" expanded="true" height="82" name="Date to Nominal" width="90" x="849" y="34"> <parameter key="attribute_name" value="Fecha"/> <parameter key="date_format" value="yyyy/MM/dd hh:mm:ss"/> <parameter key="time_zone" value="SYSTEM"/> <parameter key="locale" value="English (United States)"/> <parameter key="keep_old_attribute" value="false"/> </operator> <connect from_op="Execute Python" from_port="output 1" to_op="Transpose" to_port="example set input"/> <connect from_op="Transpose" from_port="example set output" to_op="Generate Aggregation" to_port="example set input"/> <connect from_op="Generate Aggregation" from_port="example set output" to_op="Rename" to_port="example set input"/> <connect from_op="Rename" from_port="example set output" to_op="Select Attributes" to_port="example set input"/> <connect from_op="Select Attributes" from_port="example set output" to_op="Generate Attributes" to_port="example set input"/> <connect from_op="Generate Attributes" from_port="example set output" to_op="Date to Nominal" to_port="example set input"/> <connect from_op="Date to Nominal" from_port="example set output" to_port="result 1"/> <portSpacing port="source_input 1" spacing="0"/> <portSpacing port="sink_result 1" spacing="0"/> <portSpacing port="sink_result 2" spacing="0"/> </process> </operator> </process>
6. Have fun with your future Playstation 4 .... !
Hope this helps,
Regards,
Lionel
6 -
MichaelKnopf Employee-RapidMiner, RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 31 RM Data ScientistHi @MarcoBarradas, I can confirm that this a bug in the operator code. I will create a ticket.It is however not the length of the description that causes the issue, but the newline characters in it. Thus, a workaround might be to remove all line breaks from the text.My understanding is that you are already trying that, but your script only looks for Windows-style line breaks () and is missing Unix-style line breaks (\n) which are more common in the web.For me changing the following lines did the trick:
# descripcion=descripcion.replace('\r\n', '')<br>descripcion=descripcion.replace('\r', '')<br>descripcion=descripcion.replace('\n', '')<br>
See Wikipedia for more info on the different line break styles.
7
Answers
I'll need to make some changes since sometimes the crawling may not have that attribute and the number of rows maybe dynamic but your workaround works like a charm.
Scott