The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here
[SOLVED] Extracting webpage content to CSV rows
Hi everyone.
Old & (very) rusty Rapidminer fan needs a hint!
* I have a single webpage containing information I want to export into a CSV file.
* At the end of the process, I'm expecting 3 columns (name, address, URL).
* With my current flow, I get a single column containing all the names in the first rows, then all the addresses, then all the URLs...
Here's the flow (Rapidminer 5.3, but it's the same result with 9.2)
Thank you!
<?xml version="1.0" encoding="UTF-8" standalone="no"?><br><process version="5.3.015"><br> <context><br> <input/><br> <output/><br> <macros/><br> </context><br> <operator activated="true" class="process" compatibility="5.3.015" expanded="true" name="Process"><br> <process expanded="true"><br> <operator activated="true" class="text:read_document" compatibility="5.3.002" expanded="true" height="60" name="Read Document" width="90" x="45" y="30"><br> <parameter key="file" value="E:\Rapidminer\Expert.htm"/><br> <parameter key="extract_text_only" value="false"/><br> <parameter key="use_file_extension_as_type" value="false"/><br> <parameter key="encoding" value="UTF-8"/><br> </operator><br> <operator activated="true" class="text:process_documents" compatibility="5.3.002" expanded="true" height="94" name="Process Documents" width="90" x="180" y="30"><br> <parameter key="create_word_vector" value="false"/><br> <parameter key="add_meta_information" value="false"/><br> <parameter key="keep_text" value="true"/><br> <process expanded="true"><br> <operator activated="true" class="text:cut_document" compatibility="5.3.002" expanded="true" height="60" name="Cut Document (3)" width="90" x="45" y="30"><br> <list key="string_machting_queries"><br> <parameter key="url" value="<a href="."><span"/><br> <parameter key="title" value="<span class="title">.</span>"/><br> <parameter key="address" value="<span class="address">.</span>"/><br> </list><br> <list key="regular_expression_queries"/><br> <list key="regular_region_queries"/><br> <list key="xpath_queries"><br> <parameter key="link" value="//h:a[@class="PinImage ImgLink"]/@href"/><br> </list><br> <list key="namespaces"/><br> <list key="index_queries"/><br> <process expanded="true"><br> <connect from_port="segment" to_port="document 1"/><br> <portSpacing port="source_segment" spacing="0"/><br> <portSpacing port="sink_document 1" spacing="0"/><br> <portSpacing port="sink_document 2" spacing="0"/><br> </process><br> </operator><br> <connect from_port="document" to_op="Cut Document (3)" to_port="document"/><br> <connect from_op="Cut Document (3)" from_port="documents" to_port="document 1"/><br> <portSpacing port="source_document" spacing="0"/><br> <portSpacing port="sink_document 1" spacing="0"/><br> <portSpacing port="sink_document 2" spacing="0"/><br> </process><br> </operator><br> <operator activated="true" class="select_attributes" compatibility="5.3.015" expanded="true" height="76" name="Select Attributes" width="90" x="315" y="30"><br> <parameter key="attribute_filter_type" value="subset"/><br> <parameter key="attributes" value="|address|url|title"/><br> </operator><br> <operator activated="true" class="write_excel" compatibility="5.3.015" expanded="true" height="76" name="Write Excel" width="90" x="450" y="30"><br> <parameter key="excel_file" value="E:\Rapidminer\expert.xls"/><br> </operator><br> <connect from_op="Read Document" from_port="output" to_op="Process Documents" to_port="documents 1"/><br> <connect from_op="Process Documents" from_port="example set" to_op="Select Attributes" to_port="example set input"/><br> <connect from_op="Select Attributes" from_port="example set output" to_op="Write Excel" to_port="input"/><br> <connect from_op="Write Excel" from_port="through" to_port="result 1"/><br> <portSpacing port="source_input 1" spacing="0"/><br> <portSpacing port="sink_result 1" spacing="0"/><br> <portSpacing port="sink_result 2" spacing="0"/><br> </process><br> </operator><br></process>
Tagged:
0
Best Answer
-
scepxko Member Posts: 15 MavenThank you for reminding me about splitting and merging using IDs!I found a suitable (but non elegant) solution that does the job:Read the page -> multiply -> 1x Cut Document + 1x Generate ID for each element I wanted, then Join the attributes (1+2)+3.Here my working solution:
<?xml version="1.0" encoding="UTF-8" standalone="no"?><br><process version="5.3.015"><br> <context><br> <input/><br> <output/><br> <macros/><br> </context><br> <operator activated="true" class="process" compatibility="5.3.015" expanded="true" name="Process"><br> <process expanded="true"><br> <operator activated="true" class="text:read_document" compatibility="5.3.002" expanded="true" height="60" name="Read Document" width="90" x="45" y="75"><br> <parameter key="file" value="E:\Rapidminer\Expert.htm"/><br> <parameter key="extract_text_only" value="false"/><br> <parameter key="use_file_extension_as_type" value="false"/><br> <parameter key="encoding" value="UTF-8"/><br> </operator><br> <operator activated="true" class="multiply" compatibility="5.3.015" expanded="true" height="112" name="Multiply (2)" width="90" x="179" y="75"/><br> <operator activated="true" class="text:cut_document" compatibility="5.3.002" expanded="true" height="60" name="Cut Document (5)" width="90" x="313" y="30"><br> <list key="string_machting_queries"><br> <parameter key="title" value="<span class="title">.</span>"/><br> </list><br> <list key="regular_expression_queries"/><br> <list key="regular_region_queries"/><br> <list key="xpath_queries"><br> <parameter key="link" value="//h:a[@class="PinImage ImgLink"]/@href"/><br> </list><br> <list key="namespaces"/><br> <list key="index_queries"/><br> <process expanded="true"><br> <connect from_port="segment" to_port="document 1"/><br> <portSpacing port="source_segment" spacing="0"/><br> <portSpacing port="sink_document 1" spacing="0"/><br> <portSpacing port="sink_document 2" spacing="0"/><br> </process><br> </operator><br> <operator activated="true" class="text:documents_to_data" compatibility="5.3.002" expanded="true" height="76" name="Documents to Data (5)" width="90" x="447" y="30"><br> <parameter key="text_attribute" value="text2"/><br> <parameter key="add_meta_information" value="false"/><br> </operator><br> <operator activated="true" class="generate_id" compatibility="5.3.015" expanded="true" height="76" name="Generate ID (7)" width="90" x="581" y="30"/><br> <operator activated="true" class="text:cut_document" compatibility="5.3.002" expanded="true" height="60" name="Cut Document (6)" width="90" x="313" y="210"><br> <list key="string_machting_queries"><br> <parameter key="url" value="<a href="."><span"/><br> </list><br> <list key="regular_expression_queries"/><br> <list key="regular_region_queries"/><br> <list key="xpath_queries"><br> <parameter key="link" value="//h:a[@class="PinImage ImgLink"]/@href"/><br> </list><br> <list key="namespaces"/><br> <list key="index_queries"/><br> <process expanded="true"><br> <connect from_port="segment" to_port="document 1"/><br> <portSpacing port="source_segment" spacing="0"/><br> <portSpacing port="sink_document 1" spacing="0"/><br> <portSpacing port="sink_document 2" spacing="0"/><br> </process><br> </operator><br> <operator activated="true" class="text:documents_to_data" compatibility="5.3.002" expanded="true" height="76" name="Documents to Data (4)" width="90" x="447" y="210"><br> <parameter key="text_attribute" value="text1"/><br> <parameter key="add_meta_information" value="false"/><br> </operator><br> <operator activated="true" class="generate_id" compatibility="5.3.015" expanded="true" height="76" name="Generate ID" width="90" x="581" y="210"/><br> <operator activated="true" class="text:cut_document" compatibility="5.3.002" expanded="true" height="60" name="Cut Document (7)" width="90" x="313" y="120"><br> <list key="string_machting_queries"><br> <parameter key="address" value="<span class="address">.</span>"/><br> </list><br> <list key="regular_expression_queries"/><br> <list key="regular_region_queries"/><br> <list key="xpath_queries"><br> <parameter key="link" value="//h:a[@class="PinImage ImgLink"]/@href"/><br> </list><br> <list key="namespaces"/><br> <list key="index_queries"/><br> <process expanded="true"><br> <connect from_port="segment" to_port="document 1"/><br> <portSpacing port="source_segment" spacing="0"/><br> <portSpacing port="sink_document 1" spacing="0"/><br> <portSpacing port="sink_document 2" spacing="0"/><br> </process><br> </operator><br> <operator activated="true" class="text:documents_to_data" compatibility="5.3.002" expanded="true" height="76" name="Documents to Data (6)" width="90" x="447" y="120"><br> <parameter key="text_attribute" value="text3"/><br> <parameter key="add_meta_information" value="false"/><br> </operator><br> <operator activated="true" class="generate_id" compatibility="5.3.015" expanded="true" height="76" name="Generate ID (6)" width="90" x="581" y="120"/><br> <operator activated="true" class="join" compatibility="5.3.015" expanded="true" height="76" name="Join (3)" width="90" x="715" y="75"><br> <list key="key_attributes"/><br> </operator><br> <operator activated="true" class="join" compatibility="5.3.015" expanded="true" height="76" name="Join (4)" width="90" x="782" y="210"><br> <list key="key_attributes"/><br> </operator><br> <operator activated="true" class="write_excel" compatibility="5.3.015" expanded="true" height="76" name="Write Excel" width="90" x="916" y="210"><br> <parameter key="excel_file" value="E:\Rapidminer\expert.xls"/><br> </operator><br> <connect from_op="Read Document" from_port="output" to_op="Multiply (2)" to_port="input"/><br> <connect from_op="Multiply (2)" from_port="output 1" to_op="Cut Document (5)" to_port="document"/><br> <connect from_op="Multiply (2)" from_port="output 2" to_op="Cut Document (7)" to_port="document"/><br> <connect from_op="Multiply (2)" from_port="output 3" to_op="Cut Document (6)" to_port="document"/><br> <connect from_op="Cut Document (5)" from_port="documents" to_op="Documents to Data (5)" to_port="documents 1"/><br> <connect from_op="Documents to Data (5)" from_port="example set" to_op="Generate ID (7)" to_port="example set input"/><br> <connect from_op="Generate ID (7)" from_port="example set output" to_op="Join (3)" to_port="left"/><br> <connect from_op="Cut Document (6)" from_port="documents" to_op="Documents to Data (4)" to_port="documents 1"/><br> <connect from_op="Documents to Data (4)" from_port="example set" to_op="Generate ID" to_port="example set input"/><br> <connect from_op="Generate ID" from_port="example set output" to_op="Join (4)" to_port="right"/><br> <connect from_op="Cut Document (7)" from_port="documents" to_op="Documents to Data (6)" to_port="documents 1"/><br> <connect from_op="Documents to Data (6)" from_port="example set" to_op="Generate ID (6)" to_port="example set input"/><br> <connect from_op="Generate ID (6)" from_port="example set output" to_op="Join (3)" to_port="right"/><br> <connect from_op="Join (3)" from_port="join" to_op="Join (4)" to_port="left"/><br> <connect from_op="Join (4)" from_port="join" to_op="Write Excel" to_port="input"/><br> <connect from_op="Write Excel" from_port="through" to_port="result 1"/><br> <portSpacing port="source_input 1" spacing="0"/><br> <portSpacing port="sink_result 1" spacing="0"/><br> <portSpacing port="sink_result 2" spacing="0"/><br> </process><br> </operator><br></process><br><br>
0
Answers
Lindon Ventures
Data Science Consulting from Certified RapidMiner Experts