💬 54,552 Comments

🔥 18,510 Discussions

👤1,340,658 Members

🔌3 Online

The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here

Read Excel Table with 300+ URLs and get Page Informations

Naveen_Vimalan

Naveen_Vimalan Member Posts: 3

Learner I

April 2021 in Help

I would like to get Informations such as the Response Code, Response Message, Content Type etc. of the URLs in my Excel Table. I used - Read Excel -> Store -> Handle Exception (Get Pages) -> Store - as my Process Chain. For some reason I only get the URL as my Result instead of all the Information I want. Hopefully someone can help out.

This is the Code:

<context>

<input/>

<output/>

<macros/>

</context>

<operator activated="true" class="process" compatibility="9.9.000" expanded="true" name="Process">

<parameter key="logverbosity" value="init"/>

<parameter key="random_seed" value="2001"/>

<parameter key="send_mail" value="never"/>

<parameter key="notification_email" value=""/>

<parameter key="process_duration_for_mail" value="30"/>

<parameter key="encoding" value="SYSTEM"/>

<process expanded="true">

<operator activated="true" class="read_excel" compatibility="9.9.000" expanded="true" height="68" name="Read Excel" width="90" x="112" y="136">

<parameter key="excel_file" value="/Users/XXX/datamining/excel/Leuphana.xlsx"/>

<parameter key="sheet_selection" value="sheet number"/>

<parameter key="sheet_number" value="1"/>

<parameter key="imported_cell_range" value="A1"/>

<parameter key="encoding" value="SYSTEM"/>

<parameter key="first_row_as_names" value="true"/>

<list key="annotations"/>

<parameter key="date_format" value=""/>

<parameter key="time_zone" value="SYSTEM"/>

<parameter key="locale" value="German (Germany)"/>

<parameter key="read_all_values_as_polynominal" value="false"/>

<list key="data_set_meta_data_information">

<parameter key="0" value="Links.true.file_path.attribute"/>

</list>

<parameter key="read_not_matching_values_as_missings" value="false"/>

</operator>

<operator activated="true" class="store" compatibility="9.9.000" expanded="true" height="68" name="Store" width="90" x="246" y="136">

<parameter key="repository_entry" value="../data/Leuphana_Links"/>

</operator>

<operator activated="true" class="handle_exception" compatibility="9.9.000" expanded="true" height="82" name="Handle Exception" width="90" x="380" y="136">

<parameter key="add_details_to_log" value="false"/>

<process expanded="true">

<operator activated="true" class="web:retrieve_webpages" compatibility="9.7.000" expanded="true" height="68" name="Get Pages" width="90" x="179" y="34">

<parameter key="link_attribute" value="Links"/>

<parameter key="page_attribute" value="Inhalt"/>

<parameter key="random_user_agent" value="true"/>

<parameter key="connection_timeout" value="10000"/>

<parameter key="read_timeout" value="10000"/>

<parameter key="follow_redirects" value="true"/>

<parameter key="accept_cookies" value="all"/>

<parameter key="cookie_scope" value="thread"/>

<parameter key="request_method" value="POST"/>

<parameter key="delay" value="none"/>

<parameter key="delay_amount" value="1000"/>

<parameter key="min_delay_amount" value="0"/>

<parameter key="max_delay_amount" value="1000"/>

</operator>

<connect from_port="in 1" to_op="Get Pages" to_port="Example Set"/>

<connect from_op="Get Pages" from_port="Example Set" to_port="out 1"/>

<portSpacing port="source_in 1" spacing="0"/>

<portSpacing port="source_in 2" spacing="0"/>

<portSpacing port="sink_out 1" spacing="0"/>

<portSpacing port="sink_out 2" spacing="0"/>

</process>

<process expanded="true">

<connect from_port="in 1" to_port="out 1"/>

<portSpacing port="source_in 1" spacing="0"/>

<portSpacing port="source_in 2" spacing="0"/>

<portSpacing port="sink_out 1" spacing="0"/>

<portSpacing port="sink_out 2" spacing="0"/>

</process>

</operator>

<operator activated="true" class="store" compatibility="9.9.000" expanded="true" height="68" name="Store (2)" width="90" x="514" y="136">

<parameter key="repository_entry" value="../data/Leuphana_Result"/>

</operator>

<connect from_op="Read Excel" from_port="output" to_op="Store" to_port="input"/>

<connect from_op="Store" from_port="through" to_op="Handle Exception" to_port="in 1"/>

<connect from_op="Handle Exception" from_port="out 1" to_op="Store (2)" to_port="input"/>

<portSpacing port="source_input 1" spacing="0"/>

<portSpacing port="sink_result 1" spacing="0"/>

</process>

</operator>

</process>

Tagged:

0

Best Answer

yyhuang Administrator, Employee-RapidMiner, RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 364 RM Data Scientist

April 2021 Solution Accepted

Hi @Naveen_Vimalan,

I used your excel as input for URL links and got 325 results and 8 errors (see attached screenshot for the error msg). The errors mostly come from the bad URL link that contains regex (why regex?)

Process with loop and Get Page attached for your reference.

Cheers,
YY

web_mining_YY.rmp 11K

6

Answers

ceaperez Member Posts: 541 Unicorn

April 2021 edited April 2021

Hi @Naveen_Vimalan,

Please read this interesting thread about the Web Page Operator

network connection with Get Pages - operator — RapidMiner Community

I attached a simple process to handle with, please try with it,

Best

webpag.rmp 4K

0
yyhuang Administrator, Employee-RapidMiner, RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 364 RM Data Scientist

April 2021 edited April 2021

Hi @Naveen_Vimalan,

The process you posted is broken. Are you able to attach the excel file or process file (.rmp)? I have built some web scraping and web mining process to get reviews from indeed, yelp, G2, etc.. Attached is the one used for storing the HTML web pages as the first step.

HTH!
YY

Demo Yelp Web Scraping and Mining step1.rmp 9.4K

1
Naveen_Vimalan Member Posts: 3 Learner I

April 2021

Hi @yyhuang,
I attached the excel and .rmp file down below. I also added a picture of the results I want to achieve with the 300+ URLs instead of only 4 results as shown in the screenshot.

Best Regards,
Naveen

web_mining.rmp 4.3K

Leuphana.xlsx 53.9K

0
yyhuang Administrator, Employee-RapidMiner, RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 364 RM Data Scientist

April 2021

Are you able to import the process I shared? @Naveen_Vimalan

Generally, get page works better than "get pages".

0
Naveen_Vimalan Member Posts: 3 Learner I

April 2021

Ok thanks for the answer, but is it possible to use the Operator Read Excel or where do I put in my Excel file in the process you have send me.

0

Sign In or Register to comment.