How to stop the Get Pages module stopping the process when it cannot read a URL

davidellis · November 2015

I have process that reads an excel file, gets pages and then processes the results. I have a dataset of 98 records and it runs perfectly. If I add another 500 records I get random read URL errors.

I have checked all the URLs and they work perfectly and my internet connection is solid. I found a solution on the forum based on a handle exception module but it doesn't seem to make any difference and I am not sure how it works.

Any ideas how to fix the errors or if not how to skip those URLs

SGolbert · May 2018

Hi David,

a long time after your post I have come to the same problem. It can be remediated with looping and using Get Page inside Handle Exception:

<?xml version="1.0" encoding="UTF-8"?><process version="8.2.000">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="8.2.000" expanded="true" name="Process">
    <process expanded="true">
      <operator activated="true" class="python_scripting:execute_python" compatibility="7.4.000" expanded="true" height="82" name="Execute Python" width="90" x="45" y="34">
        <parameter key="script" value="import pandas&#10;&#10;# rm_main is a mandatory function, &#10;# the number of arguments has to be the number of input ports (can be none)&#10;def rm_main():&#10;    &#10;    data2 = pandas.DataFrame({'link':['https://www.presseportal.de/blaulicht/pm/70116/3951184','https://www.nonexisting.ar', 'https://www.tu-dortmund.de/uni/de/Einstieg/aktuelles/meldungen/2018-01/18-01-31-Do-camp-ing/index.html']})&#10;&#10;    # connect 2 output ports to see the results&#10;    return data2"/>
      </operator>
      <operator activated="true" class="extract_macro" compatibility="8.2.000" expanded="true" height="68" name="Extract Macro" width="90" x="246" y="34">
        <parameter key="macro" value="number_examples"/>
        <list key="additional_macros"/>
      </operator>
      <operator activated="true" class="concurrency:loop" compatibility="8.2.000" expanded="true" height="82" name="Loop" width="90" x="447" y="34">
        <parameter key="number_of_iterations" value="%{number_examples}"/>
        <process expanded="true">
          <operator activated="true" class="filter_example_range" compatibility="8.2.000" expanded="true" height="82" name="Filter Example Range" width="90" x="112" y="34">
            <parameter key="first_example" value="%{iteration}"/>
            <parameter key="last_example" value="%{iteration}"/>
          </operator>
          <operator activated="true" class="extract_macro" compatibility="8.2.000" expanded="true" height="68" name="Extract Macro (2)" width="90" x="246" y="34">
            <parameter key="macro" value="link"/>
            <parameter key="macro_type" value="data_value"/>
            <parameter key="attribute_name" value="link"/>
            <parameter key="example_index" value="1"/>
            <list key="additional_macros"/>
          </operator>
          <operator activated="true" class="handle_exception" compatibility="8.2.000" expanded="true" height="82" name="Handle Exception" width="90" x="514" y="34">
            <process expanded="true">
              <operator activated="true" class="web:get_webpage" compatibility="7.3.000" expanded="true" height="68" name="Get Page" width="90" x="112" y="34">
                <parameter key="url" value="%{link}"/>
                <list key="query_parameters"/>
                <list key="request_properties"/>
              </operator>
              <connect from_op="Get Page" from_port="output" to_port="out 1"/>
              <portSpacing port="source_in 1" spacing="0"/>
              <portSpacing port="sink_out 1" spacing="0"/>
              <portSpacing port="sink_out 2" spacing="0"/>
            </process>
            <process expanded="true">
              <portSpacing port="source_in 1" spacing="0"/>
              <portSpacing port="sink_out 1" spacing="0"/>
              <portSpacing port="sink_out 2" spacing="0"/>
            </process>
          </operator>
          <connect from_port="input 1" to_op="Filter Example Range" to_port="example set input"/>
          <connect from_op="Filter Example Range" from_port="example set output" to_op="Extract Macro (2)" to_port="example set"/>
          <connect from_op="Handle Exception" from_port="out 1" to_port="output 1"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="source_input 2" spacing="0"/>
          <portSpacing port="sink_output 1" spacing="0"/>
          <portSpacing port="sink_output 2" spacing="0"/>
        </process>
      </operator>
      <operator activated="true" class="text:process_documents" compatibility="8.1.000" expanded="true" height="103" name="Process Documents" width="90" x="715" y="34">
        <parameter key="create_word_vector" value="false"/>
        <parameter key="keep_text" value="true"/>
        <process expanded="true">
          <operator activated="true" class="web:extract_html_text_content" compatibility="7.3.000" expanded="true" height="68" name="Extract Content" width="90" x="380" y="34"/>
          <connect from_port="document" to_op="Extract Content" to_port="document"/>
          <connect from_op="Extract Content" from_port="document" to_port="document 1"/>
          <portSpacing port="source_document" spacing="0"/>
          <portSpacing port="sink_document 1" spacing="0"/>
          <portSpacing port="sink_document 2" spacing="0"/>
        </process>
      </operator>
      <connect from_op="Execute Python" from_port="output 1" to_op="Extract Macro" to_port="example set"/>
      <connect from_op="Extract Macro" from_port="example set" to_op="Loop" to_port="input 1"/>
      <connect from_op="Loop" from_port="output 1" to_op="Process Documents" to_port="documents 1"/>
      <connect from_op="Process Documents" from_port="example set" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
    </process>
  </operator>
</process>

Strangely enough the Loop Examples operator seems to be broken, therefore I emulated it with the normal Loop operator.

It would be nice if the Get Pages operator could ignore not found responses!

Regards,

Sebastian

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

How to stop the Get Pages module stopping the process when it cannot read a URL

Answers