CrawlWeb a news site for a specific keyword

ittaj_goldberge · May 2018

Hi everyone

I am new here! i have a problem with crawlweb which i'm not able to solve, i tried and googled for weeks now.. (anyway it seems pretty simple but I just dont get it..)

I want to crawl a newssite (here: http://www.bbc.com/) for a keyword (here: .*zuckerberg.*) and save 100 results in .txt

But it just doesn't work, i tried everything but i don't seem to get it done.

I hope you can help me, please see my process in .xml.

Thank you very much for your help in advance!

<?xml version="1.0" encoding="UTF-8"?>

-<process version="8.2.000">


-<context>

<input/>

<output/>

<macros/>

</context>


-<operator name="Process" expanded="true" compatibility="8.2.000" class="process" activated="true">

<parameter value="init" key="logverbosity"/>

<parameter value="2001" key="random_seed"/>

<parameter value="never" key="send_mail"/>

<parameter value="" key="notification_email"/>

<parameter value="30" key="process_duration_for_mail"/>

<parameter value="SYSTEM" key="encoding"/>


-<process expanded="true">


-<operator name="Crawl Web" expanded="true" compatibility="7.3.000" class="web:crawl_web" activated="true" y="34" x="112" width="90" height="68">

<parameter value="http://www.bbc.com/" key="url"/>


-<list key="crawling_rules">

<parameter value=".*tech.*" key="follow_link_with_matching_url"/>

<parameter value=".*zuckerberg.*" key="store_with_matching_url"/>

<parameter value=".*news.*" key="follow_link_with_matching_url"/>

<parameter value=".*zuckerberg.*" key="store_with_matching_content"/>

</list>

<parameter value="false" key="write_pages_into_files"/>

<parameter value="true" key="add_pages_as_attribute"/>

<parameter value="txt" key="extension"/>

<parameter value="100" key="max_pages"/>

<parameter value="4" key="max_depth"/>

<parameter value="web" key="domain"/>

<parameter value="1000" key="delay"/>

<parameter value="2" key="max_threads"/>

<parameter value="10000" key="max_page_size"/>

<parameter value="rapid-miner-crawler" key="user_agent"/>

<parameter value="true" key="obey_robot_exclusion"/>

<parameter value="false" key="really_ignore_exclusion"/>

</operator>


-<operator name="Process Documents from Data" expanded="true" compatibility="8.1.000" class="text:process_document_from_data" activated="true" y="34" x="313" width="90" height="82">

<parameter value="false" key="create_word_vector"/>

<parameter value="TF-IDF" key="vector_creation"/>

<parameter value="true" key="add_meta_information"/>

<parameter value="true" key="keep_text"/>

<parameter value="none" key="prune_method"/>

<parameter value="3.0" key="prune_below_percent"/>

<parameter value="30.0" key="prune_above_percent"/>

<parameter value="0.05" key="prune_below_rank"/>

<parameter value="0.95" key="prune_above_rank"/>

<parameter value="double_sparse_array" key="datamanagement"/>

<parameter value="auto" key="data_management"/>

<parameter value="false" key="select_attributes_and_weights"/>

<list key="specify_weights"/>


-<process expanded="true">


-<operator name="Extract Content" expanded="true" compatibility="7.3.000" class="web:extract_html_text_content" activated="true" y="34" x="45" width="90" height="68">

<parameter value="true" key="extract_content"/>

<parameter value="5" key="minimum_text_block_length"/>

<parameter value="true" key="override_content_type_information"/>

<parameter value="true" key="neglegt_span_tags"/>

<parameter value="true" key="neglect_p_tags"/>

<parameter value="true" key="neglect_b_tags"/>

<parameter value="true" key="neglect_i_tags"/>

<parameter value="true" key="neglect_br_tags"/>

<parameter value="true" key="ignore_non_html_tags"/>

</operator>

<operator name="Unescape HTML Document" expanded="true" compatibility="7.3.000" class="web:unescape_html" activated="true" y="34" x="179" width="90" height="68"/>


-<operator name="Write Document" expanded="true" compatibility="8.1.000" class="text:write_document" activated="true" y="34" x="313" width="90" height="82">

<parameter value="true" key="overwrite"/>

<parameter value="SYSTEM" key="encoding"/>

</operator>


-<operator name="Write File" expanded="true" compatibility="8.2.000" class="write_file" activated="true" y="136" x="447" width="90" height="68">

<parameter value="file" key="resource_type"/>

<parameter value="C:\Users\Ittaj\Desktop\rapidminer\tests\%{t}-%{a}.txt" key="filename"/>

<parameter value="application/octet-stream" key="mime_type"/>

</operator>

<connect to_port="document" to_op="Extract Content" from_port="document"/>

<connect to_port="document" to_op="Unescape HTML Document" from_port="document" from_op="Extract Content"/>

<connect to_port="document" to_op="Write Document" from_port="document" from_op="Unescape HTML Document"/>

<connect to_port="document 1" from_port="document" from_op="Write Document"/>

<connect to_port="file" to_op="Write File" from_port="file" from_op="Write Document"/>

<portSpacing spacing="0" port="source_document"/>

<portSpacing spacing="0" port="sink_document 1"/>

<portSpacing spacing="0" port="sink_document 2"/>

</process>

</operator>

<connect to_port="example set" to_op="Process Documents from Data" from_port="Example Set" from_op="Crawl Web"/>

<connect to_port="result 1" from_port="example set" from_op="Process Documents from Data"/>

<portSpacing spacing="0" port="source_input 1"/>

<portSpacing spacing="0" port="sink_result 1"/>

<portSpacing spacing="0" port="sink_result 2"/>

</process>

</operator>

</process>

sgenzer · May 2018

hmm I think your XML code is broken. Can you please just go to the XML panel and "copy and paste" into this thread?

ittaj_goldberge · May 2018

thanks, i try it again:

<?xml version="1.0" encoding="UTF-8"?><process version="8.2.000">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="8.2.000" expanded="true" name="Process">
    <process expanded="true">
      <operator activated="true" class="web:crawl_web" compatibility="7.3.000" expanded="true" height="68" name="Crawl Web" width="90" x="112" y="34">
        <parameter key="url" value="http://www.bbc.com/"/>
        <list key="crawling_rules">
          <parameter key="follow_link_with_matching_url" value=".*tech.*"/>
          <parameter key="follow_link_with_matching_url" value=".*news.*"/>
          <parameter key="store_with_matching_url" value=".*zuckerberg.*"/>
          <parameter key="store_with_matching_content" value=".*zuckerberg.*"/>
        </list>
        <parameter key="write_pages_into_files" value="false"/>
        <parameter key="add_pages_as_attribute" value="true"/>
        <parameter key="max_pages" value="100"/>
        <parameter key="max_depth" value="4"/>
        <parameter key="max_threads" value="2"/>
        <parameter key="max_page_size" value="10000"/>
      </operator>
      <operator activated="true" class="text:process_document_from_data" compatibility="8.1.000" expanded="true" height="82" name="Process Documents from Data" width="90" x="313" y="34">
        <parameter key="create_word_vector" value="false"/>
        <parameter key="keep_text" value="true"/>
        <list key="specify_weights"/>
        <process expanded="true">
          <operator activated="true" class="web:extract_html_text_content" compatibility="7.3.000" expanded="true" height="68" name="Extract Content" width="90" x="45" y="34"/>
          <operator activated="true" class="web:unescape_html" compatibility="7.3.000" expanded="true" height="68" name="Unescape HTML Document" width="90" x="179" y="34"/>
          <operator activated="true" class="text:write_document" compatibility="8.1.000" expanded="true" height="82" name="Write Document" width="90" x="313" y="34"/>
          <operator activated="true" class="write_file" compatibility="8.2.000" expanded="true" height="68" name="Write File" width="90" x="447" y="136">
            <parameter key="filename" value="C:\Users\Ittaj\Desktop\rapidminer\tests\%{t}-%{a}.txt"/>
          </operator>
          <connect from_port="document" to_op="Extract Content" to_port="document"/>
          <connect from_op="Extract Content" from_port="document" to_op="Unescape HTML Document" to_port="document"/>
          <connect from_op="Unescape HTML Document" from_port="document" to_op="Write Document" to_port="document"/>
          <connect from_op="Write Document" from_port="document" to_port="document 1"/>
          <connect from_op="Write Document" from_port="file" to_op="Write File" to_port="file"/>
          <portSpacing port="source_document" spacing="0"/>
          <portSpacing port="sink_document 1" spacing="0"/>
          <portSpacing port="sink_document 2" spacing="0"/>
        </process>
      </operator>
      <connect from_op="Crawl Web" from_port="Example Set" to_op="Process Documents from Data" to_port="example set"/>
      <connect from_op="Process Documents from Data" from_port="example set" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
    </process>
  </operator>
</process>

kypexin · May 2018

Hi @ittaj_goldberge

This type of setting works for me, retrieving artickes with Zuc in those:

Screenshot 2018-05-11 22.57.21.png

When you say "it doesn't work", what exactly do you mean? Does the process hang, or deliver wrong results?

ittaj_goldberge · May 2018

hi @kypexin

i tried a lot of different variants (in rule application/value, and also depth and links)

usually the process runs for a second and there are no results. sometimes i got a few results (less than 20, but i need around 100).

I'm trying it right niw with your rules, it runs since 2 minutes, i will update soon.

ittaj_goldberge · May 2018

so i tried it again with your rules, and i only got 8 results, with some duplicates.

any idea how i can crawl a newssite for zuckerberg and get 100 results?

Thomas_Ott · May 2018

@ittaj_goldberge does the news site have more than 8 zuckerberg articles? You might have to change the depth parameter to dig deeper?

ittaj_goldberge · May 2018

hi @Thomas_Ott

when i go to the search bar on bbc and look for zuckerberg, there are 1000s of results..

https://www.bbc.co.uk/search?q=zuckerberg#page=5

Thomas_Ott · May 2018

@ittaj_goldberge I'm by all means not a web crawling expert but lately for some client work I was exposed to web browser automation. Websites have gotten smart and in order to prevent people from crawling their websites they created various scripts to hide content that wasn't on the first page or 'above the fold.'

I suspect that this is the case. The link you shared was really a search that you used. It required a browser to access and probably doesn't work with a web crawler like RapidMiner. So that could be the problem.

kypexin · May 2018

If thois is the case @Thomas_Ott had mentioned, I might also expect that you could probably play around with 'user agent' and 'obey robot' parameters of Crawl Web operator (namely, change user agent string and disable the checkbox and then compare the results):

MultanTVHD · August 2020

hi your answer is in this website

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

CrawlWeb a news site for a specific keyword

Answers