The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here

Webcrawler Doubt

newbierapidnewbierapid Member Posts: 6 Contributor II
edited November 2018 in Help
Hi All,

I am using RM 5.1 and I am currently experimenting with web mining.My objective is to crawl a web page and display according to the crawling rules. After applying the crawling rules I am not able to see any output.

Appreciate help and thanks in advance.

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.1.008">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="5.1.008" expanded="true" name="Process">
    <process expanded="true" height="503" width="604">
      <operator activated="true" class="web:crawl_web" compatibility="5.1.002" expanded="true" height="60" name="Crawl Web" width="90" x="122" y="119">
        <parameter key="url" value="http://patft.uspto.gov/netacgi/nph-Parser?Sect1=PTO2&amp;Sect2=HITOFF&amp;u=/netahtml/PTO/search-adv.htm&amp;r=0&amp;p=1&amp;f=S&amp;l=50&amp;Query=apple&amp;d=PTXT"/>
        <list key="crawling_rules">
          <parameter key="store_with_matching_url" value=".*(Apple)"/>
          <parameter key="store_with_matching_content" value=".*(Apple"/>
          <parameter key="follow_link_with_matching_text" value=".*(Apple"/>
        </list>
        <parameter key="write_pages_into_files" value="false"/>
        <parameter key="max_pages" value="5"/>
      </operator>
      <connect from_op="Crawl Web" from_port="Example Set" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
    </process>
  </operator>
</process>


Thanks

Answers

  • colocolo Member Posts: 236 Maven
    Hi,

    it seems there are some closing brackets missing for the last two rules.

    There is one special thing to consider when using "store_with_matching_content": if you want the dot to match all symbols including line breaks, you have to activate the dot-all mode. This is possible by placing "(?s)" at the beginning of your expression. But this will make crawling slow, since whole webpages have to be scanned (see http://rapid-i.com/rapidforum/index.php/topic,2102.0.html).

    Regards
    Matthias
  • newbierapidnewbierapid Member Posts: 6 Contributor II
    Hi Mathias,

    I have tried the the way you explained. Still i couldnt find the solution. Please find the xml code below.

    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="5.1.008">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" compatibility="5.1.008" expanded="true" name="Process">
        <process expanded="true" height="521" width="622">
          <operator activated="true" class="web:crawl_web" compatibility="5.1.002" expanded="true" height="60" name="Crawl Web" width="90" x="179" y="210">
            <parameter key="url" value="http://www.google.com/search?q=apple&amp;btnG=Search+Patents&amp;tbm=pts&amp;tbo=1&amp;hl=en"/>
            <list key="crawling_rules">
              <parameter key="follow_link_with_matching_text" value="(?s).*apple.*"/>
            </list>
            <parameter key="write_pages_into_files" value="false"/>
          </operator>
          <connect from_op="Crawl Web" from_port="Example Set" to_port="result 1"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
        </process>
      </operator>
    </process>


    Thanks
  • colocolo Member Posts: 236 Maven
    Hi,

    you're right, this is simply not working. I also can't obtain any pages for both of the URLs you tried (even without crawling rules, which means that all links should be followed). I tried some smaller webpage instead and this is working. Maybe those big pages block the crawler somehow?
    Certainly further investigation of the returned messages will be required, which means working with the source code. Or maybe I am also missing something necessary to get this working... Sorry.

    Regards
    Matthias
  • newbierapidnewbierapid Member Posts: 6 Contributor II
    Thanks Mathias,
Sign In or Register to comment.