The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here

"Problem crawling pages with blank spaces"

oju987oju987 Member Posts: 4 Contributor I
edited June 2019 in Help

I have been using Rapidminer for a while and have some experience using web crawling without major problems.  But one new assignment has me puzzled.

 

Url's are like this:

 

http:\\www.movilauto.com\toyota rav4 2012.html

http:\\www.movilauto.com\bmw 320 2013.html

 

I normally would used .+movilauto.+ to get these pages  and it would  work out pretty well.   But apparently spaces are a problem.

 

To complicate even further the number or spaces are not fixed, sometimes there are 2 like in the previous example and sometimes there are three, like in the following example

 

http:\\www.movilauto.com\toyota rav4  automatic 2012.html

 

Any suggestions?

 

 

 

 

Tagged:

Answers

  • BalazsBaranyBalazsBarany Administrator, Moderator, Employee-RapidMiner, RapidMiner Certified Analyst, RapidMiner Certified Expert Posts: 955 Unicorn

    Hi!

     

    Use the Encode URLs operator (in the Web Mining extension) to correctly pass the URLs.

    Note: Your use of backslashes \ instead of slashes / will also break everything, so you should also replace those.

     

    Regards,

    Balázs

  • oju987oju987 Member Posts: 4 Contributor I

    Thank you Balázs, for your answer.

     

    My mistake with the backslashes, I checked in the rapidminer operator and I was using  the correct slashes, it was a typing mistake while I was writing the post.

     

    I found the Encode URLs operator but I am unsure how to use it, my process is extremely simple.

     

     

    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="5.3.012">
    <context>
    <input/>
    <output/>
    <macros/>
    </context>
    <operator activated="true" class="process" compatibility="5.3.012" expanded="true" name="Process">
    <process expanded="true">
    <operator activated="true" class="web:crawl_web" compatibility="5.3.002" expanded="true" height="60" name="Crawl Web" width="90" x="246" y="120">
    <parameter key="url" value="http://autopunto.net/"/>
    <list key="crawling_rules">
    <parameter key="store_with_matching_url" value=".+autopunto.+"/>
    </list>
    <parameter key="output_dir" value="C:\predios\autopunto"/>
    <parameter key="extension" value=".html"/>
    <parameter key="max_threads" value="5"/>
    <parameter key="obey_robot_exclusion" value="false"/>
    <parameter key="really_ignore_exclusion" value="true"/>
    </operator>
    <connect from_op="Crawl Web" from_port="Example Set" to_port="result 1"/>
    <portSpacing port="source_input 1" spacing="0"/>
    <portSpacing port="sink_result 1" spacing="0"/>
    <portSpacing port="sink_result 2" spacing="0"/>
    </process>
    </operator>
    </process>

    The site has few pages and the crawling operator finds de pages but it doesn't store them.

     

    I attached the log file.

     

    Very grateful for your help!

     

     

     

    log.txt 12.5K
  • BalazsBaranyBalazsBarany Administrator, Moderator, Employee-RapidMiner, RapidMiner Certified Analyst, RapidMiner Certified Expert Posts: 955 Unicorn

    OK, this seems to be a limitation in the web crawler. 

    Your best guess is to parse the links yourself. 

    You get a list of pages from the crawler, these are the main pages. You can process them using Process Documents from Files (Text Processing extension) and Extract Information to get the link URLs with spaces. Then you can use Encode URLs to get the correct URL which you can access in the next step.

     

    Regards,

    Balázs

Sign In or Register to comment.