The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here
"Can I crawl websites in java script using rapidminer"
I have a problem crawling a website. I believe the problem is that the website is build in javascript. Is it possible to crawl such a page using rapidminer?
For example: http://www.booking.com/hotel/nl/easyhotel-amsterdam.nl.html?sid=9fc05dc001129cc3698397a2efbfba2f;dcid=1#hash-blockdisplay4
When I use the Crawl web operator i only creates two files. The files leads to the startingpage of the hotel, not the review page. While I use the reviewpage as URL in the operator.
How can I crawl this website?
Thanks Arno
For example: http://www.booking.com/hotel/nl/easyhotel-amsterdam.nl.html?sid=9fc05dc001129cc3698397a2efbfba2f;dcid=1#hash-blockdisplay4
When I use the Crawl web operator i only creates two files. The files leads to the startingpage of the hotel, not the review page. While I use the reviewpage as URL in the operator.
How can I crawl this website?
Thanks Arno
Tagged:
0
Answers
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.3.008">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="5.3.008" expanded="true" name="Process">
<process expanded="true">
<operator activated="true" class="web:crawl_web" compatibility="5.3.000" expanded="true" height="60" name="Crawl Web" width="90" x="112" y="30">
<parameter key="url" value="http://www.booking.com/hotel/nl/easyhotel-amsterdam.nl.html?sid=9fc05dc001129cc3698397a2efbfba2f;dcid=1#hash-blockdisplay4"/>
<list key="crawling_rules">
<parameter key="store_with_matching_url" value=".+/easyhotel-amsterdam.nl..+"/>
<parameter key="follow_link_with_matching_text" value=".+/easyhotel-amsterdam.nl..+|#hash-blockdisplay4"/>
</list>
<parameter key="output_dir" value="C:\Improve Your Business\Qing\Pilot\test\crawlbooking.com"/>
<parameter key="extension" value="html"/>
<parameter key="max_pages" value="1000"/>
<parameter key="max_depth" value="18"/>
<parameter key="max_page_size" value="100000"/>
</operator>
<connect from_op="Crawl Web" from_port="Example Set" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>
unfortunately at the moment this is not possible.
Best,
Nils
Thanks for your answer. Maybe a functionality in the next releases. More and more websites are using javascript.
I crawled the webites using 'Mozenda', works perfectly!
Regards, Arno