Returning website HTML code
I am a Rapidminer learner and need to be able to download the html code for any given website in order to determine if any of the accompanying pages include some form of login, form submission or other workflow. The thought is to download the html code and then search for identifiers unique to such finctionality. My question is:
a) Is this the best way to accomplish the task?
b) What is the best sequence of operators to do so?
Thank you in advance for your help, it is greatly appreciated. BK
Best Answer
-
Telcontar120 RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,635 Unicorn
If you have mulitple pages to retrieve, you can also use a csv file of URLs with the "Get Pages" operator.
And if you need to crawl through an entire site, then the very useful "Crawl Web" operator allows you to specify crawling rules and crawling depth and save all retrieved pages as html files, so it is perfect for your use case. Just be sure that you observe any crawling rules as posted in the T&C on sites that you are scraping.
2
Answers
hi @bking - sure...my first thought is to use the "Get Page" operator in the Web Mining extension. That should do the trick nicely.
Scott
Thank you, Scott & Brian. Very Helpful...I used the Crawl Web and filter examples to grab individual page html and then filter based on keywords (yet to be defined by the web development team). Will keep you posted as the project develops, thank you again.
Bill