Returning website HTML code

bking · April 2018

I am a Rapidminer learner and need to be able to download the html code for any given website in order to determine if any of the accompanying pages include some form of login, form submission or other workflow. The thought is to download the html code and then search for identifiers unique to such finctionality. My question is:

a) Is this the best way to accomplish the task?

b) What is the best sequence of operators to do so?

Thank you in advance for your help, it is greatly appreciated. BK

Telcontar120 · April 2018

If you have mulitple pages to retrieve, you can also use a csv file of URLs with the "Get Pages" operator.

And if you need to crawl through an entire site, then the very useful "Crawl Web" operator allows you to specify crawling rules and crawling depth and save all retrieved pages as html files, so it is perfect for your use case. Just be sure that you observe any crawling rules as posted in the T&C on sites that you are scraping.

sgenzer · April 2018

hi @bking - sure...my first thought is to use the "Get Page" operator in the Web Mining extension. That should do the trick nicely.

Scott

bking · April 2018

Thank you, Scott & Brian. Very Helpful...I used the Crawl Web and filter examples to grab individual page html and then filter based on keywords (yet to be defined by the web development team). Will keep you posted as the project develops, thank you again.

Bill

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

Returning website HTML code

Best Answer

Answers