Web Crawling for contact directory

Cash · March 2020

I'm trying to crawl this site to create an Excel document containing the the names, locations, phone numbers, and specialty type of individual practitioners on https://www.psychologytoday.com/us/therapists

The link above has links underneath for each state, and each state has about 50 pages or so of contacts. I'm just trying to get the html pulled so I can later pull the contact data out, likely with Tableau Prep. The CSS tags I have from selector gadget are span , h1 , .location-address-phone

This is the operator I'm using, and it's returning absolutely nothing. Can someone please help me figure this out? Thanks!

<?xml version="1.0" encoding="UTF-8"?><process version="9.5.001">

</list>

</operator>

</process>

Telcontar120 · March 2020

Unfortunately the Crawl Web operator doesn't work with https pages (and has several other known problems besides). You can replicate its functionality by using Get Pages and preparing a csv file with the page links you want to store. Since the page links seem to follow a regular pattern you can easily create such a list using Excel or even using RapidMiner. That should enable you to store the data you want (also assuming it is not in violation of that site's T&C of use).

Cash · March 2020

Thank you, Brian. That's disappointing to hear. I don't think I'll be able to do this in RM, and I don't really know how to do the process you're referring to. I verified within the T&C's that scraping was okay. I was able to find a different SW that allowed me to scrape the site very easily. So I have the information I was looking for. Thank you again for the response!

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

Web Crawling for contact directory

Best Answer

Answers