The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here

Web Crawling for contact directory

CashCash Member Posts: 11 Contributor II
I'm trying to crawl this site to create an Excel document containing the the names, locations, phone numbers, and specialty type of individual practitioners on https://www.psychologytoday.com/us/therapists 

The link above has links underneath for each state, and each state has about 50 pages or so of contacts.  I'm just trying to get the html pulled so I can later pull the contact data out, likely with Tableau Prep. The CSS tags I have from selector gadget are span , h1 , .location-address-phone

This is the operator I'm using, and it's returning absolutely nothing.  Can someone please help me figure this out?  Thanks!

<?xml version="1.0" encoding="UTF-8"?><process version="9.5.001">
  <operator activated="true" class="web:crawl_web_modern" compatibility="9.0.000" expanded="true" height="68" name="Crawl Web" width="90" x="45" y="34">
    <parameter key="url" value="https://www.psychologytoday.com/us/therapists"/>
    <list key="crawling_rules">
      <parameter key="follow_link_with_matching_url" value="https://www.psychologytoday.com/us/therapists/.*"/>
      <parameter key="store_with_matching_url" value="https://www.psychologytoday.com/us/therapists/.*"/>
    </list>
    <parameter key="max_crawl_depth" value="52"/>
    <parameter key="retrieve_as_html" value="true"/>
    <parameter key="enable_basic_auth" value="false"/>
    <parameter key="add_content_as_attribute" value="false"/>
    <parameter key="write_pages_to_disk" value="true"/>
    <parameter key="include_binary_content" value="false"/>
    <parameter key="output_dir" value="/Users/ME/Desktop/Web Crawls"/>
    <parameter key="output_file_extension" value="html"/>
    <parameter key="max_pages" value="2500"/>
    <parameter key="max_page_size" value="10000"/>
    <parameter key="delay" value="500"/>
    <parameter key="max_concurrent_connections" value="100"/>
    <parameter key="max_connections_per_host" value="50"/>
    <parameter key="user_agent" value="Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.122 Safari/537.36"/>
    <parameter key="ignore_robot_exclusion" value="false"/>
  </operator>
</process>

Tagged:

Best Answer

Answers

  • CashCash Member Posts: 11 Contributor II
    Thank you, Brian.  That's disappointing to hear.  I don't think I'll be able to do this in RM, and I don't really know how to do the process you're referring to.  I verified within the T&C's that scraping was okay.  I was able to find a different SW that allowed me to scrape the site very easily.  So I have the information I was looking for.  Thank you again for the response!
Sign In or Register to comment.