The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here
Web Crawling for contact directory
I'm trying to crawl this site to create an Excel document containing the the names, locations, phone numbers, and specialty type of individual practitioners on https://www.psychologytoday.com/us/therapists
The link above has links underneath for each state, and each state has about 50 pages or so of contacts. I'm just trying to get the html pulled so I can later pull the contact data out, likely with Tableau Prep. The CSS tags I have from selector gadget are span , h1 , .location-address-phone
This is the operator I'm using, and it's returning absolutely nothing. Can someone please help me figure this out? Thanks!
The link above has links underneath for each state, and each state has about 50 pages or so of contacts. I'm just trying to get the html pulled so I can later pull the contact data out, likely with Tableau Prep. The CSS tags I have from selector gadget are span , h1 , .location-address-phone
This is the operator I'm using, and it's returning absolutely nothing. Can someone please help me figure this out? Thanks!
<?xml version="1.0" encoding="UTF-8"?><process version="9.5.001">
<operator activated="true" class="web:crawl_web_modern" compatibility="9.0.000" expanded="true" height="68" name="Crawl Web" width="90" x="45" y="34">
<parameter key="url" value="https://www.psychologytoday.com/us/therapists"/>
<list key="crawling_rules">
<parameter key="follow_link_with_matching_url" value="https://www.psychologytoday.com/us/therapists/.*"/>
<parameter key="store_with_matching_url" value="https://www.psychologytoday.com/us/therapists/.*"/>
</list>
<parameter key="max_crawl_depth" value="52"/>
<parameter key="retrieve_as_html" value="true"/>
<parameter key="enable_basic_auth" value="false"/>
<parameter key="add_content_as_attribute" value="false"/>
<parameter key="write_pages_to_disk" value="true"/>
<parameter key="include_binary_content" value="false"/>
<parameter key="output_dir" value="/Users/ME/Desktop/Web Crawls"/>
<parameter key="output_file_extension" value="html"/>
<parameter key="max_pages" value="2500"/>
<parameter key="max_page_size" value="10000"/>
<parameter key="delay" value="500"/>
<parameter key="max_concurrent_connections" value="100"/>
<parameter key="max_connections_per_host" value="50"/>
<parameter key="user_agent" value="Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.122 Safari/537.36"/>
<parameter key="ignore_robot_exclusion" value="false"/>
</operator>
</process>
Tagged:
1
Best Answer
-
Telcontar120 RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,635 UnicornUnfortunately the Crawl Web operator doesn't work with https pages (and has several other known problems besides). You can replicate its functionality by using Get Pages and preparing a csv file with the page links you want to store. Since the page links seem to follow a regular pattern you can easily create such a list using Excel or even using RapidMiner. That should enable you to store the data you want (also assuming it is not in violation of that site's T&C of use).7
Answers