The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here
Using regex for exclusion of facebook share url in web crawler operator
nikohollfelder
Member Posts: 1 Learner II
Hi,
I want to crawl the site:
https://www.bmwgroup.com/en.html
And some of other companies.
Therfore I used the regex .+bmwgroup.+en.+
I use "en" because I just want to crawl the sites in english language and intentionally not "/en" because some sites include the en without a /.
The problem is that the crawler crawls all social media share links, too. And thus the process of crawling lasts like forever because the share links of facebook and co including the regex too.
How can I exclude facebook, linkedin, twitter and co?
I tried something like .+(?!facebook)bmwgroup.+en.+ but unsuccessful.
You have any ideas. Additionally I have to say I can't use a regex like: https\:\/\/www\.bmwgroup.+en.+ to avoid to crawl any sites not starting with https://www.bmwgroup, because other links in this site are just http or beginn with http://w3.bmwgroup and so these site would be ignored. But I want to crawl all links but not socialmedia links.
Could you please help?
I want to crawl the site:
https://www.bmwgroup.com/en.html
And some of other companies.
Therfore I used the regex .+bmwgroup.+en.+
I use "en" because I just want to crawl the sites in english language and intentionally not "/en" because some sites include the en without a /.
The problem is that the crawler crawls all social media share links, too. And thus the process of crawling lasts like forever because the share links of facebook and co including the regex too.
How can I exclude facebook, linkedin, twitter and co?
I tried something like .+(?!facebook)bmwgroup.+en.+ but unsuccessful.
You have any ideas. Additionally I have to say I can't use a regex like: https\:\/\/www\.bmwgroup.+en.+ to avoid to crawl any sites not starting with https://www.bmwgroup, because other links in this site are just http or beginn with http://w3.bmwgroup and so these site would be ignored. But I want to crawl all links but not socialmedia links.
Could you please help?
Tagged:
0
Answers
You probably do need to get the start part correctly, so try something like
https?:\/\/(www|w3)\.bmwgroup.+en.+
This will allow you to crawl both http and https for www and w3, then followed by bmwgroup. You will avoid that different domains get crawled this way, while the one of interest are crawled.