Using regex for exclusion of facebook share url in web crawler operator

nikohollfelder · June 2017

Hi,
I want to crawl the site:

https://www.bmwgroup.com/en.html
And some of other companies.

Therfore I used the regex .+bmwgroup.+en.+
I use "en" because I just want to crawl the sites in english language and intentionally not "/en" because some sites include the en without a /.
The problem is that the crawler crawls all social media share links, too. And thus the process of crawling lasts like forever because the share links of facebook and co including the regex too.
How can I exclude facebook, linkedin, twitter and co?
I tried something like .+(?!facebook)bmwgroup.+en.+ but unsuccessful.
You have any ideas. Additionally I have to say I can't use a regex like: https\:\/\/www\.bmwgroup.+en.+ to avoid to crawl any sites not starting with https://www.bmwgroup, because other links in this site are just http or beginn with http://w3.bmwgroup and so these site would be ignored. But I want to crawl all links but not socialmedia links.
Could you please help?

kayman · June 2017

You probably do need to get the start part correctly, so try something like

https?:\/\/(www|w3)\.bmwgroup.+en.+

This will allow you to crawl both http and https for www and w3, then followed by bmwgroup. You will avoid that different domains get crawled this way, while the one of interest are crawled.

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

Using regex for exclusion of facebook share url in web crawler operator

Answers