crawling rules for "store_with_matching_content" without regular expression
Hi everyone,
I'm using RapidMiner Studio and I have a problem with the "store_with_matching_content" crawling rule in the "crawl web"-operator. I want to collect the contact information from several websites with different url-structures. Because of these different url-structures and the big amount of sites I use the store_with_matching_content operator to get to the contact page of each site and save it. Unfortunately the crawler saves every single site on the webpage where it finds the pattern "contact", even when it is the labeling of a link in the site structure (and not only the page with the contact information as it should).
So my question is: is there a way to limit the matched content to a special position in the HTML-file. That means setting up a rule like "when you find 'contact' between p- or h1-, h2-, h3-tags save the website; when you find 'contact' between a-tags don't save the website"?
I know how to do it with regular expressions, but the store_with_matching_content rule doesn't allow RegEx-rules but only a given term.
Do you have any idea how to solve this issue? I would be really grateful.
Thank you.
Lukei_11
Best Answer
-
Telcontar120 RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,635 Unicorn
I don't think you can do this directly using the Crawl Web operator, but you should be able to do this after you store the full pages by using either Cut Document or Extract Content inside a Process Documents from Data subprocess operator.
1