The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here
The Problem with RM Crawling Rules - how they are explained
leptserkhan
Member Posts: 7 Contributor II
The problem with Rapid Miner crawling rules, and I think a big reason that people are not getting the results they think they should get is that the documentation -- as far as how the four rules work -- is minimal, at best. Here are the explanations provided:
*store_with_matching_url:If the regular expression matches the url, this page will be stored in the resulting ExampleSet.
*store_with_matching_content:If the regular expression matches the page content, this page will be stored in the resulting ExampleSet.
*follow_link_with_matching_url:If the regular expression matches the url, the crawler will follow the link and load the url.
*follow_link_with_matching_text:If the regular expression matches the text of the hyperlink, the crawler will follow the link and load the according url.
There is absolutely no explanation as to importance of *precedence* if that even applies and if it doesn't it should be stated so people don't spend time switching around the rules to experiment which method of precedence could possibly work.
"follow_link_with_matching_text. . . . follow the link and load the according url." Besides the fact that ". . . load the according url." is bad grammar and only serves to confuse a proper English speaker, does this mean load the page from the URL which contained the original link that is being followed, or load the page for the page that is landed upon after following the link with the matching text?
You can see that with just two improperly explained rules and the potential permutations of them in combination with the other rules, how this can lead to mayhem. And apparently based on the requests for help, that's what is happening.
Please *clarify* how the rules work and provide an easily-found link in the main dashboard that does exactly that with examples.
Otherwise a great product.
Thank you.
*store_with_matching_url:If the regular expression matches the url, this page will be stored in the resulting ExampleSet.
*store_with_matching_content:If the regular expression matches the page content, this page will be stored in the resulting ExampleSet.
*follow_link_with_matching_url:If the regular expression matches the url, the crawler will follow the link and load the url.
*follow_link_with_matching_text:If the regular expression matches the text of the hyperlink, the crawler will follow the link and load the according url.
There is absolutely no explanation as to importance of *precedence* if that even applies and if it doesn't it should be stated so people don't spend time switching around the rules to experiment which method of precedence could possibly work.
"follow_link_with_matching_text. . . . follow the link and load the according url." Besides the fact that ". . . load the according url." is bad grammar and only serves to confuse a proper English speaker, does this mean load the page from the URL which contained the original link that is being followed, or load the page for the page that is landed upon after following the link with the matching text?
You can see that with just two improperly explained rules and the potential permutations of them in combination with the other rules, how this can lead to mayhem. And apparently based on the requests for help, that's what is happening.
Please *clarify* how the rules work and provide an easily-found link in the main dashboard that does exactly that with examples.
Otherwise a great product.
Thank you.
Tagged:
0