The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here
RapidMiner difficulty accessing long URLs to extract web data
leptserkhan
Member Posts: 7 Contributor II
New to RapidMiner and hope someone can assist. I have a starting URL as such for web crawling:
http://www.domain.com/search?attrs=&;cflt=restaurants&find_desc=&find_loc=bigcity&rpp=40&start=100
regex is:
follow_link_with_matching_url = .*dig.*
have also tried: follow_link_with_matching_url = .*+dig+.* and also .+dig.+
have tried it with all of the four rule applications and nothing is returned.
the link to access looks like this with copy and paste as text:
http://www.domain.com/dig/hoopla
I know that the text "dig" is contained in the links I wish to access. I have copied and pasted the inspect element copy as html below, as well as copied and pasted the inspect element copy as xpath below for help (I would prefer to use xpath in regex but don't know how to use that with using rapidminer to select web data):
html:
<a id="digLinkabc" href="/dig/hoopla">abc. Dig</a>
xpath:
//*[@id="digTLinkabc"]
http://www.domain.com/search?attrs=&;cflt=restaurants&find_desc=&find_loc=bigcity&rpp=40&start=100
regex is:
follow_link_with_matching_url = .*dig.*
have also tried: follow_link_with_matching_url = .*+dig+.* and also .+dig.+
have tried it with all of the four rule applications and nothing is returned.
the link to access looks like this with copy and paste as text:
http://www.domain.com/dig/hoopla
I know that the text "dig" is contained in the links I wish to access. I have copied and pasted the inspect element copy as html below, as well as copied and pasted the inspect element copy as xpath below for help (I would prefer to use xpath in regex but don't know how to use that with using rapidminer to select web data):
html:
<a id="digLinkabc" href="/dig/hoopla">abc. Dig</a>
xpath:
//*[@id="digTLinkabc"]
0