"embedded crawler (websphinx) and RegEx"

tsschmidt · November 2008

(How) can I use RegEx within that crawler? It did not work...

I tried this several times as follows (see also attachement):
visit_content: ^water$
or
visit_content: \<water\>
or
visit_content: (?s)\<water\>
...

(I don't want waterfall...)

Please don't suggest HTTRACK. As far as I know HTTRACK can not filter the content of pages but only URLs.

[attachment deleted by admin]

land · November 2008

Hi,
the crawler does not support regular expressions. This are the only condition types are supported to specify which links to follow:
follow_url A link is only followed, if the target URL contains all terms stated in this parameter.
link_text A link is only followed, if the link text contains all terms stated in this parameter.

The conditions that state whether to store a page or not allow for the following expressions:
visit_url A page is only stored if its URL contains all terms stated in this parameter.
visit_content A page is only stored if its content contains all terms stated in this parameter.

Further informations could be found on http://nemoz.org/joomla/content/view/64/53/lang,de/

Greetings,
Sebastian

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

"embedded crawler (websphinx) and RegEx"

Answers