Crawl Web - empty results (PHP script)

mspiess · 2018 21

Hello there!

I'm a social scientist learning to use RapidMiner for data/text mining and text analysis.

I've been trying to apply "Crawl Web" for the following address http://www.scielo.br/scielo.php?script=sci_issuetoc&pid=0102-690920180001&lng=pt&nrm=iso with no crawling rules applied and depth of 1, but I keep getting empty results.

I wonder if this is caused by the target page's php script. If so, does anyone know I workaround for this issue?

Also, any hints on setting the crawling rules so I get only the links with a specific link text. For example, in the URL above, I'm mostly interested in the pages with the text "Texto em Português".

Greeting from Brazil,

Maiko Spiess

sgenzer · 2018 22

so it seems that there is a bot block on that site. If you uncheck "ignore robot exclusion", you get results (I did only two pages just to test). So ethically I cannot tell you do this unless you own the site OR have explicit permission from the owner to crawl his/her site.

Scott

sgenzer · 2018 21

hello @mspiess - welcome to the community. Have you tried looking at other threads in the community? A quick search revealed a thread that may be useful. https://community.rapidminer.com/t5/RapidMiner-Studio-Forum/Crawl-Web-with-follow-link-with-matching-url-returning-empty/m-p/38561#M26480

Scott

mspiess · 2018 22

Hi @sgenzer! Thanks for replying.

I have checked the thread you mentioned before posting my own but kept getting empty results. I figured if I try the operator without any rules it should return all the pages within the specified depth. Then I've tried this with a different URL and it worked okay. However, in this particular page I am still getting empty results.

So, crawl rules aside, I'm still wondering if this is something related to the page's php script. Any thoughts?

Greetings,

Maiko

mspiess · 2018 22

Okay! Got it!

Thank you for your attention.

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

Crawl Web - empty results (PHP script)

Best Answer

Answers