Crawl Web - empty results (PHP script)
Hello there!
I'm a social scientist learning to use RapidMiner for data/text mining and text analysis.
I've been trying to apply "Crawl Web" for the following address http://www.scielo.br/scielo.php?script=sci_issuetoc&pid=0102-690920180001&lng=pt&nrm=iso with no crawling rules applied and depth of 1, but I keep getting empty results.
I wonder if this is caused by the target page's php script. If so, does anyone know I workaround for this issue?
Also, any hints on setting the crawling rules so I get only the links with a specific link text. For example, in the URL above, I'm mostly interested in the pages with the text "Texto em Português".
Greeting from Brazil,
Maiko Spiess
Best Answer
-
sgenzer Administrator, Moderator, Employee-RapidMiner, RapidMiner Certified Analyst, Community Manager, Member, University Professor, PM Moderator Posts: 2,959 Community Manager
so it seems that there is a bot block on that site. If you uncheck "ignore robot exclusion", you get results (I did only two pages just to test). So ethically I cannot tell you do this unless you own the site OR have explicit permission from the owner to crawl his/her site.
Scott0
Answers
hello @mspiess - welcome to the community. Have you tried looking at other threads in the community? A quick search revealed a thread that may be useful. https://community.rapidminer.com/t5/RapidMiner-Studio-Forum/Crawl-Web-with-follow-link-with-matching-url-returning-empty/m-p/38561#M26480
Scott
Hi @sgenzer! Thanks for replying.
I have checked the thread you mentioned before posting my own but kept getting empty results. I figured if I try the operator without any rules it should return all the pages within the specified depth. Then I've tried this with a different URL and it worked okay. However, in this particular page I am still getting empty results.
So, crawl rules aside, I'm still wondering if this is something related to the page's php script. Any thoughts?
Greetings,
Maiko
Okay! Got it!
Thank you for your attention.