The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here
Web scraping on site created in javascript
marcelolimabati
Member Posts: 2 Learner I
Hi experts,
I'm trying to create a scraper from a site created in JavaScript but I'm not getting it. Which operator can I use to be able to perform the site scraper?
I was using the webtable operator, but it's only for HTML-created sites, correct?
Could you help me please?
Thank you in advance for your help.
Marcelo Batista
0
Answers
Get Page or Get Pages are the basic web scrapers that work for specific given URLs. Crawl Web is a more advanced version that can actually go through a site and follow links of a specified formulation.
Lindon Ventures
Data Science Consulting from Certified RapidMiner Experts
@marcelolimabati and @Telcontar120 i've been running into problems where websites are completely disallowing webscrapers. While I have not implemented this yet, the solution would appear to be web broswer automation. There are several non-RapidMiner packages that can do this. The big one is Selenium (python).
Something to think about.
Hello all,
I know Ruby isn't popular as Python over here (and I almost certainly can feel the crowd chanting "Switch-to-python! Switch-to-python!"), but it is quite handy when it comes to automate Web manipulation stuff (I used it on a daily basis as it was part of my testing process when I was a Ruby developer) and you can still use it with RapidMiner: just use the Execute Program operator and read its output.
Please find attached zip file with code. You need Ruby from https://www.ruby-lang.org/, Google Chrome installed, and the Selenium Chrome Driver from http://chromedriver.chromium.org/downloads to make this work.
Here it is!!!
Once you have Ruby installed, you can uncompress the zip file in your $HOME, run bundle install to install the required libraries and execute the code with ruby website.rb. If, on the other hand, you want to pass the URL as a parameter, you only need to change line 6 (browser.get 'https://www.rapidminer.com/') by browser.get ARGV[0] and that's it. Beware that with this modification, the script will throw an error if you don't call it with a URL as the last parameter like the following:
ruby website.rb https://www.datasciencegems.com/
(BTW, I haven't tried this on Windows. Mac is immensely more popular among Rubyists).
@rfuentealba I barely know Python, you want me to learn Ruby now? lol.
Ruby does not work nice with Windows, so I'll default over to Python.
(silently chanting python / python / python ...)
Using Selenium in combination with python / rapidminer works really nice. attached an example that I used to get the IP adresses of some forum, as these were not retrievable through normal API but required login. Not that this matters but it sets the scene a bit.
What below script was doing is open a specific page, enter username and password, click the login button, look for the element stating a users name, then open that page and get the IP information from his / hers profile. Next the logic took all of these IP addresses and user codes and exported them nicely in one big table / exampleset
It doesn't work anymore as we closed down the forum itself, but it shows you can do pretty complex stuff. You will have to modify it for your own uses but it may get you started more quickly
The trick for non python gurus is to use selenium first with your browser. Install the Katalon Automation Recorder (great plugin, should be standard in everybodys scraping toolkit), let it record and copy paste the generated python code. You may need to tune the code a bit but it should get you started.
Hello Sensei @Thomas_Ott,
I just wrote that Ruby code to test if Selenium does what @marcelolimabati wants (it does!), and thought it was better to share it than to keep it for myself. You will still need to install Google Chrome and the Chrome Driver, which were my main concern, but it wasn't difficult at all on my Mac at least.
And no, Ruby doesn't play nice with Windows, but that's mostly true for server (e.g. Rails, Puma, Sinatra) applications. The one I attached shouldn't be much of a problem, though.
(BTW, I went back to sleep before pressing the "Post" button and woke up with @kayman chanting "Python! Python! Python!", was about to say it would be nice if someone else could post Python code that does the same. Thanks, mate!)
@kayman man, you just saved me hours to trying to figure this stuff out. Are you going to Wisdom? If so, I'm buying you a beer.
Naah, I wish... My budget is too small for this.
But if you're ever in the neighbourhood I'll remind you on the offer...
Selenium is a good choice for php- or cookies-heavy websites. But it is very slow.
Regarding web scraping limitations, they can be partially addressed by using pauses and setting the user agent correctly. But if they really don't want any scrapper, i.e. they have a robots.txt against it, even using Selenium would be illegal.
OT: I am also no Ruby fan. It is very powerful but I value code clarity above anything else.
Hello @SGolbert,
Remember that the site was created in JavaScript, hence the only ways to scrape this are using a browser, a headless browser or a JavaScript parser. Hence, you might be able to render it to save with other solutions such as PhantomJS but the amount of work to do just that is sometimes unfeasible.
OT: I began telling everyone that I used plain TextMate so that they couldn't include me in their vim vs emacs flamewars. Be sure that I won't bring a knife with gems to a snake-firing gunfight. Mine was just a partial solution, and with Python one has to install Selenium and the Chrome Driver too, so it's not quite different.
@rfuentealba: "...that they couldn't include me in their vim vs emacs flamewars"
LOL didn't know people still used emacs. I used it on a VT-100 dummy terminal in 1989...wow I'm old!
Scott
Hi Marcelo, I'd have a look at a webscraping library that actually scrapes using JavaScript. Like https://github.com/apifytech/apify-js from Apify.
Until now JavaScript hasn't had any similar library, like Scrapy for Python. It simplifies doing deep crawls of complex JavaScript sites using lists of 100k URLs, and many other things.
Here’s the docs https://www.apify.com/docs/sdk/apify-runtime-js/latest
Cheers
Holy