Including a filter in the Get Page operator
Dear experts,
I have an issue with retrieving the info I need from a web page. My issue might go beyond RapidMiner, but I hope there'll still be some useful input :-)
I am trying to retrieve all documents from a search engine of a standardisation organisation, and in particular to retrieve some information regarding these documents, that isn't displayed by default in the search result. This is the page:
https://eur-lex.europa.eu/search.html?qid=1538673501151&scope=EURLEX&type=quick&lang=en&FM_CODED=REG
On the page there is an option to modify the information displayed by clicking on "Change displayed metadata" and selecting the desired fields. However, if I apply the filter, I do see the info I wanted, but nothing changes in the URL path, and followingly the content I get out of the Get Page operator stays the same.
Any idea how to solve this? I thought that using the query parameters of the Get Page operator could be useful, but I didn't manage to find any examples of what these parameters do and how they can be used.
Any input would be much appreciated! Many thanks in advance!
Cheers,
Snežana
Answers
hi @s_nektarijevic so that's a good question. From what I know, the short answer is no, you cannot do what you're asking with the "query parameters" feature as you want. The "Change displayed metadata" selections create a JS query that goes back to their server via https://eur-lex.europa.eu/change-displayed-metadata.html, makes a new list (and gives it a new qid), and sends it back to you. If you look at the Network traffic when you do this choice, you will see something like this:
So this is a POST request with a form. In order to do this yourself in RapidMiner you would need to make the same query (with your sessionID etc). It would be a lot of work.
If it were me I would do it in RapidMiner because (a) I am a terrible coder, and (b) I really enjoy puzzles like this. But most likely you just want to get it done. I am sure there is a Python library somewhere that will do something like this. If you know Python, I'd poke around GitHub and see what you find. Otherwise perhaps some of my coder friends will have a suggestion.
Scott
Hi @s_nektarijevic,
I would refrain from using the Get Pages operator for heavy JS things and/or processing POST. In my experience, your best bet would be to use Selenium (yes, it's slow and many more things, but it is indeed useful). Few months ago some of us had a similar discussion that might help you. You can find it here.
Basically, Selenium is a headless driver, meaning that you get all the benefits of a browser except for the visual representation. You have to interact with the browser programatically. I use it with Ruby, but (oh no, the crowd again!) you can find tutorials to use it with Python. If you use the Anaconda Python distribution and the Python Extension for RapidMiner, you can program it directly.
Hope this helps,
Dear Scott,
Many thanks for the response, this is very useful! I'll try something with the Python extension in RapidMiner, fingers crossed :-)
Cheers,
Snežana
Dear Rodrigo,
Thanks a lot for the inputs! I'll try Python Extension and see how it goes :-)
Cheers,
Snežana
One way to tackle is as follows :
-> Use firefox and load your page. You can do the same with other browsers but FF is a bit easier
-> use ctrl - shift - e to open the network inspector
-> select HTML in the network menu / pane to avoid too much clutter showing up, and click on the trashcan to remove everything stored
-> click on change modify metadata and apply your settings
-> click ok and you will see that there appears a post method page in the network pane going to change-displayed-metadata.html
-> right click this link and select -> copy -> copy post data and safe this somewhere for now (like a text file)
-> next use the get page operator (I agree there are better ways using python but this one works also)
-> set the url of the page (copy -> copy url)
-> set request method to post
-> set follow redirects
-> in the 'query parameters' add the details you got from your post data above.
so if if you have this in your file : _metadataSelected[SP_DISPLAY]=on set it as follows :
query key : _metadataSelected[SP_DISPLAY]
query value : on
You may not need to use all of these as some of them may be default values so try and error around. Worst case scenario you may need to include them all but it's a one time effort.
Good luck!
Hi Snežana,
The Scrapy framework can also do the trick. Don't forget to check the robots file:
https://eur-lex.europa.eu/robots.txt
In particular these fields can be problematic:
As far as I understood, what you are trying to do is illegal, but I know very little about crawling restrictions.
Regards,
Sebastian