Including a filter in the Get Page operator

s_nektarijevic · October 2018

Dear experts,

I have an issue with retrieving the info I need from a web page. My issue might go beyond RapidMiner, but I hope there'll still be some useful input :-)

I am trying to retrieve all documents from a search engine of a standardisation organisation, and in particular to retrieve some information regarding these documents, that isn't displayed by default in the search result. This is the page:

https://eur-lex.europa.eu/search.html?qid=1538673501151&scope=EURLEX&type=quick&lang=en&FM_CODED=REG

On the page there is an option to modify the information displayed by clicking on "Change displayed metadata" and selecting the desired fields. However, if I apply the filter, I do see the info I wanted, but nothing changes in the URL path, and followingly the content I get out of the Get Page operator stays the same.

Any idea how to solve this? I thought that using the query parameters of the Get Page operator could be useful, but I didn't manage to find any examples of what these parameters do and how they can be used.

Any input would be much appreciated! Many thanks in advance!

Cheers,

Snežana

sgenzer · October 2018

hi @s_nektarijevic so that's a good question. From what I know, the short answer is no, you cannot do what you're asking with the "query parameters" feature as you want. The "Change displayed metadata" selections create a JS query that goes back to their server via https://eur-lex.europa.eu/change-displayed-metadata.html, makes a new list (and gives it a new qid), and sends it back to you. If you look at the Network traffic when you do this choice, you will see something like this:

Request URL: https://eur-lex.europa.eu/change-displayed-metadata.html
Request Method: POST
Status Code: 302 Moved Temporarily
Remote Address: 147.67.210.44:443
Referrer Policy: no-referrer-when-downgrade
Connection: Keep-Alive
Content-Language: en
Date: Thu, 04 Oct 2018 18:33:53 GMT
Location: https://eur-lex.europa.eu/search.html?qid=1538673501151&FM_CODED=REG&scope=EURLEX&type=quick&lang=en
Server: Europa
Transfer-Encoding: chunked
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8
Accept-Encoding: gzip, deflate, br
Accept-Language: en-US,en;q=0.9,fr;q=0.8
Cache-Control: max-age=0
Connection: keep-alive
Content-Length: 3161
Content-Type: application/x-www-form-urlencoded
Cookie: ELX_SESSIONID=e4dAV0CaIYd7y9OXWEi-bZmoGqhZGVoRGB9585PVa6TET1xvcJP7!1567286828; validateConsentCookies=true; WT_FPC=id=10.235.250.103-663370864.30694416:lv=1538695991009:ss=1538695755372; ACOOKIE=C8ctADEwLjIzNS4yNTAuMTAzLTY2MzM3MDg2NC4zMDY5NDQxNgAAAAAAAAABAAAAAwAAAOdctlv7W7ZbAQAAAAEAAADnXLZb+1u2WwEAAAADAAAAITEwLjIzNS4yNTAuMTAzLTY2MzM3MDg2NC4zMDY5NDQxNg--
DNT: 1
Host: eur-lex.europa.eu
Origin: https://eur-lex.europa.eu
Referer: https://eur-lex.europa.eu/search.html?qid=1538673501151&FM_CODED=REG&scope=EURLEX&type=quick&lang=en
Upgrade-Insecure-Requests: 1
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36
qid: 1538673501151
callingUrl: /search.html?qid=1538673501151&FM_CODED=REG&scope=EURLEX&type=quick&lang=en
id: 1538677936442
defaultProfile: true
profileName: Custom profile
multilingualLink: false
firstMultilingualLanguage: en
secondMultilingualLanguage: 
thirdMultilingualLanguage: 
firstSortCriteria: LEGAL_RELEVANCE_SORT
firstSortCritAsc: DESC
secondSortCriteria: NULL
secondSortCritAsc: DESC
isExpertMode: false
nbResultPerPage: 10
highlightResult: true
_highlightResult: on
filter: 
filter: 
metadataSelected[DD_DISPLAY]: DD_DISPLAY
_metadataSelected[DD_DISPLAY]: on
_metadataSelected[CELLAR_ID_ACT_DISPLAY]: on
_metadataSelected[XC_DISPLAY]: on
_metadataSelected[XA_DISPLAY]: on
_metadataSelected[DC]: on
_metadataSelected[CT]: on
_metadataSelected[CC]: on
_metadataSelected[RJ]: on
metadataSelected[ECLI]: ECLI
_metadataSelected[ECLI]: on
metadataSelected[AU]: AU
_metadataSelected[AU]: on
metadataSelected[FM]: FM
_metadataSelected[FM]: on
_metadataSelected[DN-old]: on
metadataSelected[DTS]: DTS
_metadataSelected[DTS]: on
_metadataSelected[DTA]: on
metadataSelected[DTT]: DTT
_metadataSelected[DTT]: on
_metadataSelected[DTC]: on
_metadataSelected[TT]: on
_metadataSelected[PAGES_TOTAL]: on
metadataSelected[SO]: SO
_metadataSelected[SO]: on
_metadataSelected[PD_DISPLAY]: on
_metadataSelected[IF_DISPLAY]: on
_metadataSelected[EV_DISPLAY]: on
_metadataSelected[SG_DISPLAY]: on
_metadataSelected[DB_DISPLAY]: on
_metadataSelected[LO_DISPLAY]: on
_metadataSelected[DL_DISPLAY]: on
_metadataSelected[DH_DISPLAY]: on
_metadataSelected[NF_DISPLAY]: on
_metadataSelected[RP_DISPLAY]: on
_metadataSelected[TP_DISPLAY]: on
_metadataSelected[VO_DISPLAY]: on
_metadataSelected[MS_DISPLAY]: on
_metadataSelected[BF_DISPLAY]: on
_metadataSelected[CI_DISPLAY]: on
_metadataSelected[AJ_DISPLAY]: on
_metadataSelected[EA_DISPLAY]: on
_metadataSelected[CD_DISPLAY]: on
_metadataSelected[MD_DISPLAY]: on
_metadataSelected[SP_DISPLAY]: on
_metadataSelected[LB_DISPLAY]: on
_metadataSelected[AP]: on
_metadataSelected[DF]: on
_metadataSelected[OB]: on
_metadataSelected[PR]: on
_metadataSelected[AG_DISPLAY]: on
_metadataSelected[JR_DISPLAY]: on
_metadataSelected[NA]: on
_metadataSelected[NO]: on
_metadataSelected[NC]: on
_metadataSelected[COLL_DISPLAY]: on
_metadataSelected[NO_OJ]: on
_metadataSelected[NO_OJ_CLASS]: on
_metadataSelected[COLL_OJ_DISPLAY]: on
_metadataSelected[AS_DISPLAY]: on
_metadataSelected[CM]: on
_metadataSelected[IC]: on
_metadataSelected[AF]: on
_metadataSelected[MI]: on
_metadataSelected[LG]: on
_metadataSelected[RI]: on
_metadataSelected[REP]: on
_metadataSelected[TOC_DISPLAY]: on
_metadataSelected[PROC_GR_DISPLAY]: on
_metadataSelected[DP]: on
_metadataSelected[AD]: on
_metadataSelected[LF]: on
_metadataSelected[RS_DISPLAY]: on
metadataSelected[MNE_IMPLEMENTS_DIR_DISPLAY]: MNE_IMPLEMENTS_DIR_DISPLAY
_metadataSelected[MNE_IMPLEMENTS_DIR_DISPLAY]: on
_metadataSelected[ELI]: on
button.apply: Apply

So this is a POST request with a form. In order to do this yourself in RapidMiner you would need to make the same query (with your sessionID etc). It would be a lot of work.

If it were me I would do it in RapidMiner because (a) I am a terrible coder, and (b) I really enjoy puzzles like this. But most likely you just want to get it done. I am sure there is a Python library somewhere that will do something like this. If you know Python, I'd poke around GitHub and see what you find. Otherwise perhaps some of my coder friends will have a suggestion.

Scott

rfuentealba · October 2018

Hi @s_nektarijevic,

I would refrain from using the Get Pages operator for heavy JS things and/or processing POST. In my experience, your best bet would be to use Selenium (yes, it's slow and many more things, but it is indeed useful). Few months ago some of us had a similar discussion that might help you. You can find it here.

Basically, Selenium is a headless driver, meaning that you get all the benefits of a browser except for the visual representation. You have to interact with the browser programatically. I use it with Ruby, but (oh no, the crowd again!) you can find tutorials to use it with Python. If you use the Anaconda Python distribution and the Python Extension for RapidMiner, you can program it directly.

Hope this helps,

s_nektarijevic · October 2018

Dear Scott,

Many thanks for the response, this is very useful! I'll try something with the Python extension in RapidMiner, fingers crossed :-)

Cheers,

Snežana

s_nektarijevic · October 2018

Dear Rodrigo,

Thanks a lot for the inputs! I'll try Python Extension and see how it goes :-)

Cheers,

Snežana

kayman · October 2018

One way to tackle is as follows :

-> Use firefox and load your page. You can do the same with other browsers but FF is a bit easier

-> use ctrl - shift - e to open the network inspector

-> select HTML in the network menu / pane to avoid too much clutter showing up, and click on the trashcan to remove everything stored

-> click on change modify metadata and apply your settings

-> click ok and you will see that there appears a post method page in the network pane going to change-displayed-metadata.html

-> right click this link and select -> copy -> copy post data and safe this somewhere for now (like a text file)

-> next use the get page operator (I agree there are better ways using python but this one works also)

-> set the url of the page (copy -> copy url)

-> set request method to post

-> set follow redirects

-> in the 'query parameters' add the details you got from your post data above.

so if if you have this in your file : _metadataSelected[SP_DISPLAY]=on set it as follows :

query key : _metadataSelected[SP_DISPLAY]

query value : on

You may not need to use all of these as some of them may be default values so try and error around. Worst case scenario you may need to include them all but it's a one time effort.

Good luck!

SGolbert · October 2018

Hi Snežana,

The Scrapy framework can also do the trick. Don't forget to check the robots file:

https://eur-lex.europa.eu/robots.txt

In particular these fields can be problematic:

Crawl-delay: 10
Disallow: /autocomplete
Disallow: /change-displayed-metadata

As far as I understood, what you are trying to do is illegal, but I know very little about crawling restrictions.

Regards,

Sebastian

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

Including a filter in the Get Page operator

Answers