"Crawl Web -- Enable Basic Auth"
Hi - I'm brand new to Rapidminer as of this week (using Studio 7.3). I'm using Crawl Web to access the web page http://www.thetimes.co.uk/search?q= (with added search parameters), and I can successfully return a set of news articles. However each search result is returning only the first few paragraphs of each article because my login has not been recognized. I've entered the correct account credentials in "Enable Basic Auth". Any ideas please?
Best Answer
-
Thomas_Ott RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,761 Unicorn
I remember watching this video https://www.youtube.com/watch?v=-Sr3i7klRHM a while back and I believe they past Twitter OAuth credentials in RapidMiner using Generate User Data and something else. This was right before we came out with the Twitter operators, but if you hack this you might be able to get into your login.
0
Answers
hi...yes I have had the same issues with this operator. You're doing everything correctly but I have found that "basic auth" feature rather hit-or-miss. It's basic <grin>. Note the help documentation says only to use this over https because it places the auth credentials in the header. But in a news site like New York Times (where I have a subscription), that's not how it works. I am not an expert in authentication so will defer to others on the differences here.
That said, I have gotten this kind of thing to work in RapidMiner but it will not be one click like you are hoping...
Scott
Might I suggest checking out our Mozenda extension. Although you pay some $ to Mozenda, you can scrap things way easier.
Thanks, Tom. I forgot about Mozenda because it is never an option for me (it requires a Windows client) and it is very expensive. But for those with Windows and a budget of $99+ per month, it is certainly a good option.
Scott
Thanks for the responses. I did take a brief look at Mozenda (looks interesting), but was hoping there might be an alternative approach for the same reasons as Scott, i.e. because I use a MacBook and because of the cost. I know it is an option if I install a virtal machine program like VMware Fusion, so I may yet have to reconsider. The Times login is https/login.thetimes.co.uk, so I had hoped that maybe I'd just misstepped in my set-up of Enable Basic Auth.
That's a very helpful video. Thanks, Tom.
Thanks Thomas. I haven't had chance to try the video idea as I'm wrestling with Process Documents from the Web at the moment. But will take a look when I get chance.
That's a great Youtube video! Looks like it's also using one of my example processes from back in the day too! #FeelingProud
http://community.rapidminer.com/t5/RapidMiner-Server/SOLVED-Open-File-with-basic-authentication-in-RapidAnalytics/m-p/24073
You might need to change a bit of the XML on this link to convert it from 5.3 to 7.3 formatting.
I have a whole set of template processes somewhere around that setup OAuth integration for a couple of email marketing APIs (Silverpop & DotMailer) as well as Twitter authentication.
Hi,
basic auth is the authentication where in your browser you'd get the ugly input dialog box overlay.
If you have a form login (embedded login in web page), that's not basic auth anymore. The problem is that those logins would be theoretically be supported, but due to Cross Site Request Forgery prevention, it almost never works Thus it was excluded from the operator.
Regards,
Marco