XPath in Rapidminer
Hello Repidminer Community,
since two weeks or so I started to work with Rapidminer for educational purposes. I really like this tool. But right now, I´m really stuck since two days. My problem relates to the usage of the crawling, extract information or cutting operator. To be precisely it´s about using the Xpath query (at the moment I have to use work arounds with regex but they´re are not really consistent).
My problem is the following:
Considerung for example the famous imdb reviews, e. g. http://www.imdb.com/title/tt0307901/reviews?start=0. If I´m trying to extract some specific element I really get stuck with my query. For example if i want to extract single review texts, I tried to use the following selector " .//*[@id='tn15content']/p[1] " as it is suggested by the developer tools of chrome but employing this in Rapidminer I get no single result.
As you might see I´m a total beginner / noob with Xpath (sorry for that, data science in general is a total new area for me and I know I´m getting old with leraning new stuff) but I really couldn´t find for me an understandable answer of such a question in previous threads, they always seem to be to highly sophisticated for my limited personal understanding. So, if you give some hints, examples or resources how to use and practice the Xpath query in Rapidminer, it would be very nice.
Kind regards
Morgan!
Answers
Hi Morgan,
any chance to download a single file and use the Wizard of Read XML to built the xpath? That's the way i am usually doing it.
~Martin
Dortmund, Germany
thx for the suggestion.... i really appreciate your help.
i have the html files downloaded and will try out the read xml file later at night. I will come back to let you know if I get some good results.
Very thankful
Morgan
Hello again,
since I have raw HTMLs of the single sites ( I Used the Crawl Web operator).... the read XML operator doesn´t work from scratch, so I tried to "convert" HTMLs to XMLs by hand, because I didn´t get it how to this in rapidminer (how do i save the document after HTML to XML operator as XML, there is no such thing like "write xml"). But anyways, using the import wizard from read XML on my "hand-made" XML-file gives me an error message "invalid XML format".
Oh my gosh, I´m so stupid!!!
Hi again,
any chance to post the html here so i can have a look?
~Martin
Dortmund, Germany
Hi,
sorry for my lateness. I was blocked by my regular job and couldn´t advance my private machine learning & rapidminer workshop.
Sure i can give you a quick example of the IMDB review HTML (http://www.imdb.com/title/tt0307901/reviews?start=0), my first step was "crawl web" and save html (as a backup I know that I actually also can use the "process documents from web"-operator for a more integrated workflow), after that i wanted to extract via XPath for example a set of attributes like author of review, rating and the actual text. But as I stated i was not able to create the right XPaths so I tried Regex and Strings, which worked quite well, but I think XPath should be the first choice, so I realy want to master it :-) The HTML code for the example website is attached to this post (because the raw HTML code would excess the post limit).
Very thankful,
Morgan
because of the non-response to my last post, just as a short follow-up. Does someone has someone a few resources to learn XPath syntax applied to rapidminer. Also another crawler related question, is there an official rapidminer tutorial building crawlers for websites With authentication?
Kind regards
Morgan