text extraction from html

nrwstudent · July 2016

After I got nearly immediatly response from the community for my first question I feel encouraged to ask another one

I'm trying to extract text from html. Therefore I would like to use Xpath instead of the Extract Content Operator.
Therefore I would like to use the Extract Information operator. But when I copy and paste the xpath I get from google chrome (//*[@id="content"]/div[1]/div/p[1]/text()I get no propper results.

In another post I read that I have to insert h: like //h:*[@id="content"]/h:div[1]/h:div/h:p[1]/text() - but no improvement.

Could you tell me what i did wrong?

Thanks in advance !

MartinLiebig · July 2016

Hi,

is it possible to post the full process? A bit hard to do this on the fly.

To be honest i got a bit lazy lately. The Aylien extension provides an Extract Article option which makes the parsing obsolute. The free API is capped at 1k pages/day though.

~Martin

Thomas_Ott · July 2016

I've done a bit of Xpath extraction using RapidMIner and found that you can't just paste the Xpath that Google gives you. Not sure why, but it doesn't work 99% of the time. I would look at the structure of the page and then build accordingly.

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

text extraction from html

Answers