The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here

text extraction from html

nrwstudentnrwstudent Member Posts: 4 Contributor I
edited October 2019 in Help

After I got nearly immediatly response from the community for my first question I feel encouraged to ask another one =)

I'm trying to extract text from html. Therefore I would like to use Xpath instead of the Extract Content Operator.
Therefore I would like to use the Extract Information operator. But when I copy and paste the xpath I get from google chrome (//*[@id="content"]/div[1]/div/p[1]/text()I get no propper results.

In another post I read that I have to insert h: like //h:*[@id="content"]/h:div[1]/h:div/h:p[1]/text() - but no improvement.

 

Could you tell me what i did wrong?

 

Thanks in advance !

Answers

  • MartinLiebigMartinLiebig Administrator, Moderator, Employee-RapidMiner, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,533 RM Data Scientist

    Hi,

     

    is it possible to post the full process? A bit hard to do this on the fly.

     

    To be honest i got a bit lazy lately. The Aylien extension provides an Extract Article option which makes the parsing obsolute. The free API is capped at 1k pages/day though.

     

    ~Martin

    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
  • Thomas_OttThomas_Ott RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,761 Unicorn

    I've done a bit of Xpath extraction using RapidMIner and found that you can't just paste the Xpath that Google gives you. Not sure why, but it doesn't work 99% of the time. I would look at the structure of the page and then build accordingly. 

Sign In or Register to comment.