Website-Content into one cell
Hello everyone,
I want to use textmining methods on the lyrics of a website.
What I have now is:
Artist | Song | Lyrics |
---|---|---|
The Killers | Mr. Brightside | http://lyrics.html |
What I do want is:
Artist | Song | Lyrics |
---|---|---|
The Killers | Mr. Brightside | Coming out of my cage and I'm doing just fine... |
You know what I mean? The Lyrics are written within a <p></p> and I want the whole string into one single cell -
I do know, that I need "Retrieve", "Get Pages" and "Process Documents to Data" (inside: "Extract Content", and the I don't know any further,...)
Which Operator manages it, that the content within the <p> is put into one cell
I hope someone can help me, because I need the Lyrics for further processings
Thank you
Answers
I think you want "Cut Document" rather than (or in addition to) "Extract Content" in this case. After you have retrieved the pages using "Get Pages" and then created your text documents using "Data to Documents" you can use Cut Document and then specify the region of the html that you want to extract using either Xpath (if the lyrics are in a named element) or some kind of regex query.
Lindon Ventures
Data Science Consulting from Certified RapidMiner Experts