Regexpression for html content extraction
Hi guys, I have an HTML page and want to extract after a specific <h2> tag all the content followed by the <p> tag.
I am using the Extract Information component and the Regular Expression as query/type. I have tried to extract the
content of the <h2> tag (regex: <h2>(.+?)</h2>) which gives me the right result Specific 1 text (HTML snipped is listed below).
But when I am trying to extract the <p>blabla...</p> content after this specific <h2> tag using
regex: <h2>Specific 1</h2><p>(.+?)</p> that doesn't work.
...
<h2>Specific 1</h2>
<p>blablabla...</p>
...
Can someonte tell me why and what the right regex is to get the <p> content?
Thank you
Best Answer
-
mike075i Member Posts: 11 Contributor II
Hello, I have solved the problem myself all the problem was that I had to add the h: statement before the HTML tags in the XPath query. The solution is related to this post https://community.rapidminer.com/t5/RapidMiner-Studio-Forum/XPath-with-quot-Cut-Document-quot-or-quot-Extract-Information/td-p/45582.
1
Answers
Can you post your html file? The expression you've given seems like it should work but it is hard to tell or test without a data sample.
Lindon Ventures
Data Science Consulting from Certified RapidMiner Experts
This was only an example. I have attached the whole HTML document which contains the policies of Google in different languages (for simplicity I have attached the English one) in txt format, because of the upload conditions of file extensions I have changed it from .html to .txt. Below is the <p>...</p> part listed which I want to extract after the <h2> tag:
Not sure if you will be able to manage this with regex, xpath might be a better candidate for your needs.
But if there is only one match in your html this may work :
(read as : start at the beginning of the file, do not stop at linebreaks, untill you find the first h2 with id="infocollect", next take the content in the following p tag and store that, then ignore everything again till the end of the page.)
So replacing with $1 gives just the p tag content.
Thank you, but the same issue all the content in the attribute is marked as ?. You are right that XPath is the main choice but I don't have much time to learn XPath now . In addition, I am getting every time while I am executing using the Regex this error message (example for danish language):
@sgenzer are you able to read this text file? I can open it in Notepad++ and it looks fine and says it is encoded UTF-8, but when I try to read it in RapidMiner, it comes back with unreadable characters (both using System encoding as well as UTF-8). I feel like there was another thread with this problem recently, but now I can't find it. Is this another known bug? Or is there some other encoding setting that I am missing somewhere? Thanks!
Lindon Ventures
Data Science Consulting from Certified RapidMiner Experts
hi @Telcontar120 yes I can read this file fine. However I cannot see the </p> tag on that text file so I did the RegEx including a small snippet of the next piece.
Scott
I believed that XPath is something like a new programming language that's why I wrote that I have not much time to learn it, but it is not so and it has an easy syntax to find the right elements in the DOM structure. XPath is by far the better solution but I had no experience before with it. In addition, Chropath for chrome is an awesome extension to check for the right path. Thank you.
I have tried to extract the <p> content after the <h2 class="H8KnQb" id="infocollect">Τα στοιχεία που συλλέγουμε</h2> tag using the XPath query: //h2[@id='infocollect']/following-sibling::p[1] in the Extract Information component, but the problem remains in the output. As you can see in the below screenshot the content gets extracted right using the XPath query in ChroPath.
I have added in addition the Extract Content operator to exclude the HMTL tags and get only the text which starts as Συλλέγουμε στοιχεία, για να παρέχουμε καλύτερες υπηρεσίες σε όλους τους χρήστες μας. Here is my XML code maybe you can help me to fix this problem:
I am using RapidMiner Studio version 8.1.001 Win64 platform