Read greek, danish etc. html pages
Hi guys,
I am new to RapidMiner Studio. I want to do a web scraping task which crawls some greek (and later danish etc) HTML sites and extracts the content. In the resulting columns, all the Greek letters are looking wired as the screenshot shows.
The Process Document from Data operator contains the following two components.
One Idea was to add the Keep Document Parts and add some regular expression for UTF-8 so I have inserted in the extraction regex parameters: \p{L} for all languages related to this article: Java regex for support Unicode?. But that did not fix the problem. So my questions are:
1. What regular expression is the right one?
2. Is there any other way to achive the columns containing the greek letter?
Thank you in advance for help
Comments
Hi,
did you try to change the main process encoding to UTF-8? you can get there by clicking into the white of "Process".
Best,
Martin
Dortmund, Germany
Hi, yes but it didn't fixed the problem. Below I have posted the screenshot of the output of the Extract Content component, too.
I have done the same process using the Read RSS Feed in the main process instead of the Crawl Web component and the encoding works fine. I don't know why using the Crawl Web component this problem occurs .
This looks like ISO-8859-7 interpreted as UTF-8 to me. Do you have the URL of the crawled website?
I have tested ISO-8859-7, too but the same issue remains, the site is this one: https://www.google.gr/intl/el/policies/privacy/archive/. I have to crawl all the past policies politics (greek) and gather some information of every site. I want to mention that with the Read RSS Feed operator there is no such problem but I don't need a rss reader for my purpose.
hi @mike075i - so I can get this working on my computer but I needed to do two things:
(a) Make sure I had Roboto font installed with Greek characters (I'm not sure this is necessary)
(b) override the encoding to UTF-8
(note that you did not post your XML process so I just did Get Page of this URL: https://www.google.gr/intl/el/policies/privacy/archive/20160325/)
Scott
Oh sorry, my fault forgot to post my XML code so here it is:
> (b) override the encoding to UTF-8
Where is this setting located (which component), I was not able to find that :smileysad:.
I filed a bug report for the wrong encoding detection.
I hope this is working for you
Thank you very much this solution has fixed my problem, thumb up :smileyhappy:.