Read greek, danish etc. html pages

mike075i · April 2018

Hi guys,

I am new to RapidMiner Studio. I want to do a web scraping task which crawls some greek (and later danish etc) HTML sites and extracts the content. In the resulting columns, all the Greek letters are looking wired as the screenshot shows.

The Process Document from Data operator contains the following two components.

One Idea was to add the Keep Document Parts and add some regular expression for UTF-8 so I have inserted in the extraction regex parameters: \p{L} for all languages related to this article: Java regex for support Unicode?. But that did not fix the problem. So my questions are:

1. What regular expression is the right one?

2. Is there any other way to achive the columns containing the greek letter?

Thank you in advance for help

MartinLiebig · April 2018

Hi,

did you try to change the main process encoding to UTF-8? you can get there by clicking into the white of "Process".

Best,

Martin

mike075i · April 2018

Hi, yes but it didn't fixed the problem. Below I have posted the screenshot of the output of the Extract Content component, too.

I have done the same process using the Read RSS Feed in the main process instead of the Crawl Web component and the encoding works fine. I don't know why using the Crawl Web component this problem occurs .

jwpfau · April 2018

This looks like ISO-8859-7 interpreted as UTF-8 to me. Do you have the URL of the crawled website?

mike075i · April 2018

I have tested ISO-8859-7, too but the same issue remains, the site is this one: https://www.google.gr/intl/el/policies/privacy/archive/. I have to crawl all the past policies politics (greek) and gather some information of every site. I want to mention that with the Read RSS Feed operator there is no such problem but I don't need a rss reader for my purpose.

sgenzer · April 2018

hi @mike075i - so I can get this working on my computer but I needed to do two things:

(a) Make sure I had Roboto font installed with Greek characters (I'm not sure this is necessary)

(b) override the encoding to UTF-8

(note that you did not post your XML process so I just did Get Page of this URL: https://www.google.gr/intl/el/policies/privacy/archive/20160325/)

Scott

Screen Shot 2018-04-19 at 8.53.37 PM.png Screen Shot 2018-04-19 at 8.55.15 PM.png

mike075i · April 2018

Oh sorry, my fault forgot to post my XML code so here it is:

<?xml version="1.0" encoding="UTF-8"?><process version="8.1.001">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="8.1.001" expanded="true" name="Process">
    <parameter key="encoding" value="SYSTEM"/>
    <process expanded="true">
      <operator activated="true" class="web:crawl_web_modern" compatibility="7.3.000" expanded="true" height="68" name="Crawl Web" width="90" x="45" y="136">
        <parameter key="url" value="https://www.google.gr/intl/el/policies/privacy/archive/"/>
        <list key="crawling_rules">
          <parameter key="follow_link_with_matching_url" value=".+privacy/archive.+"/>
        </list>
      </operator>
      <operator activated="false" class="web:crawl_web_modern" compatibility="7.3.000" expanded="true" height="68" name="TEST" width="90" x="45" y="493">
        <parameter key="url" value="http://www.samos.aegean.gr/st/"/>
        <list key="crawling_rules"/>
      </operator>
      <operator activated="true" class="web:retrieve_webpages" compatibility="7.3.000" expanded="true" height="68" name="Get Pages" width="90" x="313" y="136">
        <parameter key="link_attribute" value="Link"/>
        <parameter key="random_user_agent" value="true"/>
      </operator>
      <operator activated="true" class="text:process_document_from_data" compatibility="8.1.000" expanded="true" height="82" name="Process Documents from Data" width="90" x="581" y="238">
        <parameter key="vector_creation" value="Binary Term Occurrences"/>
        <parameter key="add_meta_information" value="false"/>
        <list key="specify_weights"/>
        <process expanded="true">
          <operator activated="true" class="web:extract_html_text_content" compatibility="7.3.000" expanded="true" height="68" name="Extract Content" width="90" x="313" y="187"/>
          <operator activated="false" class="text:tokenize" compatibility="8.1.000" expanded="true" height="68" name="Tokenize" width="90" x="112" y="340"/>
          <connect from_port="document" to_op="Extract Content" to_port="document"/>
          <connect from_op="Extract Content" from_port="document" to_port="document 1"/>
          <portSpacing port="source_document" spacing="0"/>
          <portSpacing port="sink_document 1" spacing="0"/>
          <portSpacing port="sink_document 2" spacing="0"/>
        </process>
      </operator>
      <connect from_op="Crawl Web" from_port="example set" to_op="Get Pages" to_port="Example Set"/>
      <connect from_op="Get Pages" from_port="Example Set" to_op="Process Documents from Data" to_port="example set"/>
      <connect from_op="Process Documents from Data" from_port="example set" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
    </process>
  </operator>
</process>

> (b) override the encoding to UTF-8

Where is this setting located (which component), I was not able to find that :smileysad:.

jwpfau · April 2018

I filed a bug report for the wrong encoding detection.

I hope this is working for you

<?xml version="1.0" encoding="UTF-8"?><process version="8.1.003">
 <context>
 <input/>
 <output/>
 <macros/>
 </context>
 <operator activated="true" class="process" compatibility="8.1.003" expanded="true" name="Process">
 <process expanded="true">
 <operator activated="true" class="web:crawl_web_modern" compatibility="7.3.000" expanded="true" height="68" name="Crawl Web" width="90" x="45" y="34">
 <parameter key="url" value="https://www.google.gr/intl/el/policies/privacy/archive/"/>
 <list key="crawling_rules">
 <parameter key="follow_link_with_matching_url" value=".+privacy/archive.+"/>
 </list>
 </operator>
 <operator activated="true" class="concurrency:loop_values" compatibility="8.1.003" expanded="true" height="82" name="Loop Values" width="90" x="246" y="34">
 <parameter key="attribute" value="Link"/>
 <parameter key="iteration_macro" value="link"/>
 <process expanded="true">
 <operator activated="true" class="web:get_webpage" compatibility="7.3.000" expanded="true" height="68" name="Get Page" width="90" x="112" y="34">
 <parameter key="url" value="%{link}"/>
 <list key="query_parameters"/>
 <list key="request_properties"/>
 <parameter key="override_encoding" value="true"/>
 <parameter key="encoding" value="UTF-8"/>
 </operator>
 <connect from_op="Get Page" from_port="output" to_port="output 1"/>
 <portSpacing port="source_input 1" spacing="0"/>
 <portSpacing port="source_input 2" spacing="0"/>
 <portSpacing port="sink_output 1" spacing="0"/>
 <portSpacing port="sink_output 2" spacing="0"/>
 </process>
 </operator>
 <operator activated="true" class="text:process_documents" compatibility="8.2.000-SNAPSHOT" expanded="true" height="103" name="Process Documents" width="90" x="447" y="34">
 <parameter key="vector_creation" value="Binary Term Occurrences"/>
 <parameter key="add_meta_information" value="false"/>
 <process expanded="true">
 <operator activated="true" class="web:extract_html_text_content" compatibility="7.3.000" expanded="true" height="68" name="Extract Content (4)" width="90" x="447" y="34"/>
 <connect from_port="document" to_op="Extract Content (4)" to_port="document"/>
 <connect from_op="Extract Content (4)" from_port="document" to_port="document 1"/>
 <portSpacing port="source_document" spacing="0"/>
 <portSpacing port="sink_document 1" spacing="0"/>
 <portSpacing port="sink_document 2" spacing="0"/>
 </process>
 </operator>
 <connect from_op="Crawl Web" from_port="example set" to_op="Loop Values" to_port="input 1"/>
 <connect from_op="Loop Values" from_port="output 1" to_op="Process Documents" to_port="documents 1"/>
 <connect from_op="Process Documents" from_port="example set" to_port="result 1"/>
 <connect from_op="Process Documents" from_port="word list" to_port="result 2"/>
 <portSpacing port="source_input 1" spacing="0"/>
 <portSpacing port="sink_result 1" spacing="0"/>
 <portSpacing port="sink_result 2" spacing="0"/>
 <portSpacing port="sink_result 3" spacing="0"/>
 </process>
 </operator>
</process>

mike075i · April 2018

Thank you very much this solution has fixed my problem, thumb up :smileyhappy:.

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

Read greek, danish etc. html pages

Fixed and Released · Last Updated October 2019

Comments