Problem with collecting specific information using RegEx

lukei_11 · February 2018

Hey RapidMiner community,

I have a problem with the use of a RegEx:

I'd like to collect information about the adress of different institutions and companies. For this reason I use the crawl web operator and collect the sites that have the adress information on them. This step is working perfectly. In the next step I want to retrieve the street and the Zipcode + city. For that I use the following RegEx in the "Extract Information" operator:

(.+\s)((D|d|DE)?\-?[6-7][0-9]{4}\s[A-Z][a-z]{1,})

With this RegEx I'd like to collect following:

For example from this site http://www.vfb.de/de/1893/club/service/formales/impressum/

I want "Mercedesstraße 109" and "70372 Stuttgart" as the result.

For the part with the Zipcode (starting with either the number 6 or 7) and the name of the city it is working. Because of that I want to look for the line above that. But as soon as I add the first part (.+\s) to collect the line above the Zipcode and city, the result in the result-section of my process is just a ? (Questionmark). Is there any mistake in my RegEx or does RapidMiner require a special format? Because when I test my RegEx in a free online RegEx-Tester it is working properly...

Thank you!

lukei_11

BalazsBarany · February 2018

Hi!

There are special cases for multiline regexes. Make sure to use an online regular expression tester that lets you use Java syntax, as that's what RapidMiner uses. Best is to use the tester in RapidMiner, which is e. g. available in the Replace operator.

Here's an example process:

<?xml version="1.0" encoding="UTF-8"?><process version="7.6.003">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="7.6.003" expanded="true" name="Process">
    <process expanded="true">
      <operator activated="true" class="generate_data_user_specification" compatibility="7.6.003" expanded="true" height="68" name="Generate Data by User Specification" width="90" x="112" y="34">
        <list key="attribute_values">
          <parameter key="daten" value="&quot;VfB Stuttgart 1893 AG&#10;VfB Stuttgart 1893 AG&#10;Mercedesstraße 109&#10;70372 Stuttgart&quot;"/>
        </list>
        <list key="set_additional_roles"/>
      </operator>
      <operator activated="true" class="replace" compatibility="7.6.003" expanded="true" height="82" name="Replace" width="90" x="313" y="34">
        <parameter key="attribute_filter_type" value="single"/>
        <parameter key="attribute" value="daten"/>
        <parameter key="replace_what" value="(.+\s)((D|d|DE)?\-?[6-7][0-9]{4}\s[A-Z][a-z]{1,})"/>
      </operator>
      <connect from_op="Generate Data by User Specification" from_port="output" to_op="Replace" to_port="example set input"/>
      <connect from_op="Replace" from_port="example set output" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
    </process>
  </operator>
</process>

Click on the Replace operator and then on the small button on the right side of "replace what". You'll see the built-in regular expression tester. Paste your example text, the matched part will be highlighted. Here you can play with the expression until it does what you want.

Regular expressions are not the best method for this, though, if your output is not "regular".

Regards,

Balázs

lukei_11 · February 2018

Dear Balazs,

thank you for your quick response! I tried your solution but it isn't working... When I test my RegEx in the testing environment it matches what it should but when I run my process it only returns a ? (Questionmark) as the result.

RegEx problem.PNG in the test environment the RegEx matches the right parts

Is there any mistake in my process?

<?xml version="1.0" encoding="UTF-8"?><process version="8.1.000">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="8.1.000" expanded="true" name="Process">
    <parameter key="logverbosity" value="init"/>
    <parameter key="random_seed" value="2001"/>
    <parameter key="send_mail" value="never"/>
    <parameter key="notification_email" value=""/>
    <parameter key="process_duration_for_mail" value="30"/>
    <parameter key="encoding" value="SYSTEM"/>
    <process expanded="true">
      <operator activated="true" class="web:crawl_web_modern" compatibility="7.3.000" expanded="true" height="68" name="Crawl Web" width="90" x="45" y="238">
        <parameter key="url" value="https://www.vfb.de/"/>
        <list key="crawling_rules">
          <parameter key="follow_link_with_matching_url" value=".+impressum(.*)?"/>
          <parameter key="store_with_matching_url" value=".+impressum(.*)?"/>
        </list>
        <parameter key="max_crawl_depth" value="1"/>
        <parameter key="retrieve_as_html" value="true"/>
        <parameter key="enable_basic_auth" value="false"/>
        <parameter key="add_content_as_attribute" value="true"/>
        <parameter key="write_pages_to_disk" value="false"/>
        <parameter key="include_binary_content" value="false"/>
        <parameter key="output_dir" value="C:\Users\lukei\Desktop\Impressum"/>
        <parameter key="output_file_extension" value="html"/>
        <parameter key="max_pages" value="100"/>
        <parameter key="max_page_size" value="1000"/>
        <parameter key="delay" value="200"/>
        <parameter key="max_concurrent_connections" value="100"/>
        <parameter key="max_connections_per_host" value="50"/>
        <parameter key="user_agent" value="rapidminer-web-mining-extension-crawler"/>
        <parameter key="ignore_robot_exclusion" value="false"/>
      </operator>
      <operator activated="true" class="text:process_document_from_data" compatibility="8.1.000" expanded="true" height="82" name="Process Documents from Data" width="90" x="246" y="238">
        <parameter key="create_word_vector" value="false"/>
        <parameter key="vector_creation" value="TF-IDF"/>
        <parameter key="add_meta_information" value="true"/>
        <parameter key="keep_text" value="false"/>
        <parameter key="prune_method" value="none"/>
        <parameter key="prune_below_percent" value="3.0"/>
        <parameter key="prune_above_percent" value="30.0"/>
        <parameter key="prune_below_rank" value="0.05"/>
        <parameter key="prune_above_rank" value="0.95"/>
        <parameter key="datamanagement" value="double_sparse_array"/>
        <parameter key="data_management" value="auto"/>
        <parameter key="select_attributes_and_weights" value="false"/>
        <list key="specify_weights"/>
        <process expanded="true">
          <operator activated="true" class="text:extract_information" compatibility="8.1.000" expanded="true" height="68" name="Extract Information" width="90" x="246" y="34">
            <parameter key="query_type" value="Regular Expression"/>
            <list key="string_machting_queries"/>
            <parameter key="attribute_type" value="Nominal"/>
            <list key="regular_expression_queries">
              <parameter key="PLZ und Ort" value="([6-7][0-9]{4}\s[A-Z][a-z]{1,})"/>
              <parameter key="PLZ" value="[6-7][0-9]{4}\s"/>
              <parameter key="Strasse" value="(.+\s)((D|d|DE)?\-?[6-7][0-9]{4}\s[A-Z][a-z]{1,})"/>
            </list>
            <list key="regular_region_queries"/>
            <list key="xpath_queries"/>
            <list key="namespaces"/>
            <parameter key="ignore_CDATA" value="true"/>
            <parameter key="assume_html" value="true"/>
            <list key="index_queries"/>
            <list key="jsonpath_queries"/>
          </operator>
          <connect from_port="document" to_op="Extract Information" to_port="document"/>
          <connect from_op="Extract Information" from_port="document" to_port="document 1"/>
          <portSpacing port="source_document" spacing="0"/>
          <portSpacing port="sink_document 1" spacing="0"/>
          <portSpacing port="sink_document 2" spacing="0"/>
        </process>
      </operator>
      <connect from_op="Crawl Web" from_port="example set" to_op="Process Documents from Data" to_port="example set"/>
      <connect from_op="Process Documents from Data" from_port="example set" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
      <description align="center" color="yellow" colored="false" height="105" resized="false" width="180" x="426" y="12"> YYY ABC</description>
    </process>
  </operator>
</process>

You mentioned that this is possibly not the best way of doing it... Do you have another idea how to automated collect the contact information and adresses of soccer clubs (and there are many soccer clubs in Germany...)?

Thank you!

BalazsBarany · February 2018

Hi!

You're searching for only characters and numbers in the street. However, the web site contains: Mercedesstraße 109. The ß is expressed as an HTML entity.

The text you're testing is not what's coming out of the crawler operator. Your process is set up to return the HTML code.

This approach doesn't work well because you start on a few sites, tune your regexp to detect the addresses there, then you encounter additional sites with a different format, you tune the regexp more, then it doesn't work on the original site anymore, or gives you too many false hits etc.

This kind of processing is very hard. Google is trying to do it and even for them it sometimes fails if somebody was very creative when writing the address.

Your best bet is to find a structured listing. Maybe on Wikipedia? Wikidata? The DFB?

Regards,

Balázs

sgenzer · February 2018

have you tried using the Data Search extension? Your example looks very very similar to the one used by @ey in the tutorial.

Scott

ey1 · February 2018

Hi lukei_11,

Please try out the Read HTML Tables operator from the Web Table Extraction extension.

Best Wishes,

Edwin

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

Problem with collecting specific information using RegEx

Answers