The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here

"Split Text Case - RegEx dropping letters"

mobmob Member Posts: 37 Contributor II
edited June 2019 in Help
I have some text that comes in different cases sometimes tokens are TextendNewtext others textendNewtext I found this regex online for python
([A-Z])([A-Z])([a-z])|([a-z])([A-Z])

but when I apply it to my dataset using the replace tokens operator and ([A-Z])([A-Z])([a-z])|([a-z])([A-Z]) replaced by $1 $2 I get
Texten ewtext

I'm far from an expert in regex in any flavour but can anyone help me resolve this

Answers

  • mobmob Member Posts: 37 Contributor II
    For reference this is the StackOverflow post where I got the regex http://stackoverflow.com/questions/15369566/putting-space-in-camel-case-string-using-regular-expression

    is there a way in rapidminer to handle CamelCaseTextOfVariousLengths and split it into tokens?
  • MartinLiebigMartinLiebig Administrator, Moderator, Employee-RapidMiner, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,533 RM Data Scientist
    Hi mob,

    why not simply use replace and replace capital letters with white space followed by the latter? Seems to work

    ~Martin

    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="7.0.000">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" compatibility="7.0.000" expanded="true" name="Process">
        <process expanded="true">
          <operator activated="true" class="generate_data_user_specification" compatibility="7.0.000" expanded="true" height="68" name="Generate Data by User Specification" width="90" x="45" y="34">
            <list key="attribute_values">
              <parameter key="text" value="&quot;CamelCaseTextOfVariousLengths&quot;"/>
            </list>
            <list key="set_additional_roles"/>
          </operator>
          <operator activated="true" class="nominal_to_text" compatibility="7.0.000" expanded="true" height="82" name="Nominal to Text" width="90" x="179" y="34"/>
          <operator activated="true" class="replace" compatibility="7.0.000" expanded="true" height="82" name="Replace" width="90" x="313" y="34">
            <parameter key="replace_what" value="([A-Z])"/>
            <parameter key="replace_by" value=" $1"/>
          </operator>
          <operator activated="true" class="text:process_document_from_data" compatibility="6.5.000" expanded="true" height="82" name="Process Documents from Data" width="90" x="447" y="34">
            <list key="specify_weights"/>
            <process expanded="true">
              <operator activated="true" class="text:tokenize" compatibility="6.5.000" expanded="true" height="68" name="Tokenize" width="90" x="246" y="34"/>
              <connect from_port="document" to_op="Tokenize" to_port="document"/>
              <connect from_op="Tokenize" from_port="document" to_port="document 1"/>
              <portSpacing port="source_document" spacing="0"/>
              <portSpacing port="sink_document 1" spacing="0"/>
              <portSpacing port="sink_document 2" spacing="0"/>
            </process>
          </operator>
          <connect from_op="Generate Data by User Specification" from_port="output" to_op="Nominal to Text" to_port="example set input"/>
          <connect from_op="Nominal to Text" from_port="example set output" to_op="Replace" to_port="example set input"/>
          <connect from_op="Replace" from_port="example set output" to_op="Process Documents from Data" to_port="example set"/>
          <connect from_op="Process Documents from Data" from_port="example set" to_port="result 1"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
        </process>
      </operator>
    </process>
    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
  • BalazsBaranyBalazsBarany Administrator, Moderator, Employee-RapidMiner, RapidMiner Certified Analyst, RapidMiner Certified Expert Posts: 955 Unicorn
    The problem with your replacement is the following:

    In the regular expression
    ([A-Z])([A-Z])([a-z])|([a-z])([A-Z])
    all the parentheses are numbered. You're trying to replace by the value coming from the first and second parentheses, which would be two capital letters if your text matched that. It doesn't, so it goes to the second (alternate) match after the pipe symbol. But the contents of those parentheses are not in the replacement string. They would be $4 and $5.

    Martin's approach seems to be what you want.
  • mobmob Member Posts: 37 Contributor II
    Thanks Martin and Balázs Martins simple solution did exactly what I needed and also handled the situation with tokens like notcamelCase
  • AndreasSAndreasS Member Posts: 4 Contributor I

    Hi Martin, hi everybody

    I am facing the same problem as mob. Unfortunately I couldn't solve it using your comment from 02-01-2016.

     

    Problem:

    I want to separate the following text: "PleaseSeparateMeByCapitalLetters" into "Please Separate Me By Capital Letters"

    I tried to use the Replace Tokens operator

    - replace what:[A-Z]

    - replace by: $1

     

    However the result is " lease eparate e y apital etters".

     

    Thanks in advance for your help

     

    Andreas

  • BalazsBaranyBalazsBarany Administrator, Moderator, Employee-RapidMiner, RapidMiner Certified Analyst, RapidMiner Certified Expert Posts: 955 Unicorn

    The $1 refers to a "capture expression". You define capture expressions with (). If that's not in the replace what part, then $1 will be empty.

  • kaymankayman Member Posts: 662 Unicorn

    Try this : ([A-Z])(.) ,replace by  $1$2 and ensure there is a space before $1. It also adds a space before the first word but you can remove that one again by doing a second regex or trim the string.

     

    so like this :

     

    <parameter key="replace_what" value="([A-Z])(.)"/>
    <parameter key="replace_by" value=" $1$2"/>

     

    Probably not the most sexy solution but plain simple sometimes does the trick also.

  • AndreasSAndreasS Member Posts: 4 Contributor I

    Thank you !!!

Sign In or Register to comment.