Remove or replace URL and RT from Twitter dataset

ikayunida123 · June 2018

Hello everyone!

So right now I'm trying to do a data cleaning phase on text classification using Twitter dataset. But I have a problem about how to replace (or maybe remove) the URL, RT and @ character. I've read some post on the forum but I didn't understand anything :catsad:

For the URL on the dataset, I want to change the format from "https:" or "http:" to "link" (I don't know why it can't have a null value like " "). But after I executed my process using Replace operator, the result from "http://blablabla" didn't change into "link" only, but the result come out like this "linkblablabla". Maybe it has something to do with the RegEx? :catsad: I know what's RegEx but I don't how how to use and write it :catsad:

I'm really confused right now. Please help me.

This's my RapidMiner process :

<?xml version="1.0" encoding="UTF-8"?><process version="8.1.001">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="8.1.001" expanded="true" name="Process">
    <process expanded="true">
      <operator activated="true" class="retrieve" compatibility="8.1.001" expanded="true" height="68" name="Retrieve Dataset Skripsi" width="90" x="45" y="34">
        <parameter key="repository_entry" value="Dataset Skripsi"/>
      </operator>
      <operator activated="true" class="nominal_to_text" compatibility="8.1.001" expanded="true" height="82" name="Nominal to Text" width="90" x="179" y="34">
        <parameter key="attribute_filter_type" value="single"/>
        <parameter key="attribute" value="Text"/>
      </operator>
      <operator activated="true" class="set_role" compatibility="8.1.001" expanded="true" height="82" name="Set Role" width="90" x="313" y="34">
        <parameter key="attribute_name" value="Label"/>
        <parameter key="target_role" value="label"/>
        <list key="set_additional_roles"/>
      </operator>
      <operator activated="true" class="filter_examples" compatibility="8.1.001" expanded="true" height="103" name="Filter Examples" width="90" x="447" y="34">
        <parameter key="condition_class" value="no_missing_attributes"/>
        <list key="filters_list"/>
      </operator>
      <operator activated="true" class="remove_duplicates" compatibility="8.1.001" expanded="true" height="103" name="Remove Duplicates" width="90" x="581" y="34">
        <parameter key="attribute_filter_type" value="single"/>
        <parameter key="attribute" value="Text"/>
      </operator>
      <operator activated="true" class="replace" compatibility="8.1.001" expanded="true" height="82" name="Replace" width="90" x="715" y="34">
        <parameter key="replace_what" value="(https://)"/>
        <parameter key="replace_by" value="link"/>
      </operator>
      <connect from_op="Retrieve Dataset Skripsi" from_port="output" to_op="Nominal to Text" to_port="example set input"/>
      <connect from_op="Nominal to Text" from_port="example set output" to_op="Set Role" to_port="example set input"/>
      <connect from_op="Set Role" from_port="example set output" to_op="Filter Examples" to_port="example set input"/>
      <connect from_op="Filter Examples" from_port="example set output" to_op="Remove Duplicates" to_port="example set input"/>
      <connect from_op="Remove Duplicates" from_port="example set output" to_op="Replace" to_port="example set input"/>
      <connect from_op="Replace" from_port="example set output" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
    </process>
  </operator>
</process>

I need your help. Thank you!

rfuentealba · June 2018

Hi @ikayunida123

I found another one for your viewing pleasure (...or not):

(https?|http)://[-a-zA-Z0-9+&@#/%?=~_|!:,.;]*[-a-zA-Z0-9+&@#/%=~_|]

This one produces the following results:

Screen Shot 2018-06-07 at 01.20.12.png

It's not simple to read, but it is indeed easy to understand: instead of using (.+) or (.*), the square brackets limit the amount and type of characters that must be recognized after certain patterns.

Hmmm... I've decided that it is not easy to understand either. I tried to explain, I swear. But hopefully this or the other ones I already shared with you might help you.

All the best,

rfuentealba · June 2018

Hi @ikayunida123

There are many things you can do with regular expressions. Don't worry, what you are asking is not trivial. First of all, you have to handle a feature of regular expressions named "groups", something that not everyone is able to explain (me among these).

Look at this regex: (http|https):\/\/([\w\s\d\.]+)(\/?)(.*)

It declares four groups. The first one between the first parentheses matches http or https, then it must have the / (the \ is to escape the /, to avoid using something different). Then the second group comes also between parentheses, and the square brackets contain a symbol that represents every letter (\w), digit (\d) or spacing character (\s) until before the next slash. The third group is to read an optional slash (that is why it ends with a question mark), and the fourth group is for reading the rest of the URL (that .* means the next group is like the rest until EOL).

With the following setting, you can convert http://community.rapidminer.com/t5/forums/ to link:community.rapidminer.com. This might not be what you want.

Screen Shot 2018-06-07 at 00.42.32.png

Another solution is to strip the HTTP/HTTPS part. So https://community.rapidminer.com/t5/forums becomes link:community.rapidminer.com/t5/forums. This regex: (http|https):\/\/(.*) does the trick for you. It declares two groups. The first if it's http or https, the / (which is not a group) and all the rest. Then, in the replacement, you use link:$2 to replace the first group and the / by the word "link:".

Screen Shot 2018-06-07 at 00.49.11.png

Let's say you want just the URL, no protocol nor anything. Only what comes from the DNS part. You can use this regex: (http|https):\/\/([\w\d\.\:]*)(\/?)(.*) and replace by just $2 (which makes reference to your second group), and your result will look like this:

Screen Shot 2018-06-07 at 00.52.06.png

Notice that on this, I added the \: to the second group so you can get the port too. What if I don't want the port? Well, just add \: to the third group and make it conditional, so that your regular expression is like this one: (http|https):\/\/([\w\d\.]*)(\:?|\/?)(.*). Pretty easy, huh?

No, not really. It takes a long time to master regular expressions, and you will probably wonder why would anyone care about all this magic and nonsense. The thing is that regular expressions are ugly, but if you do a lot of data preparation (like you should as a data scientist), it is much more difficult to parse stuff like URL's by actually reading the content from left to right and looking at conditions.

The simplest way to learn Regular Expressions is by practicing. You will find that these should be part of your toolkit. I always give my students this URL:

https://developer.mozilla.org/en-US/docs/Web/JavaScript/Guide/Regular_Expressions

They freeze at first, but after three or four weeks, they become little sorcerers.

Just promise me one thing: if anyone tells you that they need the regular expression for an e-mail, don't try to do it by yourself and go to http://emailregex.com/ instead. Especially if they want a regex which works with Perl. The first time I saw it, I thought it was a compressed file, or much worse, a damaged one.

Hope this helps,

Rodrigo.

David_A · June 2018

Woah great solution and very detailed.

I took the liberty to re-use it to answer the same question on Stack Overflow.

ikayunida123 · June 2018

@rfuentealba Oh my god, thank you so much! It works nicely on my process :catvery-happy:

rfuentealba · June 2018

Glad it helped. However, I was reading my answer again and found that I made a mistake. Not a serious one unless you are parsing thousands of URL's (in that case, every saved flops counts):

https?://[-a-zA-Z0-9+&@#/%?=~_|!:,.;]*[-a-zA-Z0-9+&@#/%=~_|]

This is the final regular expression you should use. Using (http|https?) at the end is redundant (like asking if it's http or it's http or it's https), because s? means that the content might or might not have the character s at the end.

Also, for future reference, I've found that on this implementation of regular expressions there is no need to escape the / character. That's a behaviour I acquired from using UNIX command line tools such as vim or sed.

AmosGH · October 2019

I also tried (https|http)(.*) for my URL and it worked

kayman · October 2019

If you want a bit more 'readability' you could also change the A-Za-z0-9_ with \w\d which covers every word character and digit.

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

Remove or replace URL and RT from Twitter dataset

Best Answer

Answers