💬0 Comments

🔥0 Discussions

👤0 Members

🔌0 Online

The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here

Music Lyrics Analyzer: how to handle repeated lyrics?

mt_12345

mt_12345 Member Posts: 4

Learner III

2018 22 edited 2018 01 in Help

Hey guys,

I'm currently working on an automatic Music Lyrics Analyzer. The MLA uses text analytics methods based on an established platform to analyze the vocabulary used in song lyrics of different interpreters / genres and build clusters of songs based on their lyrics. In many songs, some sections of lyrics are repeated twice, indicated by a string string “x2".

In my opinion, I have to account for those repetition to avoid screwed classification model's results. Do you agree? If yes, how to handle this? Which operators should I choose?

Many thanks for your help! Have a good day!

0

Answers

sgenzer Administrator, Moderator, Employee-RapidMiner, RapidMiner Certified Analyst, Community Manager, Member, University Professor, PM Moderator Posts: 2,959 Community Manager

2018 22

hmm I'm not really sure about whether or not you should be weighting the repetitions or not but if you use tokenization and TFIDF, the repetitions will be weighted accordingly anyway.

Scott

0
mt_12345 Member Posts: 4 Learner III

2018 23

Thanks a lot for your answer. I will try it out!

Cheers

0
mt_12345 Member Posts: 4 Learner III

2018 23

Just to make sure that everyone gets my question right: The repetitions are only indicated by a string x2, the text itself is not included twice in the songtext. So we have to do some transformations that the text really appears twice, right? Any ideas how we can do this?

I think what Scott suggested is what comes one step later.

Thanks

0
David_A Administrator, Moderator, Employee-RapidMiner, RMResearcher, Member Posts: 297 RM Research

2018 23
Hi,

it depends a bit on how the lyrics are returned. One token per line or stanza. If this is the case you can play with regular expressions and the replace Operator.

Perhaps a bit cumbersome, but something like this should do the trick:

Replace what: (.+) x2

Replace with: $1 $1

Then you can repeat that pattern for x3, x4, ...

Hope this helps.
2
kayman Member Posts: 662 Unicorn

2018 24

Regular expressions are probably the best approach here indeed, but the quality will depend on your original data. The one given by David would work already to some extend but since it's greedy it can strip too much data if you have multiple x2's in your data. If your structure is as follows (so with linebreaks) :

some sentence

another sentence x2

yet again another sentence

and some other x2

The regular expression that will work best in that case is (?m)^(.*?) x2$

Roughly translated this means for any line you see start at the beginning and then group everything that appears untill the first time you see x2.

So replace it then with $1 $1 will give you the same string twice. If there is no x2 in the strin/line it will simply keep the original.

if everything is in one line (.*?) x2 will do fine also, but ensure you use the questionmark if you have more than one time x2 in your string. This will ensure the capture stops as soon as it finds an x2, otherwise it will take everything untill the last time it finds an x2

Note that if your 2x would be in parantheses it will become like this (.*?) \(x2\)

2
mt_12345 Member Posts: 4 Learner III

2018 25

Thanks a lot guys! I need to try it out to see if the results are satisfying.

1

Sign In or Register to comment.