Grouping profiles strings having the same words, but occurring out of order Python
I have a data frame containing a column of profile types, which looks like this:
left_side right_side similarity</code>0 Android Java 1 Software Development Developer 2 Full-stack Developer 3 JavaScript Frontend Design 4 Android iOS JavaScript 5 Ruby JavaScript PHP</pre><div><code><p>I've used NLP to fuzzy match similar profiles, which returned the following similarity dataframe:</p><div><pre class="CodeBlock"><code>
7 JavaScript Frontend Design Design JavaScript Frontend 0.849943 8 JavaScript Frontend Design Frontend Design JavaScript 0.814599 9 JavaScript Frontend Design JavaScript Frontend 0.808010 10 JavaScript Frontend Design Frontend JavaScript Design 0.802881 12 Android iOS JavaScript Android iOS Java 0.925126 15 Machine Learning Engineer Machine Learning Developer 0.839165 21 Android Developer Developer Android Developer 0.872646 25 Design Marketing Testing Design Marketing 0.817195 28 Quality Assurance Quality Assurance Developer 0.948010
While this has helped, taking me from 478 unique profile to 461, what I'd want to focus on are profiles like this:
Frontend Design JavaScript Design Frontend JavaScript<br>
The only tool I've seen which looks to address this problem is difflib? My question is, what other techniques would be available so as to go through and standardize these profiles that consist of the same words, but out of order, to one standard string. So desired output would be, taking a string containing "Design", "Frontend" and "JavaScript" and replacing it with "Design Frontend JavaScript".
Right now, I'm merging my original dataframe with the similarity dataframe to replace all occurrences of profile string on the right_side with the left_side, but that means I'm replacing the right_side below ("Java Python Data Science") with the left_side below ("JavaScript Python Data Science").
</code>53 JavaScript Python Data Science Java Python Data Science</pre><p></p><p>Any help would be greatly appreciated!!!</p></div><div><br></div>
Answers
Dortmund, Germany