I want to group different item by their brand
I had bunch of items and i want to group them by their brand. The item description of the data i receive seem concatenate brand name and item name together. With varies length of brand name and now i want to group them, for example in the picture, can group all OREO Item together instead they seperated into different groups. Thank you!
Best Answers
-
FBT Member Posts: 106 Unicorn
Do you have a dictionary containing all possible brand names? If not, I believe your best choice would be to combine the ideas of previous responses (i.e. build some regex logic) to create such a dictionary on which you can then run your grouping. This does require some manual labour and can, depending on the amount of different brand names, take up a lot of time, but based on your input data structure there is just no way to directly make an aggregation on brand name.
0 -
rfuentealba RapidMiner Certified Analyst, Member, University Professor Posts: 568 UnicornThere are tons of ways to accomplish that task.
1.- Create a CSV file with Excel and import it.
2.- Use the Data Editor from RapidMiner to create a new data object.
3.- Use the "Create ExampleSet" operator from the Operator Toolbox.
4.- Create a copy of your 200000 products and create something that groups by similarities.
Before going that way, do you mind analyzing your content with a hexadecimal editor first? Perhaps you can find a pattern that allows you to actually do the split. On Mac, you can use HexFiend, which is super easy to use.
As of me, the most difficult way to split strings I can think of (and the one I would never choose to explain others but would probably choose for myself) is to split each string by commas, execute multiple orderings by word1, word1+word2, word1+word2+word3, until I can analyze each one in terms of depth (depth 1 would be OREO, depth 2 would be VICTORIA'S SECRET, depth 3 would be THE PEGASUS GROUP, and so on) and amount of products per depth. However that is time consuming and I would use Ruby for such a task. Please, don't follow this. I'm just being creative and encouraging you to build your own solution as preparing data doesn't have to be done in RapidMiner if you have other ways to do that.
All the best,
Rodrigo.0
Answers
Hi @demonlovesong
Use the following:
(\S+)\s+.*
This regular expression means "Capture anything that isn't a space (\S+) that comes before one or more spaces \s+ that in turn come before any kind of character .*" That is why you use $1, because you need the first (and only) string before the \s space.
All the best,
Rodrigo.
Thank you for the solution,it is really helpful, but what if the brand name are containing more than one word? I have 1913 item to abstract the brand name and they are in random sequence, is it achievable?
In your example it seems the brand is separated from the other content using a tab (or multiple spaces), can you confirm that?
If that's the case it should be fairly straighforward. Your regex needs to be adjusted as follows in case of tabs :
^(.*?)\t.*
or, even easier : install the operator toolbox extention, and use the 'create exampleset' operator to copy your data and convert it to a dataset. Attached example gives an idea on how to do this.
That is not the case, there are branch name in the form of following picture attacted, this is something give us a problem while creating the column. Is that anyway to attract them correctly? Thank you!
Alright i see, thank you very much
By the way, If i want to create a dictionary for all the brand name, how am i going to do it with rapid miner? Thank you!
1.- Create a CSV file with Excel and import it.
2.- Use the Data Editor from RapidMiner to create a new data object.
3.- Use the "Create ExampleSet" operator from the Operator Toolbox.
4.- Create a copy of your 200000 products and create something that groups by similarities.
Before going that way, do you mind analyzing your content with a hexadecimal editor first? Perhaps you can find a pattern that allows you to actually do the split. On Mac, you can use HexFiend, which is super easy to use.
As of me, the most difficult way to split strings I can think of (and the one I would never choose to explain others but would probably choose for myself) is to split each string by commas, execute multiple orderings by word1, word1+word2, word1+word2+word3, until I can analyze each one in terms of depth (depth 1 would be OREO, depth 2 would be VICTORIA'S SECRET, depth 3 would be THE PEGASUS GROUP, and so on) and amount of products per depth. However that is time consuming and I would use Ruby for such a task. Please, don't follow this. I'm just being creative and encouraging you to build your own solution as preparing data doesn't have to be done in RapidMiner if you have other ways to do that.
All the best,
Rodrigo.
Really Appreciate your time, thank you very much and have a nice day!