Combining Data Sets

ari_31321 · October 2020

Hi, I am new at rapidminer and i am wondering if there is a solution to the following question

Is there a possibility to match the following Data Set1 and Set2 with the logic, that the Needed Data Set can be obtained?
- "?" in Set2 should be replaced by the corresponding values of Set1
- if there is a unambiguous assignment such as in row 2, only the corresponding value should be added
- if there is a ambiguous assignment such as in row 1 and 3, all possible values should be added

Data Set1:
Attr_A Attr_B Attr_C
A_111 B_123 C_999
A_111 B_234 C_999
A_111 B_345 C_222
A_222 B_456 C_111
A_222 B_567 C_222

Data Set2:
Attr_1 Attr_A Attr_B Attr_C
... A_111 ? ?
...   A_111 B_234 ?
...   A_222 ? ?

Data Set Needed:
Attr_1 Attr_A Attr_B Attr_C
...    A_111    B_123, B_234, B_345   C_999, C_222
... A_111      B_234 C_999
... A_222 B_456, B_567 C_111, C_222

MartinLiebig · October 2020

Hi,

I think what you want is first Append Data_set1 and Data_set 2 with an Append (SuperSet) operator. Afterwards you use Aggregate and Concat on Attr_b and Attr_C?

Best,

Martin

ari_31321 · October 2020

Hi, thanks for the quick answer!

Data Set1 is some kind of a "Relation Table", where the relations between article numbers are shown
(e.g. A_111 is related to B_123 and C_999, but there is also a relation between A_111 , B_234, C_999 and so on).

Therefore Data Set1 won't have the same size as Data Set2. The Append Operator won't work in this case.
Maybe I am wrong with the joins operator group in general for this task. But so far I have no solution how Data Set2 could "make use" of Data Set1 to fill the missing cells in respect of the relation on the article numbers as they are shown in Data Set1.

Thanks in advance for help!

MartinLiebig · October 2020

Hi,

that is why i would recommend the operator Append SuperSet, which also works if one example set is the superset of the other. It is part of Operator Toolbox extension.

Best,

Martin

ari_31321 · October 2020

Thanks for explaining again. I haven't had that extension installed yet.
Unfortunatelly your suggested solution does not work in my case.

I try again to explain my question again, maybe that helps:

Data Set2 should use Data Set1 (as a relation table) to fill the missing values in Data Set2/ or create the new Data Set Needed.
That means e.g. that Data Set2 should "look up" in Data Set1 and recognize that value A_111 is related to B_123, B_234, B_345. So the missing value in Attr. B of the Needed Data Set should be "B_123, B_234, B_345". (The same question with Att C for the Needed Data Set).
As a further aspect Data Set2 should also "understand" if it contains already the values A_111 and B_234, that those two values in combination have only a relation with C_999 (see relation table/ Data Set1).

I hope the explanation helps to understand my objective. Sorry for the long explanation.

Thanks again!

Telcontar120 · October 2020

This is a multiple step operation. First you are going to have to aggregate to get all the values of B and C associated with a given value of A. You can then concatenate these into a single cell. You can then use that to join back with the original dataset with all the missings.

Of course, you may want to take a step back and ask whether you actually need this data structure, or whether another structure would be suitable. If you are doing anything with machine learning, this is going to be an awkward structure to use. join based on attribute A. You have a couple of alternatives:

you could generate a dataset with only one row for each unique value of A and all associated values of B and C (this is actually easier to generate)
you could generate a dataset with each unique combination of A, B, and C

Generally speaking having multiple distinct values in a single db cell is harder for computers to deal with.
Also, you said that Martin's approach would not work, but I am not sure why you said that, did you actually try it out?

ari_31321 · October 2020

Thanks for explaining again in more detail! I could find a way to solve it using your further explication.

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

Combining Data Sets

Answers