Remove duplicate rows including rearrangements across columns

dataguy · 2017 08

I have generated several "optimized" Fantasy Football Challenge lineups.

I am trying to figure out how to remove duplicate rows where the lineup generated includes rearrangments within same-position columns.

My goal is to remove any row that is a duplicate, exact column matched or not (the same lineup of players is the same lineup regardless of order).

QB	RB	RB	WR	WR	WR	TE	FLEX	DST
Tom Brady (9864209)	Carlos Hyde (9864280)	Alex Collins (9864148)	Demaryius Thomas (9864449)	Cordarrelle Patterson (9864841)	Michael Thomas (9864513)	Austin Seferian-Jenkins (9864619)	Todd Gurley (9864760)	Los Angeles Chargers (9864756)
Tom Brady (9864209)	Carlos Hyde (9864280)	Alex Collins (9864148)	Michael Thomas (9864513)	Emmanuel Sanders (9864451)	Seth Roberts (9864845)	Austin Seferian-Jenkins (9864619)	Todd Gurley (9864760)	Los Angeles Chargers (9864756)
Tom Brady (9864209)	Carlos Hyde (9864280)	Kareem Hunt (9864565)	Seth Roberts (9864845)	Demaryius Thomas (9864449)	Tyrell Williams (9864740)	Austin Seferian-Jenkins (9864619)	Todd Gurley (9864760)	Los Angeles Chargers (9864756)
Tom Brady (9864209)	Carlos Hyde (9864280)	Kenyan Drake (9864473)	Emmanuel Sanders (9864451)	Cordarrelle Patterson (9864841)	Michael Thomas (9864513)	Austin Seferian-Jenkins (9864619)	Todd Gurley (9864760)	Los Angeles Chargers (9864756)
Tom Brady (9864209)	Carlos Hyde (9864280)	Kenyan Drake (9864473)	Jermaine Kearse (9864605)	Demaryius Thomas (9864449)	Michael Thomas (9864513)	Austin Seferian-Jenkins (9864619)	Kareem Hunt (9864565)	Los Angeles Chargers (9864756)
Tom Brady (9864209)	Kareem Hunt (9864565)	Carlos Hyde (9864280)	Michael Thomas (9864513)	Demaryius Thomas (9864449)	Tyrell Williams (9864740)	Austin Seferian-Jenkins (9864619)	Marshawn Lynch (9864829)	Los Angeles Chargers (9864756)
Tom Brady (9864209)	Kenyan Drake (9864473)	Carlos Hyde (9864280)	Tyrell Williams (9864740)	Demaryius Thomas (9864449)	Michael Thomas (9864513)	Austin Seferian-Jenkins (9864619)	Kareem Hunt (9864565)	Los Angeles Chargers (9864756)
Tom Brady (9864209)	Leonard Fournette (9864406)	Alex Collins (9864148)	Demaryius Thomas (9864449)	Seth Roberts (9864845)	Michael Thomas (9864513)	Austin Seferian-Jenkins (9864619)	Carlos Hyde (9864280)	Los Angeles Chargers (9864756)
Tom Brady (9864209)	Carlos Hyde (9864280)	Leonard Fournette (9864406)	Seth Roberts (9864845)	Emmanuel Sanders (9864451)	Michael Thomas (9864513)	Austin Seferian-Jenkins (9864619)	Kenyan Drake (9864473)	Los Angeles Chargers (9864756)
Tom Brady (9864209)	Leonard Fournette (9864406)	Carlos Hyde (9864280)	Tyrell Williams (9864740)	Michael Thomas (9864513)	Emmanuel Sanders (9864451)	Austin Seferian-Jenkins (9864619)	Marshawn Lynch (9864829)	Los Angeles Chargers (9864756)
Tom Brady (9864209)	Alex Collins (9864148)	Carlos Hyde (9864280)	Demaryius Thomas (9864449)	Cordarrelle Patterson (9864841)	Michael Thomas (9864513)	Austin Seferian-Jenkins (9864619)	Todd Gurley (9864760)	Los Angeles Chargers (9864756)

Afterward, I noticed that most of the generated lineups are the same players, simply rearranged across columns (see the Carlos Hyde and Kenyan Drake RB columns above).

To make things worse, any player can be put into the FLEX position column.

My hope is to stay with simple per-cell analysis code so that I can apply a duplicate remover using a per-rule row such as this:

(($A2=$A3)or($A2=$H3))&
(($B2=$B3)or($B2=$C3)or($B2=$H3))&
(($C2=$B3)or($C2=$C3)or($C2=$H3))&
(($D2=$D3)or($D2=$E3)or($D2=$F3)or($D2=$H3))&
(($E2=$D3)or($E2=$E3)or($E2=$F3)or($E2=$H3))&
(($F2=$D3)or($F2=$E3)or($F2=$F3)or($F2=$H3))&
(($G2=$G3)or($G2=$H3))&
(($H2=$G3)or($H2=$H3))&
($I2=$I3)

The above matching would provide a brute-force, per cell check for matches of the same player position category (QB, RB, WR, TE, FLEX, DST), but my hope is someone knows of a better solution.

Any RapidMiner guidance would be appreciated.

Thanks in advance...

Telcontar120 · 2017 10

There may be more elegant ways of doing this, but certainly one way that would work would be as follows:

Assign each player in each position a unique prime number (the same player will have a different number in a different position). Add that to your dataset as "player number" or something similar.
Create a synthetic id for each "lineup" that is the product of all the player numbers in the lineup (Generate Attributes would do this). Thus, any two teams that have the same players/positions combination regardless of order will share the same id.
Dedupe by this id.

I hope this is helpful.

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

Remove duplicate rows including rearrangements across columns

Answers