Is there any way to find common values in records?
Would you Please guide me to convert my CSV data file:
1,A,C,Z,F,G
2,G,Q,R,C,
3,Z,G,Q,
4,C,F,
5,O,P,
6,O,X,Y,J,
7,A,P,X,
I have this table with ~1,000,000 records like these 7 records that you see (**In real Database A,B,C,... are words in string**), Records 1 and 2 are common in G and C value and 2,3 and 1,3 and ...
I want to sync records if they have at least two common value like Records 1 & 2,3,4 (but record 5,6,7 haven't at least 2 shared values with others) and generate a list like this:
1 A C Z F G Q R
2 G Q R C A Z F
3 Z G Q A C F R
4 C F A Z G Q R
5 O P
6 O X Y J
7 A P X
at the end we must have 4 same records if we sort data and one others without sync:
1 A C F G Q R Z
2 A C F G Q R Z
3 A C F G Q R Z
4 A C F G Q R Z
5 O P
6 J O X Y
7 A P X
Maybe I do not use good term for my meaning, please see:
1 A C Z F G
2 G Q R C
record 1 has C and G common with Record 2 now 1 has not R and Q thus we must have 1 A C Z F G + Q and R and Record 2 has not A,Z and F thus we must have: 2 G Q R C + A,Z and F thus at the end we have:
1 A C Z F G Q R
2 G Q R C A Z F
I need all records Respectively in the queue from top to bottom.
for other example: record 4:
1,A,C,Z,F,G
2,G,Q,R,C,
3,Z,G,Q,
4,C,F,
5,O,P,
6,O,X,Y,J,
7,A,P,X,
at first we have :
1 A C Z F G Q R
2 G Q R C A Z F
3 Z G Q A C F R
4,C,F,
5,O,P,
6,O,X,Y,J,
7,A,P,X,
Now at 4 we have C and F common with 1
thus
we generate:
4 C F + a z g q r
at first I wrote a delphi code but it is so slow. second Someone suggest me this groovy code:
def f=[:]
new File('Data.csv').readLines().each{
def items=it.split(',')
def name
items.eachWithIndex { String entry, int i ->
if(i==0){
name=entry
}
else if(entry){
if(!f[entry])
f[entry]=[]
f[entry]<<name
}
}
}
f.findAll {it.value.size()>1}
It is very fast (because of using a map file I think), but It only finds the common values.
Now I am here,
Please give me any help or suggestion.
Thank you
Answers
Hello, @smmsamm
The best way to do such a thing, considering that A, B, C... etc, are strings, is by using a database, especially a SQL database. I assume that 1, 2, 3, ..., 7 are sentences, hence:
Why would anyone care about using a database for these things? Piece of cake: because you are trying to find a relation between words and sentences, and there is nothing better to process than a relational database. Once you get your data sorted, you have many choices to find what groups can be together, e.g.:You can also use SQL superpowers, create subqueries to filter pairs of sentences that appear more than once when using words as counters. Since you have a million records, either you aren't on the free tier of RapidMiner or you are still unable to find the right bazooka to address your problem. Not that this is an issue to answer but it will be if you want to apply RapidMiner to this problem with your license.
All the best,
Rodrigo.