The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here
"Market basket analysis using fpgrowth"
Hi
I tried using fpgrowth for market basket analysis. I used a small data sample.
Correct me if i am wrong but i do not seem to get accurate results in terms of support values. By support i understand the number of orders a set of items are in as compared o the total number of orders.
For item 1="id_200-5745" and item 2="id_202-6176" the value of Support=0.9
But only 1 order out of 19 orders had these two items together. So should the support not be 0.05 (= 1/19)
The data set is already in the format required for fpgrowth. Each row represents one order and each column represents one item. a column value of 'true' indicates that that item is in the order and 'false' indicates the item is not in the order
data set.
Output is:
xml code:
<operator name="Root" class="Process" expanded="yes">
<operator name="ExcelExampleSource (2)" class="ExcelExampleSource" breakpoints="after">
<parameter key="excel_file" value="C:\market basket\fpgrowth_data2.xls"/>
<parameter key="first_row_as_names" value="true"/>
</operator>
<operator name="FPGrowth" class="FPGrowth">
<parameter key="find_min_number_of_itemsets" value="false"/>
<parameter key="min_support" value="0.8"/>
</operator>
</operator>
Many Thanks
I tried using fpgrowth for market basket analysis. I used a small data sample.
Correct me if i am wrong but i do not seem to get accurate results in terms of support values. By support i understand the number of orders a set of items are in as compared o the total number of orders.
For item 1="id_200-5745" and item 2="id_202-6176" the value of Support=0.9
But only 1 order out of 19 orders had these two items together. So should the support not be 0.05 (= 1/19)
The data set is already in the format required for fpgrowth. Each row represents one order and each column represents one item. a column value of 'true' indicates that that item is in the order and 'false' indicates the item is not in the order
data set.
id_200-8216 id_204-6359 id_202-4110 id_204-4431 id_203-6751 id_205-3148 id_204-7961 id_203-8852 id_204-3920 id_203-2779 id_203-7381 id_200-2303 id_100-5163 id_200-3492 id_203-0515 id_202-3716 id_202-8819 id_201-2956 id_203-1769 id_100-6580 id_203-7222 id_204-4623 id_200-0748 id_204-6247 id_204-8905 id_200-5745 id_204-1463 id_202-6176 id_204-7074 id_202-5694 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE TRUE FALSE FALSE FALSE FALSE FALSE TRUE TRUE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE |
Output is:
Size Support Item 1 Item 2 1 0.95 id_200-5745 1 0.9 id_202-6176 2 0.9 id_200-5745 id_202-6176 |
xml code:
<operator name="Root" class="Process" expanded="yes">
<operator name="ExcelExampleSource (2)" class="ExcelExampleSource" breakpoints="after">
<parameter key="excel_file" value="C:\market basket\fpgrowth_data2.xls"/>
<parameter key="first_row_as_names" value="true"/>
</operator>
<operator name="FPGrowth" class="FPGrowth">
<parameter key="find_min_number_of_itemsets" value="false"/>
<parameter key="min_support" value="0.8"/>
</operator>
</operator>
Many Thanks
Tagged:
0
Answers
and again somebody who got into the "meta-data-was-not-defined-and-rapidminer-guessed-wrong"-trap
Just search a bit in this forum (e.g. in this recent thread here http://rapid-i.com/rapidforum/index.php/topic,776.0.html ) and you will see what I mean. All those problems go back to a combination of the facts that RapidMiner internally stores nominal values as numbers (mainly for performance and also for memory consumption reasons) and users usually are too lazy to define all their meta data for RM correctly (although this is possible as the thread above shows).
The question (after reading those threads) is: what do those items with this "wrong" support have in common? Correct! They start with a "TRUE" value in the first line. In RapidMiner, if no meta data is defined, the first value becomes "negative" and the second becomes "positive". So from there everything is correctly counted - but the problem is that the wrong thing is counted (in this case the number of "FALSE"s) which leads to the high supports.
So what can be done? Well, this is pretty easy: define the meta data (e.g. by defining an .aml file describing your data) or simply make sure that the first line contains only falses. You can also combine both approaches and add such a line to your data, create the .aml file for example with the Attribute Editor tool and remove the line again so that your counts will not be distorted.
Cheers,
Ingo
P.S.: By the way: I moved this thread into the "Problems" board of this forum.
I added a row with all falses to the data set. The support value seems to be ok with item1= id_200-5745 and item2= id_202-6176.
However there is still an issue with the Support values i get for the other combinations of items.
For example in the data set none of the orders has these three items together but the support is 0.05 :item 1= id_204-8905 item 2=id_204-6359 item 3=id_204-4623
Similarly i get support values of 0.05 for other combinations of items which are not to be found in any order.
xml code
<operator name="Root" class="Process" expanded="yes">
<operator name="ExcelExampleSource" class="ExcelExampleSource">
<parameter key="excel_file" value="C:\market basket\data4.xls"/>
<parameter key="first_row_as_names" value="true"/>
</operator>
<operator name="FPGrowth" class="FPGrowth">
<parameter key="find_min_number_of_itemsets" value="false"/>
<parameter key="min_support" value="0.05"/>
</operator>
</operator>
data set
false false false false false false false false false false false false false false false false false false false false false false false false false false false false false false
false false false false false false false false false false false false false false false false false false false false false false false false false true false true false false
false false false false false false false false false true false false false false false false false false false false false false false false false false false false false false
false false false false false false false false false false false false false false false false false false false false false false false false false false false false false true
false false false false false false false true true false false false false false false false false false false false false false false true false false false false false false
false false true false false false false false false false false false false false false false false false false false false false false false false false false false false false
false false false false false true true false false false false false false false false false false false false false false false false false false false false false false false
false false false false false false false false false false false false false false false false false false false true false false false false false false false false false false
false false false false false false false false false false false false false true false false false false false false false false false false false false false false false false
false false false false false false false false false false false false false false false false false true false false false false false false false false false false false false
false false false false false false false false false false false false false false false false false false false false false false false false false false false false true false
false false false false false false false false false false false true false false false false false false false false false false false false false false false false false false
true false false false false false false false false false false false false false false false false false false false false false false false false false false false false false
false false false true false false false false false false false false false false false false false false false false false false false false false false false false false false
false false false false false false false false false false false false false false false false false false false false false false false false false false false true false false
false false false false false false false false false false true false false false false false false false false false false false false false false false false false false false
false true false false false false false false false false false false true false true false false false false false true true false false true false false false false false
false false false false false false false false false false false false false false false false true false false false false false false false false false false false false false
false false false false false false false false false false false false false false false false false false false false false false true false false false false false false false
false false false false true false false false false false false false false false false true false false true false false false false false false false true false false false
The list of support values:
1 0.1 id_202-6176
1 0.05 id_205-3148
1 0.05 id_204-8905
1 0.05 id_204-7961
1 0.05 id_204-7074
1 0.05 id_204-6359
1 0.05 id_204-6247
1 0.05 id_204-4623
1 0.05 id_204-4431
1 0.05 id_204-3920
1 0.05 id_204-1463
1 0.05 id_203-8852
1 0.05 id_203-7381
1 0.05 id_203-7222
1 0.05 id_203-6751
1 0.05 id_203-2779
1 0.05 id_203-1769
1 0.05 id_203-0515
1 0.05 id_202-8819
1 0.05 id_202-5694
1 0.05 id_202-4110
1 0.05 id_202-3716
1 0.05 id_201-2956
1 0.05 id_200-8216
1 0.05 id_200-5745
1 0.05 id_200-3492
1 0.05 id_200-2303
1 0.05 id_200-0748
1 0.05 id_100-6580
1 0.05 id_100-5163
2 0.05 id_202-6176 id_200-5745
2 0.05 id_205-3148 id_204-7961
2 0.05 id_204-8905 id_204-6359
2 0.05 id_204-8905 id_204-4623
2 0.05 id_204-8905 id_203-7222
2 0.05 id_204-8905 id_203-0515
2 0.05 id_204-8905 id_100-5163
2 0.05 id_204-6359 id_204-4623
2 0.05 id_204-6359 id_203-7222
2 0.05 id_204-6359 id_203-0515
2 0.05 id_204-6359 id_100-5163
2 0.05 id_204-6247 id_204-3920
2 0.05 id_204-6247 id_203-8852
2 0.05 id_204-4623 id_203-7222
2 0.05 id_204-4623 id_203-0515
2 0.05 id_204-4623 id_100-5163
2 0.05 id_204-3920 id_203-8852
2 0.05 id_204-1463 id_203-6751
2 0.05 id_204-1463 id_203-1769
2 0.05 id_204-1463 id_202-3716
2 0.05 id_203-7222 id_203-0515
2 0.05 id_203-7222 id_100-5163
2 0.05 id_203-6751 id_203-1769
2 0.05 id_203-6751 id_202-3716
2 0.05 id_203-1769 id_202-3716
2 0.05 id_203-0515 id_100-5163
3 0.05 id_204-8905 id_204-6359 id_204-4623
3 0.05 id_204-8905 id_204-6359 id_203-7222
3 0.05 id_204-8905 id_204-6359 id_203-0515
3 0.05 id_204-8905 id_204-6359 id_100-5163
3 0.05 id_204-8905 id_204-4623 id_203-7222
3 0.05 id_204-8905 id_204-4623 id_203-0515
3 0.05 id_204-8905 id_204-4623 id_100-5163
3 0.05 id_204-8905 id_203-7222 id_203-0515
3 0.05 id_204-8905 id_203-7222 id_100-5163
3 0.05 id_204-8905 id_203-0515 id_100-5163
3 0.05 id_204-6359 id_204-4623 id_203-7222
3 0.05 id_204-6359 id_204-4623 id_203-0515
3 0.05 id_204-6359 id_204-4623 id_100-5163
3 0.05 id_204-6359 id_203-7222 id_203-0515
3 0.05 id_204-6359 id_203-7222 id_100-5163
3 0.05 id_204-6359 id_203-0515 id_100-5163
3 0.05 id_204-6247 id_204-3920 id_203-8852
3 0.05 id_204-4623 id_203-7222 id_203-0515
3 0.05 id_204-4623 id_203-7222 id_100-5163
3 0.05 id_204-4623 id_203-0515 id_100-5163
3 0.05 id_204-1463 id_203-6751 id_203-1769
3 0.05 id_204-1463 id_203-6751 id_202-3716
3 0.05 id_204-1463 id_203-1769 id_202-3716
3 0.05 id_203-7222 id_203-0515 id_100-5163
3 0.05 id_203-6751 id_203-1769 id_202-3716
4 0.05 id_204-8905 id_204-6359 id_204-4623 id_203-7222
4 0.05 id_204-8905 id_204-6359 id_204-4623 id_203-0515
4 0.05 id_204-8905 id_204-6359 id_204-4623 id_100-5163
4 0.05 id_204-8905 id_204-6359 id_203-7222 id_203-0515
4 0.05 id_204-8905 id_204-6359 id_203-7222 id_100-5163
4 0.05 id_204-8905 id_204-6359 id_203-0515 id_100-5163
4 0.05 id_204-8905 id_204-4623 id_203-7222 id_203-0515
4 0.05 id_204-8905 id_204-4623 id_203-7222 id_100-5163
4 0.05 id_204-8905 id_204-4623 id_203-0515 id_100-5163
4 0.05 id_204-8905 id_203-7222 id_203-0515 id_100-5163
4 0.05 id_204-6359 id_204-4623 id_203-7222 id_203-0515
4 0.05 id_204-6359 id_204-4623 id_203-7222 id_100-5163
4 0.05 id_204-6359 id_204-4623 id_203-0515 id_100-5163
4 0.05 id_204-6359 id_203-7222 id_203-0515 id_100-5163
4 0.05 id_204-4623 id_203-7222 id_203-0515 id_100-5163
4 0.05 id_204-1463 id_203-6751 id_203-1769 id_202-3716
5 0.05 id_204-8905 id_204-6359 id_204-4623 id_203-7222 id_203-0515
5 0.05 id_204-8905 id_204-6359 id_204-4623 id_203-7222 id_100-5163
5 0.05 id_204-8905 id_204-6359 id_204-4623 id_203-0515 id_100-5163
5 0.05 id_204-8905 id_204-6359 id_203-7222 id_203-0515 id_100-5163
5 0.05 id_204-8905 id_204-4623 id_203-7222 id_203-0515 id_100-5163
5 0.05 id_204-6359 id_204-4623 id_203-7222 id_203-0515 id_100-5163
6 0.05 id_204-8905 id_204-6359 id_204-4623 id_203-7222 id_203-0515 id_100-5163
Many Thanks
Kind regards,
Tobias
I have a file with 1399 columns and 7745 lines, if I use true and false instead of 1 and 0 the file becomes vary large.
Thanks
no there is not difference if you use nominal numbers.
Greetings,
Sebastian
I have the following which seems to be working fine.
I wanna know if there is any operator for visualizing the association rules, like the one shown in theD2K tutorial here http://algdocs.ncsa.uiuc.edu/TU-20031101-2.pdf, or any other.
Any sugestions?
Thanks
unfortunately we don't have plot exactly like that, but I think it's a good idea to add it in future. I will note it...
But we have a graphical visualization, which can be very informative, especially for really many rules. You simply have to switch in the result tab of the Association rules to the graph view. Take a look at it and simply play with the controls, until you get a feeling for whats the best configuration for your rules...
Greetings,
Sebastian