The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here

Help on How to Use FP-Growth

JieDuJieDu Member Posts: 6 Contributor II
edited November 2018 in Help
Hi,

Can someone help me on how to use the FP-Growth operator? I am new to Rapidminder and try to use it to do some data mining work.

Here is the toy problem I used:

Transaction Beef Boots Cheese Chicken Clothes Milk
1 TRUE FALSE FALSE TRUE FALSE TRUE
2 TRUE FALSE TRUE FALSE FALSE FALSE
3 FALSE TRUE TRUE FALSE FALSE FALSE
4 TRUE FALSE TRUE TRUE FALSE FALSE
5 TRUE FALSE TRUE TRUE TRUE TRUE
6 FALSE FALSE FALSE TRUE TRUE TRUE
7 FALSE FALSE FALSE TRUE TRUE TRUE

With minimum support is set at 0.3, I can easily find the frequent itemsets as the following:

Itemset                  Trans Count    Support
Beef                                4            0.57
Cheese                          4            0.57
Chicken                          5            0.71
Clothes                          3              0.43
Milk                                  4            0.57
Beef, Cheese                3            0.43
Beef, Chicken                3            0.43
Chicken, Clothes            3            0.43
Chicken, Milk                  4            0.57
Clothes, Milk                    3            0.43
Chicken, Clothes, Milk    3            0.43

However, FP-Growth outputs:

Size    Support    Item1    Item2
1      0.571      Cheese
1      0.429      Milk
1      0.429      Clothes
1      0.429      Beef
2      0.429      Cheese  Milk

Both the support value and the itemsets are different from hand calculation.

I only used two operators: one for retrieve the data from repository (I checked the data output.
The data looks good) and FP-Growth with "Find min number of itersets" un-checked and the "min support" set to 0.3.

Maybe there are some parameters I should set up? Really appreciate your help!

Jie

Answers

  • haddockhaddock Member Posts: 849 Maven
    Hi there Jie,

    It looks like your value count is inverted, so you need to declare explicitly that the positive value is 'TRUE', and vice versa. You can do this by placing a 'Remap Binominals' operator upstream of the 'FPGrowth' operator. While this may seem onerous, it can be useful in other applications, where for instance the absence or presence of something is being investigated.

    Good luck!
  • JieDuJieDu Member Posts: 6 Contributor II
    Hi Haddock,

    It works! Thank you very much for the help!

    May I ask you a following up questions?

    I am looking for a web mining tool to do web usage analysis (Association rules, sequential patterns, etc) for the click stream data. The dataset size is about 10 - 100 millions records with 100 variables. Do you think RapidMiner is the right tool? I know companies using SAS Enterprise Miner. But it is really pricey. Some friends recommend Knowledge Studio or Revolution R. I have watched several RapidMiner video tutorials. I like the elegant GUI design and the simplicity of the drag-and-drop. There are a rich set of operators to cover wide range of problems. What about the performance and accuracy? Really appreciate your advice.

    Thanks a lot in advance.

    Jie
  • haddockhaddock Member Posts: 849 Maven
    Hi Jie,

    Glad that worked; on your more general questions it is difficult to be specific, I rather doubt that anyone has sufficient knowledge of all the available packages. FWIW I use RapidMiner to sift for patterns in datasets of the size you mention, and because I need the answers fast I greatly value that RM is open source, and therefore checkable and extendable. I use RM to marshal the data, and CUDA to grind it. Zoooom!
  • JieDuJieDu Member Posts: 6 Contributor II
    Hi Haddock,

    Thank you very much for the quick response. It is very nice to know that you are using RM to mine the dataset of the similar size. I will give RM a try.

    Thanks again.

    Jie
Sign In or Register to comment.