Help on How to Use FP-Growth

JieDu · March 2012

Hi,

Can someone help me on how to use the FP-Growth operator? I am new to Rapidminder and try to use it to do some data mining work.

Here is the toy problem I used:

Transaction Beef Boots Cheese Chicken Clothes Milk
1 TRUE FALSE FALSE TRUE FALSE TRUE
2 TRUE FALSE TRUE FALSE FALSE FALSE
3 FALSE TRUE TRUE FALSE FALSE FALSE
4 TRUE FALSE TRUE TRUE FALSE FALSE
5 TRUE FALSE TRUE TRUE TRUE TRUE
6 FALSE FALSE FALSE TRUE TRUE TRUE
7 FALSE FALSE FALSE TRUE TRUE TRUE

With minimum support is set at 0.3, I can easily find the frequent itemsets as the following:

Itemset Trans Count Support
Beef 4 0.57
Cheese 4 0.57
Chicken 5 0.71
Clothes 3 0.43
Milk 4 0.57
Beef, Cheese 3 0.43
Beef, Chicken 3 0.43
Chicken, Clothes 3 0.43
Chicken, Milk 4 0.57
Clothes, Milk 3 0.43
Chicken, Clothes, Milk 3 0.43

However, FP-Growth outputs:

Size Support Item1 Item2
1 0.571 Cheese
1 0.429 Milk
1 0.429 Clothes
1 0.429 Beef
2 0.429 Cheese Milk

Both the support value and the itemsets are different from hand calculation.

I only used two operators: one for retrieve the data from repository (I checked the data output.
The data looks good) and FP-Growth with "Find min number of itersets" un-checked and the "min support" set to 0.3.

Maybe there are some parameters I should set up? Really appreciate your help!

Jie

haddock · April 2012

Hi there Jie,

It looks like your value count is inverted, so you need to declare explicitly that the positive value is 'TRUE', and vice versa. You can do this by placing a 'Remap Binominals' operator upstream of the 'FPGrowth' operator. While this may seem onerous, it can be useful in other applications, where for instance the absence or presence of something is being investigated.

Good luck!

JieDu · April 2012

Hi Haddock,

It works! Thank you very much for the help!

May I ask you a following up questions?

I am looking for a web mining tool to do web usage analysis (Association rules, sequential patterns, etc) for the click stream data. The dataset size is about 10 - 100 millions records with 100 variables. Do you think RapidMiner is the right tool? I know companies using SAS Enterprise Miner. But it is really pricey. Some friends recommend Knowledge Studio or Revolution R. I have watched several RapidMiner video tutorials. I like the elegant GUI design and the simplicity of the drag-and-drop. There are a rich set of operators to cover wide range of problems. What about the performance and accuracy? Really appreciate your advice.

Thanks a lot in advance.

Jie

haddock · April 2012

Hi Jie,

Glad that worked; on your more general questions it is difficult to be specific, I rather doubt that anyone has sufficient knowledge of all the available packages. FWIW I use RapidMiner to sift for patterns in datasets of the size you mention, and because I need the answers fast I greatly value that RM is open source, and therefore checkable and extendable. I use RM to marshal the data, and CUDA to grind it. Zoooom!

JieDu · April 2012

Hi Haddock,

Thank you very much for the quick response. It is very nice to know that you are using RM to mine the dataset of the similar size. I will give RM a try.

Thanks again.

Jie

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

Help on How to Use FP-Growth

Answers