Problem with Naive Bayesian
Hello,
I'm from Germany and studying Financial Management. Right now I have to make a presentation about the Naive Bayesian on RapidMiner. My problem is, that I don't understand how the results ,,prediction(no) / prediction(yes)" can be computed.
Here is my XML Process:
<?xml version="1.0" encoding="UTF-8"?><process version="8.0.001">
<operator activated="true" class="retrieve" compatibility="8.0.001" expanded="true" height="68" name="Retrieve Golf" width="90" x="45" y="34">
<parameter key="repository_entry" value="//Samples/data/Golf"/>
</operator>
</process>
<?xml version="1.0" encoding="UTF-8"?><process version="8.0.001">
<operator activated="true" class="naive_bayes" compatibility="8.0.001" expanded="true" height="82" name="Naive Bayes" width="90" x="246" y="34">
<parameter key="laplace_correction" value="true"/>
</operator>
</process>
<?xml version="1.0" encoding="UTF-8"?><process version="8.0.001">
<operator activated="true" class="retrieve" compatibility="8.0.001" expanded="true" height="68" name="Retrieve Golf-Testset" width="90" x="45" y="187">
<parameter key="repository_entry" value="//Samples/data/Golf-Testset"/>
</operator>
</process>
<?xml version="1.0" encoding="UTF-8"?><process version="8.0.001">
<operator activated="true" class="apply_model" compatibility="8.0.001" expanded="true" height="82" name="Apply Model" width="90" x="313" y="340">
<list key="application_parameters"/>
<parameter key="create_view" value="false"/>
</operator>
</process>
<?xml version="1.0" encoding="UTF-8"?><process version="8.0.001">
<operator activated="true" class="performance" compatibility="8.0.001" expanded="true" height="82" name="Performance" width="90" x="380" y="85">
<parameter key="use_example_weights" value="true"/>
</operator>
</process>
I've used the Golf data set. For example the first row: sunny, Temperature:85, Humidity:85 and Wind:false.
For Temperature and Humidity I've used the probability density function in order to get the following results for no= 0,0003074677... and yes=0,000059000924.
What should I do witht those results to get the results from the Prediction No= 0,711 and Yes= 0,289?
Thank you in advance!
Best Answer
-
lionelderkrikor RapidMiner Certified Analyst, Member Posts: 1,195 Unicorn
Hi again @domi_wiese,
I think you have strictly reach your goal :
Don't forget that you perform calculation without Laplace correction :
without Laplace correction, the results of RapidMiner are :
To come back to the calculations :
5/14 => OK
2/5 => OK
3/5 => OK
0,04125 (Humidity you confirm ?) => OK
0,02121 (Temperature you confirm ?) => NOK (I find 0,1204 => I made an error => can you give detail of your calculation for this case
=> I don't know where is my error in this calculation)
Morality : Is the solution in the calculator of OS Windows..............??????
Best regards,
Lionel
1
Answers
If you want to understand the calculations behind NB (also using the Golf dataset), check out Ingo's short video here:
Lindon Ventures
Data Science Consulting from Certified RapidMiner Experts
Thank you for this video. I have already watched it, but there is just the Basic explained, which is not the problem for me. I'm talking about the next steps, I mean: how to combinate the continous numeric values with those from Outlook and Wind to get the predictions (yes and no). In other words: what is the equation to get to those predictions, for example for the first row?
Hi,
for numeric variables we use a gaussian assumption. The probability is given by the usual gaussian pdf with the calculated mean and variance. For nominal variables we can get the probabilities from simple counting.
Best
Dortmund, Germany
Thank you very much.
Let's stick to the first row with the prediction for no (71,1%).
I've used the probability density function to get for temperature (85) and humidity (85) the following results
Temperature Humidity
yes 0,00097307096 0,00319056274
no 0,0464961233 0,0412564316
At next, I've computed the following results for Outlook=sunny and wind=false
Sunny False
yes 3/9 4/9
no 2/5 2/5
In order to get the prediction no (71,1%) and yes (28,9%) I thought that it would be like this:
Multiply all the results for yes with (9/14) and multiply all the results for no with (5/14). Then add those two results to have the Basis (evidence). At last divide the result for yes with the evidence and the results for no with the evidence to get the predictions.
What am I doing wrong?
Thank you in advance!
Hi,
This topic interests me a lot.
In deed, from my opinion, it is essential to understand the theory behind the algorithms.
I hope you can give me a few minutes of attention :
1. Here the results of confidence of the Golf test set (after training by the Golf dataset) given by RapidMiner
without Laplace correction :
2. I tried to retrieve this results manually, but I have this illogical results for the first row of the Golf data set :
You can find the whole Excel calculation file by following this link :
https://drive.google.com/open?id=18T153eElmtsjOzihGwLENVh8cwHdaHMT
3. I used too Python, and the results are differents from RapidMiner :
You can fi
you can find the process here :
and the training and test Golf dataset by following the link here (Excel file) :
https://drive.google.com/open?id=18Dht5-aTuJVehZvbU3LZLAvzQTBixCLB
4. I think I understood the calculation methodology of confidences and I am almost on my calculations.
Can you help me to find my error if there is an error ?
Why the results of Python are different from RapidMiner ?
Is there a postprocessing of the probabilities in RapidMiner ?
Thanks you for your help.
Best regards,
Lionel
dang, @IngoRM - @lionelderkrikor also uses Excel to check calculations!! I thought I was the only luddite lingering around. Now if I could only get my hands on my old HP 42S RPN calculator.... :smileylol:
(sorry @lionelderkrikor - I was just showing Ingo some calcs on Excel today and could not resist. Believe me, I am sometimes very proud of my luddite skills...)
Scott
Hi,
thanks for sharing your point of view Lionel! I really appreciate it.
In my opinion, you used not the ,,correct" equation for the probability density function. I think the following link has it right at the following time: 02:20min.
https://www.youtube.com/watch?v=k2diLn5Nqbs&t=125s&list=PL7r4RQYRQRfgw3-ccVUzdlYh5HK-tQHFs&index=3
By using that equation and performing like I already described in my last post, I've received for prediction no = 78,756%, which still isn't 71,1%.
Could someone please help me find the solution.
Thank you.
Hi,
i think @lionelderkrikor forgot the priors. So you need to multiply by 4/14 and 10/14 respectively.
Best,
Martin
Dortmund, Germany
Hi,
Thanks you for your feedback @domi_wiese, I admit that you are much closer to the expected results......
Many things :
1. A priori the equation i use, and the equation of your video are equivalent :
yhyhyh
2. In my intermediate results, I retrieve strictly the same results given by RapidMiner in the Distribution Table (mean/std dev of Temperature and Humidity, count of nominal attributes) without Laplace correction :
ppmp
That's why, I don't understand why I obtain these illogical results.
3. @mschmitz, a priori, I have not forgotten the priors in the calculations : In deed that's not explicit and detailed in Excel calculation file.
Although there is no change in the results, here the link to my second release of Excel file :
https://drive.google.com/open?id=12mELZ_SW8fv-VfeRkY-mUjqEUb42ODx6
4. @domi_wiese, maybe you can share your calculation file and/or your intermediate results - P(Xi|Y = yes/no) / P(Y = yes/no) - in order
we find the solution to this mysterious Naive Bayes problematic....
5. Do not give up : I'm sure, we will find the solution to this problem and if we can not do it with Excel, @sgenzer will lend us his HP 42S RPN calculator..... or i will retrieve my old TI 86 calculator from college :
I hope that I advanced the reflection on this topic a little bit.
Best regards,
Lionel
@lionelderkrikor. @sgenzer
Ok boys, I'm dropping my beast on the table too...
And on the 7th day, God created the HP 48GX
Hi,
@lionelderkrikor
I'm really sorry. I made a mistake while using the probability density function.
But I've corrected them. Now, I have computed like in the first picture below, in order to get the the intermediate result for no. I do the same for yes. After that, I got the predictions which is 71,7% for no. This is still around 0,5% too much, but I think it could be correct. What do you think?1 Picture2 Picture
Hi @lionelderkrikor,
thank you for bringing my attention to laplace correction. I'll look after that by tommorow.
Of course I will send you my calculation.
Hi again @domi_wiese,
Thanks to you, I found my error : a problem of bracket and exponent in Excel......
For my general culture : What is your calculator software ?
and good luck for your presentation.
Best regards,
Lionel
Hi @lionelderkrikor,
I'm glad we found our mistakes and solved them and thank you for wishing me luck.
To be honest: First I used my own calculator, but then I used a calculator on the internet. I can show you the link of course.
https://web2.0rechner.de/
Have a nice day!
Hi @lionelderkrikor,
just one thing: could you please send me a picture of your design view with the process? And where is the option with the laplace correction? I know what that is, but I can't find the position of it.
Thank you in advance!
Hi @lionelderkrikor,
I've already found out, how it works. So, thanks again and have a nice day!