The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here
User KNN - How to get list of user ID which recommendation is generated from?
Taking the user - book rating dataset as example:
User | Book | Rating
User 1 | Book 1 | 5
User 2 | Book 1 | 4
User 2 | Book 3 | 3
...
We can use User k-NN operator under Recommender extension to find out which book we should recommend to users based on the similarity of book preference compared to other users, the output looks like:
User | Recommended book
User 1 | Book 3
User 1 | Book 5
User 2 | Book 6
...
However, is there any way to find out who are these 'similar' users where the recommendation is coming from?
Current output: We recommend Book 3 to User 1
Expected output: We recommend Book 3 to User 1 because of 85% similarity to User X
I have tried using Cross distance operator to calculate distance between different users and find out shortest distance users. However, cross distance perceived both of these scenario as similar:
1. Two users who have read the same book
2. Two users who have not read the same book
While user knn's similarity is solely based on #1 two users who have read the same book.
Hence, it turns out that the book recommendation are always not draw from the users who have shortest euclidean distance.
User | Book | Rating
User 1 | Book 1 | 5
User 2 | Book 1 | 4
User 2 | Book 3 | 3
...
We can use User k-NN operator under Recommender extension to find out which book we should recommend to users based on the similarity of book preference compared to other users, the output looks like:
User | Recommended book
User 1 | Book 3
User 1 | Book 5
User 2 | Book 6
...
However, is there any way to find out who are these 'similar' users where the recommendation is coming from?
Current output: We recommend Book 3 to User 1
Expected output: We recommend Book 3 to User 1 because of 85% similarity to User X
I have tried using Cross distance operator to calculate distance between different users and find out shortest distance users. However, cross distance perceived both of these scenario as similar:
1. Two users who have read the same book
2. Two users who have not read the same book
While user knn's similarity is solely based on #1 two users who have read the same book.
Hence, it turns out that the book recommendation are always not draw from the users who have shortest euclidean distance.
Tagged:
0
Answers
P/S: Sorry I have not mentioned clearly in the description earlier that it's "User k-NN", have re-edited the description now.
I have tried to perform steps below:
1. Use "User k-NN" (k=3, not weighted) to generate recommendation
2. Use "k-NN" (k=3, not weighted, cosine similarity) on the same dataset, using user column as label
However, after i take the 3 users who have the highest confidence factors from step 2, it seems like i'm not able to match with the results generated at step 1.
For example, step 1 recommended book 5 to user 1.
However, in step 2, all 3 most similar users have never rated book 5.
Do you have any idea what could possibly caused the difference?
However, if you decided to use a two step solution with two k-NNs, one for recommendations and another for identifying the likely recommender, the discrepancy you get may be due to several issues.
(1) it seems that the initial k-NN uses Euclidean metric and the later k-NN relies on the Cosine metric, the two metrics are very different (if you were to measure similarity between two stars, the first measures the physical distance between each star and the observer, and the latter the angular separation between the stars in the eye of the observer). So if you use different similarity measures you will end up with different nearest neighbours for both - I think this explains your results.
(2) Assume that both k-NNs use the same similarity measurements. If there is a confusion between several possible answers, you may decide to use a 1-NN in the latter case to get the best match for your recommender.
(3) It is also possible that several neighbours have the same distance to your solution and a random equidistant neighbour may be returned, not the same for both models.
(4) It is also possible that you are getting different answers in both because the models may be undertaking different pre-processing steps, e.g. elimination of missing values, conversion of nominals to numerical values or normalisation/standardisation of values. So if there are any pre-processing options available in both, switch them off and do the pre-processing manually.
Note that a typical process for measuring distances using k-NN in systems, which are unlike RapidMiner to return the likelihoods of neighbour recommendations, is to create k-NN model first and then to apply it to the training set to return likelihood measurements (which often come separately).
From the description of "User k-NN", it seems to be using cosine similarity as well. I have used the same pre-processing steps for both, except there is one more extra step (pivot data) when feeding the data to "k-NN". But nonetheless, I understand your point, (3) may still happen even if other steps are the same.
If I would like to build a user based collaborative filtering recommender system from scratch to align both results (recommendations and nearest neighbours), would this be the steps to follow? Or is there any easier way of doing this?
Step 1. Use k-nn or cosine similarity operator to find out the top nearest neighbours for each user
Step 2. Loop all users, for all the items not rated by each user, calculate average score which are rated by his neighbours
Step 3. Loop all users, select the top scoring items from the result of Step 2
P/S:
The key reason I'm using "User k-NN" extension is due to it's able to provide recommendation result directly (Step 3), while using "k-NN" will only provide result on who are the most similar users (Step 1). However in my case, both results are required.