The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here
Similarity between data
Hi everyone!
I have a data set that contains about 250.000 products, consisting of various columns like "artid", "title", "longtext" and so on.
Now, I want to find similar products to each product, where the result should look like:
artid; similar1-artid; simliar2-artid; and so on.
For this, I'd like to select the columns that should be analyzed and I'd like to set a "limit of similarity" that tells rapidminer when to write the artid of a similar product into the results list (next to each product) and when to ignore it.
I had a look on many video tutorials, dealing with text classifcation but none of them told me on how to create such a dataset. (listing each product again together with the artid of the similiar products)
I also tried "data to similarity" but it fails to display the results, even if I filter for 1 % of the data.
Does anyone have an idea on that?
Many thanks in advance!
I have a data set that contains about 250.000 products, consisting of various columns like "artid", "title", "longtext" and so on.
Now, I want to find similar products to each product, where the result should look like:
artid; similar1-artid; simliar2-artid; and so on.
For this, I'd like to select the columns that should be analyzed and I'd like to set a "limit of similarity" that tells rapidminer when to write the artid of a similar product into the results list (next to each product) and when to ignore it.
I had a look on many video tutorials, dealing with text classifcation but none of them told me on how to create such a dataset. (listing each product again together with the artid of the similiar products)
I also tried "data to similarity" but it fails to display the results, even if I filter for 1 % of the data.
Does anyone have an idea on that?
Many thanks in advance!
0
Answers
Dortmund, Germany
Many thanks for your input!
Well, I just ran the process (which took about 3 hours on a 16 gb ram machine, even after splitting the data) and received a result containing four columns:
Row Number; request; document; distance
request seems to be the artid which I set to role "id", document contains numbers between 1 and to and distance is always filled with "?".
I thought that I'd receive something like:
id; "simliar id"; percentage
Regarding the (250K*250K)/2 rows:
Wouldn't it be possible to get the exact same number of rows as the input data set contains and just add the similar artids in each row (while each of them is a product) as a new column?
Is there anything wrong with my code?: Thanks!
So 250k x 250k.
Many thanks for your input regarding the similarity to data-operator.
Unfortunately this exceeds my RAM (16 GB) even if I filter the data to 0.005.
Is there any chance to do this analysis without needing a 3-digit RAM server?
Thanks!
The output of cross distance is exactly what you need as it returns three columns id of the product (request column), id of the most similar product (document attribute) and the distance between these two products (so some measure how these two products are similar).
Some additional comments - in your scenario I wouldn't set compute similarities parameter as you obtain three most dissimilar products.
To analyse you results in Results View perspective click on request column so the results will be sorted by the request column. Then you'd see that each product appears only 3 times (as you select top 3 products) and the document column would contain the most similar products and the last column is the distance.
I also suggest not use Data2Similarity operator as it is not necessary and very memory consuming operator. In presented above scenario the result of cross distance operator would consume 250k*3(number of k)*3(number of columns)*8(size of double type) so about 18MB of RAM to store your results. THat is much less then 16GB, but fist you have to correct your process.
Best
By saying Do you mean something like in the following code?: This was the only version that ran for at least a few minutes before crying for more RAM.
But it also just stopped working (without any notice) after half an hour.
Thanks!
As I wrote you have to (!!!) use text processing or text mining extension to convert your product text descriptions to some meaningful numbers.
I can suggest to look at the youtube chanel to learn about text mining in RM, and you'll be able to solve your problem.
In your process you have to replace Nominal2Numerical operator with the Process Documents form Data operator. Before you would have to convert nominal to text or re-read your data and set correct attribute type for longtext attribute.
If you run such process for 250k documents it would take some time but your computer would be able to do it without any problems.
Please excuse my late reply.
After watching a few videos and reading an article on another website, I managed receiving results when using the "data to similarity" operator.
But when I use "cross distance", I only get "?" in the "distance" column of the results. (I especially don't know if my connections are correct.
process with "data to similarity" (working): process with "cross distance" (not working): Does anyone have an idea why v1 is working and v2 not?
Thank you very much!
After adding the "multiply" operator between "process documents" and "cross distance", I'm finally getting results! ;D
I don't know if this is the correct way to feed the ref port of "cross distance", but I hope so.
I now only have two questions left:
1.,
Can I make the result look like:
request;document
productX;productA,productB,productC
Instead of
request;document
productX;productA
productX;productB
productX; productC
2.,
What do similiarity-score like 0.XX mean?
Isn't it possible to show anything between 1.0 and 0.0 (meaning percentages?)
It the moment I get something between 0 and 1.55 (small sample set)
Thank you very much!
1.,
Can I make the result look like:
request;document
productX;productA,productB,productC
Yes, the Aggregate operator will do that for you. See this example: 2.,
What do similiarity-score like 0.XX mean?
Isn't it possible to show anything between 1.0 and 0.0 (meaning percentages?)
It the moment I get something between 0 and 1.55 (small sample set)
I'm not 100% sure your example here, but it sounds like Normalize is the operator you are looking for here.
Number 2 seems to work, thanks!
For problem 1:
Setting "aggregation attributes" to "document" and "concatenation" doesn't work as it says:
"The value type of the attribute is not compatible with the aggregation function "concatination".
Is the another aggregation function that works or is there another operator that has to be put in front of "aggregate"?