The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here
Export "Data To Similarity" results to a CSV
Hi!
I am working with text mining in Rapidminer and the following problem has arised:
I use the Data to Similarity operator from the Text Extension and the "sim" output port gives a table with three columns: one object, another object, and the similarity between them. However, I can't sort or export that result, which I'd love to do, in order to be able to work with that data as a CSV file.
Is there any way to export that table?
Thank you very much!
I am working with text mining in Rapidminer and the following problem has arised:
I use the Data to Similarity operator from the Text Extension and the "sim" output port gives a table with three columns: one object, another object, and the similarity between them. However, I can't sort or export that result, which I'd love to do, in order to be able to work with that data as a CSV file.
Is there any way to export that table?
Thank you very much!
Tagged:
0
Answers
you can take Similarity to Data to get an Example Set out of it. Afterwards you can store it with any Write operator.
~Martin
Dortmund, Germany
I am facing now another problem, though.
After using the Similarity to Data operator, I have a dataset with three columns: the first id used for comparison, the second id used for comparison, and the similarity percentage. Now, I would like to combine that information with my original database (which has many attributes). I don't know how to, for example, obtain the rows from my original database where the similarity percentage is greater than 50%. Any idea?
Thank you in advance.
Use a Filter Examples to delete the examples< 0.5. Afterwards you can join the original data. If you do not have an ID in the dataset, you can use GenerateID before hand to add one.
~Martin
Dortmund, Germany
Thank you very much!
However, I have a last question. I have applied Data to Similarity and then Similarity to Data right after, to be able to use the output dataset. But the dataset contains all results duplicated, since I have applied both operators. How could I prevent this from happening? Or how could I get rid of the duplicated results and just keep a row per similarity between two objects?
Thank you.
a general idea is to use Cross Distance, it is a bit more flexible.
For your question:
Do i understand it correctly, that you have the distance twice in like this My first idea would be to create a new ID with the Two IDs you have. I would always take the smaller one first. So you always get a string like
SmallNumber _ BigNumber
This results in this: Afterwards you can use Remove Duplicates on this. See attached Process
Dortmund, Germany
That worked perfectly.
And how to get count of similar looking sets( Text field). For the below set I want count like
ABC is good text -----3
XYZ is great -----------2
FIRST SECOND SIMILARITY textfield
1 2 1 ABC is a good text
3 8 1 ABC is a good text
4 9 1 ABC is a good text
12 32 1 XYZ is great
31 77 1 XYZ is great
Can't you use an Aggregate operator for this?
Thanks Thomas. Results achieved.