The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here
"Possible Bug: Missing Results"
I'm a bit new to RapidMiner, so I don't want to file an official bug report until I get some community feedback. Using the Text extension, I've been using the Process Data from Files operator with success. However, when I combine it with the Similarity from Data operator, the results perspective stops working. The log still reports that everything went fine, but nothing new appears.
This issue continues even after I remove the similarity operator. The only way to restore normal functioning is to close RapidMiner and delete the perspective XML files.
Am I doing something wrong, or is this a bug?
This issue continues even after I remove the similarity operator. The only way to restore normal functioning is to close RapidMiner and delete the perspective XML files.
Am I doing something wrong, or is this a bug?
Tagged:
0
Answers
can you provide a process (and if it depends on the data, that as well) so we can reproduce it?
General rule of thumb is if you need to delete some file afterwards to get everything working again there is something which is not working as intended
Regards,
Marco
just jumping in: another thing which came to my mind was a closed result history:
http://rapid-i.com/rapidforum/index.php/topic,3598.msg13402.html
Maybe it's simply this...
Cheers,
Ingo
http://www.mediafire.com/?6t86rwieaw5b12d
Also, this problem was replicable on another computer (Amazon EC2 instance).
Thanks for the process and data, this really helps to find the problem.
I can - at least partially - replicate your problem. But the reason is not a broken result display due to the similarity operator but simply a too long runtime for creating the display for the similarity. After about 40 minutes on my computer, the tab for the similarity object has finally been created and it took another 50 minutes until the message "Please standby while the display is created..." vanished and the results finally have been there.
You can easily try this yourself:
- Use your process and text data but change the parameter "prune_below_absolute" to 200 and "prune_above_absolute" to 250: it will take about 10 seconds until the tab is created and another 10 seconds until the display creation has finished. The number of created terms is about 100.
- Now change the parameter "prune_above_absolute" to 500: it will now take about 25 seconds until the tab is created and another 40 seconds until the display creation has finished. The number of created terms with these pruning settings is about 250.
- You can repeat this by slightly increasing the setting - check the number of created terms and the increase in time. With your pruning settings, you ended up with more than 13000 terms which cause the long display creation times mentioned above...
So the result will actually be created, but it simply takes more than an hour. In this time one of my computer's CPUs was used 100% of time - so RapidMiner really had some calculations to do. Not too much of a problem if the similarity is used for additional calculations in the rest of an automated process but certainly not too much fun for an interactive exploration of the similaritiesInteresting observation: the number of examples (about 1000) was a smaller problem than the number of attributes. I did actually not have expected this since the number of attributes should contribute only linearly to the necessary runtime for most of the similarity / distance measures. I will think about that and discuss this with the others.
So this is indeed not really a bug but maybe a chance for an performance improvement for the creation of the similarity viewer (if you like you can still file a report in our bugtracker at http://bugs.rapid-i.com as a feature request and add a link to this conversation here). For now, you have several options like using a stronger pruning / filtering / stemming and other approaches which help to reduce the number of features. If you do not want to look at the similarities themself but simply use them for the rest of the process, I would recommend to filter down the number of attributes during process design like in the small test above and remove the filter afterwards after the full process has been designed.
Cheers,
Ingo
On a related note, is there a maximum limit to the size of an ExampleSet? Using a larger data input via the Similarity Data operator, I'm getting negative 637040551 examples in the result set.