DBSCAN to classify another Dataset with the same Attributes
Hi,
so my idea was to classify objects with a trained DBSCAN model. So if the object from the testdataset is within or near a cluster from the model its labeled with the cluster and when there is no such cluster for the object it labled as unknown or something else (i think the "?" is the label to go). I used the "Apply Model" operator to do such, but this does not work as intended. It basiclly checks if there is a ID that is the same as in the trainingdataset and if so the Object will be labeled the same cluster as the object from the training set. Basically I try to creat a binominal classifier.
So my question, is ther any possible way to create a process that does the idea, but withot a predictiv operator (tree, k-NN,...)?
My idea was to check every cluster with the "loop cluster" and than try to check every object from the testset to the objects in the cluster (like cluser_0 has 10 obejcts, the test set have 10 objects -> the loop runs 100 times) and compare them with an distance measurment (euclidean distance) and if the result is below a threshhold than the object from the testset will be labled as the cluster it was compared with. The output should be the labeled testdata.
any ideas? and thanks
Best Answer
-
lionelderkrikor RapidMiner Certified Analyst, Member Posts: 1,195 Unicorn
Hi @Tucka,
Your first solution is indeed, not possible, because, apriori, RapidMiner's DBscan algorithm classify
all the data in the cluster(s) : In fine, there is no "unlabelled data" or "noise".
I highlighted this behaviour of RapidMiner (vs Python's DBscan algorithm) in this thread at the beginning of this year : See the last two posts, you will see and understand thanks to the illustrations (a small picture is better than a long speech .:cathappy:..) the difference in behavior between the two algorithms.
So I can propose a palliative solution based on a Python script. (you can implement and execute some Python scripts inside RapidMiner...).
In deed, given the parameters of the DBscan algorithm (min moints / epsilon), the data satisfying these conditions are "classified" in the generated cluster(s) by the DBscan Python algorithm, the other are classified as "unlabelled"...
So to perform your task, we can apply your first method (train / apply a DBscan) but using the Python's DBscan algorithm instead RapidMiner's DBscan).
- For this task, can you share your dataset and the DBscan's parameters you used to train this model (min points / epsilon) ?
- I allow myself to ask (again) the question in the thread I mentionned above : Why RapidMiner's DBSCAN is clustering all the data ? / Why there are not "unlabelled" data ?
NB : If I'm asking these 2 last questions, it's because "It feels good to understand how things work" (to quote @earmijo in a recent post...:smileyvery-happy:)
Regards,
Lionel
1
Answers
Hi @lionelderkrikor
thaks for the python idea! This nearly does the job, far better than the rapidminer DBSCAN.
I've, like you mentioned before, seen that the DBSCAN from rapidminer labeld all "unknown" objects into a cluster where the density is not there. So I've some good looking cluster and one that includes all objects that aren, in my opinion, "unknown" but also a hole lot of objects that are clustert on a density base.
As for now I can't share the data, sorry for that.. but i can discribe them as follow, its a timeseries of tempreture data from a system, the data also includes some fatal errors of the system condition. So my task was to find the errors and also try to automaticly detect them via DM. The DM-Method I choose, from the literature, was the outliner detection with DBSCAN.
Thanks to you I can now detect outliner in my trainingdata but if i try to validate the model python builds a new model, so I've done some reserch and my conclusion is that the DBSCAN algorithm is not capable of finding outliner from other dataset. For now I've implemented a decision tree to compensate that... and this soluition works, not quite as nice as a normal classifyer but hey, this is somthing i can write down.
so thanks for that!
"So my question is, why RapidMiner's DBSCAN is clustering all the data regardless of the setting epsilon / min points ?"
Yeha this is somthing I'd like to know to!
Thanks
Hey @lionelderkrikor
So my question is, why RapidMiner's DBSCAN is clustering all the data regardless of the setting epsilon / min points ?
Answer: https://docs.rapidminer.com/latest/studio/operators/modeling/segmentation/dbscan.html
what the rapitminer DBSCAN does is clustering the "noise" into th "cluster_0", I have compared the results of both DBSCAN algorithms and they a nearly the same.
Hi,
You're welcome, @Tucka.
Glad that your process works fine.:smileyhappy:
Of course, thanks for your answer about the behaviour of the RapidMiner's DBscan algorithm.
Note : For outlier detection, you have dedicated operators in RapidMiner :
...and in the extension Anomaly Detection (to install from the Marketplace) :
Good continuation,
Regards,
Lionel