Clustering with labels?

Fred12 · November 2016

Hi,

is there any way to do clustering with labels to control performance (in classification)? what operator can I use to do that (e.g with k-means?)

and is there some way to cluster the data with the "help" from labels if the class is known, so I mean clustering based on given labels (e.g find out which class label is clustered together, and then get the centroid of that local cluster and so on... ?)

Is there some operator existent that uses labels for clustering? I just want to find out some more properties about my dataset and my classes (e.g local cluster labels centroid tables... etc.)

MartinLiebig · November 2016

did you try Map Clustering on Labels and then the performance operators?

dang · November 2016

If you have labeled data, most of the time clustering is bring owls to Athens....

Of course you can use 'set role' to make lable column to normal regular attributes and pretend to not have any label information. Use the data without special attribute 'label' you can do any clustering you want.

Hope that makes senses...

Fred12 · November 2016

I know the purpose of clustering, but I want to compare the found clusters with labeled "clusters" if you know what I mean, to find the "goodness" of clusters by comparing them with some ground truth...

any sophisticated way to do so? any ideas?

Fred12 · November 2016

yeah thanks, that seemed to work, but I still don't know how that operator works,

how is it choosing which cluster is what label?

MartinLiebig · November 2016

Mh, good question. The important code is in ClusterToPrediction.java - but it's quite a chunk.

	@Override
	public void doWork() throws OperatorException {
		ExampleSet exampleSet = exampleSetInput.getData(ExampleSet.class);
		ClusterModel model = clusterModelInput.getData(ClusterModel.class);

		// generate the predicted attribute
		Attribute labelAttribute = exampleSet.getAttributes().getLabel();
		PredictionModel.createPredictedLabel(exampleSet, labelAttribute);
		Attribute predictedLabel = exampleSet.getAttributes().getPredictedLabel();

		HashMap<Integer, String> intToClusterMapping = new HashMap<Integer, String>();
		int[][] mappingTable = new int[model.getNumberOfClusters()][model.getNumberOfClusters()];

		// count the occurrence of each label with every cluster
		int a = 0;
		for (int i = 0; i < model.getNumberOfClusters(); i++) {
			HashMap<String, Integer> labelOccurrence = new HashMap<String, Integer>();
			for (Example example : exampleSet) {
				String label = example.getValueAsString(labelAttribute);
				if (!labelOccurrence.containsKey(label)) {
					labelOccurrence.put(label, 0);
					if (i == 0) {
						intToClusterMapping.put(a, label);
						a++;
					}
				}
				if (example.getValue(example.getAttributes().getCluster()) == i) {
					labelOccurrence.put(label, labelOccurrence.get(label) + 1);
				}
			}

			if (i == 0 && model.getNumberOfClusters() != labelOccurrence.size()) {
				throw new UserError(this, 943, labelOccurrence.size(), model.getNumberOfClusters());
			}

			for (int j = 0; j < mappingTable[i].length; j++) {
				String clusterName = intToClusterMapping.get(j);
				int occ = labelOccurrence.get(clusterName);
				mappingTable[i][j] = occ;
			}
		}
		/*
		 * Munkres-algorithm or the hungarian method
		 */
		// find the maximum
		int maxValue = -1;
		for (int i = 0; i < mappingTable.length; i++) {
			for (int j = 0; j < mappingTable[i].length; j++) {
				if (mappingTable[i][j] > maxValue) {
					maxValue = mappingTable[i][j];
				}
			}
		}

		// compute the new (inverted) table (and column-minima)
		for (int i = 0; i < mappingTable.length; i++) {
			int minimum = Integer.MAX_VALUE;
			for (int j = 0; j < mappingTable[i].length; j++) {
				mappingTable[i][j] = maxValue - mappingTable[i][j];
				if (mappingTable[i][j] < minimum) {
					minimum = mappingTable[i][j];
				}
			}
			// subtract the column-minima
			if (minimum > 0) {
				for (int j = 0; j < mappingTable[i].length; j++) {
					mappingTable[i][j] = mappingTable[i][j] - minimum;
				}
			}
		}
		// compute and subtract the row-minima
		for (int i = 0; i < mappingTable[0].length; i++) {
			int minimum = Integer.MAX_VALUE;
			for (int j = 0; j < mappingTable.length; j++) {
				if (mappingTable[j][i] < minimum) {
					minimum = mappingTable[j][i];
				}
			}
			// subtract the row-minima
			if (minimum > 0) {
				for (int j = 0; j < mappingTable.length; j++) {
					mappingTable[j][i] = mappingTable[j][i] - minimum;
				}
			}
		}
		while (!assignmentAvailable(mappingTable)) {
			Vector<Integer> markedRows = new Vector<Integer>();
			Vector<Integer> markedColumns = new Vector<Integer>();

			// mark all rows which have no marked zero (start labeling)
			for (int i = 0; i < mappingTable[0].length; i++) {
				boolean markedZero = false;
				for (int j = 0; j < mappingTable.length; j++) {
					if (mappingTable[j][i] == Integer.MIN_VALUE) {
						markedZero = true;
						break;
					}
				}
				if (!markedZero) {
					markedRows.add(i);
				}
			}

			boolean newMarked = true;
			while (newMarked) {
				newMarked = false;
				// mark all columns with a slashed zero in a marked row
				for (int i = 0; i < mappingTable.length; i++) {
					for (int j = 0; j < mappingTable[i].length; j++) {
						if (mappingTable[i][j] == Integer.MAX_VALUE) {
							if (markedRows.contains(j) && !markedColumns.contains(i)) {
								newMarked = true;
								markedColumns.add(i);
							}
						}
					}
				}
				// mark all rows with a marked zero in a marked column
				for (int i = 0; i < mappingTable[0].length; i++) {
					for (int j = 0; j < mappingTable.length; j++) {
						if (mappingTable[j][i] == Integer.MIN_VALUE) {
							if (markedColumns.contains(j) && !markedRows.contains(i)) {
								newMarked = true;
								markedRows.add(i);
							}
						}
					}
				}
			} // end while (newMarked)

			// inverting of the marked columns
			for (int i = 0; i < mappingTable.length; i++) {
				if (!markedColumns.contains(i)) {
					markedColumns.add(i);
				} else {
					markedColumns.removeElement(i);
				}
			}

			// find the minimum in the marked range
			int minimum = Integer.MAX_VALUE;
			for (int i = 0; i < markedRows.size(); i++) {
				for (int j = 0; j < markedColumns.size(); j++) {
					if (mappingTable[markedColumns.get(j)][markedRows.get(i)] < minimum) {
						minimum = mappingTable[markedColumns.get(j)][markedRows.get(i)];
					}
				}
			}
			// substract the minimum from all elements in the marked range
			for (int i = 0; i < markedRows.size(); i++) {
				for (int j = 0; j < markedColumns.size(); j++) {
					mappingTable[markedColumns.get(j)][markedRows.get(i)] = mappingTable[markedColumns.get(j)][markedRows
							.get(i)] - minimum;
				}
			}

			// add the minimum to all elements which are neither marked in a row nor in a column
			for (int i = 0; i < mappingTable.length; i++) {
				if (!markedColumns.contains(i)) {
					for (int j = 0; j < mappingTable[i].length; j++) {
						if (!markedRows.contains(j)) {
							mappingTable[i][j] = mappingTable[i][j] + minimum;
						}
					}
				}
			}
			// reset the Integer.MIN_VALUE and Integer.MAX_VALUE to zero
			for (int i = 0; i < mappingTable.length; i++) {
				for (int j = 0; j < mappingTable[i].length; j++) {
					if (mappingTable[i][j] == Integer.MAX_VALUE) {
						mappingTable[i][j] = 0;
					}
					if (mappingTable[i][j] == Integer.MIN_VALUE) {
						mappingTable[i][j] = 0;
					}
				}
			}
		} // end while(!assignmentAvailable)

		// compute the mapping (there must be a possible assignment)
		HashMap<Integer, String> clusterToPrediction = new HashMap<Integer, String>();
		for (int i = 0; i < mappingTable.length; i++) {
			int result = -1;
			for (int j = 0; j < mappingTable[i].length; j++) {
				if (mappingTable[i][j] == Integer.MIN_VALUE) {
					result = j;
					break;
				}
			}
			String resultCluster = intToClusterMapping.get(result);
			clusterToPrediction.put(i, resultCluster);
		}

		// insert the result in the predicted attribute
		HashMap<String, Integer> predictionToCluster = new HashMap<String, Integer>();
		// set the preditedLabel in the example table and compute to each prediction the cluster
		int i = 0;
		Attribute clusterAttribute = exampleSet.getAttributes().getCluster();
		for (Example example : exampleSet) {
			String resultLabel = clusterToPrediction.get((int) example.getValue(example.getAttributes().getCluster()));
			example.setValue(predictedLabel, resultLabel);
			if (predictionToCluster.size() < model.getNumberOfClusters()) {
				if (!predictionToCluster.containsKey(example.getValueAsString(example.getAttributes().getPredictedLabel()))) {
					String clusterNumber = example.getValueAsString(clusterAttribute).replaceAll("[^\\d]+", "");
					try {
						int number = Integer.parseInt(clusterNumber);
						predictionToCluster.put(example.getValueAsString(example.getAttributes().getPredictedLabel()),
								number);
					} catch (NumberFormatException e) {
						throw new UserError(this, 145, clusterAttribute.getName());
					}
				}
			}
			i++;
		}

		// set the confidence in the example table
		i = 0;
		for (Example example : exampleSet) {
			if (model.getClass() == FlatFuzzyClusterModel.class) {
				FlatFuzzyClusterModel fuzzyModel = (FlatFuzzyClusterModel) model;
				for (int j = 0; j < clusterToPrediction.size(); j++) {
					String label = clusterToPrediction.get(j);
					example.setConfidence(label,
							fuzzyModel.getExampleInClusterProbability(i, predictionToCluster.get(label)));
				}
			} else {
				example.setConfidence(clusterToPrediction.get((int) example.getValue(example.getAttributes().getCluster())),
						1);
			}
			i++;
		}

		exampleSetOutput.deliver(exampleSet);
		clusterModelOutput.deliver(model);
	}

	/* Returns true, if there is a solution availble. */
	private boolean assignmentAvailable(int[][] mappingTable) {
		int markedZeros = 0;
		boolean modificationDone = true;

		while (modificationDone) {
			while (modificationDone) {
				modificationDone = false;
				// column by column
				for (int i = 0; i < mappingTable.length; i++) {
					int position = -1;
					for (int j = 0; j < mappingTable[i].length; j++) {
						if (mappingTable[i][j] == 0) {
							if (position == -1) {
								position = j;
							} else {
								position = -1;
								break;
							}
						}
					}
					if (position != -1) {
						modificationDone = true;
						mappingTable[i][position] = Integer.MIN_VALUE; // marked zero
						for (int k = 0; k < mappingTable.length; k++) {
							if (mappingTable[k][position] == 0) {
								mappingTable[k][position] = Integer.MAX_VALUE; // slashed zeros
							}
						}
						markedZeros++;
					}
				}
				if (markedZeros == mappingTable.length) {
					return true;
				}

				// line by line
				for (int i = 0; i < mappingTable[0].length; i++) {
					int position = -1;
					for (int j = 0; j < mappingTable.length; j++) {
						if (mappingTable[j][i] == 0) {
							if (position == -1) {
								position = j;
							} else {
								position = -1;
								break;
							}
						}
					}
					if (position != -1) {
						modificationDone = true;
						mappingTable[position][i] = Integer.MIN_VALUE;// marked zero
						for (int k = 0; k < mappingTable[0].length; k++) {
							if (mappingTable[position][k] == 0) {
								mappingTable[position][k] = Integer.MAX_VALUE; // slashed zeros
							}
						}
						markedZeros++;
					}
				}
				if (markedZeros == mappingTable.length) {
					return true;
				}
			}
			// modificationDone is here always false
			// ambiguous zeros
			int aktMarkedZeros = markedZeros;
			for (int i = 0; i < mappingTable.length; i++) {
				for (int j = 0; j < mappingTable[i].length; j++) {
					if (mappingTable[i][j] == 0) {
						mappingTable[i][j] = Integer.MIN_VALUE;// marked zero
						for (int k = j + 1; k < mappingTable[i].length; k++) {
							if (mappingTable[i][k] == 0) {
								mappingTable[i][k] = Integer.MAX_VALUE; // slashed zeros in the same
																		// column
							}
						}
						for (int k = 0; k < mappingTable.length; k++) {
							if (mappingTable[k][j] == 0) {
								mappingTable[k][j] = Integer.MAX_VALUE; // slashed zeros
							}
						}
						modificationDone = true;
						markedZeros++;
						break;
					}
				}
				if (aktMarkedZeros != markedZeros) {
					break;
				}
			}
			if (markedZeros == mappingTable.length) {
				return true;
			}
		}

		return false;
	}

student_compute · June 2018

Hi, how should I use this code in the program? Where should I copy and use?
Thankful
Sorry i'm asking

student_compute · June 2018

hi

sorry

please help me

thanks

Muhammed_Fatih_ · June 2020

Hi @mschmitz,

one further question in this connection. Which classification model does the "Map Clustering on Labels" operator consider with regard to the subsequent calculation of performance values?

Thank you in advance for your response!

Best regards!

Telcontar120 · June 2020

The Map Clustering on Labels "model" simply chooses a cluster for each class and maps to that, by minimizing the total number of errors produced by the mapping. Assignments by cluster are exclusive. It then calculates the performance metrics by looking at "predictions" (based on the mapped clusters) and the "actual" (the label). You need to have the same number of clusters as you have label classes for this operator to work.

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

Clustering with labels?

Best Answer

Answers