Combine documents + weighting

simon_knoll · September 2010

Hello dear RM Team,
it would be a cool feature if the combine documents operator would have the capabillities to weight incoming documents (the terms of one document are more important then others)

all the best,
simon

simon_knoll · September 2010

i worte a fast implementation for that on the combine documents operator sourcecode, which seems to be working, any comments?

/*
 *  RapidMiner
 *
 *  Copyright (C) 2001-2009 by Rapid-I and the contributors
 *
 *  Complete list of developers available at our web site:
 *
 *       http://rapid-i.com
 *
 *  This program is free software: you can redistribute it and/or modify
 *  it under the terms of the GNU Affero General Public License as published by
 *  the Free Software Foundation, either version 3 of the License, or
 *  (at your option) any later version.
 *
 *  This program is distributed in the hope that it will be useful,
 *  but WITHOUT ANY WARRANTY; without even the implied warranty of
 *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
 *  GNU Affero General Public License for more details.
 *
 *  You should have received a copy of the GNU Affero General Public License
 *  along with this program.  If not, see http://www.gnu.org/licenses/.
 */
package com.rapidminer.operator.text.io.transformer;

import java.util.ArrayList;
import java.util.List;

import com.rapidminer.operator.Operator;
import com.rapidminer.operator.OperatorDescription;
import com.rapidminer.operator.OperatorException;
import com.rapidminer.operator.Value;
import com.rapidminer.operator.ports.InputPortExtender;
import com.rapidminer.operator.ports.OutputPort;
import com.rapidminer.operator.text.Document;
import com.rapidminer.operator.text.Token;

/**
 * This operator combines serveral documents by appending their content to a new
 * document. The meta data will be added from all documents but the values of
 * the first documents will be overwritten by the values of the following.
 * 
 * @author Tobias Malbrecht, Sebastian Land
 */
public class CombineDocumentsOperator extends Operator {

	private InputPortExtender documentInputPorts = new InputPortExtender(
			"documents", getInputPorts());

	private OutputPort documentOutput = getOutputPorts().createPort("document");

	public CombineDocumentsOperator(OperatorDescription description) {
		super(description);
		documentInputPorts.start();
		getTransformer().addGenerationRule(documentOutput, Document.class);
	}

@Override
	public void doWork() throws OperatorException {
		List<Document> documents = documentInputPorts.getData(true);

		List<Token> tokens = new ArrayList<Token>();
		Document result = new Document(tokens);
		//within this loop i observe the labelnames of the documents. if they entail a pattern like <label>_weigh_<weight>
		//i cast <weight> to float and i'm multiplying every token's weight with <weight> 
		String[] splitted;
		for (Document document : documents) {
			String label = (String) document.getMetaDataValue("label");
			splitted = label.split("_weight_");
			if (splitted.length > 1) {
				List<Token> newSequence = new ArrayList<Token>();
				float weight = Float.parseFloat(splitted[1]);
				List<Token> tseq = document.getTokenSequence();
				for (Token token : tseq) {
					Token t = new Token(token.getToken(), token.getWeight()
							* weight);
					newSequence.add(t);
					System.out.println(t);
				}
				tokens.addAll(newSequence);
			} else {
				tokens.addAll(document.getTokenSequence());
			}

			//this line is just for beauty
			document.addMetaData("label", splitted[0],
					document.getMetaDataType("label"));

			result.addMetaData(document);
		}
		documentOutput.deliver(result);
	}
}

fischer · September 2010

Hi,

we have thought about this and think it is a good idea in general. However, assuming that you have something like "label_weight_0.7" in the annotations looks a bit weird. We should at least have a weight meta data or something similar that does not require this parsing operation. How are you constructing this string in your case?

Best,
Simon

simon_knoll · September 2010

Hi Simon,
doing the weighting within the label was the easiest way for me to integrate it in my program.
Of which string are you talking about?

if you are talking about the string for the label than it goes like that:

first a bit context:
i want to cluster webservices, and for that i have documents related to the service. as not every document has the same importance, i have to weight them.

now how i build the label name:
the prefix is allways the service id, then i have "_weight_" and then i have a weight value like 0.5
e.g.: SMSService01_weight_0.5

all the best,
simon

fischer · September 2010

Hi Simon,

thanks for clarifying this. Aytually I was thinking about which operator you are using to construct these strings. Is it an RM operator or your own implementation?

Do you agree that this concatenation of strings is not the most elegant solution if we want to incorportate it into the release?

Best,
Simon

simon_knoll · September 2010

Hi Simon,

The string is not constructed by a rapidminer operator, but by my own code, where im setting the labelnames of create document operators.

But i agree with you that for a release there should be a more elegant/general way. Maybe a metadata which can be set for every document as you mentioned in your previous post.

This was just a quick n' dirty coding which fit into my own implementation. Nevertheless also i would appreciate, if this comes into a release, that one can handle this by metadata for instance.

all the best,
Simon

fischer · September 2010

Hi,

if you change that so we have an additional meta data field "weight" which always contains a number, I would copy that to the next release. What do you think?

Best,
Simon

simon_knoll · October 2010

Hi Simon,
sorry for the late answer. I would appreciate that that this feature comes to the next release.

when does the next release will happen?

all the best
simon

land · October 2010

Hi,
we will include weighting into the next major release of the Text Extension. There are many ongoing changes beside this, so it might take some time.

Greetings,
Sebastian

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

Combine documents + weighting

Answers