"Clustering and Normalization"

hgwelec · May 2009

Dear All,

I have a dataset which consists of 20 numeric variables.

I would like to apply z-score transformation to all variables : I use normalization node and all ok until here

The problem now is that i want to de-normalize values of all 20 fields to the original values so that cluster values make sense.

1) Is there a nore to do this for all 20 fields
2) If not can someone provide an example on how to do it for a single field only?

Thanks!

steffen · May 2009

Hello

The only hint I can give you is to use AttributeConstruction. Unfortunately you have to include the mean and stdev manually.

regards,

Steffen

haddock · May 2009

Hi,

The nice thing about RM is that you can do things in many different ways...

<operator name="Root" class="Process" expanded="yes">
    <operator name="ExampleSetGenerator" class="ExampleSetGenerator">
        <parameter key="target_function"	value="random"/>
        <parameter key="number_of_attributes"	value="20"/>
    </operator>
    <operator name="IdTagging" class="IdTagging">
    </operator>
    <operator name="CSVExampleSetWriter" class="CSVExampleSetWriter">
        <parameter key="csv_file"	value="bla"/>
    </operator>
    <operator name="Normalization" class="Normalization">
    </operator>
    <operator name="KMeans" class="KMeans">
    </operator>
    <operator name="CSVExampleSource" class="CSVExampleSource">
        <parameter key="filename"	value="bla"/>
    </operator>
    <operator name="IdTagging (2)" class="IdTagging">
    </operator>
    <operator name="ExampleSetJoin" class="ExampleSetJoin">
        <parameter key="remove_double_attributes"	value="false"/>
    </operator>
    <operator name="FeatureNameFilter" class="FeatureNameFilter">
        <parameter key="skip_features_with_name"	value="att[0-9]*"/>
    </operator>
    <operator name="ChangeAttributeNamesReplace" class="ChangeAttributeNamesReplace">
        <parameter key="replace_what"	value="_from_ES2"/>
        <parameter key="apply_on_special"	value="false"/>
    </operator>
</operator>

Bit of a mess, because normalization seems to hit objects even if you store them away, but it does the job... I think.

PS Can someone prod Ingo towards his PM box here, thanks.

haddock · May 2009

Silly me :-\ if I tick "create view" on the normalization operator I don't need to write and read back the CSV, like this..

<operator name="Root" class="Process" expanded="yes">
    <operator name="ExampleSetGenerator" class="ExampleSetGenerator">
        <parameter key="target_function"	value="random"/>
        <parameter key="number_of_attributes"	value="20"/>
    </operator>
    <operator name="IdTagging" class="IdTagging">
    </operator>
    <operator name="IOStorer" class="IOStorer">
        <parameter key="name"	value="original"/>
        <parameter key="io_object"	value="ExampleSet"/>
        <parameter key="remove_from_process"	value="false"/>
    </operator>
    <operator name="Normalization" class="Normalization">
        <parameter key="return_preprocessing_model"	value="true"/>
        <parameter key="create_view"	value="true"/>
    </operator>
    <operator name="KMeans" class="KMeans">
    </operator>
    <operator name="IORetriever" class="IORetriever">
        <parameter key="name"	value="original"/>
        <parameter key="io_object"	value="ExampleSet"/>
    </operator>
    <operator name="ExampleSetJoin" class="ExampleSetJoin">
        <parameter key="remove_double_attributes"	value="false"/>
    </operator>
    <operator name="FeatureNameFilter" class="FeatureNameFilter">
        <parameter key="skip_features_with_name"	value="att[0-9]*"/>
    </operator>
    <operator name="ChangeAttributeNamesReplace" class="ChangeAttributeNamesReplace">
        <parameter key="replace_what"	value="_from_ES2"/>
        <parameter key="apply_on_special"	value="false"/>
    </operator>
</operator>

hgwelec · May 2009

Hello and Thanks for reply,

However i do not understand the example given : Where is the DE-normalization happening for every attribute?

Thanks again!

hgwelec · May 2009

Haddock,

Really interesting method

The problem is though that the clustering output still does not show you the DE-normalized values such as in:

Cluster 0 :
attr1 : x
attr2 : y
attr3 : z

with x,y,z being DE-normalized

Perhaps a DE-normalize operator would be useful!?

haddock · May 2009

Hi,

The original problem was...

The problem now is that i want to de-normalize values of all 20 fields to the original values

The method shows the original values, or do you not agree?

The problem is though that the clustering output still does not show you the DE-normalized

If by "DE-normalized" you mean something other than the "original" values then perhaps so, but that was not the question.

In short, I disagree that a de-normalizer operator is necessary, because you can always just keep the originals!

hgwelec · May 2009

Hi again Haddock,

First of all : ***Thanks for your help*** i do not mean to sound rude :-)

However the *full* quote was :

The problem now is that i want to de-normalize values of all 20 fields to the original values so that cluster values make sense

Notice that the last part says : "so that cluster values make sense"

Unfortunately this is not the case with your solution. Again i do not want to appear rude i am just giving my opinion that perhaps an operator would prove helpful. Just trying to add my 2 cents...

Thanks!

haddock · May 2009

Hi,

I'm always amused by posts that start "i do not mean to sound rude".

Versions one and two of the code did the job. Did you run them? Version three was only put in to make things clearer for you. Something got flipped and the clusters got lost. So I'll edit version three out.

Maybe you'll want to edit your last post as well.

hgwelec · May 2009

I'm always amused by posts that start "i do not mean to sound rude".

Great!.Now on with the problem

Versions one and two of the code did the job Did you run them?

No they didn't, they did the job the way you perceived it / Yes i did run all of them

Version three was only put in to make things clearer for you. Something got flipped and the clusters got lost. So I'll edit version three out.

So that means that there can be an output like the one i explained? To have the numbers in the cluster model prior the normalization? I sure would like to see how this is possible because this is actually what i wanted originally.

Maybe you'll want to edit your last post as well.

Sure if you explain why should i, no problem!

haddock · May 2009

No they didn't, they did the job the way you perceived it / Yes i did run all of them

Excellent, in which case you can explain in what way the original values are not tied to the clusters.

So that means that there can be an output like the one i explained? To have the numbers in the cluster model prior the normalization? I sure would like to see how this is possible because this is actually what i wanted originally.

No, it means exactly what it says, I tried to clarify my solution by adding better titles to the operators, and things stopped working.

Quote
Maybe you'll want to edit your last post as well.

Sure if you explain why should i, no problem!

Because you are wrong. Do you disagree that if you normalise numbers and then de-normalise them you should end up with the numbers you started with? De-normalising can be effected just by keeping the originals, which is what my solution does, and I'm sorry you can't understand that.

hgwelec · May 2009

Haddock,

The point is that your solution does NOT output a ***Clustering Model window*** with de-normalized values! The sequence should be the following

1) Get unnormalized values
2) Normalize them
3) run clustering model using normalized values
4) show the CLUSTERING MODEL'S RESULTS DENORMALIZED. I do *not* want for every row it's associated de-normalized value!!

Your solution does not do step (4) , It writes each de-normalized values to a table! Do you understand the difference Haddock??

Please try to understand what is sought here..

From what i can tell (as steffen said) there is no way to do this automatically in RM. If someone else can help on this, please do so

Thanks!

haddock · May 2009

4) show the CLUSTERING MODEL'S RESULTS DENORMALIZED. I do *not* want for every row it's associated de-normalized value!!

Please explain this term, and how we were meant to guess it from your original question, let me remind of what it actually was.....

The problem now is that i want to de-normalize values of all 20 fields to the original values so that cluster values make sense.

A word of advice, when you can't see over the top of the hole you are digging, stop digging.

keith · May 2009

If I understand what hgwelec is asking for, he wants to be able to express the centroid values of each cluster in the scale of the original data.

He's not talking about having an ExampleSet that contains both the raw values and normalized values for each data point. He wants to describe the clusters in the data's natural scale. This would help, for example, in explaining the clusters are to other people, or even just to better interpret the model himself.

If my reading of the problem is correct, then the following discussion may be helpful...

You'd need to know the mean and standard deviation of each attribute in the original data to convert the normalized centroid values to original scale values (i.e. "denormalize"). While RM computes the sum and std dev as part of the meta data view of an ExampleSet, I'm not sure there's a way to get to those values. If you're reading data from a database, you might be able to have a second DatabaseExampleSource with a query that returns the mean and std dev for each attribute.

Once you have the mean and std dev, you need to get the centroid values into an example set. I haven't worked with clustering models, so I don't know how this would be done in RM. But once you have both the mean+stddev and the centroid values, you can probably use one of the Join operators to match up the clusters with their mean+stdev, and then use AttributeConstruction (as steffen mentioned in the first reply to this thread) to build the centroid values on the original data's scale.

Hopefully this doesn't add further confusion to the situation...

Keith

haddock · May 2009

Nice one Keith,

Now that I do understand, and curiously he'll still need the original/raw data

While RM computes the sum and std dev as part of the meta data view of an ExampleSet, I'm not sure there's a way to get to those values.

I think this does the necessary.

<operator name="Root" class="Process" expanded="yes">
    <operator name="ExampleSetGenerator" class="ExampleSetGenerator">
        <parameter key="target_function"	value="random"/>
        <parameter key="number_of_attributes"	value="1"/>
    </operator>
    <operator name="MovingAverage" class="MovingAverage">
        <parameter key="attribute_name"	value="att1"/>
        <parameter key="window_width"	value="100"/>
        <parameter key="result_position"	value="start"/>
    </operator>
    <operator name="ChangeAttributeNamesReplace" class="ChangeAttributeNamesReplace">
        <parameter key="replace_what"	value="\(|\)"/>
    </operator>
    <operator name="ChangeAttributeName" class="ChangeAttributeName">
        <parameter key="old_name"	value="moving_averageatt1"/>
        <parameter key="new_name"	value="avg_att1"/>
    </operator>
    <operator name="MovingAverage (2)" class="MovingAverage">
        <parameter key="attribute_name"	value="att1"/>
        <parameter key="window_width"	value="100"/>
        <parameter key="aggregation_function"	value="standard_deviation"/>
        <parameter key="result_position"	value="start"/>
    </operator>
    <operator name="ChangeAttributeNamesReplace (2)" class="ChangeAttributeNamesReplace">
        <parameter key="replace_what"	value="\(|\)"/>
    </operator>
    <operator name="ChangeAttributeName (2)" class="ChangeAttributeName">
        <parameter key="old_name"	value="moving_averageatt1"/>
        <parameter key="new_name"	value="stddev_att1"/>
    </operator>
    <operator name="MissingValueReplenishment" class="MissingValueReplenishment">
        <list key="columns">
        </list>
    </operator>
</operator>

and this works out the average for each cluster - just added a change of role on the cluster and an OLAP operator to my original offering.

<operator name="Root" class="Process" expanded="yes">
    <operator name="ExampleSetGenerator" class="ExampleSetGenerator">
        <parameter key="target_function"	value="random"/>
        <parameter key="number_of_attributes"	value="20"/>
    </operator>
    <operator name="IdTagging" class="IdTagging">
    </operator>
    <operator name="IOStorer" class="IOStorer">
        <parameter key="name"	value="original"/>
        <parameter key="io_object"	value="ExampleSet"/>
        <parameter key="remove_from_process"	value="false"/>
    </operator>
    <operator name="Normalization" class="Normalization">
        <parameter key="return_preprocessing_model"	value="true"/>
        <parameter key="create_view"	value="true"/>
    </operator>
    <operator name="KMeans" class="KMeans">
    </operator>
    <operator name="IORetriever" class="IORetriever">
        <parameter key="name"	value="original"/>
        <parameter key="io_object"	value="ExampleSet"/>
    </operator>
    <operator name="ExampleSetJoin" class="ExampleSetJoin">
        <parameter key="remove_double_attributes"	value="false"/>
    </operator>
    <operator name="FeatureNameFilter" class="FeatureNameFilter">
        <parameter key="skip_features_with_name"	value="att[0-9]*"/>
    </operator>
    <operator name="ChangeAttributeNamesReplace" class="ChangeAttributeNamesReplace">
        <parameter key="replace_what"	value="_from_ES2"/>
        <parameter key="apply_on_special"	value="false"/>
    </operator>
    <operator name="ChangeAttributeRole" class="ChangeAttributeRole">
        <parameter key="name"	value="cluster"/>
    </operator>
    <operator name="Aggregation" class="Aggregation">
        <list key="aggregation_attributes">
          <parameter key="att1"	value="average"/>
          <parameter key="att2"	value="average"/>
          <parameter key="att3"	value="average"/>
          <parameter key="att4"	value="average"/>
          <parameter key="att5"	value="average"/>
          <parameter key="att6"	value="average"/>
          <parameter key="att7"	value="average"/>
          <parameter key="att8"	value="average"/>
          <parameter key="att9"	value="average"/>
          <parameter key="att10"	value="average"/>
          <parameter key="att11"	value="average"/>
          <parameter key="att12"	value="average"/>
          <parameter key="att13"	value="average"/>
          <parameter key="att14"	value="average"/>
          <parameter key="att15"	value="average"/>
          <parameter key="att16"	value="average"/>
          <parameter key="att17"	value="average"/>
          <parameter key="att18"	value="average"/>
          <parameter key="att19"	value="average"/>
          <parameter key="att20"	value="average"/>
        </list>
        <parameter key="group_by_attributes"	value="cluster"/>
    </operator>
</operator>

Thanks again for bringing clarity to the question, how we were meant to get that from the original question remains a mystery to me.

hgwelec · May 2009

@keith,

This is what i am talking about and steffen understood what i meant right from my 1st post.

So by using attribute construction it can be done but imagine building new attributes for 60 input variables! so the question is whether some node can be used to calculate all this information for all -say- 60 attributes and i guess this cannot happen (?) as steffen originally said.

@haddock

It appears that you still don't get it but may be i am wrong...can you do the same example that you last posted for 60 input variables? How much time will it take you to do it? Let alone also having to do a log transformation to each of 60 variables to fix their skewed distributions...

steffen · May 2009

Hello

hgwelec wrote:

@keith,
This is what i am talking about and steffen understood what i meant right from my 1st post.

I'd like to see myself in such a glorious light, but sorry: I did understand it exactly as haddock did until keith made your point clear.

@haddock:
I did not know the operator MovingAverage yet ... really nice. However, it seems the calculation of stdev is messed up, isn't it ?

@hgwelec:
The second process of haddock does exactly what you want. He was able to calculate the cluster centroids for the denormalized (ie. not normalized) values and hence the denormalized cluster centers (this is only correct if the cluster centroids of the cluster operator are calculated as mean .. which is correct for KMeans). The issue of scalability remains, but: Either you add an entry for each attribute in the aggregation operator manually OR you use a loop .... in JAVA, which means hacking an operator yourself. I do not see another option.

Again we have faced an example of the law of leaky abstraction ...

kind regards,

Steffen

PS: the process of haddock is ok, but I did not check the calculation of the values by an example (just to be sure) .. my head is a little fuzzy today...

haddock · May 2009

Greets Steff!

I did not know the operator MovingAverage yet ... really nice. However, it seems the calculation of stdev is messed up, isn't it ?

Needs checking - but if you think so, that'll do for me. You'll probably understand if I say that my interest in this thread has waned somewhat ;D

Reminds me of an old Oxford philosophy exam story.....

Is this a question?

Yes, if this is an answer.

keith · May 2009

haddock wrote:

Nice one Keith,

Now that I do understand, and curiously he'll still need the original/raw data

I think this does the necessary.
<code deleted>

Ah, clever. Using the moving average to create a window that spans the entire dataset, and calculating the mean/stdev. Wouldn't have thought to approach it that way.

and this works out the average for each cluster - just added a change of role on the cluster and an OLAP operator to my original offering.
<code deleted>

Also a smarter approach to the problem than I would have thought of. I was fixated on trying to access the centroid values and convert them back to the original, non-normalized scale. Instead, you're labelling all the original data rows with the cluster, and calculating the means directly. Clever...

If it was possible to access the centroid values directly and apply the mean/stdev calculations from your first code sample, that would probably be a more scalable solution than joining the data to itself and computing the sum/stdev across the entire data set (depends on how many rows he's dealing with). It would also (I think) handle the case where the cluster centers are calculated by something other than mean (as steffen alludes to). But what you presented certainly solves the problem as presented. Thanks, I learned something today.

Thanks again for bringing clarity to the question, how we were meant to get that from the original question remains a mystery to me.

That's what great about having a forum where you get many eyeballs looking at a question. For example, to me, when I read:


The problem now is that i want to de-normalize values of all 20 fields to the original values so that cluster values make sense.

... it was pretty quickly apparent that, even if he didn't have the terminology quite right, he was talking about data that describe the clusters ("cluster values" a.k.a. centroids), and meant "original scale" rather than "original values". But I never would have come up with the solution haddock did.

Despite the frustrations expressed on this thread, this forum is still a friendlier place for earnest newbies (which I was not that long ago) to learn RapidMiner than the R-help list is for R, and is one of the many things I think is great about RM.

Keith

haddock · May 2009

Hi Keith!

Both you and Steffen come out of this episode as very solid citizens who deserve the respect you get, so many thanks to you both on behalf of all Rapido heads.

Despite the frustrations expressed on this thread, this forum is still a friendlier place for earnest newbies (which I was not that long ago) to learn RapidMiner than the R-help list is for R, and is one of the many things I think is great about RM.

I've learnt from two sources, Ralf's most excellent course, and trying to answer the puzzles set right here, so absolutely spot on, my friend, spot on.

Shubha · May 2009

Hi,

So by using attribute construction it can be done but imagine building new attributes for 60 input variables! so the question is whether some node can be used to calculate all this information for all -say- 60 attributes

<operator name="Root" class="Process" expanded="yes">
    <operator name="ExampleSetGenerator" class="ExampleSetGenerator">
        <parameter key="target_function"	value="random"/>
        <parameter key="number_of_attributes"	value="20"/>
    </operator>
    <operator name="IdTagging" class="IdTagging">
    </operator>
    <operator name="IOStorer" class="IOStorer">
        <parameter key="name"	value="original"/>
        <parameter key="io_object"	value="ExampleSet"/>
        <parameter key="remove_from_process"	value="false"/>
    </operator>
    <operator name="Normalization" class="Normalization">
        <parameter key="return_preprocessing_model"	value="true"/>
        <parameter key="create_view"	value="true"/>
    </operator>
    <operator name="KMeans" class="KMeans">
    </operator>
    <operator name="IORetriever" class="IORetriever">
        <parameter key="name"	value="original"/>
        <parameter key="io_object"	value="ExampleSet"/>
    </operator>
    <operator name="ExampleSetJoin" class="ExampleSetJoin">
        <parameter key="remove_double_attributes"	value="false"/>
    </operator>
    <operator name="FeatureNameFilter" class="FeatureNameFilter">
        <parameter key="skip_features_with_name"	value="att[0-9]*"/>
    </operator>
    <operator name="ChangeAttributeNamesReplace" class="ChangeAttributeNamesReplace">
        <parameter key="replace_what"	value="_from_ES2"/>
        <parameter key="apply_on_special"	value="false"/>
    </operator>
    <operator name="ChangeAttributeRole" class="ChangeAttributeRole">
        <parameter key="name"	value="cluster"/>
    </operator>
    <operator name="ValueIterator" class="ValueIterator" expanded="yes">
        <parameter key="attribute"	value="cluster"/>
        <operator name="ExampleFilter" class="ExampleFilter">
            <parameter key="condition_class"	value="attribute_value_filter"/>
            <parameter key="parameter_string"	value="cluster=%{loop_value}"/>
        </operator>
        <operator name="AttributeFilter" class="AttributeFilter">
            <parameter key="condition_class"	value="attribute_name_filter"/>
            <parameter key="parameter_string"	value="att.*"/>
            <parameter key="apply_on_special"	value="true"/>
        </operator>
        <operator name="ExampleSetTranspose" class="ExampleSetTranspose">
        </operator>
        <operator name="AttributeAggregation" class="AttributeAggregation">
            <parameter key="attribute_name"	value="Centroid_%{loop_value}"/>
            <parameter key="aggregation_attributes"	value="att_.*"/>
            <parameter key="aggregation_function"	value="average"/>
            <parameter key="keep_all"	value="false"/>
        </operator>
    </operator>
    <operator name="ExampleSetJoin (2)" class="ExampleSetJoin">
        <parameter key="remove_double_attributes"	value="false"/>
    </operator>
    <operator name="ExampleSetTranspose (2)" class="ExampleSetTranspose">
    </operator>
</operator>

The code is nothing but the haddock's Aggregation operator being replaced by a set of operators in the end.... Also, as pointed out the same approach of finding the average cannot be taken, say if you are dealing with KMedoids....

Shubha · May 2009

If it was possible to access the centroid values directly and apply the mean/stdev calculations from your first code sample, that would probably be a more scalable solution than joining the data to itself and computing the sum/stdev across the entire data set (depends on how many rows he's dealing with). It would also (I think) handle the case where the cluster centers are calculated by something other than mean (as steffen alludes to).

--- by Keith

The below is a tricky (infact, a very tricky) way of extracting the centroid values directly from the model.

<operator name="Root" class="Process" expanded="yes">
    <operator name="ExampleSetGenerator" class="ExampleSetGenerator">
        <parameter key="target_function"	value="random"/>
        <parameter key="number_of_attributes"	value="20"/>
    </operator>
    <operator name="KMeans" class="KMeans">
        <parameter key="k"	value="3"/>
    </operator>
    <operator name="Model_To_ExampleSet" class="OperatorChain" expanded="yes">
        <operator name="ResultWriter" class="ResultWriter">
            <parameter key="result_file"	value="Z:\Clus.csv"/>
        </operator>
        <operator name="CSVExampleSource" class="CSVExampleSource">
            <parameter key="filename"	value="Z:\clus.csv"/>
            <parameter key="read_attribute_names"	value="false"/>
            <parameter key="column_separators"	value=";\s*"/>
            <parameter key="trim_lines"	value="true"/>
        </operator>
        <operator name="ChangeAttributeNames2Generic" class="ChangeAttributeNames2Generic">
        </operator>
        <operator name="ExampleFilter (1)" class="ExampleFilter">
            <parameter key="condition_class"	value="attribute_value_filter"/>
            <parameter key="parameter_string"	value="att1=.*\t.*|Cluster \d"/>
        </operator>
        <operator name="Split (1)" class="Split">
            <parameter key="attributes"	value="att1"/>
            <parameter key="split_pattern"	value=" "/>
        </operator>
        <operator name="NominalNumbers2Numerical (1)" class="NominalNumbers2Numerical">
        </operator>
        <operator name="AttributeConstruction" class="AttributeConstruction">
            <list key="function_descriptions">
              <parameter key="mid"	value="if(att1_2&gt;1,1,att1_2)"/>
            </list>
        </operator>
        <operator name="CumulateSeries" class="CumulateSeries">
            <parameter key="attribute_name"	value="mid"/>
            <parameter key="keep_original_attribute"	value="false"/>
        </operator>
        <operator name="ExampleFilter (2)" class="ExampleFilter">
            <parameter key="condition_class"	value="attribute_value_filter"/>
            <parameter key="parameter_string"	value="att1_1=.*\t.*"/>
        </operator>
        <operator name="Split (2)" class="Split">
            <parameter key="attributes"	value="att1_1"/>
            <parameter key="split_pattern"	value=":\t"/>
        </operator>
        <operator name="AttributeFilter" class="AttributeFilter">
            <parameter key="condition_class"	value="attribute_name_filter"/>
            <parameter key="parameter_string"	value="att1_2"/>
            <parameter key="invert_filter"	value="true"/>
        </operator>
        <operator name="NominalNumbers2Numerical (2)" class="NominalNumbers2Numerical">
        </operator>
        <operator name="ChangeAttributeName (1)" class="ChangeAttributeName">
            <parameter key="old_name"	value="att1_1_2"/>
            <parameter key="new_name"	value="Centroid"/>
        </operator>
        <operator name="ChangeAttributeName (2)" class="ChangeAttributeName">
            <parameter key="old_name"	value="cumulative(mid)"/>
            <parameter key="new_name"	value="cluster_num"/>
        </operator>
        <operator name="Example2AttributePivoting" class="Example2AttributePivoting">
            <parameter key="group_attribute"	value="cluster_num"/>
            <parameter key="index_attribute"	value="att1_1_1"/>
            <parameter key="consider_weights"	value="false"/>
        </operator>
    </operator>
</operator>

A Note:
1. This method can be applied even for KMedoids....I meant to say, this also eludes the issue of "What if the cluster centers are not the mean?".
2. The centroid values are acurate for three decimal places, because the centroid values are read as it is from the "Text View" of the model. If the "Text view" gave, say five digits after the decimal point, then the same would be the result in the exampleset produced.

Best,
Shubha Karanth

haddock · May 2009

Hi Shubha,

I think there is a problem with your first example, because it only covers the case where there are two clusters, and with the second there is no data by the time of the first split, so I'm not sure why it is here at all . Bemused readers should run to the break, like this ( I've just removed the drive letter and put in a break )...

<operator name="Root" class="Process" expanded="yes">
    <operator name="ExampleSetGenerator" class="ExampleSetGenerator">
        <parameter key="target_function"	value="random"/>
        <parameter key="number_of_attributes"	value="20"/>
    </operator>
    <operator name="KMeans" class="KMeans">
        <parameter key="k"	value="3"/>
    </operator>
    <operator name="Model_To_ExampleSet" class="OperatorChain" expanded="yes">
        <operator name="ResultWriter" class="ResultWriter">
            <parameter key="result_file"	value="Clus.csv"/>
        </operator>
        <operator name="CSVExampleSource" class="CSVExampleSource">
            <parameter key="filename"	value="clus.csv"/>
            <parameter key="read_attribute_names"	value="false"/>
            <parameter key="column_separators"	value=";\s*"/>
            <parameter key="trim_lines"	value="true"/>
        </operator>
        <operator name="ChangeAttributeNames2Generic" class="ChangeAttributeNames2Generic">
        </operator>
        <operator name="ExampleFilter (1)" class="ExampleFilter" breakpoints="after">
            <parameter key="condition_class"	value="attribute_value_filter"/>
            <parameter key="parameter_string"	value="att1=.*\t.*|Cluster \d"/>
        </operator>
        <operator name="Split (1)" class="Split">
            <parameter key="attributes"	value="att1"/>
            <parameter key="split_pattern"	value=" "/>
        </operator>
        <operator name="NominalNumbers2Numerical (1)" class="NominalNumbers2Numerical">
        </operator>
        <operator name="AttributeConstruction" class="AttributeConstruction">
            <list key="function_descriptions">
              <parameter key="mid"	value="if(att1_2&gt;1,1,att1_2)"/>
            </list>
        </operator>
        <operator name="CumulateSeries" class="CumulateSeries">
            <parameter key="attribute_name"	value="mid"/>
            <parameter key="keep_original_attribute"	value="false"/>
        </operator>
        <operator name="ExampleFilter (2)" class="ExampleFilter">
            <parameter key="condition_class"	value="attribute_value_filter"/>
            <parameter key="parameter_string"	value="att1_1=.*\t.*"/>
        </operator>
        <operator name="Split (2)" class="Split">
            <parameter key="attributes"	value="att1_1"/>
            <parameter key="split_pattern"	value=":\t"/>
        </operator>
        <operator name="AttributeFilter" class="AttributeFilter">
            <parameter key="condition_class"	value="attribute_name_filter"/>
            <parameter key="parameter_string"	value="att1_2"/>
            <parameter key="invert_filter"	value="true"/>
        </operator>
        <operator name="NominalNumbers2Numerical (2)" class="NominalNumbers2Numerical">
        </operator>
        <operator name="ChangeAttributeName (1)" class="ChangeAttributeName">
            <parameter key="old_name"	value="att1_1_2"/>
            <parameter key="new_name"	value="Centroid"/>
        </operator>
        <operator name="ChangeAttributeName (2)" class="ChangeAttributeName">
            <parameter key="old_name"	value="cumulative(mid)"/>
            <parameter key="new_name"	value="cluster_num"/>
        </operator>
        <operator name="Example2AttributePivoting" class="Example2AttributePivoting">
            <parameter key="group_attribute"	value="cluster_num"/>
            <parameter key="index_attribute"	value="att1_1_1"/>
            <parameter key="consider_weights"	value="false"/>
        </operator>
    </operator>
</operator>

Perhaps you could explain what I've missed ? ;D

Good weekend!

hgwelec · May 2009

It appears that the way that i described my problem was not the right one.

I have seen other users express that my terminology was not correct i have no reason to think otherwise and for that i have to agree. It wasn't.

But since the essence of discussions in this forum is to both solve our problems *and* to draw some insights as to how RM can become better, i feel that even though a JAVA code could be a solution (when the dataset contains MANY attributes) for users that do no have the necessary programing skills the problem cannot be easily fixed.

Since normalization prior any clustering process is usually required, perhaps a De-Normalize node would prove to be very useful. .

Many Thanks!

haddock · May 2009

And I still disagree!

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

"Clustering and Normalization"

Answers