The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here

Two simple questions

earmijoearmijo Member Posts: 271 Unicorn
edited November 2018 in Help
I teach Data Mining at a Business School and I'm considering using Rapid-Miner as the official software (last year I used XLMiner and Rattle/R). I'm translating everything I did with those two packages to Rapid-i.  I have two very simple questions.

1) After running a cluster algorithm (say k-means), I'd like to get some basic stats (means, medians, st devs) BY cluster membership. Can I do that?

2) Suppose I have a set of variables  (beer=label,      income, education, age, woman, etc = attributes) and I want to run a simple linear regression.  I want to be able to manually leave some variables out. For instance, I want to omit "age" and "woman". How could I do that?    I've tried to use  FeatureNameFilter  but I can only list one of the two. (I've tried to separate the list of variables I want to omit with commas, semi-colons, etc with no success).

Thanks in advance for any help,

E.

Answers

  • TobiasMalbrechtTobiasMalbrecht Moderator, Employee-RapidMiner, Member Posts: 295 RM Product Management
    Hi,
    earmijo wrote:

    I teach Data Mining at a Business School and I'm considering using Rapid-Miner as the official software (last year I used XLMiner and Rattle/R). I'm translating everything I did with those two packages to Rapid-i.  I have two very simple questions.
    that is great, we really appreciate if RapidMiner is used in data mining classes. Do you mind telling us which Business School you are teaching at? We are always curious where RM is used... :)

    Now back to your questions, they are actually ... well .. quite simple! ;)
    earmijo wrote:

    1) After running a cluster algorithm (say k-means), I'd like to get some basic stats (means, medians, st devs) BY cluster membership. Can I do that?
    Place an [tt]Aggregation[/tt] operator after the clustering algorithm. You than have to specify which attributes should be aggregated and by which function (mean, median, stddev, min, max, etc). As [tt]group_by[/tt] attribute you have to specify the cluster id.
    earmijo wrote:

    2) Suppose I have a set of variables  (beer=label,      income, education, age, woman, etc = attributes) and I want to run a simple linear regression.  I want to be able to manually leave some variables out. For instance, I want to omit "age" and "woman". How could I do that?    I've tried to use  FeatureNameFilter  but I can only list one of the two. (I've tried to separate the list of variables I want to omit with commas, semi-colons, etc with no success).
    The [tt]FeatureNameFilter[/tt] recognizes regular expressions. The regular expression comprising both attributes age and woman would be [tt]age|woman[/tt]. The [tt]|[/tt] is like a logical or. By the way: the [tt]FeatureNameFilter[/tt] is replaced by the [tt]AttributeFilter[/tt] operator, which allows you also to filter by other conditions than given names or regular expressions, respectively.

    Hope that helps,
    Tobias
  • earmijoearmijo Member Posts: 271 Unicorn
    Thanks Tobias for your quick response. I teach at the Rotterdam School of Management in Europe and INCAE Business School in Latin America. The answer about Filtering solved my problem perfectly. The one about clustering I couldn't make it work. Here it is applied to one of the sample programs. The program complains that 'cluster' is not a valid variable (but that's the name given by the program to the cluster_id).

    <operator name="Root" class="Process" expanded="yes">
        <parameter key="logverbosity" value="warning"/>
        <operator name="ExampleSource" class="ExampleSource">
            <parameter key="attributes" value="../data/iris.aml"/>
        </operator>
        <operator name="KMeans" class="KMeans">
            <parameter key="k" value="3"/>
        </operator>
        <operator name="Aggregation" class="Aggregation">
            <list key="aggregation_attributes">
              <parameter key="a1" value="average"/>
            </list>
            <parameter key="group_by_attributes" value="cluster"/>
        </operator>
    </operator>
  • TobiasMalbrechtTobiasMalbrecht Moderator, Employee-RapidMiner, Member Posts: 295 RM Product Management
    Hi,

    the problem here is that the [tt]Aggregation[/tt] operator does not look for special attributes when matching the names given as parameters. Hence, you have to make the special cluster attribute (named cluster) to a regular attribute. You can do this by placing a [tt]ChangeAttributeRole[/tt] operator between the clustering operator and the aggregation operator. You can use this code ...

        <operator name="ChangeAttributeRole" class="ChangeAttributeRole">
            <parameter key="name" value="cluster"/>
        </operator>
    Hope that solves the problem.
    Regards,
    Tobias
  • earmijoearmijo Member Posts: 271 Unicorn
    Fantastic. Thanks for your time.
Sign In or Register to comment.