The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here
KernelKMeans now produces error when classify text
RM team
I have switched to RM 4.2. I began testing by using an existing project that classifies text by KernelKMeans. Text is read from a database and passed through StringtextInput and StringTokenizer. This operator chain worked before. Now I receive an error message
Error 104 - non-numeric
Error in: KernelKMeans (KernelKMeans) The example set contains non-numerical attribute #0: StockItemDesc (nominal/single_value)/values=
Using KMediods to classify text works. Looking at the metadata with examplevisualizer there are string vectors and weights.
Here is the project.
<operator name="Root" class="Process" expanded="yes">
<description text="#ylt#h3#ygt#Specifying texts by an example set#ylt#/h3#ygt##ylt#p#ygt#Using the parameter list or the wizard are simple methods for setting up the directories from which the text documents are read. Sometimes, however, a more flexible solution is needed. If, for instance, your text documents have different types of encoding or are written in different languages, you might wish to provide this information for each input directory separately.#ylt#/p#ygt# #ylt#p#ygt#You can do this by using an example set that contains one row for each input directory and corresponding attributes for source, encoding, type and class. If such an example set is provided, the texts in the parameter list are ignored.#ylt#/p#ygt#"/>
<operator name="DatabaseExampleSource" class="DatabaseExampleSource">
<parameter key="database_system" value="Microsoft SQL Server (JTDS)"/>
<parameter key="database_url" value="jdbc:jtds:sqlserver://localhost:1433/XXX"/>
<parameter key="id_attribute" value="IDNbr"/>
<parameter key="password" value="y6sa3JX9Wrc="/>
<parameter key="query" value="SELECT [Text], [IDNbr] FROM [Classify]"/>
<parameter key="username" value="sa"/>
</operator>
<operator name="StringTextInput" class="StringTextInput" expanded="yes">
<parameter key="filter_nominal_attributes" value="true"/>
<list key="namespaces">
</list>
<operator name="StringTokenizer" class="StringTokenizer">
</operator>
</operator>
<operator name="ExampleVisualizer" class="ExampleVisualizer" breakpoints="before">
</operator>
<operator name="KernelKMeans" class="KernelKMeans" breakpoints="after">
<parameter key="k" value="500"/>
<parameter key="kernel_type" value="KernelDot"/>
</operator>
<operator name="ClusterModel2ExampleSet" class="ClusterModel2ExampleSet">
<parameter key="keep_cluster_model" value="false"/>
</operator>
<operator name="ExampleSetWriter" class="ExampleSetWriter">
<parameter key="example_set_file" value="Example.dat"/>
<parameter key="special_format" value="$i $v[cluster]"/>
</operator>
</operator>
Thanks for your help.
B
I have switched to RM 4.2. I began testing by using an existing project that classifies text by KernelKMeans. Text is read from a database and passed through StringtextInput and StringTokenizer. This operator chain worked before. Now I receive an error message
Error 104 - non-numeric
Error in: KernelKMeans (KernelKMeans) The example set contains non-numerical attribute #0: StockItemDesc (nominal/single_value)/values=
Using KMediods to classify text works. Looking at the metadata with examplevisualizer there are string vectors and weights.
Here is the project.
<operator name="Root" class="Process" expanded="yes">
<description text="#ylt#h3#ygt#Specifying texts by an example set#ylt#/h3#ygt##ylt#p#ygt#Using the parameter list or the wizard are simple methods for setting up the directories from which the text documents are read. Sometimes, however, a more flexible solution is needed. If, for instance, your text documents have different types of encoding or are written in different languages, you might wish to provide this information for each input directory separately.#ylt#/p#ygt# #ylt#p#ygt#You can do this by using an example set that contains one row for each input directory and corresponding attributes for source, encoding, type and class. If such an example set is provided, the texts in the parameter list are ignored.#ylt#/p#ygt#"/>
<operator name="DatabaseExampleSource" class="DatabaseExampleSource">
<parameter key="database_system" value="Microsoft SQL Server (JTDS)"/>
<parameter key="database_url" value="jdbc:jtds:sqlserver://localhost:1433/XXX"/>
<parameter key="id_attribute" value="IDNbr"/>
<parameter key="password" value="y6sa3JX9Wrc="/>
<parameter key="query" value="SELECT [Text], [IDNbr] FROM [Classify]"/>
<parameter key="username" value="sa"/>
</operator>
<operator name="StringTextInput" class="StringTextInput" expanded="yes">
<parameter key="filter_nominal_attributes" value="true"/>
<list key="namespaces">
</list>
<operator name="StringTokenizer" class="StringTokenizer">
</operator>
</operator>
<operator name="ExampleVisualizer" class="ExampleVisualizer" breakpoints="before">
</operator>
<operator name="KernelKMeans" class="KernelKMeans" breakpoints="after">
<parameter key="k" value="500"/>
<parameter key="kernel_type" value="KernelDot"/>
</operator>
<operator name="ClusterModel2ExampleSet" class="ClusterModel2ExampleSet">
<parameter key="keep_cluster_model" value="false"/>
</operator>
<operator name="ExampleSetWriter" class="ExampleSetWriter">
<parameter key="example_set_file" value="Example.dat"/>
<parameter key="special_format" value="$i $v[cluster]"/>
</operator>
</operator>
Thanks for your help.
B
Tagged:
0
Answers
It does not fail immediately on starting like KernelKmeans does now.
are you sure this process did work with RM 4.1 and before? I am asking because as far as I can see the "usual" kernel functions of RapidMiner are used and those never supported nominal values...
Hoever, you could of course use the operator Nominal2Numeric before the clustering, it might even be more appropriate to apply a Nominal2Binominal first.
Cheers,
Ingo
I reinstalled RM 4.1 alongside RM 4.2. I tested this project. It runs under 4.1 and fails under 4.2.
Same SQL query to pull records and same text in the records.
+++++++++++++
<operator name="Root" class="Process" expanded="yes">
<description text="#ylt#h3#ygt#Specifying texts by an example set#ylt#/h3#ygt##ylt#p#ygt#Using the parameter list or the wizard are simple methods for setting up the directories from which the text documents are read. Sometimes, however, a more flexible solution is needed. If, for instance, your text documents have different types of encoding or are written in different languages, you might wish to provide this information for each input directory separately.#ylt#/p#ygt# #ylt#p#ygt#You can do this by using an example set that contains one row for each input directory and corresponding attributes for source, encoding, type and class. If such an example set is provided, the texts in the parameter list are ignored.#ylt#/p#ygt#"/>
<operator name="DatabaseExampleSource" class="DatabaseExampleSource">
<parameter key="database_system" value="Microsoft SQL Server (JTDS)"/>
<parameter key="database_url" value="jdbc:jtds:sqlserver://localhost:1433/SqlServer"/>
<parameter key="id_attribute" value="RecID"/>
<parameter key="password" value="y6sa3JX9Wrc="/>
<parameter key="query" value="SELECT [Text1], [Text2], [RecID] FROM
<parameter key="username" value="sa"/>
</operator>
<operator name="StringTextInput" class="StringTextInput" expanded="yes">
<parameter key="filter_nominal_attributes" value="true"/>
<list key="namespaces">
</list>
<operator name="StringTokenizer" class="StringTokenizer">
</operator>
</operator>
<operator name="ExampleVisualizer" class="ExampleVisualizer">
</operator>
<operator name="KernelKMeans" class="KernelKMeans">
<parameter key="k" value="500"/>
<parameter key="kernel_type" value="KernelDot"/>
</operator>
<operator name="ClusterModel2ExampleSet" class="ClusterModel2ExampleSet">
<parameter key="keep_cluster_model" value="false"/>
</operator>
<operator name="ExampleSetWriter" class="ExampleSetWriter">
<parameter key="example_set_file" value="C:\TestDataOutput.dat"/>
<parameter key="special_format" value="$i $v[cluster]"/>
</operator>
</operator>
+++
4.2 error message
Error in: KernelKMeans (KernelKMeans) The example set contains non-numerical attribute #0: StockItemDesc
++++++++++++++++++
<as far as I can see the "usual" kernel functions of RapidMiner are used and those never supported nominal values>
Doesn't the FilterNominalAttributes convert the attributes to a usable format for further processing?
Thanks for your help.
B.
Yes, but with the new parameter they are also still kept as part of the example set as long as "remove_original_attributes" is set to "false". Instead of removing the directly here (with the parameter setting mentioned above) you could of course also use the operator "AttributeFilter" after the text processing to filter out all nominal attributes and only keep the numerical ones.
Cheers,
Ingo
This runs successfully now. Thanks for the help.
B.