Text Mining / Clustering / Label Prediction

Dave0408 · October 2016

Hello there,

i am playing arround with some "Text Processing". I've got a collection of about 1000 articles on sport (exspecially soccer/football) news collected from different RSS Feeds.

To start with an good basis I catgorized them all manually into 7 categories. That leads to following distribution (in "german"):

label	count	%
Teamnews	430	37,01
Rest	166	14,29
Transfers	143	12,31
Skandal	141	12,13
Verletzung	124	10,67
Management	99	8,52
Liganews	59	5,08
Summe	1162	100

My aim now is to set up a prediction model that will categorize future articels by its own.

That's where i stuck a little bit. Basically i'll do the following text processing:

Spoiler

<?xml version="1.0" encoding="UTF-8"?><process version="7.2.001">
  <operator activated="true" class="text:process_document_from_data" compatibility="7.2.000" expanded="true" height="82" name="Process Documents from Data" width="90" x="447" y="136">
    <parameter key="create_word_vector" value="true"/>
    <parameter key="vector_creation" value="TF-IDF"/>
    <parameter key="add_meta_information" value="true"/>
    <parameter key="keep_text" value="true"/>
    <parameter key="prune_method" value="none"/>
    <parameter key="prune_below_percent" value="3.0"/>
    <parameter key="prune_above_percent" value="30.0"/>
    <parameter key="prune_below_rank" value="0.05"/>
    <parameter key="prune_above_rank" value="0.95"/>
    <parameter key="datamanagement" value="double_sparse_array"/>
    <parameter key="select_attributes_and_weights" value="false"/>
    <list key="specify_weights"/>
    <process expanded="true">
      <operator activated="true" class="text:tokenize" compatibility="7.2.000" expanded="true" height="68" name="Tokenize (2)" width="90" x="112" y="34">
        <parameter key="mode" value="non letters"/>
        <parameter key="characters" value=".:"/>
        <parameter key="language" value="German"/>
        <parameter key="max_token_length" value="3"/>
      </operator>
      <operator activated="true" class="text:filter_by_length" compatibility="7.2.000" expanded="true" height="68" name="Filter Tokens (2)" width="90" x="246" y="34">
        <parameter key="min_chars" value="4"/>
        <parameter key="max_chars" value="25"/>
      </operator>
      <operator activated="true" class="text:transform_cases" compatibility="7.2.000" expanded="true" height="68" name="Transform Cases (3)" width="90" x="380" y="34">
        <parameter key="transform_to" value="lower case"/>
      </operator>
      <operator activated="true" class="text:filter_stopwords_german" compatibility="7.2.000" expanded="true" height="68" name="Filter Stopwords (2)" width="90" x="514" y="34">
        <parameter key="stop_word_list" value="Standard"/>
      </operator>
      <operator activated="true" class="text:filter_stopwords_english" compatibility="7.2.000" expanded="true" height="68" name="Filter Stopwords (3)" width="90" x="648" y="34"/>
      <operator activated="true" class="open_file" compatibility="7.2.001" expanded="true" height="68" name="Open File (2)" width="90" x="715" y="136">
        <parameter key="resource_type" value="file"/>
        <parameter key="filename" value="C:\Master\RSS\stoplist_manuell.txt"/>
      </operator>
      <operator activated="true" class="text:filter_stopwords_dictionary" compatibility="7.2.000" expanded="true" height="82" name="Filter Stopwords (4)" width="90" x="849" y="34">
        <parameter key="case_sensitive" value="false"/>
        <parameter key="encoding" value="SYSTEM"/>
      </operator>
      <operator activated="true" class="open_file" compatibility="7.2.001" expanded="true" height="68" name="Open File (3)" width="90" x="916" y="136">
        <parameter key="resource_type" value="file"/>
        <parameter key="filename" value="C:\Master\RSS\stoplist_manuell_begriffe_aller_kategorien.txt"/>
      </operator>
      <operator activated="true" class="text:filter_stopwords_dictionary" compatibility="7.2.000" expanded="true" height="82" name="Filter Stopwords (5)" width="90" x="1050" y="34">
        <parameter key="case_sensitive" value="false"/>
        <parameter key="encoding" value="SYSTEM"/>
      </operator>
      <connect from_port="document" to_op="Tokenize (2)" to_port="document"/>
      <connect from_op="Tokenize (2)" from_port="document" to_op="Filter Tokens (2)" to_port="document"/>
      <connect from_op="Filter Tokens (2)" from_port="document" to_op="Transform Cases (3)" to_port="document"/>
      <connect from_op="Transform Cases (3)" from_port="document" to_op="Filter Stopwords (2)" to_port="document"/>
      <connect from_op="Filter Stopwords (2)" from_port="document" to_op="Filter Stopwords (3)" to_port="document"/>
      <connect from_op="Filter Stopwords (3)" from_port="document" to_op="Filter Stopwords (4)" to_port="document"/>
      <connect from_op="Open File (2)" from_port="file" to_op="Filter Stopwords (4)" to_port="file"/>
      <connect from_op="Filter Stopwords (4)" from_port="document" to_op="Filter Stopwords (5)" to_port="document"/>
      <connect from_op="Open File (3)" from_port="file" to_op="Filter Stopwords (5)" to_port="file"/>
      <connect from_op="Filter Stopwords (5)" from_port="document" to_port="document 1"/>
      <portSpacing port="source_document" spacing="0"/>
      <portSpacing port="sink_document 1" spacing="0"/>
      <portSpacing port="sink_document 2" spacing="0"/>
    </process>
  </operator>
</process>

<?xml version="1.0" encoding="UTF-8"?><process version="7.2.001">
<operator activated="true" class="text:process_document_from_data" compatibility="7.2.000" expanded="true" height="82" name="Process Documents from Data" width="90" x="447" y="136">
<parameter key="create_word_vector" value="true"/>
<parameter key="vector_creation" value="TF-IDF"/>
<parameter key="add_meta_information" value="true"/>
<parameter key="keep_text" value="true"/>
<parameter key="prune_method" value="none"/>
<parameter key="prune_below_percent" value="3.0"/>
<parameter key="prune_above_percent" value="30.0"/>
<parameter key="prune_below_rank" value="0.05"/>
<parameter key="prune_above_rank" value="0.95"/>
<parameter key="datamanagement" value="double_sparse_array"/>
<parameter key="select_attributes_and_weights" value="false"/>
<list key="specify_weights"/>
<process expanded="true">
<operator activated="true" class="text:tokenize" compatibility="7.2.000" expanded="true" height="68" name="Tokenize (2)" width="90" x="112" y="34">
<parameter key="mode" value="non letters"/>
<parameter key="characters" value=".:"/>
<parameter key="language" value="German"/>
<parameter key="max_token_length" value="3"/>
</operator>
<operator activated="true" class="text:filter_by_length" compatibility="7.2.000" expanded="true" height="68" name="Filter Tokens (2)" width="90" x="246" y="34">
<parameter key="min_chars" value="4"/>
<parameter key="max_chars" value="25"/>
</operator>
<operator activated="true" class="text:transform_cases" compatibility="7.2.000" expanded="true" height="68" name="Transform Cases (3)" width="90" x="380" y="34">
<parameter key="transform_to" value="lower case"/>
</operator>
<operator activated="true" class="text:filter_stopwords_german" compatibility="7.2.000" expanded="true" height="68" name="Filter Stopwords (2)" width="90" x="514" y="34">
<parameter key="stop_word_list" value="Standard"/>
</operator>
<operator activated="true" class="text:filter_stopwords_english" compatibility="7.2.000" expanded="true" height="68" name="Filter Stopwords (3)" width="90" x="648" y="34"/>
<operator activated="true" class="open_file" compatibility="7.2.001" expanded="true" height="68" name="Open File (2)" width="90" x="715" y="136">
<parameter key="resource_type" value="file"/>
<parameter key="filename" value="C:\Master\RSS\stoplist_manuell.txt"/>
</operator>
<operator activated="true" class="text:filter_stopwords_dictionary" compatibility="7.2.000" expanded="true" height="82" name="Filter Stopwords (4)" width="90" x="849" y="34">
<parameter key="case_sensitive" value="false"/>
<parameter key="encoding" value="SYSTEM"/>
</operator>
<operator activated="true" class="open_file" compatibility="7.2.001" expanded="true" height="68" name="Open File (3)" width="90" x="916" y="136">
<parameter key="resource_type" value="file"/>
<parameter key="filename" value="C:\Master\RSS\stoplist_manuell_begriffe_aller_kategorien.txt"/>
</operator>
<operator activated="true" class="text:filter_stopwords_dictionary" compatibility="7.2.000" expanded="true" height="82" name="Filter Stopwords (5)" width="90" x="1050" y="34">
<parameter key="case_sensitive" value="false"/>
<parameter key="encoding" value="SYSTEM"/>
</operator>
<connect from_port="document" to_op="Tokenize (2)" to_port="document"/>
<connect from_op="Tokenize (2)" from_port="document" to_op="Filter Tokens (2)" to_port="document"/>
<connect from_op="Filter Tokens (2)" from_port="document" to_op="Transform Cases (3)" to_port="document"/>
<connect from_op="Transform Cases (3)" from_port="document" to_op="Filter Stopwords (2)" to_port="document"/>
<connect from_op="Filter Stopwords (2)" from_port="document" to_op="Filter Stopwords (3)" to_port="document"/>
<connect from_op="Filter Stopwords (3)" from_port="document" to_op="Filter Stopwords (4)" to_port="document"/>
<connect from_op="Open File (2)" from_port="file" to_op="Filter Stopwords (4)" to_port="file"/>
<connect from_op="Filter Stopwords (4)" from_port="document" to_op="Filter Stopwords (5)" to_port="document"/>
<connect from_op="Open File (3)" from_port="file" to_op="Filter Stopwords (5)" to_port="file"/>
<connect from_op="Filter Stopwords (5)" from_port="document" to_port="document 1"/>
<portSpacing port="source_document" spacing="0"/>
<portSpacing port="sink_document 1" spacing="0"/>
<portSpacing port="sink_document 2" spacing="0"/>
</process>
</operator>
</process>

In another process I filtered the labels and checked the created WordLists and was satisfied with the results. So it regnozied the most "important" words for every label.

I stored them in an mysql db. I also created a top50 wordlist wich includes the 50 most used words of a label. But do not use both lists right now.

But back to my current problem. To create a model I choose the X-Validation Operator and tried different classification learners (like: Naive Bayes, k-NN, ID3 and Decision Tree).

Because the results of the performance Operator in all cases where so disappointing, i also used "optimize parameters" operator. Unfortunatelly without positive success.

For example i got an accuracy of 12,48% in my k-NN prediction model.

Here is an example output:

accuracy: 12.48% +/- 0.59% (mikro: 12.48%)

	true Skandal	true Management	true Transfers	true Verletzung	true Teamnews	true Rest	true Liganews	class precision
pred. Skandal	141	98	142	124	430	161	58	12.22%
pred. Management	0	0	1	0	0	0	0	0.00%
pred. Transfers	0	1	0	0	0	0	0	0.00%
pred. Verletzung	0	0	0	0	0	0	0	0.00%
pred. Teamnews	0	0	0	0	0	0	0	0.00%
pred. Rest	0	0	0	0	0	4	1	80.00%
pred. Liganews	0	0	0	0	0	1	0	0.00%
class recall	100.00%	0.00%	0.00%	0.00%	0.00%	2.41%	0.00%

Tests with reducing the number of articles in label "Teamnews" to #150 to get an better distribution weren't successfull too.

So is there any hint or tip how i can increase my accuracy to something higher than 70%?

Is it a mistake in previous text processing steps?

Should i use my stored wordlists for each categorie instead of the whole articels?

Or is this the completly wrong way of doing it?

If you need any more information, please let me know.

Thanks.

Best,

David

MartinLiebig · October 2016

Hi,

quick thought: Have you tried a Linear SVM in a Polynominal by Binominal Classification operator?

~Martin

Dave0408 · October 2016

Hi.

Have you tried a Linear SVM in a Polynominal by Binominal Classification operator?

No. Never used before....

I had a look a the tutorial process. It is used with numerical attributes?!

My input training example set has got following structure:

Role	Name	Type
label	label	nominal
	text	nominal

And the operator combination is not able to work with that?

Spoiler

<?xml version="1.0" encoding="UTF-8"?><process version="7.2.001">
  <operator activated="true" class="split_validation" compatibility="7.2.001" expanded="true" height="124" name="Validation (3)" width="90" x="715" y="34">
    <parameter key="create_complete_model" value="false"/>
    <parameter key="split" value="relative"/>
    <parameter key="split_ratio" value="0.7"/>
    <parameter key="training_set_size" value="100"/>
    <parameter key="test_set_size" value="-1"/>
    <parameter key="sampling_type" value="shuffled sampling"/>
    <parameter key="use_local_random_seed" value="false"/>
    <parameter key="local_random_seed" value="1992"/>
    <process expanded="true">
      <operator activated="true" class="polynomial_by_binomial_classification" compatibility="7.2.001" expanded="true" height="82" name="Polynomial by Binomial Classification (2)" width="90" x="112" y="30">
        <parameter key="classification_strategies" value="exhaustive code (ECOC)"/>
        <parameter key="random_code_multiplicator" value="2.0"/>
        <parameter key="use_local_random_seed" value="false"/>
        <parameter key="local_random_seed" value="1992"/>
        <process expanded="true">
          <operator activated="true" class="support_vector_machine_linear" compatibility="7.2.001" expanded="true" height="82" name="SVM (Linear)" width="90" x="380" y="34">
            <parameter key="kernel_cache" value="200"/>
            <parameter key="C" value="0.0"/>
            <parameter key="convergence_epsilon" value="0.001"/>
            <parameter key="max_iterations" value="100000"/>
            <parameter key="scale" value="true"/>
            <parameter key="L_pos" value="1.0"/>
            <parameter key="L_neg" value="1.0"/>
            <parameter key="epsilon" value="0.0"/>
            <parameter key="epsilon_plus" value="0.0"/>
            <parameter key="epsilon_minus" value="0.0"/>
            <parameter key="balance_cost" value="false"/>
            <parameter key="quadratic_loss_pos" value="false"/>
            <parameter key="quadratic_loss_neg" value="false"/>
          </operator>
          <connect from_port="training set" to_op="SVM (Linear)" to_port="training set"/>
          <connect from_op="SVM (Linear)" from_port="model" to_port="model"/>
          <portSpacing port="source_training set" spacing="0"/>
          <portSpacing port="sink_model" spacing="0"/>
        </process>
      </operator>
      <connect from_port="training" to_op="Polynomial by Binomial Classification (2)" to_port="training set"/>
      <connect from_op="Polynomial by Binomial Classification (2)" from_port="model" to_port="model"/>
      <portSpacing port="source_training" spacing="0"/>
      <portSpacing port="sink_model" spacing="0"/>
      <portSpacing port="sink_through 1" spacing="0"/>
    </process>
    <process expanded="true">
      <operator activated="true" class="apply_model" compatibility="7.1.001" expanded="true" height="82" name="Apply Model (3)" width="90" x="45" y="30">
        <list key="application_parameters"/>
        <parameter key="create_view" value="false"/>
      </operator>
      <operator activated="true" class="performance_classification" compatibility="7.2.001" expanded="true" height="82" name="Performance (Classification)" width="90" x="180" y="30">
        <parameter key="main_criterion" value="first"/>
        <parameter key="accuracy" value="true"/>
        <parameter key="classification_error" value="false"/>
        <parameter key="kappa" value="false"/>
        <parameter key="weighted_mean_recall" value="false"/>
        <parameter key="weighted_mean_precision" value="false"/>
        <parameter key="spearman_rho" value="false"/>
        <parameter key="kendall_tau" value="false"/>
        <parameter key="absolute_error" value="false"/>
        <parameter key="relative_error" value="false"/>
        <parameter key="relative_error_lenient" value="false"/>
        <parameter key="relative_error_strict" value="false"/>
        <parameter key="normalized_absolute_error" value="false"/>
        <parameter key="root_mean_squared_error" value="false"/>
        <parameter key="root_relative_squared_error" value="false"/>
        <parameter key="squared_error" value="false"/>
        <parameter key="correlation" value="false"/>
        <parameter key="squared_correlation" value="false"/>
        <parameter key="cross-entropy" value="false"/>
        <parameter key="margin" value="false"/>
        <parameter key="soft_margin_loss" value="false"/>
        <parameter key="logistic_loss" value="false"/>
        <parameter key="skip_undefined_labels" value="true"/>
        <parameter key="use_example_weights" value="true"/>
        <list key="class_weights"/>
      </operator>
      <connect from_port="model" to_op="Apply Model (3)" to_port="model"/>
      <connect from_port="test set" to_op="Apply Model (3)" to_port="unlabelled data"/>
      <connect from_op="Apply Model (3)" from_port="labelled data" to_op="Performance (Classification)" to_port="labelled data"/>
      <connect from_op="Performance (Classification)" from_port="performance" to_port="averagable 1"/>
      <portSpacing port="source_model" spacing="0"/>
      <portSpacing port="source_test set" spacing="0"/>
      <portSpacing port="source_through 1" spacing="0"/>
      <portSpacing port="sink_averagable 1" spacing="0"/>
      <portSpacing port="sink_averagable 2" spacing="0"/>
    </process>
  </operator>
</process>

<?xml version="1.0" encoding="UTF-8"?><process version="7.2.001">
<operator activated="true" class="split_validation" compatibility="7.2.001" expanded="true" height="124" name="Validation (3)" width="90" x="715" y="34">
<parameter key="create_complete_model" value="false"/>
<parameter key="split" value="relative"/>
<parameter key="split_ratio" value="0.7"/>
<parameter key="training_set_size" value="100"/>
<parameter key="test_set_size" value="-1"/>
<parameter key="sampling_type" value="shuffled sampling"/>
<parameter key="use_local_random_seed" value="false"/>
<parameter key="local_random_seed" value="1992"/>
<process expanded="true">
<operator activated="true" class="polynomial_by_binomial_classification" compatibility="7.2.001" expanded="true" height="82" name="Polynomial by Binomial Classification (2)" width="90" x="112" y="30">
<parameter key="classification_strategies" value="exhaustive code (ECOC)"/>
<parameter key="random_code_multiplicator" value="2.0"/>
<parameter key="use_local_random_seed" value="false"/>
<parameter key="local_random_seed" value="1992"/>
<process expanded="true">
<operator activated="true" class="support_vector_machine_linear" compatibility="7.2.001" expanded="true" height="82" name="SVM (Linear)" width="90" x="380" y="34">
<parameter key="kernel_cache" value="200"/>
<parameter key="C" value="0.0"/>
<parameter key="convergence_epsilon" value="0.001"/>
<parameter key="max_iterations" value="100000"/>
<parameter key="scale" value="true"/>
<parameter key="L_pos" value="1.0"/>
<parameter key="L_neg" value="1.0"/>
<parameter key="epsilon" value="0.0"/>
<parameter key="epsilon_plus" value="0.0"/>
<parameter key="epsilon_minus" value="0.0"/>
<parameter key="balance_cost" value="false"/>
<parameter key="quadratic_loss_pos" value="false"/>
<parameter key="quadratic_loss_neg" value="false"/>
</operator>
<connect from_port="training set" to_op="SVM (Linear)" to_port="training set"/>
<connect from_op="SVM (Linear)" from_port="model" to_port="model"/>
<portSpacing port="source_training set" spacing="0"/>
<portSpacing port="sink_model" spacing="0"/>
</process>
</operator>
<connect from_port="training" to_op="Polynomial by Binomial Classification (2)" to_port="training set"/>
<connect from_op="Polynomial by Binomial Classification (2)" from_port="model" to_port="model"/>
<portSpacing port="source_training" spacing="0"/>
<portSpacing port="sink_model" spacing="0"/>
<portSpacing port="sink_through 1" spacing="0"/>
</process>
<process expanded="true">
<operator activated="true" class="apply_model" compatibility="7.1.001" expanded="true" height="82" name="Apply Model (3)" width="90" x="45" y="30">
<list key="application_parameters"/>
<parameter key="create_view" value="false"/>
</operator>
<operator activated="true" class="performance_classification" compatibility="7.2.001" expanded="true" height="82" name="Performance (Classification)" width="90" x="180" y="30">
<parameter key="main_criterion" value="first"/>
<parameter key="accuracy" value="true"/>
<parameter key="classification_error" value="false"/>
<parameter key="kappa" value="false"/>
<parameter key="weighted_mean_recall" value="false"/>
<parameter key="weighted_mean_precision" value="false"/>
<parameter key="spearman_rho" value="false"/>
<parameter key="kendall_tau" value="false"/>
<parameter key="absolute_error" value="false"/>
<parameter key="relative_error" value="false"/>
<parameter key="relative_error_lenient" value="false"/>
<parameter key="relative_error_strict" value="false"/>
<parameter key="normalized_absolute_error" value="false"/>
<parameter key="root_mean_squared_error" value="false"/>
<parameter key="root_relative_squared_error" value="false"/>
<parameter key="squared_error" value="false"/>
<parameter key="correlation" value="false"/>
<parameter key="squared_correlation" value="false"/>
<parameter key="cross-entropy" value="false"/>
<parameter key="margin" value="false"/>
<parameter key="soft_margin_loss" value="false"/>
<parameter key="logistic_loss" value="false"/>
<parameter key="skip_undefined_labels" value="true"/>
<parameter key="use_example_weights" value="true"/>
<list key="class_weights"/>
</operator>
<connect from_port="model" to_op="Apply Model (3)" to_port="model"/>
<connect from_port="test set" to_op="Apply Model (3)" to_port="unlabelled data"/>
<connect from_op="Apply Model (3)" from_port="labelled data" to_op="Performance (Classification)" to_port="labelled data"/>
<connect from_op="Performance (Classification)" from_port="performance" to_port="averagable 1"/>
<portSpacing port="source_model" spacing="0"/>
<portSpacing port="source_test set" spacing="0"/>
<portSpacing port="source_through 1" spacing="0"/>
<portSpacing port="sink_averagable 1" spacing="0"/>
<portSpacing port="sink_averagable 2" spacing="0"/>
</process>
</operator>
</process>

EDIT:

Fixed: The text attribute is Type "text" and not nominal.

MartinLiebig · October 2016

Hi,

you are right. It does not work on nominal/text attributes. Butyou usually do not train on the text itself. I cannot load your process (for some unknown reason) but it is using tokenization. So the structure of your data should be

label (nominal, label)

Text (text, special)

count_wordA (numerical, regular)

count_wordB (numerical, regular)

count_wordC (numerical, regular)

count_wordD (numerical, regular)

where count might be the TF-IDF value. That is perfect for a SVM.

Best,

Martin

Dave0408 · October 2016

Hey Martin,

thanks for your quick thoughts.

Running the Linear SVM in a Polynominal by Binominal Classification operator leads me to an accuracy of 63.19% +/- 4.75% (mikro: 63.17%).

Not perfect but even really better as my first results. And i think its acceptable.

Unfortunatelly i had to cut of my input examples to a limit of 150 per label. Otherwise rapidminer chrashes on my computer (i5 Core and 16GB RAM).

So thanks again.

MartinLiebig · October 2016

Hi Dave,

try to change pruning settings. How may attributes did you create? If it is like 2k, i can imagine why the SVM crashes..

Next step for better results would be to optimize on the setting C of the SVM. Take a logarithmic "grid" between 1e-3 to 1e3. That should boost it.

Best,

Martin

Dave0408 · October 2016

Hey Martin,

try to change pruning settings. How may attributes did you create? If it is like 2k, i can imagine why the SVM crashes..

I tried. But 9/10 times rapidminer runs into a loop situation? The process time counter goes on but nothing happens. I am only succesfull if i handle an very large amount of examples (less than 100). This costs me a lot of accuracy.

My origin database includes 1162 examples and 16 regular attributes.

I filter this to 725 examples and 1 regular attributes. Then I start my text preprocessing.

After this step my example set inlcludes 725 examples and 2 special and 72 regular attributes

Using pruning my wordlist only contains 47 examples for all 6 labels. (What i think is very less?!)

When the process runs into the "Polynominal by Binominal" Operator inlcuding the SVM(linear) rapidminer stucks.

Spoiler

<?xml version="1.0" encoding="UTF-8"?><process version="7.2.001">
  <operator activated="true" class="polynomial_by_binomial_classification" compatibility="7.2.001" expanded="true" height="82" name="Polynominal by Binominal Classification (4)" width="90" x="916" y="34">
    <parameter key="classification_strategies" value="1 against 1"/>
    <parameter key="random_code_multiplicator" value="2.0"/>
    <parameter key="use_local_random_seed" value="false"/>
    <parameter key="local_random_seed" value="1992"/>
    <process expanded="true">
      <operator activated="true" class="support_vector_machine" compatibility="7.2.001" expanded="true" height="124" name="SVM (4)" width="90" x="313" y="34">
        <parameter key="kernel_type" value="neural"/>
        <parameter key="kernel_gamma" value="1.0"/>
        <parameter key="kernel_sigma1" value="1.0"/>
        <parameter key="kernel_sigma2" value="0.0"/>
        <parameter key="kernel_sigma3" value="2.0"/>
        <parameter key="kernel_shift" value="1.0"/>
        <parameter key="kernel_degree" value="2.0"/>
        <parameter key="kernel_a" value="1.0"/>
        <parameter key="kernel_b" value="0.0"/>
        <parameter key="kernel_cache" value="200"/>
        <parameter key="C" value="0.001"/>
        <parameter key="convergence_epsilon" value="0.001"/>
        <parameter key="max_iterations" value="100000"/>
        <parameter key="scale" value="true"/>
        <parameter key="calculate_weights" value="true"/>
        <parameter key="return_optimization_performance" value="true"/>
        <parameter key="L_pos" value="1.0"/>
        <parameter key="L_neg" value="1.0"/>
        <parameter key="epsilon" value="0.0"/>
        <parameter key="epsilon_plus" value="0.0"/>
        <parameter key="epsilon_minus" value="0.0"/>
        <parameter key="balance_cost" value="false"/>
        <parameter key="quadratic_loss_pos" value="false"/>
        <parameter key="quadratic_loss_neg" value="false"/>
        <parameter key="estimate_performance" value="false"/>
      </operator>
      <connect from_port="training set" to_op="SVM (4)" to_port="training set"/>
      <connect from_op="SVM (4)" from_port="model" to_port="model"/>
      <portSpacing port="source_training set" spacing="0"/>
      <portSpacing port="sink_model" spacing="0"/>
    </process>
  </operator>
</process>

<?xml version="1.0" encoding="UTF-8"?><process version="7.2.001">
<operator activated="true" class="polynomial_by_binomial_classification" compatibility="7.2.001" expanded="true" height="82" name="Polynominal by Binominal Classification (4)" width="90" x="916" y="34">
<parameter key="classification_strategies" value="1 against 1"/>
<parameter key="random_code_multiplicator" value="2.0"/>
<parameter key="use_local_random_seed" value="false"/>
<parameter key="local_random_seed" value="1992"/>
<process expanded="true">
<operator activated="true" class="support_vector_machine" compatibility="7.2.001" expanded="true" height="124" name="SVM (4)" width="90" x="313" y="34">
<parameter key="kernel_type" value="neural"/>
<parameter key="kernel_gamma" value="1.0"/>
<parameter key="kernel_sigma1" value="1.0"/>
<parameter key="kernel_sigma2" value="0.0"/>
<parameter key="kernel_sigma3" value="2.0"/>
<parameter key="kernel_shift" value="1.0"/>
<parameter key="kernel_degree" value="2.0"/>
<parameter key="kernel_a" value="1.0"/>
<parameter key="kernel_b" value="0.0"/>
<parameter key="kernel_cache" value="200"/>
<parameter key="C" value="0.001"/>
<parameter key="convergence_epsilon" value="0.001"/>
<parameter key="max_iterations" value="100000"/>
<parameter key="scale" value="true"/>
<parameter key="calculate_weights" value="true"/>
<parameter key="return_optimization_performance" value="true"/>
<parameter key="L_pos" value="1.0"/>
<parameter key="L_neg" value="1.0"/>
<parameter key="epsilon" value="0.0"/>
<parameter key="epsilon_plus" value="0.0"/>
<parameter key="epsilon_minus" value="0.0"/>
<parameter key="balance_cost" value="false"/>
<parameter key="quadratic_loss_pos" value="false"/>
<parameter key="quadratic_loss_neg" value="false"/>
<parameter key="estimate_performance" value="false"/>
</operator>
<connect from_port="training set" to_op="SVM (4)" to_port="training set"/>
<connect from_op="SVM (4)" from_port="model" to_port="model"/>
<portSpacing port="source_training set" spacing="0"/>
<portSpacing port="sink_model" spacing="0"/>
</process>
</operator>
</process>

Some information about my hardware:

Betriebsystemname	Microsoft Windows 10 Pro
Version	10.0.14393 Build 14393
Systemtyp	x64-basierter PC
Prozessor	Intel(R) Core(TM) i5-5200U CPU @ 2.20GHz, 2195 MHz, 2 Kern(e), 4 logische(r) Prozessor(en)
Installierter physischer Speicher (RAM)	16,0 GB
Größe der Auslagerungsdatei	800 MB

My Java Version:

Version 8 Update 111 Build 1.8.0_111-b14

Any idea?

Next step for better results would be to optimize on the setting C of the SVM. Take a logarithmic "grid" between 1e-3 to 1e3. That should boost it.

This is a good hint and i will check it, when the problem situation above is solved.

Best wishes,

David

MartinLiebig · October 2016

Hi,

can you maybe sent me the data and the process? I am keen to have a look on it. Of course we treat the data as confidential. My email address would be mschmitz at rapidminer dot com

~ Martin

jabra · April 2018

Hello, you can first cluster the texts. And after the implementation of the column, the cluster was identified as a label and performed with classification algorithms? For example, the svm algorithm
Thank you so much

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

Text Mining / Clustering / Label Prediction

Answers