How can I filter no missing values in a special attribute

ArnoG · April 2014

I m using the "Process documents from data" to select sentences containing a certain word. The operator generates an example set with a special attribute named text. Now I want to select only the records containing text, but what I trie it doesn't seem to work. I tried filter examples/no_missing values, but somehow I can't filter the recors out containing text. Anabody suggestions?

Regards Arno

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="6.0.003">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="6.0.003" expanded="true" name="Process">
<process expanded="true">
<operator activated="true" class="read_excel" compatibility="6.0.003" expanded="true" height="60" name="Read Excel" width="90" x="45" y="75">
<parameter key="excel_file" value="C:\Improve Your Business\Qing\Rapidminer\Hampshire hotel\Prediction model.xlsx"/>
<parameter key="sheet_number" value="2"/>
<parameter key="imported_cell_range" value="A1:F8"/>
<parameter key="first_row_as_names" value="false"/>
<list key="annotations">
<parameter key="0" value="Name"/>
</list>
<list key="data_set_meta_data_information">
<parameter key="0" value="Date.false.date_time.attribute"/>
<parameter key="1" value="Rate.false.numeric.attribute"/>
<parameter key="2" value="Guest category.false.binominal.attribute"/>
<parameter key="3" value="Positivereview.true.text.attribute"/>
<parameter key="4" value="Negativereview.true.text.attribute"/>
<parameter key="5" value="Sentiment.true.attribute_value.label"/>
</list>
</operator>
<operator activated="true" class="text:process_document_from_data" compatibility="5.3.002" expanded="true" height="76" name="Process Documents from Data" width="90" x="246" y="75">
<parameter key="vector_creation" value="Term Occurrences"/>
<parameter key="keep_text" value="true"/>
<list key="specify_weights"/>
<process expanded="true">
<operator activated="true" class="text:tokenize" compatibility="5.3.002" expanded="true" height="60" name="Tokenize" width="90" x="112" y="30">
<parameter key="mode" value="specify characters"/>
<parameter key="characters" value=".:?!"/>
</operator>
<operator activated="true" class="text:filter_tokens_by_content" compatibility="5.3.002" expanded="true" height="60" name="Filter Tokens (by Content)" width="90" x="246" y="30">
<parameter key="string" value="room"/>
</operator>
<connect from_port="document" to_op="Tokenize" to_port="document"/>
<connect from_op="Tokenize" from_port="document" to_op="Filter Tokens (by Content)" to_port="document"/>
<connect from_op="Filter Tokens (by Content)" from_port="document" to_port="document 1"/>
<portSpacing port="source_document" spacing="0"/>
<portSpacing port="sink_document 1" spacing="0"/>
<portSpacing port="sink_document 2" spacing="0"/>
</process>
</operator>
<operator activated="true" class="filter_examples" compatibility="6.0.003" expanded="true" height="94" name="Filter Examples" width="90" x="447" y="75">
<parameter key="condition_class" value="no_missing_attributes"/>
<list key="filters_list"/>
</operator>
<connect from_op="Read Excel" from_port="output" to_op="Process Documents from Data" to_port="example set"/>
<connect from_op="Process Documents from Data" from_port="example set" to_op="Filter Examples" to_port="example set input"/>
<connect from_op="Filter Examples" from_port="example set output" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>

awchisholm · April 2014

Hello Arno

I don't have your data so I can't be certain but I think this is what is happening.

Your process is reading in a spreadsheet and the resulting example set has three attributes, Two of these are of type text and the third is a label. The process documents from data operator will process the text attributes together for each example in the example set. The tokenizing is splitting by characters .:?! which means the document is split into sentence like tokens and you are keeping only those which contain the word 'room'. The resulting document vector will therefore contain attributes corresponding to sentences containing the word room. The setting 'term occurrences' for the process documents operator counts the number of times the token appears in each example within the example set. A value of zero means the example has no match.

Could it be that you want to remove those examples which have the value 0 for all possible attributes?

regards

Andrew

ArnoG · April 2014

Hi Andrew,
That is exactly what I'm trying to do. My process is reading a spreadsheet with 2 text columns and 1 label collumn for the sentiment. The process results in a example set containing 7 examples, 2 special attributes and 3 regular attributes.
3 out of the 7 examples contains text, 4 have no text. The examlples with text have at leat 1 regular attribute with a 1. The 4 examples without text have all 0.

I like to remove all the examples with a 0 for all attributes. So ypu're exectly right. Is tgat possible?

Regards.

Arno

awchisholm · April 2014

Hello Arno

There are many ways. One to try would be "Remove Useless Attributes".

regards

Andrew

ArnoG · April 2014

Hi Andrew,
Thanks for your response. I was not familair with this operator. I tried the operator but the operator removes attributes instead od examples. Am I is using it the ridht way?

Regards,

Arno

awchisholm · April 2014

Hello Arno

My mistake - silly me - not thinking straight.

You could add up all values of the attributes to create a new attribute and then filter out all those where the new attribute is not zero. The operator to use would be "Generate Aggregation"; set the parameters to be "value type" and "numeric" and ensure the aggregation function is "sum". Using this operator means you don't need to know the names of the attributes you are summing.

regards

Andrew

ArnoG · April 2014

Hi Andrew,
Thanks! That worked. I created a new attribute and added up all the values. Then used the filter examples operator, set it to custom_filter, is not 0. Now I have the examples containing text.
Thanks.

Regards,

Arno

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

How can I filter no missing values in a special attribute

Answers