The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here
Classification - comparison of one attribute to others attributes
Hi. I'm trying to classify authors of texts. I have 4 attributes containing the most commonly used words - attribute A B C and D. Attribute A is compared against A in rest of data, B against B in rest of data, etc.
But I want to check if attribute A exists in attributes A B C and D. For example:
1) row X has A with "example" value and B with "test" value
2) row Y has A with "test" value and B with "qwerty" value
3) "test" value exists in both X and Y, so it should return true, so there is a bigger chance that author of X is the same as author of Y
How I can do that? I want to use it together with operators like Decision Tree, KNN, etc.
Tagged:
0
Answers
How does your data look like? Do you mind to share a little example?
There can be many ways to do this but it all depends on how your data looks like.
Here is a picture of what I'm thinking:
...and here is the XML code for that operation.
<?xml version="1.0" encoding="UTF-8"?><process version="9.3.000">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="9.3.000" expanded="true" name="Process">
<parameter key="logverbosity" value="init"/>
<parameter key="random_seed" value="2001"/>
<parameter key="send_mail" value="never"/>
<parameter key="notification_email" value=""/>
<parameter key="process_duration_for_mail" value="30"/>
<parameter key="encoding" value="UTF-8"/>
<process expanded="true">
<operator activated="true" class="utility:create_exampleset" compatibility="9.3.000" expanded="true" height="68" name="Create ExampleSet" width="90" x="45" y="34">
<parameter key="generator_type" value="comma separated text"/>
<parameter key="number_of_examples" value="100"/>
<parameter key="use_stepsize" value="false"/>
<list key="function_descriptions"/>
<parameter key="add_id_attribute" value="false"/>
<list key="numeric_series_configuration"/>
<list key="date_series_configuration"/>
<list key="date_series_configuration (interval)"/>
<parameter key="date_format" value="yyyy-MM-dd HH:mm:ss"/>
<parameter key="time_zone" value="SYSTEM"/>
<parameter key="input_csv_text" value="Author,A,B,C,D Tolstoi,word1,word2,word3,word4 Chejov,word4,word5,word2,word6 Dostoievski,word7,word8,word9,word6 Solzhenitsyn,word10,word11,word3,word12"/>
<parameter key="column_separator" value=","/>
<parameter key="parse_all_as_nominal" value="false"/>
<parameter key="decimal_point_character" value="."/>
<parameter key="trim_attribute_names" value="true"/>
</operator>
<operator activated="true" class="de_pivot" compatibility="9.3.000" expanded="true" height="82" name="De-Pivot" width="90" x="179" y="34">
<list key="attribute_name">
<parameter key="Word" value="\w"/>
</list>
<parameter key="index_attribute" value="Index"/>
<parameter key="create_nominal_index" value="true"/>
<parameter key="keep_missings" value="false"/>
<description align="center" color="transparent" colored="false" width="126">With the De-Pivot operator, a list of words is obtained together with its nominal index from where was the word obtained.</description>
</operator>
<operator activated="true" class="multiply" compatibility="9.3.000" expanded="true" height="103" name="Multiply" width="90" x="313" y="34">
<description align="center" color="transparent" colored="false" width="126">We use the Multiply operator so that we can prepare the case.</description>
</operator>
<operator activated="true" class="concurrency:join" compatibility="9.3.000" expanded="true" height="82" name="Join" width="90" x="447" y="34">
<parameter key="remove_double_attributes" value="false"/>
<parameter key="join_type" value="inner"/>
<parameter key="use_id_attribute_as_key" value="false"/>
<list key="key_attributes">
<parameter key="Word" value="Word"/>
</list>
<parameter key="keep_both_join_attributes" value="false"/>
<description align="center" color="transparent" colored="false" width="126">A simple inner join by words can show us what words are common among authors.</description>
</operator>
<operator activated="true" class="generate_attributes" compatibility="9.3.000" expanded="true" height="82" name="Generate Attributes" width="90" x="581" y="34">
<list key="function_descriptions">
<parameter key="Same?" value="Author == Author_from_ES2"/>
</list>
<parameter key="keep_all" value="true"/>
<description align="center" color="transparent" colored="false" width="126">The Join gave us that author A is the same as author A. We will compare each attribute and mark it as &quot;Same&quot;...</description>
</operator>
<operator activated="true" class="filter_examples" compatibility="9.3.000" expanded="true" height="103" name="Filter Examples" width="90" x="715" y="34">
<parameter key="parameter_expression" value=""/>
<parameter key="condition_class" value="custom_filters"/>
<parameter key="invert_filter" value="false"/>
<list key="filters_list">
<parameter key="filters_entry_key" value="Same?.equals.false"/>
</list>
<parameter key="filters_logic_and" value="true"/>
<parameter key="filters_check_metadata" value="true"/>
<description align="center" color="transparent" colored="false" width="126">...so that we can filter these repeated similarities.</description>
</operator>
<operator activated="true" class="select_attributes" compatibility="9.3.000" expanded="true" height="82" name="Select Attributes" width="90" x="849" y="34">
<parameter key="attribute_filter_type" value="subset"/>
<parameter key="attribute" value=""/>
<parameter key="attributes" value="Author|Author_from_ES2|Index|Index_from_ES2|Word"/>
<parameter key="use_except_expression" value="false"/>
<parameter key="value_type" value="attribute_value"/>
<parameter key="use_value_type_exception" value="false"/>
<parameter key="except_value_type" value="time"/>
<parameter key="block_type" value="attribute_block"/>
<parameter key="use_block_type_exception" value="false"/>
<parameter key="except_block_type" value="value_matrix_row_start"/>
<parameter key="invert_selection" value="false"/>
<parameter key="include_special_attributes" value="false"/>
<description align="center" color="transparent" colored="false" width="126">Finally, we select only the attributes we need.</description>
</operator>
<connect from_op="Create ExampleSet" from_port="output" to_op="De-Pivot" to_port="example set input"/>
<connect from_op="De-Pivot" from_port="example set output" to_op="Multiply" to_port="input"/>
<connect from_op="Multiply" from_port="output 1" to_op="Join" to_port="left"/>
<connect from_op="Multiply" from_port="output 2" to_op="Join" to_port="right"/>
<connect from_op="Join" from_port="join" to_op="Generate Attributes" to_port="example set input"/>
<connect from_op="Generate Attributes" from_port="example set output" to_op="Filter Examples" to_port="example set input"/>
<connect from_op="Filter Examples" from_port="example set output" to_op="Select Attributes" to_port="example set input"/>
<connect from_op="Select Attributes" from_port="example set output" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>
This process has a problem, though. Since the Join gave us this:
Chéjov == Dostoievski
Dostoievski == Chéjov.
You can do something to eliminate those double sentences. I used the Generate Attributes to generate an attribute that says KEEP if the first author is less than the second (so Chéjov is less than Dostoievski, because it begins with C and C < D) and DELETE if the first author is greater than the second (Dostoievski is greater than Chéjov because D > C). This is the corrected process:
Hope this helps,
Rodrigo.
"100395", "1000866", "1640", "318", "44", "6", "0,6006289", "anyway", "really", "decided", "write"
"104212", "1000866", "1155", "230", "57", "6", "0,6173913", "we're", "almost", "scrub", "really"
"108960", "1000866", "1774", "336", "59", "6", "0,5119048", "because", "chris", "about", "people"
"111351", "1000866", "1034", "192", "47", "6", "0,6666667", "really", "peter", "because", "happy"
Scott
"108960", "1000866", "1774", "336", "59", "6", "0,5119048", "decided", "chris", "really", "people"
"114290", "1011289", "1777", "328", "77", "6", "0,6128049", "jacen", "talking", "about", "they"
"116160", "1011289", "1777", "348", "93", "6", "0,5545977", "about", "really", "write", "ending"
"104488", "1011311", "1027", "196", "79", "6", "0,6479592", "lives", "worry", "control", "melody"
"105743", "1011311", "1261", "243", "97", "6", "0,5884774", "little", "right", "think", "drivers"