The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here
Join if partly match
Hi!
I'm working on increasing the relevance of product-search results on our website by importing synonyms into our system.
Therefore, I downloaded the synonym list from opentheasaurus.
What I'd like to do now is not to import all the synoyms (as they would increase the indexing time) but only want to import those where one of the words in a list of matching synonyms is also included in our database.
I therefore processed out product documents to get a word list and converted it to data. First question: Which type of data is it now? Text or polynominal?
Second question:
How can I now filter out those synonym-pairs where none of the included synonyms is also in the word list?
An example:
My list
Bike
Boat
Car
The synonym list from opentheasaurus
bike, bicycle
boat, motorboat, sailboat
airplane, plane
In the example above, the resulting data set should be:
bike, bicycle
boat, motorboat, sailboat
(as "airplane, plane" isn't in the wordlist)
I tried using loop through attributes - and values and several other combinations.
Is there even a simple method for that?
Thanks!
I'm working on increasing the relevance of product-search results on our website by importing synonyms into our system.
Therefore, I downloaded the synonym list from opentheasaurus.
What I'd like to do now is not to import all the synoyms (as they would increase the indexing time) but only want to import those where one of the words in a list of matching synonyms is also included in our database.
I therefore processed out product documents to get a word list and converted it to data. First question: Which type of data is it now? Text or polynominal?
Second question:
How can I now filter out those synonym-pairs where none of the included synonyms is also in the word list?
An example:
My list
Bike
Boat
Car
The synonym list from opentheasaurus
bike, bicycle
boat, motorboat, sailboat
airplane, plane
In the example above, the resulting data set should be:
bike, bicycle
boat, motorboat, sailboat
(as "airplane, plane" isn't in the wordlist)
I tried using loop through attributes - and values and several other combinations.
Is there even a simple method for that?
Thanks!
Tagged:
0
Answers
sounds like you can use a Generate Attribute to generate new Attribute like "Contains Bike" or so and then join on this?
~Martin
Dortmund, Germany
Thanks!
You mean like the following process?
At the moment for example "cable" would also be found if the synonym is named "energy-cable" or whatever. (Please see the function in "generate attribute")
Is there a way to only find those attributes that don't have any other letter at the beginning and the end of the loop_value (only space, comma or punctuation mark would be allowed; I guess using regex)?
Thanks!
sure. I think contains actually takes regexes, even though it is not explicity documented.
~Martin
Dortmund, Germany
I added a "split" operator into the loop so it can test against each attribute using an exact match comparison.
How can I say euqal either attribute1 or attribute2 or attribut3, ...?
The process tells me that "||" is only allowed for boolean or numerical attributes.
Thanks,
Steven
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="7.1.000">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="7.1.000" expanded="true" name="Process">
<process expanded="true">
<operator activated="true" class="retrieve" compatibility="7.1.000" expanded="true" height="68" name="Retrieve t123_product_words" width="90" x="112" y="85">
<parameter key="repository_entry" value="//Local Repository/data/t123_product_words"/>
</operator>
<operator activated="true" class="sample_stratified" compatibility="7.1.000" expanded="true" height="82" name="Sample (Stratified)" width="90" x="246" y="85"/>
<operator activated="true" breakpoints="after" class="loop_values" compatibility="7.1.000" expanded="true" height="82" name="Loop Values" width="90" x="380" y="85">
<parameter key="attribute" value="word"/>
<process expanded="true">
<operator activated="true" class="retrieve" compatibility="7.1.000" expanded="true" height="68" name="Retrieve synonyms_all" width="90" x="179" y="85">
<parameter key="repository_entry" value="//Local Repository/data/synonyms_all"/>
</operator>
<operator activated="true" class="split" compatibility="7.1.000" expanded="true" height="82" name="Split" width="90" x="313" y="85">
<parameter key="attribute_filter_type" value="single"/>
<parameter key="attribute" value="att1"/>
</operator>
<operator activated="true" class="generate_attributes" compatibility="7.1.000" expanded="true" height="82" name="Generate Attributes" width="90" x="514" y="136">
<list key="function_descriptions">
<parameter key="contains_attribute" value="if(equals([att1_1]||[att1_2]||[att1_3]||[att1_4]||[att1_5]||[att1_6]||[att1_7]||[att1_8]||[att1_9]||[att1_10]||[att1_11]||[att1_12]||[att1_13]||[att1_14]||[att1_15]||[att1_16]||[att1_17]||[att1_18]||[att1_19],%{loop_value}),"YESMATCH","NOMATCH")"/>
</list>
</operator>
<connect from_op="Retrieve synonyms_all" from_port="output" to_op="Split" to_port="example set input"/>
<connect from_op="Split" from_port="example set output" to_op="Generate Attributes" to_port="example set input"/>
<connect from_op="Generate Attributes" from_port="example set output" to_port="out 1"/>
<portSpacing port="source_example set" spacing="0"/>
<portSpacing port="sink_out 1" spacing="0"/>
<portSpacing port="sink_out 2" spacing="0"/>
</process>
</operator>
<operator activated="true" class="append" compatibility="7.1.000" expanded="true" height="82" name="Append" width="90" x="514" y="85"/>
<operator activated="true" class="remove_duplicates" compatibility="7.1.000" expanded="true" height="82" name="Remove Duplicates" width="90" x="648" y="85">
<parameter key="attribute_filter_type" value="single"/>
<parameter key="attribute" value="contains_attribute"/>
</operator>
<operator activated="true" class="filter_examples" compatibility="7.1.000" expanded="true" height="103" name="Filter Examples" width="90" x="782" y="85">
<list key="filters_list">
<parameter key="filters_entry_key" value="contains_attribute.does_not_equal.NOMATCH"/>
</list>
</operator>
<connect from_op="Retrieve t123_product_words" from_port="output" to_op="Sample (Stratified)" to_port="example set input"/>
<connect from_op="Sample (Stratified)" from_port="example set output" to_op="Loop Values" to_port="example set"/>
<connect from_op="Loop Values" from_port="out 1" to_op="Append" to_port="example set 1"/>
<connect from_op="Append" from_port="merged set" to_op="Remove Duplicates" to_port="example set input"/>
<connect from_op="Remove Duplicates" from_port="example set output" to_op="Filter Examples" to_port="example set input"/>
<connect from_op="Filter Examples" from_port="example set output" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>
I matched making this process work but unfortunately it always gets stuck between loop 150 and 300.
Do you have an idea to make this easier or to make it consume less memory?: Thanks,
Steven