Self-modifying stop-list/n-gram filter in text mining?
I'm working on building a process that would webscrape websites of a selected industry and hunt for industry-specific keywords in the collected text. The difficulty I'm facing in this case is that when I want to look for phrases or n-grams, there are a lot of them that are just rubbish. In the final output, I would like to see n-grams that only contain those specific keywords in the phrases, followed or preceeded by (in certain cases) words that otherwise might have been simply filterd out or do not provide any valuable insight on their own.
E.g. Ship-building industries-> sonar_systems. Normally I would not be interested in the word "systems" as it does very little in the way of giving me something meaningful with respect to the industry, but the word sonar, and the n-gram sonar_systems, is decently valuable to me from an analysis point of view.
So basically I could either have a stoplist that self-populates itself (somehow!) by looking at intermediate results from Process Documents from Data/Files and then only passes the relevant n-grams for further analysis such as text clustering, or association rules, etc. OR I could find some clever way of filtering certain n-grams before they are being passed to other nodes.
Any ideas on how to do this? Thank you very much!
P.S. I do not have an end to end process, just the text-ming part of it. Configuring a crawler with exception handling isn't that much of a problem. And if I put this in a server, I can make some kind of an app with this.
<?xml version="1.0" encoding="UTF-8"?><process version="7.5.001">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="7.5.001" expanded="true" name="Process">
<process expanded="true">
<operator activated="true" class="text:process_document_from_file" compatibility="7.5.000" expanded="true" height="82" name="Process Documents from Files" width="90" x="112" y="391">
<list key="text_directories">
<parameter key="Legal" value="C:\Users\Pari\Documents\Odin\Data"/>
</list>
<parameter key="vector_creation" value="Binary Term Occurrences"/>
<parameter key="keep_text" value="true"/>
<parameter key="prune_method" value="absolute"/>
<parameter key="prune_below_absolute" value="5"/>
<parameter key="prune_above_absolute" value="1000"/>
<process expanded="true">
<operator activated="true" class="text:transform_cases" compatibility="7.5.000" expanded="true" height="68" name="Transform Cases" width="90" x="112" y="85"/>
<operator activated="true" class="text:tokenize" compatibility="7.5.000" expanded="true" height="68" name="Tokenize" width="90" x="246" y="85"/>
<operator activated="true" class="text:filter_stopwords_english" compatibility="7.5.000" expanded="true" height="68" name="Filter Stopwords (English)" width="90" x="380" y="85"/>
<operator activated="true" class="text:filter_by_length" compatibility="7.5.000" expanded="true" height="68" name="Filter Tokens (by Length)" width="90" x="514" y="85">
<parameter key="min_chars" value="3"/>
<parameter key="max_chars" value="15"/>
</operator>
<operator activated="true" class="text:generate_n_grams_terms" compatibility="7.5.000" expanded="true" height="68" name="Generate n-Grams (Terms)" width="90" x="648" y="85"/>
<operator activated="true" class="text:filter_stopwords_dictionary" compatibility="7.5.000" expanded="true" height="82" name="Filter Stopwords (Dictionary)" width="90" x="782" y="85">
<parameter key="file" value="C:\Users\Pari\Documents\Odin\Aero Stoplist.txt"/>
</operator>
<operator activated="false" class="text:stem_snowball" compatibility="7.5.000" expanded="true" height="68" name="Stem (Snowball)" width="90" x="514" y="340"/>
<operator activated="true" class="text:filter_stopwords_dictionary" compatibility="7.5.000" expanded="true" height="82" name="Filter Stopwords (2)" width="90" x="916" y="85">
<parameter key="file" value="C:\Users\Pari\Documents\Odin\Aero n-gram Stoplist.txt"/>
</operator>
<connect from_port="document" to_op="Transform Cases" to_port="document"/>
<connect from_op="Transform Cases" from_port="document" to_op="Tokenize" to_port="document"/>
<connect from_op="Tokenize" from_port="document" to_op="Filter Stopwords (English)" to_port="document"/>
<connect from_op="Filter Stopwords (English)" from_port="document" to_op="Filter Tokens (by Length)" to_port="document"/>
<connect from_op="Filter Tokens (by Length)" from_port="document" to_op="Generate n-Grams (Terms)" to_port="document"/>
<connect from_op="Generate n-Grams (Terms)" from_port="document" to_op="Filter Stopwords (Dictionary)" to_port="document"/>
<connect from_op="Filter Stopwords (Dictionary)" from_port="document" to_op="Filter Stopwords (2)" to_port="document"/>
<connect from_op="Filter Stopwords (2)" from_port="document" to_port="document 1"/>
<portSpacing port="source_document" spacing="0"/>
<portSpacing port="sink_document 1" spacing="0"/>
<portSpacing port="sink_document 2" spacing="0"/>
</process>
</operator>
<operator activated="true" class="numerical_to_binominal" compatibility="7.5.001" expanded="true" height="82" name="Numerical to Binominal" width="90" x="313" y="340"/>
<operator activated="true" class="fp_growth" compatibility="7.5.001" expanded="true" height="82" name="FP-Growth" width="90" x="514" y="238">
<parameter key="find_min_number_of_itemsets" value="false"/>
<parameter key="positive_value" value="true"/>
<parameter key="min_support" value="0.02"/>
</operator>
<operator activated="true" class="create_association_rules" compatibility="7.5.001" expanded="true" height="82" name="Create Association Rules" width="90" x="648" y="238">
<parameter key="min_confidence" value="0.5"/>
<parameter key="gain_theta" value="1.0"/>
</operator>
<operator activated="true" class="converters:rules_2_example_set" compatibility="0.3.000" expanded="true" height="82" name="Association Rules to ExampleSet" width="90" x="782" y="34"/>
<connect from_op="Process Documents from Files" from_port="example set" to_op="Numerical to Binominal" to_port="example set input"/>
<connect from_op="Process Documents from Files" from_port="word list" to_port="result 4"/>
<connect from_op="Numerical to Binominal" from_port="example set output" to_op="FP-Growth" to_port="example set"/>
<connect from_op="FP-Growth" from_port="frequent sets" to_op="Create Association Rules" to_port="item sets"/>
<connect from_op="Create Association Rules" from_port="rules" to_op="Association Rules to ExampleSet" to_port="rules input"/>
<connect from_op="Create Association Rules" from_port="item sets" to_port="result 3"/>
<connect from_op="Association Rules to ExampleSet" from_port="example set" to_port="result 1"/>
<connect from_op="Association Rules to ExampleSet" from_port="original rules output" to_port="result 2"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
<portSpacing port="sink_result 3" spacing="0"/>
<portSpacing port="sink_result 4" spacing="0"/>
<portSpacing port="sink_result 5" spacing="0"/>
</process>
</operator>
</process>
Answers
I did a self populating stopword list for a customer 2 years ago and but I can't find it. If memory serves, we text mined a corpus of insurance documents with pruning and used n_grams. From there we exported the Wordlist and saved it to a repository.
That repository would be looped back in and appended the stopword list that would be used to do the next iteration. It was quite complex but I do remember using Loops for this and possible the Remember and Recall operators too.
Thank you @Thomas_Ott, I will try to build something along the lines of what you said. Have a great day!