The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here
I want to filter token by context with double word such as GLOBAL-WARMING|HARM-REDUCTION
Linhnm_178
Member Posts: 4 Learner I
in Help
I genuinely appreciate your suggestions. I am writing a thesis, where I build my own set of words (using filter by context, paste in regular expressions) and want to find, how many times those words apprear in the pdf file.
However I do not think rapidminer can read my double words such as Global-warming, harm-reduction. These are seperate words that only make sense when the stand together.
Can someone please give me advice?
However I do not think rapidminer can read my double words such as Global-warming, harm-reduction. These are seperate words that only make sense when the stand together.
Can someone please give me advice?
<?xml version="1.0" encoding="UTF-8"?><process version="9.10.001"> <context> <input/> <output/> <macros/> </context> <operator activated="true" class="process" compatibility="9.10.001" expanded="true" name="Process"> <parameter key="logverbosity" value="init"/> <parameter key="random_seed" value="2001"/> <parameter key="send_mail" value="never"/> <parameter key="notification_email" value=""/> <parameter key="process_duration_for_mail" value="30"/> <parameter key="encoding" value="SYSTEM"/> <process expanded="true"> <operator activated="true" class="concurrency:loop_files" compatibility="9.10.001" expanded="true" height="82" name="Loop Files" width="90" x="179" y="34"> <parameter key="directory" value="C:/Master Thesis/file"/> <parameter key="filter_type" value="glob"/> <parameter key="filter_by_glob" value="*.pdf"/> <parameter key="recursive" value="false"/> <parameter key="enable_macros" value="false"/> <parameter key="macro_for_file_name" value="file_name"/> <parameter key="macro_for_file_type" value="file_type"/> <parameter key="macro_for_folder_name" value="folder_name"/> <parameter key="reuse_results" value="false"/> <parameter key="enable_parallel_execution" value="true"/> <process expanded="true"> <operator activated="true" class="text:read_document" compatibility="9.4.000" expanded="true" height="68" name="Read Document" width="90" x="313" y="34"> <parameter key="extract_text_only" value="true"/> <parameter key="use_file_extension_as_type" value="true"/> <parameter key="content_type" value="pdf"/> <parameter key="encoding" value="SYSTEM"/> </operator> <connect from_port="file object" to_op="Read Document" to_port="file"/> <connect from_op="Read Document" from_port="output" to_port="output 1"/> <portSpacing port="source_file object" spacing="0"/> <portSpacing port="source_input 1" spacing="0"/> <portSpacing port="sink_output 1" spacing="0"/> <portSpacing port="sink_output 2" spacing="0"/> </process> </operator> <operator activated="false" class="text:combine_documents" compatibility="9.4.000" expanded="true" height="68" name="Combine Documents" width="90" x="179" y="136"/> <operator activated="true" class="text:process_documents" compatibility="9.4.000" expanded="true" height="103" name="Process Documents" width="90" x="447" y="85"> <parameter key="create_word_vector" value="true"/> <parameter key="vector_creation" value="TF-IDF"/> <parameter key="add_meta_information" value="true"/> <parameter key="keep_text" value="false"/> <parameter key="prune_method" value="none"/> <parameter key="prune_below_percent" value="3.0"/> <parameter key="prune_above_percent" value="30.0"/> <parameter key="prune_below_rank" value="0.05"/> <parameter key="prune_above_rank" value="0.95"/> <parameter key="datamanagement" value="double_sparse_array"/> <parameter key="data_management" value="auto"/> <process expanded="true"> <operator activated="true" class="text:tokenize" compatibility="9.4.000" expanded="true" height="68" name="Tokenize" width="90" x="179" y="34"> <parameter key="mode" value="non letters"/> <parameter key="characters" value=".:"/> <parameter key="language" value="English"/> <parameter key="max_token_length" value="3"/> </operator> <operator activated="true" class="text:transform_cases" compatibility="9.4.000" expanded="true" height="68" name="Transform Cases" width="90" x="380" y="34"> <parameter key="transform_to" value="lower case"/> </operator> <operator activated="true" class="text:filter_by_length" compatibility="9.4.000" expanded="true" height="68" name="Filter Tokens (by Length)" width="90" x="514" y="34"> <parameter key="min_chars" value="4"/> <parameter key="max_chars" value="25"/> </operator> <operator activated="true" class="text:filter_tokens_by_content" compatibility="9.4.000" expanded="true" height="68" name="Filter Tokens (by Content)" width="90" x="715" y="34"> <parameter key="condition" value="matches"/> <parameter key="regular_expression" value="ANTIBIOTIC|BARRIER|BIOMASS|BREEDING|CHILD|CHILDREN|CHOLESTEROL|CLAIM|COMPLIANCE|CONCERN|CORRECTIVE|DECREASE|DEFENSE|DEPLETION|DIETARY|DISCHARGE|DISCLOSE|DISCLOSURE|DONATION|EMISSION|FOOTPRINT|FOSSIL|GAP|GLOBAL WARMING|GMO|GRAZING|GREENHOUSE GAS|HARM-REDUCTION|HORMONE|HYDRO|IMPACT|INCREASE|INTENSIVE|LIABILITIES|LIMITATION|LOSS|LOSSES|NEGATIVE|NICOTINE|OBESITY|OPPORTUNITIES|OPPORTUNITY|QUALITY|REDUCE|REDUCTION|REMANUFACTURED|REPROCESSED|REPUTATIONAL|RESPONSIBILITY|RESTRICTION|RISK|SAFETY|SCARCITY|SETTLEMENT|STANDARD|TURNOVER|WASTE|WELFARE "/> <parameter key="case_sensitive" value="false"/> <parameter key="invert condition" value="false"/> </operator> <operator activated="true" class="text:filter_stopwords_english" compatibility="9.4.000" expanded="true" height="68" name="Filter Stopwords (English)" width="90" x="916" y="34"/> <operator activated="true" class="text:generate_n_grams_terms" compatibility="9.4.000" expanded="true" height="68" name="Generate n-Grams (Terms)" width="90" x="1050" y="34"> <parameter key="max_length" value="2"/> </operator> <connect from_port="document" to_op="Tokenize" to_port="document"/> <connect from_op="Tokenize" from_port="document" to_op="Transform Cases" to_port="document"/> <connect from_op="Transform Cases" from_port="document" to_op="Filter Tokens (by Length)" to_port="document"/> <connect from_op="Filter Tokens (by Length)" from_port="document" to_op="Filter Tokens (by Content)" to_port="document"/> <connect from_op="Filter Tokens (by Content)" from_port="document" to_op="Filter Stopwords (English)" to_port="document"/> <connect from_op="Filter Stopwords (English)" from_port="document" to_op="Generate n-Grams (Terms)" to_port="document"/> <connect from_op="Generate n-Grams (Terms)" from_port="document" to_port="document 1"/> <portSpacing port="source_document" spacing="0"/> <portSpacing port="sink_document 1" spacing="0"/> <portSpacing port="sink_document 2" spacing="0"/> </process> </operator> <operator activated="true" class="text:wordlist_to_data" compatibility="9.4.000" expanded="true" height="82" name="WordList to Data" width="90" x="715" y="238"/> <connect from_op="Loop Files" from_port="output 1" to_op="Process Documents" to_port="documents 1"/> <connect from_op="Process Documents" from_port="example set" to_port="result 1"/> <connect from_op="Process Documents" from_port="word list" to_op="WordList to Data" to_port="word list"/> <connect from_op="WordList to Data" from_port="word list" to_port="result 3"/> <connect from_op="WordList to Data" from_port="example set" to_port="result 2"/> <portSpacing port="source_input 1" spacing="0"/> <portSpacing port="sink_result 1" spacing="0"/> <portSpacing port="sink_result 2" spacing="0"/> <portSpacing port="sink_result 3" spacing="0"/> <portSpacing port="sink_result 4" spacing="0"/> </process> </operator> </process>
0
Best Answers
-
ceaperez Member Posts: 541 UnicornHi @Linhnm_178,I understand now.Please check the n-Grams operator from the text processing extension.Check this link for more information.1
-
Linhnm_178 Member Posts: 4 Learner IHi Ceaperez,
Thank you so much for your reply. Is that possible to intentionally search for "Harm-reduction" instead of let the document run itself and generate all those repeated double words? In my case, my pdf file has 150 pages, so I'm afraid the results is too long.
0
Answers
Best,
Cesar
Do you have any suggestions?