The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here
Extracting the most representative 10 keywords from web page
singing_bird_1
Member Posts: 16 Contributor I
Hi all,
I am new in rapid miner
I want to know how can i extract the most 10 representative keywords from a web page
Is there a node that can do this? if no, then tell me how can i do this
I want to give a URL of web page as an input and get the 10 representative keywords of that web page as output
thanks in advance
Tagged:
0
Answers
You're going to user the Get Page operator, do some HTML cleaning with another operator, then put it into a Text Processing routine. I'm running out the door but do take a look through the Community for some XML examples.
As @Thomas_Ott suggests, this is definitely possible, but it will require a series of operators. Working with text from web pages can be quite tricky because of all the extra html and formatting.
It also depends on what you mean by "10 most representative" words. Many times, the most frequent words are not necessarily the words that capture the main topic of the page. So even after you have done text processing and have a word vector, you need to think about what exactly your definition of "most representative" might mean. Different ways of calculating the word vector can help with that: TF-IDF vs term frequency, for example.
Lindon Ventures
Data Science Consulting from Certified RapidMiner Experts
Might I suggest using this process from my Tutorial page here: http://www.neuralmarkettrends.com/use-rapidminer-discover-twitter-content as a starting point. TTYL!
I mean by " 10 most representative keywords" is that from all the extracted keywords from the page, I want only 10 keywords that best describe the content or the context of the page
yes I agree with @Telcontar120 - I would learn how to use the Text Processing Extension so you can tokenize and create word vectors, etc...
Scott
thanks all for your replies
I am doing preprocessing now for the web pages
first I filtered the html tags then i will start preprocessing
I have a question please. I am in the first step or removing the html tags.
I included 9 URLs in a csv file to be processed, but after removing the html tags I get a paragraph of only one URL or only one web page not the 9 web pages.
how can I get the text after removing the html tags for more than one url?
here is the XML for my process
<?xml version="1.0" encoding="UTF-8"?><process version="7.5.003">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="7.5.003" expanded="true" name="Process">
<process expanded="true">
<operator activated="true" class="open_file" compatibility="7.5.003" expanded="true" height="68" name="Open File" width="90" x="45" y="34">
<parameter key="filename" value="C:\Users\Mennatollah\Desktop\url_test_test.csv"/>
</operator>
<operator activated="true" class="read_csv" compatibility="7.5.003" expanded="true" height="68" name="Read CSV" width="90" x="179" y="34">
<parameter key="use_quotes" value="false"/>
<parameter key="parse_numbers" value="false"/>
<parameter key="first_row_as_names" value="false"/>
<list key="annotations"/>
<list key="data_set_meta_data_information"/>
</operator>
<operator activated="true" class="nominal_to_text" compatibility="7.5.003" expanded="true" height="82" name="Nominal to Text" width="90" x="246" y="187"/>
<operator activated="true" class="web:retrieve_webpages" compatibility="7.3.000" expanded="true" height="68" name="Get Pages" width="90" x="447" y="34">
<parameter key="link_attribute" value="att1"/>
</operator>
<operator activated="true" class="multiply" compatibility="7.5.003" expanded="true" height="103" name="Multiply" width="90" x="380" y="289"/>
<operator activated="true" class="text:process_document_from_data" compatibility="7.5.000" expanded="true" height="82" name="Process Documents from Data" width="90" x="581" y="34">
<parameter key="keep_text" value="true"/>
<parameter key="prune_method" value="by ranking"/>
<parameter key="prune_below_rank" value="0.009"/>
<parameter key="prune_above_rank" value="0.095"/>
<list key="specify_weights"/>
<process expanded="true">
<operator activated="true" class="web:extract_html_text_content" compatibility="7.3.000" expanded="true" height="68" name="Extract Content" width="90" x="45" y="34">
<parameter key="ignore_non_html_tags" value="false"/>
</operator>
<operator activated="true" class="multiply" compatibility="7.5.003" expanded="true" height="103" name="Multiply (2)" width="90" x="448" y="44"/>
<connect from_port="document" to_op="Extract Content" to_port="document"/>
<connect from_op="Extract Content" from_port="document" to_op="Multiply (2)" to_port="input"/>
<connect from_op="Multiply (2)" from_port="output 1" to_port="document 2"/>
<connect from_op="Multiply (2)" from_port="output 2" to_port="document 1"/>
<portSpacing port="source_document" spacing="0"/>
<portSpacing port="sink_document 1" spacing="0"/>
<portSpacing port="sink_document 2" spacing="0"/>
<portSpacing port="sink_document 3" spacing="0"/>
</process>
</operator>
<connect from_op="Open File" from_port="file" to_op="Read CSV" to_port="file"/>
<connect from_op="Read CSV" from_port="output" to_op="Nominal to Text" to_port="example set input"/>
<connect from_op="Nominal to Text" from_port="example set output" to_op="Get Pages" to_port="Example Set"/>
<connect from_op="Get Pages" from_port="Example Set" to_op="Multiply" to_port="input"/>
<connect from_op="Multiply" from_port="output 1" to_port="result 2"/>
<connect from_op="Multiply" from_port="output 2" to_op="Process Documents from Data" to_port="example set"/>
<connect from_op="Process Documents from Data" from_port="example set" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
<portSpacing port="sink_result 3" spacing="0"/>
</process>
</operator>
</process>
hello @singing_bird_1 - I'm glad you're making progress. Can you please re-post your XML inside the </> tool so that we can copy and paste it ourselves into RapidMiner?
Thanks.
Scott
attached the xml code
thank you
here is the xml code
thank you
hello @singing_bird_1 ok we're making some progress. Thank you for pasting your XML. It seems that you are running RM 7.5 which is an old version. Some of your operators were updated in 7.6 and you have pasted things like
in your XML which does not work well. Can you please try updating RapidMiner to 7.6, opening your process, going to the XML tab, copying exactly what is there, and pasting it here again in this thread?
Scott