Performing Principal Component Analysis of a set of tweets.
Hello! First and foremost, I apologize if this topic has been found somewhere. I have spent a considerable amount of time attempting to look for a method.
I have found 2 social science studies that utilized PCA of text data using Rapid Miner. They displayed in a table which words had the highest eigenvalue for a particular factors. I am interested in learning how to do this, but thus far I have been frustrated with a lack of process/steps. I also wonder if it is something so elementary that there are no methods that explain the process?
To be more specific, I am interested in analyzing an excel file containing 2000 tweets (for starters). Thank you in advance for your sincere assistance!
Best Answer
-
Thomas_Ott RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,761 Unicorn
Well without reading the whole thing, it's kind of hard to figure out what they did exactly.
I suspect it must be something like this:
<?xml version="1.0" encoding="UTF-8"?><process version="7.2.003">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="7.2.003" expanded="true" name="Process">
<process expanded="true">
<operator activated="true" class="social_media:search_twitter" compatibility="7.2.000" expanded="true" height="68" name="Search Twitter" width="90" x="112" y="34">
<parameter key="connection" value="Twitter Connection"/>
<parameter key="query" value="iphone"/>
<parameter key="language" value="en"/>
</operator>
<operator activated="true" class="select_attributes" compatibility="7.2.003" expanded="true" height="82" name="Select Attributes" width="90" x="246" y="34">
<parameter key="attribute_filter_type" value="single"/>
<parameter key="attribute" value="Text"/>
</operator>
<operator activated="true" class="generate_attributes" compatibility="7.2.003" expanded="true" height="82" name="Generate Attributes" width="90" x="380" y="34">
<list key="function_descriptions">
<parameter key="label" value=""iPhone""/>
</list>
</operator>
<operator activated="true" class="social_media:search_twitter" compatibility="7.2.000" expanded="true" height="68" name="Search Twitter (2)" width="90" x="112" y="136">
<parameter key="connection" value="Twitter Connection"/>
<parameter key="query" value="samsung"/>
<parameter key="language" value="en"/>
</operator>
<operator activated="true" class="select_attributes" compatibility="7.2.003" expanded="true" height="82" name="Select Attributes (2)" width="90" x="246" y="136">
<parameter key="attribute_filter_type" value="single"/>
<parameter key="attribute" value="Text"/>
</operator>
<operator activated="true" class="generate_attributes" compatibility="7.2.003" expanded="true" height="82" name="Generate Attributes (2)" width="90" x="380" y="136">
<list key="function_descriptions">
<parameter key="label" value=""samsung""/>
</list>
</operator>
<operator activated="true" class="append" compatibility="7.2.003" expanded="true" height="103" name="Append" width="90" x="581" y="34"/>
<operator activated="true" class="set_role" compatibility="7.2.003" expanded="true" height="82" name="Set Role" width="90" x="715" y="34">
<parameter key="attribute_name" value="label"/>
<parameter key="target_role" value="label"/>
<list key="set_additional_roles"/>
</operator>
<operator activated="true" class="nominal_to_text" compatibility="7.2.003" expanded="true" height="82" name="Nominal to Text" width="90" x="849" y="34">
<parameter key="attribute_filter_type" value="single"/>
<parameter key="attribute" value="Text"/>
</operator>
<operator activated="true" class="text:process_document_from_data" compatibility="7.2.001" expanded="true" height="82" name="Process Documents from Data" width="90" x="983" y="34">
<parameter key="prune_method" value="percentual"/>
<list key="specify_weights"/>
<process expanded="true">
<operator activated="true" class="text:tokenize" compatibility="7.2.001" expanded="true" height="68" name="Tokenize" width="90" x="179" y="34"/>
<connect from_port="document" to_op="Tokenize" to_port="document"/>
<connect from_op="Tokenize" from_port="document" to_port="document 1"/>
<portSpacing port="source_document" spacing="0"/>
<portSpacing port="sink_document 1" spacing="0"/>
<portSpacing port="sink_document 2" spacing="0"/>
</process>
</operator>
<operator activated="true" class="principal_component_analysis" compatibility="7.2.003" expanded="true" height="103" name="PCA" width="90" x="1117" y="34"/>
<connect from_op="Search Twitter" from_port="output" to_op="Select Attributes" to_port="example set input"/>
<connect from_op="Select Attributes" from_port="example set output" to_op="Generate Attributes" to_port="example set input"/>
<connect from_op="Generate Attributes" from_port="example set output" to_op="Append" to_port="example set 1"/>
<connect from_op="Search Twitter (2)" from_port="output" to_op="Select Attributes (2)" to_port="example set input"/>
<connect from_op="Select Attributes (2)" from_port="example set output" to_op="Generate Attributes (2)" to_port="example set input"/>
<connect from_op="Generate Attributes (2)" from_port="example set output" to_op="Append" to_port="example set 2"/>
<connect from_op="Append" from_port="merged set" to_op="Set Role" to_port="example set input"/>
<connect from_op="Set Role" from_port="example set output" to_op="Nominal to Text" to_port="example set input"/>
<connect from_op="Nominal to Text" from_port="example set output" to_op="Process Documents from Data" to_port="example set"/>
<connect from_op="Process Documents from Data" from_port="example set" to_op="PCA" to_port="example set input"/>
<connect from_op="PCA" from_port="example set output" to_port="result 2"/>
<connect from_op="PCA" from_port="preprocessing model" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
<portSpacing port="sink_result 3" spacing="0"/>
</process>
</operator>
</process>That said, I'm a bit cautious about the 100% accuracy of their model.
0
Answers
Can you provide a link to where this was done? My initial thought is that the text was transformed into Word Vectors by using TFIDF or something.
Hello! Here is one article that claims to do it . I apologize if I cannot provide the whole article, but to quote the specific portion..
"We separated China from Philippine news reports, then extracted principal components from our two separate sets-of-words. This procedure is intuitively similar to what principal components analysis does to quantified variables." (Montiel et al., 2014)
Thank you! I will attempt to make sense of this.