The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here
Problem extracting data
Hello I am new to rapidminer. I started out with a simple craiglist scrape. However, I do not get any data back. Can some one please advise?
no"?>
<process version="5.2.008">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="5.2.008" expanded="true" name="Process">
<process expanded="true" height="-20" width="-50">
<operator activated="true" class="web:process_web" compatibility="5.2.003" expanded="true" height="60" name="Process Documents from Web" width="90" x="36" y="46">
<parameter key="url" value="http://tampa.craigslist.org/cto"/>
<list key="crawling_rules"/>
<parameter key="add_pages_as_attribute" value="true"/>
<parameter key="domain" value="subtree"/>
<parameter key="max_page_size" value="10000"/>
<parameter key="user_agent" value="Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.95 Safari/537.11"/>
<parameter key="obey_robot_exclusion" value="false"/>
<parameter key="really_ignore_exclusion" value="true"/>
<process expanded="true" height="171" width="738">
<operator activated="true" class="text:extract_information" compatibility="5.2.004" expanded="true" height="60" name="Extract Information" width="90" x="114" y="24">
<parameter key="query_type" value="XPath"/>
<list key="string_machting_queries"/>
<parameter key="attribute_type" value="Binominal"/>
<list key="regular_expression_queries"/>
<list key="regular_region_queries"/>
<list key="xpath_queries">
<parameter key="link" value="//*[@id=&quot;toc_rows"]/p"/>
<parameter key="price" value="//*[@id=&quot;toc_rows"]/p[2]/span"/>
<parameter key="location" value="//*[@id=&quot;toc_rows"]/p[2]/span[6]/font"/>
<parameter key="title" value="/html/body/article/section/h2"/>
<parameter key="ad body" value="//*[@id=&quot;userbody"]"/>
<parameter key="postingid" value="/html/body/article/section/p"/>
<parameter key="email" value="/html/body/article/section/section[1]/small/a"/>
</list>
<list key="namespaces">
<parameter key="postingtitle" value="*[local-name(.) = 'postingtitle']"/>
<parameter key="body" value="*[local-name(.) = 'userbody']"/>
<parameter key="email" value="*[local-name(.) = 'small']"/>
</list>
<list key="index_queries"/>
</operator>
<connect from_port="document" to_op="Extract Information" to_port="document"/>
<connect from_op="Extract Information" from_port="document" to_port="document 1"/>
<portSpacing port="source_document" spacing="0"/>
<portSpacing port="sink_document 1" spacing="0"/>
<portSpacing port="sink_document 2" spacing="0"/>
</process>
</operator>
<connect from_op="Process Documents from Web" from_port="example set" to_port="result 1"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
</process>
</operator>
</process>
0
Answers
in Extract Information you have ticked "assume html". This is usually a good idea. However, that also means that all html tags are in the h: namespace. If you adjust your XPaths to match for e.g. "h:p" and "h:span" etc. instead of using only "p" and "span" you will get results.
Next time, please be more careful when posting the process xml and follow the instructions in the post linked from my signature - I had a hard time copying it into my RapidMiner instance.
Best regards,
Marius