read data from html tables
There are many pages on the web that contain useful data in the form of simple html tables. Here's an example:
https://en.wikipedia.org/wiki/List_of_metropolitan
RapidMiner can be used to retrieve data automatically in html form using "get page" and store it as a document, and can even do this iteratively if a set of related pages are required. But what users often want to do is to extract the information in the html table into a usable example set in RapidMiner. So an operator should be created that does the following:
- collect the table column headers and use them as attribute names
- collect each data row from the table and store it as an example
- identify and set the appropriate data type for each resulting attribute
It seems like it would be an incredibly useful operator that did all this automatically - "HTML table to data" or something similar. In theory this could be similar to the read csv operator, with a small wizard to identify the table, the columns, set the data types, etc.
P.S. I know it is technically feasible using a series of xpath expressions in the read xml operator, but after consulting with some RM product experts in the general studio forum, the consensus is that it is still a multi-step process that requires good knowledge of xpath parsing under current options. So adding a single operator that did all the parsing, renaming, etc., would be a significant improvement.
Comments
This sounds a little lke a simple version of how Diffbot works. Have you tried out their service?
I agree having this as a simple operator would be pretty handy.
Hi all,
my colleague Edwin Yaqub recently developed an extension for this use case. Maybe you can check this out and give us some feedback?
Here is the link to the RapidMiner blog post. The extension is available on the marketplace.
http://community.rapidminer.com/t5/Community-Blog/The-Web-Table-Extraction-Operator/ba-p/37353
Best regards,
Edin
Hi all,
This seems to be a slightly older thread, but we just released an updated version of the Web Table Extraction extension.
It has an operator 'Read HTML Table' which adresses the above features requested by Telcontar120. The latest release adds type guessing on the ExampleSet and also allows to load the HTML documents from local file path in addition to web URL. The former is helpful when dealing with large amount of html data files collected through crawls.
The html tables can have complicated header structures and many of these tables are used for layout i.e. they are not data tables. The operator is smart in that it uses a model to classify layout from data tables. The model is trained on lots of HTML documents that were crawled and painstakingly cleaned in public research projects such as the Common Crawl, the Dresden Web Table Corpus and the Web Data Commons project (Uni Mannheim). Since there can be an infinite many ways tables are structured using HTML, some standardization rules apply to extract data from html pages i.e., tables are not nested, the headers are single-layered and that atleast 2 attributes and 3 rows exist in the table. The operator has been tested to work on Wiki-like but also arbitrary many other websites where HTML tables follow similar structures.
You can download the latest release from this link:
https://marketplace.rapidminer.com/UpdateServer/faces/download.xhtml?productId=rmx_web_table_extraction&platform=ANY&version=0.1.6
Hope it helps.
Kind Regards,
Edwin Yaqub