Highlights of the Extension Updates
At RapidMiner Research, we just released updates of multiple extensions developed under the DS4DM research project. Here is a highlight of these updates.
Web Table Extraction Extension
The new version is 0.1.6. In this version, the ‘Read HTML Table’ operator can load the HTML documents from local file path in addition to web URL. This is helpful when dealing with large amounts of HTML data files, that may have been collected through web crawling. Once the HTML data tables are retrieved and being converted into ExampleSets, the operator can also guess the numeric data type of attributes.
Spreadsheet Table Extraction Extension
The new version is 0.2.1. In this version, the following updates are available:
- The ‘Read Google Spreadsheet’ operator provides type guessing so you can retrieve sheets from an online Google Spreadsheet document and directly process the numeric data.
- The ‘Read Excel Online’ operator is the completely new operator added to the extension. This extends your reach to the Excel Online spreadsheets. There is a dedicated Blogpost detailing salient features of this operator available here: Link: Reading Excel files directly from your companies OneDrive
PDF Table Extraction Extension
The new version is 0.1.4. This also adds type guessing to the ‘Read PDF Table’ operator.
Data Search for Data Mining Extension
The new version is 0.1.2. This update includes various enhancements, most notable of them are made in the ‘Translate’ operator. The extension provides Search-Join mechanism through a joint usage of ‘Data Search’, ‘Translate’ and ‘Fuse’ operators. Translate filters out tables, that have schema and instance match for the new attribute you want to discover and integrate to your original (query) table. Before fusion is performed, the discovered tables are converted to the schema of the query table. This requires statistical measures of interest to be defined on the cell-level and table-level for the new attributes. In this update, we added metrics for defining “trust” in the new data by using similarity and dissimilarity for data discovered by the Data Search operator. To this, the following trust and mistrust measures have been added:
- Levenstein Mistrust: Mean value of Levenstein cross-distance for each non-empty cell value present in the discovered collection.
- Jaro Winkler Trust: Mean value of Jaro Winkler cross-distance for each non-empty cell value present in the discovered collection.
- Fuzzy Trust: Mean value of Fuzzy cross-distance for each non-empty cell value present in the discovered collection.
- Missing Values: The number of empty values in a translated table.
Other metrics include Coverage and Trust (please refer to the earlier post for more details [1]). The figure below shows the distributions of these metrics on the Control Panel view of the Translate operator.
The Control Panel view of the Translate operator shows list of translated tables and the distribution of statistical metrics such as Coverage, Ratio, Levenstein Mistrust, Jaro Winkler Trust, Fuzzy Trust and Missing Values to be used by the Fuse operator
This update paves the way to perform data fusion not just at data level (by using Voting, Clustered Voting, Intersection, etc.) but also advanced meta-data level such as by optimizing on multiple objectives.
Acknowledgments
The extensions are developed as part of “Data Search for Data Mining (DS4DM)” project (website: http://ds4dm.com), which is sponsored by the German ministry of education and research (BMBF).
References
[1] The Data Search for Data Mining, Release post, Web-link: http://community.rapidminer.com/t5/Community-Blog/The-Data-Search-for-Data-Mining-Extension-Release/ba-p/38231