The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here
Tutorial for the GeoProcessing extension
BalazsBarany
Administrator, Moderator, Employee-RapidMiner, RapidMiner Certified Analyst, RapidMiner Certified Expert Posts: 955 Unicorn
There is new an extension called GeoProcessing in the RapidMiner Marketplace. To give you an idea of what you can do with this extension, here is a tutorial using some of the operators.
Our fictional scenario: We're working with the city of Vienna, Austria, to celebrate the long history of Vienna and the river Danube. For the celebrations, we would like to organize a boat race and a running event for children. We are working with geodata from the Open Data server of Vienna.
In the 1970s Vienna built an artificial island inside the Danube, called Donauinsel (Danube Island). Since then there's the Danube (left arm on the picture) and the New Danube (right). Here's a map to give you an idea:
We are only interested in the parts of the Danube and the New Danube that flow through Vienna. These are highlighted in the next map:
The boat race should be in the longest part of the Danube (or New Danube) through Vienna, so we want to determine the length of the river parts.
For the children's running event, we want to select the two bridges with the shortest distance between them. All bridges in Vienna are of course also available on the Open Data server:
We are obviously only interested in the bridges over the Danube, not every bridge in Vienna. So we will filter the data accordingly:
Then we will calculate the distance between every bridge and select the shortest one (ignoring very short distances of multi-part bridges).
In order to make RapidMiner capable of doing all this, install the GeoProcessing extension from the Marketplace. Make sure that you see the Geoprocessing folder in your Extensions in the Operators panel.
We express global coordinates in latitude and longitude degrees (counted from the equator and from the international 0 meridian in Greenwich). These are angles, so the distance between coordinates depends on the geographic position. We can't use these coordinates for calculating absolute sizes in our favourite measurement system (meters, yards, miles, ...).
The process of transforming coordinates to a new coordinate system (CRS, coordinate reference system) is called projection or reprojection. You can think of it as taking a photo from an airplane or a satellite to transform the three-dimensional earth surface to a two-dimensional picture. The projected coordinates can be measured in meters or other units, and geometry functions will give us the expected measurements.
Coordinate systems are referred to by EPSG codes. You can check epsg.io to find an appropriate coordinate system for the area you're working on.
It's not always necessary to reproject coordinates. If we only want to know if a geometry contains or touches another geometry, we can calculate that in the original coordinate system (if we ignore problems spanning the line between longitudes -180° and 180°).
This process contains standard RapidMiner operators only, the extension is not yet in use. The Read CSV operators are set up with the comma as the separator, and UTF-8 encoding, but otherwise with the default settings. The attribute names come from the first line, the data format is determined automatically.
We only keep a few attributes (the geometry and the object name) and rename them for later use. For example, the river geometry is renamed to riverGeom.
The standard for expressing geometries in textual form is called WKT, Well Known Text. The open data server delivers the geometries in this format, and this is also the format used by the GeoProcessing operators. If you have GIS data in a database, you can use ST_AsText in SQL to get them in this format.
After reading the data, we first extract the parts of the Danube inside the boundaries of Vienna. We use Calculate Geometry Relation for this (Danube inside Vienna in the process). It has one input, so we need both the Vienna and the Danube coordinates in one example set. The easiest way to achieve this is a Cartesian join (it combines every row from the first example set with every row from the second one). We use the intersection function of Calculate Geometry Relation for getting the result. It returns the common part of the two geometries (a polygon and a line) as another geometry, in our case a shorter line (just the part of the Danube inside the Vienna polygon).
We then filter out the New Danube for the bridges, but keep both parts for the river part length calculation.
We want to get the length in meters here, not in ellipsoid degrees. So we reproject the original coordinates to a projection commonly used in Austria, ETRS89/Austria (EPSG code: 3416). This projection is appropriate here. If you work in a different geographical area, be sure to select an appropriate projection. (Choosing a wrong projection will lead to big distortions in the calculated measures.)
After reprojecting to EPSG:3416, we can calculate the length of the river arms with Calculate measures on a geometry (called Calculate river length here).
Now on to the bridges.
First we want to find bridges that cross the Danube. This is a geographic join operation if we apply it on two example sets.
We select the function crosses here. Other functions include contains/containedBy, intersects, overlaps, touches, etc. The function parameter stays empty here, it is only used by isWithinDistance.
Now we can create a distance "matrix" (not formatted as a matrix) for all the selected bridges. This happens in a subprocess.
To calculate distances, we will of course reproject the bridge coordinates to the Austrian meter-based coordinate system. We join the bridge table with itself using a Cartesian join so we get a row for every combination of bridges, but remove the row if it compares the bridge with itself.
Then we use Calculate Geometry Relation with the distance function on the projected geometries.
We then filter out everything with a distance of less than 100 meters to avoid returning irrelevant combinations (some smaller parts of the bridges are separate entries in the data).
Now we can sort the data by distance and return the first row. According to our data, "Steg an der Nordbahnbrücke" and "Floridsdorfer Brücke" would be the nearest ones, with a distance of 481 meters.
That's it, we are done with the analysis. We imported geodata from the Web, transformed coordinates, combined different example sets with different methods and calculated real-world measures on the geometries.
Some directions you could go from there:
- Use the operator Geometry to Coordinates to visualize data (it works best with point geometries, or if you have a large number of geometries)
- Try different ways to geographically join example sets
- Try out the different functions in Calculate Geometry Relation and Calculate measures on a geometry
I'm looking forward to your questions and remarks on the GeoProcessing extension and this tutorial.
Our fictional scenario: We're working with the city of Vienna, Austria, to celebrate the long history of Vienna and the river Danube. For the celebrations, we would like to organize a boat race and a running event for children. We are working with geodata from the Open Data server of Vienna.
In the 1970s Vienna built an artificial island inside the Danube, called Donauinsel (Danube Island). Since then there's the Danube (left arm on the picture) and the New Danube (right). Here's a map to give you an idea:
We are only interested in the parts of the Danube and the New Danube that flow through Vienna. These are highlighted in the next map:
The boat race should be in the longest part of the Danube (or New Danube) through Vienna, so we want to determine the length of the river parts.
For the children's running event, we want to select the two bridges with the shortest distance between them. All bridges in Vienna are of course also available on the Open Data server:
We are obviously only interested in the bridges over the Danube, not every bridge in Vienna. So we will filter the data accordingly:
Then we will calculate the distance between every bridge and select the shortest one (ignoring very short distances of multi-part bridges).
In order to make RapidMiner capable of doing all this, install the GeoProcessing extension from the Marketplace. Make sure that you see the Geoprocessing folder in your Extensions in the Operators panel.
Some background knowledge
Earth is an irregular ellipsoid, but we like to look at maps in two dimensions, as these are more suitable for computer screens or paper. This transformation to two dimensions also allows the application of geometry calculations like distance, length, area and so on.We express global coordinates in latitude and longitude degrees (counted from the equator and from the international 0 meridian in Greenwich). These are angles, so the distance between coordinates depends on the geographic position. We can't use these coordinates for calculating absolute sizes in our favourite measurement system (meters, yards, miles, ...).
The process of transforming coordinates to a new coordinate system (CRS, coordinate reference system) is called projection or reprojection. You can think of it as taking a photo from an airplane or a satellite to transform the three-dimensional earth surface to a two-dimensional picture. The projected coordinates can be measured in meters or other units, and geometry functions will give us the expected measurements.
Coordinate systems are referred to by EPSG codes. You can check epsg.io to find an appropriate coordinate system for the area you're working on.
It's not always necessary to reproject coordinates. If we only want to know if a geometry contains or touches another geometry, we can calculate that in the original coordinate system (if we ignore problems spanning the line between longitudes -180° and 180°).
Getting the data
The Vienna open data server contains geodata in many formats. We can easily use the CSV version in RapidMiner. The example process loads the data directly from the web, you could of course save them locally if you need them more often.This process contains standard RapidMiner operators only, the extension is not yet in use. The Read CSV operators are set up with the comma as the separator, and UTF-8 encoding, but otherwise with the default settings. The attribute names come from the first line, the data format is determined automatically.
We only keep a few attributes (the geometry and the object name) and rename them for later use. For example, the river geometry is renamed to riverGeom.
The standard for expressing geometries in textual form is called WKT, Well Known Text. The open data server delivers the geometries in this format, and this is also the format used by the GeoProcessing operators. If you have GIS data in a database, you can use ST_AsText in SQL to get them in this format.
The tutorial process
After reading the data, we first extract the parts of the Danube inside the boundaries of Vienna. We use Calculate Geometry Relation for this (Danube inside Vienna in the process). It has one input, so we need both the Vienna and the Danube coordinates in one example set. The easiest way to achieve this is a Cartesian join (it combines every row from the first example set with every row from the second one). We use the intersection function of Calculate Geometry Relation for getting the result. It returns the common part of the two geometries (a polygon and a line) as another geometry, in our case a shorter line (just the part of the Danube inside the Vienna polygon).
We then filter out the New Danube for the bridges, but keep both parts for the river part length calculation.
We want to get the length in meters here, not in ellipsoid degrees. So we reproject the original coordinates to a projection commonly used in Austria, ETRS89/Austria (EPSG code: 3416). This projection is appropriate here. If you work in a different geographical area, be sure to select an appropriate projection. (Choosing a wrong projection will lead to big distortions in the calculated measures.)
After reprojecting to EPSG:3416, we can calculate the length of the river arms with Calculate measures on a geometry (called Calculate river length here).
Now on to the bridges.
First we want to find bridges that cross the Danube. This is a geographic join operation if we apply it on two example sets.
We select the function crosses here. Other functions include contains/containedBy, intersects, overlaps, touches, etc. The function parameter stays empty here, it is only used by isWithinDistance.
Now we can create a distance "matrix" (not formatted as a matrix) for all the selected bridges. This happens in a subprocess.
To calculate distances, we will of course reproject the bridge coordinates to the Austrian meter-based coordinate system. We join the bridge table with itself using a Cartesian join so we get a row for every combination of bridges, but remove the row if it compares the bridge with itself.
Then we use Calculate Geometry Relation with the distance function on the projected geometries.
We then filter out everything with a distance of less than 100 meters to avoid returning irrelevant combinations (some smaller parts of the bridges are separate entries in the data).
Now we can sort the data by distance and return the first row. According to our data, "Steg an der Nordbahnbrücke" and "Floridsdorfer Brücke" would be the nearest ones, with a distance of 481 meters.
That's it, we are done with the analysis. We imported geodata from the Web, transformed coordinates, combined different example sets with different methods and calculated real-world measures on the geometries.
Some directions you could go from there:
- Use the operator Geometry to Coordinates to visualize data (it works best with point geometries, or if you have a large number of geometries)
- Try different ways to geographically join example sets
- Try out the different functions in Calculate Geometry Relation and Calculate measures on a geometry
I'm looking forward to your questions and remarks on the GeoProcessing extension and this tutorial.
Tagged:
18
Comments
Get Data (1st process shown above)
Calculating Distances (2nd process shown above)
Strangely enough, it is still working on 9.10.001 studio version. However, when executed on AI Hub there is an issue with connection to the sql database on version 9.10.001 - after 1 hour and 45 minutes during execution the following error gets thrown: java.lang.IllegalAccessError: tried to access class com.microsoft.sqlserver.jdbc.SQLServerDriverIntProperty from class com.microsoft.sqlserver.jdbc.SQLServerDriver.
Rapidminer support advised me to upgrade to 9.10.008, but when I add the bundles of geoscript jar files the sql connection breaks. Any help would be much appreciated. Note that I've also developed scripts making use of the geohash and interpolation for quicker data matching, so I would really need to keep using the groovy script using geoscript (unless there is also geohash and interpolation operators as extension).
nice to hear from you after such a long time.
I haven't looked into updating the geotools and geoscript lately. However, I'm actively using the GeoProcessing extension which should have newer libraries, and accessing MySQL and PostgreSQL databases is not a problem in the latest Studio. I don't have MS SQL to test.
I guess that updating the geo* library jars one by one to current versions is the best approach. Maybe some common logging or utility library is too old, it gets loaded when Studio starts, and then the MSSQL driver breaks.
Regards,
Balázs
I managed to get the geoscript working without breaking the sql connector - I only added the 113 jar files below out of the 142 jar files you have in the package. The sql connections still works (both in studio and AI hub)
Yes. Crossing the boundary means that you had a point outside of the boundary and then in the next reading of that sensor it is inside.
Another way of expressing the same is having the boundary as a linestring instead of a polygon and the two points united to a linestring. Then you would actually use the "crosses" operation on these linestrings. But you could get false positives with unregular shapes, so I would recommend the first solution.
Regards,
Balázs
I started with data having Lat,Long as columns
e.g. 120.16006113532586 22.98360854837837
I wanted to create a POLYGON in in WKT - assuming this would allow me to use the 'Calculate Geometry Relation'
So I used ReadCSV -> Coordinates To Geometry.
... hoping to convert the above Lat, Long into POLYGON((120 22, 120 22, 120 22,. ...))
But instead the result from Coordinates to Geometry is:
Is the overall strategy correct? If yes, how to solve the part above?
Thanks
Yes, coordinates are just points, so Coordinates to Geometry only creates Points.
Grouping points to linestrings or polygons is not available in the Geoprocessing extension. You might be able to create the correct polygon WKT using Generate Attributes and Aggregate in RapidMiner.
These complex things are usually being done on the data level in a GIS-enabled database like PostGIS, or in a tool like QGIS. RapidMiner, even with the Geoprocessing extension, is not a replacement of an entire GIS pipeline.
If your data are polygons, you should have them as polygons, then you can use Geoprocessing e. g. for the matching process.
Regards,
Balázs