The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here

Clustering of GPS coordinate data?

CausalityvsCorrCausalityvsCorr Member Posts: 17 Contributor II
edited November 2018 in Help

I have dataset which contains only a set of coordinate id's (the name of a building) and their latitude and longitude coordinates. The dataset has three rows, the name (Coordinate ID) of attribute  and then latitude and finally below it the longitude data.

 

I need to cluster the ID's based on their mutual information so that one cluster consists of ID's which are near each others.

=>

How to pre-process the data? 

Does Rapidminer have a proper algorithm for this task?

Tagged:

Answers

  • Thomas_OttThomas_Ott RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,761 Unicorn

    Hi, to use Geo and GIS functions in RapidMiner you;ll have to hack some Groovy. Theres a great thread here: http://community.rapidminer.com/t5/RapidMiner-Studio-Forum/Geographic-operations-in-RapidMiner/m-p/25118

  • Telcontar120Telcontar120 RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,635 Unicorn

    If the attributes are on different rows, you probably also need to pivot your data so the lat and long for each building is all on the same row.  That's the way you'll want it to be for the clustering.

     

    Brian T.
    Lindon Ventures 
    Data Science Consulting from Certified RapidMiner Experts
  • CausalityvsCorrCausalityvsCorr Member Posts: 17 Contributor II

    Thank you very much of the feedback.

     

    For Thomas: Do I really need all this extra packages, as for example I do not need to show anythig on the map?

     

    For Telcontar 120: Yes, currently the dataset is in form of row 1: building ID, row 2 lat, row 3 long. What exactly means "to have all in one row"? And, if in one row, how would it be possible to use clustering such as k-means when # of rows is just one?

  • BalazsBaranyBalazsBarany Administrator, Moderator, Employee-RapidMiner, RapidMiner Certified Analyst, RapidMiner Certified Expert Posts: 955 Unicorn

    Hi,

     

    the representation you want to use in RapidMiner is this (CSV example):

     

    Building ID;latitude;longitude

    123;22.524;19.4904

     

    etc. This means that each example is a row. 

     

    You describe your representation like this:

     

    123

    22.524

    19.49

     

    If this is actually the case, you need to preprocess your data to get the first (tabular) form.

     

    About the need for converting coordinates: this depends on the geographical area. In an ideal world, one degree of latitude and one degree of longitude would represent the same distance. This is almost true around the equator but spectacularly wrong in the northern and southern regions. The problem is that objects being one degree of latitude apart don't have the same distance between them as objects one degree of longitude apart. If you have an actual globe, you can easily see why (the coordinate lines are not squares but trapezoids). 

     

    So the correct, applicable-to-every-situation way is to convert your coordinates into a representation (projection or CRS = coordinate reference system) that is defined for the area you're applying your process in. Doing this conversion is possible in RapidMiner with the mentioned Groovy scripts (or ready-to use processes), but you need to install the Geoscript libraries for it to work. If you don't need to do this often, you might want to transform the coordinates in QGIS (graphical) or with a command line program like ogr2ogr.

     

    That said, you might be in a region where the difference between latitude and longitude distance is negligible, or it wouldn't harm the operation you're applying. For example, some clustering methods would be less affected than others. (You might want to Normalize your lat/long data for some clustering algorithms.)

     

    Regards,

    Balázs

  • Thomas_OttThomas_Ott RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,761 Unicorn

    Hi CvC, you'll need those libraries if you want to do distance calculations based on lat/long and other geo calculations.

  • Telcontar120Telcontar120 RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,635 Unicorn

    Clustering works on attributes, not examples.  So based on your description, you need to take the data from the following structure:

    id1 Building #1

    id1 Lat #1

    id1 Long #1

    id2 Building #2

    id2 Lat #2

    id2 Long #2

     

    To this:

    id  Building  Lat  Long

     

    So each building is its own row with an associated latitude and longitude.  You can then cluster on latitude and longitude to find the buildings that are closest to each other.  Don't forget to normalize your numerical data too before clustering!

     

     

     

    Brian T.
    Lindon Ventures 
    Data Science Consulting from Certified RapidMiner Experts
  • CausalityvsCorrCausalityvsCorr Member Posts: 17 Contributor II

    Thank you about excellent feedback and multiple viewpoints. I think I can now manage my stuff properly

Sign In or Register to comment.