The Altair Community is migrating to a new platform to provide a better experience for you. In preparation for the migration, the Altair Community is on read-only mode from October 28 - November 6, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here

What is best way to use Version Control with RapidMiner

CarlosBCarlosB Member Posts: 10 Contributor I
edited November 2018 in Help

I would like to save my RapidMiner work in a source code repository such as GitHub.   I noticed that the RapidMiner repository on my local machine has files that seem to correspond to my process.  It looks like each process has three files on the hard drive:  ".ioo", ".md", and ".properties".  I'm thinking of checking these files into a source code repository such as GitHub.   I would then rely on GitHub to store my work and version it.   I'm thinking this should work.   I'm interested in knowing how other people are versioning their work.  

Tagged:

Answers

  • Thomas_OttThomas_Ott RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,761 Unicorn

    Great question. With Server I usually do simple versioning in the versioning view panel. For just Studio I save the processes to a repository and then GitHub that up. 

  • SGolbertSGolbert RapidMiner Certified Analyst, Member Posts: 344 Unicorn
    Hi all,

    I've been thinking about this problem also. I want to implement a data science project in the spirit of TDSP (the same more or less applies to CRISP-DM). I want to have most of the documentation in a GitHub or BitBucket.

    The problem is how to integrate the RapidMiner processes into the equation. An option would be for analysts to have their local RM repository associated with the GitHub one, so that they can pull request. The admin would then accept the changes and then pull the updated GitHub repository into RM Server!

    One of the problem about that, is that it practically negates RM-Server collaboration functions (which are aged BTW!). Another problem is the handling of datasets or big models, which shouldn't be passed around by git. To avoid that, they could be managed by some of the processes, so that RM-Server can get the data by itself (i.e. with scheduling).

    If anyone can provide more ideas I would appreciate it!
  • BalazsBaranyBalazsBarany Administrator, Moderator, Employee-RapidMiner, RapidMiner Certified Analyst, RapidMiner Certified Expert Posts: 955 Unicorn
    Hi,

    this is something that I think about quite often. 

    First part, many data scientists work with data that doesn't belong into a public repository. But you can of course use Git completely offline.
    If you don't have these concerns, Github is a good choice. Or you might have a company git or SVN server.

    As the others said, this workflow is OK if you use a local repository, but you lose the benefits of the server repository for collaboration. It would be great to have a more advanced versioning system on the Server. (Now that recent versions keep the repository in files, an integration with git seems easier to do.) 

    I've also seen people who kept their repositories in a synchronized, versioned Dropbox folder. That's again an external cloud service, so maybe not the best for private data, but hard to beat in terms of convenience: every change in your processes and data is automatically archived and you can access a limited number of previous versions.
    Non-external solutions for this include Syncthing (really awesome) and a self-hosted Nextcloud or Owncloud server. Even a Raspberry Pi is powerful enough for running these.

    Regards,
    Balázs
Sign In or Register to comment.