Git Workflow For Data Scientists (git rebase)

Git Workflow For Data Scientists (git rebase)

Intro

It is not uncommon to find new data scientists entering the workforce with great programming and analytical skills, but with limited experience related to coding on team projects. This can happen for a variety of reasons, and sometimes limits the productivity of these individuals until they get used to the team's workflow.

The goal of this post is to shed some light on a simple, but effective git workflow based on git rebase to hopefully help new data scientists hit the ground running when joining team projects.

Git is one of the most popular version control systems used by developers to track and store changes to almost any kind of file present on a software project. It has great functionalities related to code synchronization and backup that are specially helpful for developers working on bigger projects.

The following workflow might not be suited for every situation, but it has been great for our team at Auto ML Station.

Git workflow based on git rebase

The following explanation assumes that the team in question uses the develop branch as their main branch for features in development and that you are about to start a new task on my_new_branch.

Since there are more people working on the project, two things have to be considered:

  1. It is not recommended to work directly on the develop branch;
  2. You have to pull all the latest changes from the remote develop branch to your local repository to make sure your work is up to date with the rest of the team.

This can be done with:

  • git checkout develop
  • git pull

After synchronizing your local and remote repos, you can see bellow that E is the team's latest commit and you want to start your work based on it.

    A---B---C---D---E 'develop'

You can use the following commands to create my_new_branch and switch into it to start working:

  • git branch my_new_branch (this step can be done on a web browser via Jira, Bitbucket or some other similar tool that might be synchronized to your remote repository)
  • git checkout my_new_branch

After completing your task, let's assume you used the commands bellow to add 3 new commits (H, I and J) to your new branch.

  • git status (to check for modified files)
  • git add <file_names> (to stage modified files)
  • git commit -m 'my new commit message' (to commit staged modifications)

Your git tree should look like the following:

          H---I---J 'my_new_branch'
         /
    D---E 'develop'

Suppose that while you were working on my_new_branch one of your teammates was also working on commits F and G, and his/her git tree looks like the this:

          F---G 'teammate_branch'
         /
    D---E 'develop'

At this point, any of you can merge into develop without conflicts, but as soon as that's done, the other one will have to complete some extra steps to update the local work branch before being able to merge into develop without issues.

Let's assume your colleague pushed his/her branch before you and your team merged his/her work into the remote develop.

At this point, you have to pull his/her new changes to your local develop branch with:

  • git checkout develop
  • git pull

After this synchronization your git tree will be as follows:

          H---I---J 'my_new_branch'
         /
    D---E---F---G 'develop'

Now you can switch to my_new_branch and update it with the latest develop commits with the commands below.

  • git checkout my_new_branch
  • git rebase develop

The rebase command will update your branch with all commits from develop and then add your commits on top of them with updated hashes (H*, I* and J*) as seen bellow.

                  H*--I*--J* 'my_new_branch'
                 /
    D---E---F---G 'develop'

With that, all that's left to do is push your work into the remote repo:

  • git push --set-upstream origin my_new_branch or just git push in case the target branch already exists remotely.

Now your team can review and merge your branch into develop (probably by creating a pull request that I expect to explore on a future post).

The final result will be a develop branch that looks like this:

    D---E---F---G---H*--I*--J* 'develop'

Summary

  1. git checkout develop
  2. git pull
  3. git branch my_new_branch (this step can be done on a web browser via Jira, Bitbucket or some other similar tool that might be synchronized to your remote repository)
  4. git checkout my_new_branch
  5. Work on your code...
  6. git status
  7. git add <file_names>
  8. git commit -m 'my new commit message'
  9. git checkout develop
  10. git pull
  11. git checkout my_new_branch
  12. git rebase develop
  13. git push --set-upstream origin my_new_branch or just git push in case the target branch already exists remotely

For more information on git rebase, check the official git rebase documentation for a more detailed explanation of the command and its options.


Photo by Christina Morillo

Did you find this article valuable?

Support AutoML Station's team blog by becoming a sponsor. Any amount is appreciated!