<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0"><channel><title><![CDATA[AutoML Station's team blog]]></title><description><![CDATA[Our goal is to share relevant Machine Learning and Data Science knowledge in order to boost developers’ productivity.]]></description><link>https://blog.amlstation.com</link><generator>RSS for Node</generator><lastBuildDate>Wed, 08 Apr 2026 20:12:25 GMT</lastBuildDate><atom:link href="https://blog.amlstation.com/rss.xml" rel="self" type="application/rss+xml"/><language><![CDATA[en]]></language><ttl>60</ttl><item><title><![CDATA[Clean Code - What does it mean, and why is it important?]]></title><description><![CDATA[Intro
Most of us have already seen something like "Must be able to write Clean Code"  in the list of requirements for some data science and/or software development opportunity, but what does it mean, and why is it important?
Robert C. Martin (Uncle B...]]></description><link>https://blog.amlstation.com/clean-code-what-does-it-mean-and-why-is-it-important</link><guid isPermaLink="true">https://blog.amlstation.com/clean-code-what-does-it-mean-and-why-is-it-important</guid><category><![CDATA[Python]]></category><category><![CDATA[clean code]]></category><category><![CDATA[agile]]></category><category><![CDATA[Productivity]]></category><category><![CDATA[Programming Tips]]></category><dc:creator><![CDATA[Patrick Metzner Morais]]></dc:creator><pubDate>Fri, 01 Jul 2022 14:08:44 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1656683857482/re0NLFmrg.jpg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h2 id="heading-intro">Intro</h2>
<p>Most of us have already seen something like <em>"Must be able to write Clean Code"</em>  in the list of requirements for some data science and/or software development opportunity, but what does it mean, and why is it important?</p>
<p>Robert C. Martin (Uncle Bob), one of the professionals behind the Agile Manifesto (2001), extensively explored the principles of Clean Code on his book <strong><em>Clean Code: A Handbook of Agile Software Craftsmanship</em></strong>, originally published in 2008. </p>
<p>In short, Clean code is about being professional, writing maintainable and testable code to enable programmers to be more productive. It comprises a set of techniques that allows developers to go fast on the long run.</p>
<p>This post sumarizes some of the main topics covered in Uncle Bob's book, but the complete reading is highly encouraged.</p>
<h2 id="heading-why-clean-code-matters">Why Clean Code matters?</h2>
<p>Unless we talk about a project in its initial stage, in most scenarios, a lot of the software development work is related to maintenance and extension of existing code, meaning that a lot of time is spent reading them rather than just writing new lines. </p>
<p>Not following the principles of Clean Code can drastically increase the time needed to read and fully understand a piece of code, leading to bottlenecks and large losses. In extreme circumstances the development speed can slow down almost to a halt.</p>
<p>Clean code is a great way to go fast and have productivity on the long run, and in the following section we explore how to write it.</p>
<h2 id="heading-how-to-write-clean-code">How to write Clean Code?</h2>
<p>To make this post lighter to read and facilitate future consultations, this section is divided in a few lists roughly summarizing what I believe are the most important and easier to implement principles of Clean Code, leaving denser and more advanced topics, like <strong><em>S.O.L.I.D.</em></strong>, for future posts.</p>
<h3 id="heading-chapter-2-meaningful-names">Chapter 2 - Meaningful Names</h3>
<p>Any software is filled with names, and choosing them wisely can take time, but significantly improves code readability. When choosing names for variables and functions, as well as files and directories, keep the following in mind:</p>
<ul>
<li>Names should be relevant and explicit. No relevant information should be just implicit;</li>
<li>Names should encode intent, and should reveal information regarding what something is and what it does without the need of comments;</li>
<li>Names may be long if necessary;</li>
<li>Names should be intuitive and searchable. Using pronounceable names may help with that;</li>
<li>Prioritize verbs to name functions and methods, and substantives for classes and objects;</li>
<li>Avoid using "magic numbers" (unexplained constants). Use named constants instead;</li>
</ul>
<pre><code class="lang-python"><span class="hljs-comment"># Bad Example</span>
vf = vi + <span class="hljs-number">9.81</span> * t

<span class="hljs-comment"># Good Example</span>
gravitational_acceleration = <span class="hljs-number">9.81</span>
final_velocity = initial_velocity + gravitational_acceleration * time_in_seconds
</code></pre>
<h3 id="heading-chapter-3-functions">Chapter 3 - Functions</h3>
<p>A lot of the aspects taken into account when writing clean functions, also apply when writing clean classes, covered on Chapter 10 of the book. </p>
<p>I will explore that in more detail on a future post about <strong>S.O.L.I.D.</strong>, but in short, some of the key factors to consider when writing functions are:</p>
<ul>
<li>Functions should be small. They should have one single responsibility and accomplish it in the simplest way possible;</li>
<li>A function should be reusable in multiple places throughout the code to make maintenance easier on the long run;</li>
<li>Don't Repeat Yourself (DRY):<ul>
<li>If you find yourself repeating a piece of code, evaluate if it could become a function;</li>
<li>This ensures no ambiguity and applies to different aspects of the project's development, such as documentation, tests and databases;</li>
</ul>
</li>
<li>Use the least amount of arguments necessary. Too many arguments, specially boolean ones, make it harder to respect the single responsibility rule;</li>
</ul>
<h3 id="heading-chapter-4-comments">Chapter 4 - Comments</h3>
<p>Comments do not make up for bad code, and they should often be avoided, but if you find yourself in a situation that comments are really necessary, consider the following:</p>
<ul>
<li>Do not explain the code. A well written code should be self explanatory, and there should be only one source of truth which is the code itself;</li>
<li>Comments lie:<ul>
<li>While code is refactored, comments are not always updated and eventually are not representative of the code where they are inserted in;</li>
<li>A comment that is false is worse than the absence of a comment;</li>
</ul>
</li>
<li>Instructions for other programmers and rationale behind decisions might be worthy of comments, but should be kept to a minimum, and should be reviewed and updated as the code evolves;</li>
<li>Comments that generate documentation are good. The further removed your documentation is from the source code, the more likely it is to become outdated. Embedding the documentation directly into the code is sometimes a good strategy;</li>
</ul>
<h3 id="heading-chapter-5-formatting">Chapter 5 - Formatting</h3>
<ul>
<li>Formatting is about communication within a team, and good formatting improves code readability and maintainability;</li>
<li>It can be done with indentation, vertical and horizontal alignment, spaces and other IDE rules;</li>
<li>There are some tools that automate part of the formatting job in some languages. <a target="_blank" href="https://pypi.org/project/black/"><strong><em>Black</em></strong></a> is an interesting option if you are working with Python;</li>
</ul>
<h3 id="heading-chapter-7-error-handling">Chapter 7 - Error Handling</h3>
<p>Things can go wrong, but the professional programmer makes sure that the code always does what it is supposed to do. When handling errors, consider the following:</p>
<ul>
<li>Use exceptions over error codes:<ul>
<li>Exceptions are easier to debug;</li>
<li>When using exceptions, it is possible, but not necessary to handle all possible cases;</li>
</ul>
</li>
<li>Treat exceptions and try-catch blocks appropriately:<ul>
<li>Your "catches" should always lead the program to a consistent state;</li>
<li>You should be able to determine the source of errors, so create informative messages to go along with any exception;</li>
</ul>
</li>
</ul>
<h3 id="heading-chapter-8-boundaries">Chapter 8 - Boundaries</h3>
<p>It is very common to work on projects that depend on third-party software, packages and libraries. Sometimes they are bought, sometimes we rely on open source projects, and sometimes we use code developed by colleagues from our own organization. </p>
<p>Either way we must set the boundaries to integrate foreign code into ours. To keep your code clean, consider the following:</p>
<ul>
<li>Write tests for the third-party code:<ul>
<li>It is a great opportunity to learn how to use them; </li>
<li>It enables us to detect behavioral differences when there are new releases of the third-party packages;</li>
</ul>
</li>
<li>Write wrapper APIs:<ul>
<li>Your code remains unchanged when updating or migrating (only the wrappers change);</li>
<li>Testing updates and migrations becomes easier;</li>
</ul>
</li>
</ul>
<h3 id="heading-chapter-9-unit-tests">Chapter 9 - Unit Tests</h3>
<p>A code is only really clean once it is validated with tests, and test code should also follow Clean Code principles already mentioned. When writing tests, also consider:</p>
<ul>
<li>One assert per test is ideal;</li>
<li>Tests should be F.I.R.S.T.:<ul>
<li><strong>Fast</strong>. Should run fast to enable frequent execution;</li>
<li><strong>Independent</strong>. Should not depend on each other. You should be able to run only a small group of tests, and there should not be a cascading effect when something goes wrong;</li>
<li><strong>Repeatable</strong>. Should be repeatedly executable in different environments (Q.A., production, development, etc.);</li>
<li><strong>Self-validating</strong>. Tests should return True or False so that errors are not subjective and don't require specific knowledge to interpret the result;</li>
<li><strong>Timely</strong>. Tests should be implemented along with the code (ideally before the code), so that the code never gets too complicated to be tested;</li>
</ul>
</li>
</ul>
<h3 id="heading-boy-scout-rule">Boy Scout rule</h3>
<p>One last advice. Whenever writing code, always try to follow the Boy Scout rule:</p>
<ul>
<li>Leave your code base cleaner than you found it:<ul>
<li>If it is safe to do so;</li>
<li>If your code is already covered by tests;</li>
</ul>
</li>
<li>Change names of variables and functions, maybe breaking down large functions into smaller ones, but without refactoring the code so you don't waste time;</li>
<li>Writing good code is not good enough. The code has to be kept clean over time, and we must play an active role in it;</li>
</ul>
<h2 id="heading-what-are-the-results-of-clean-code">What are the results of Clean Code?</h2>
<p>If you consistently follow the principles of Clean Code, you will have:</p>
<ul>
<li>Readable, testable and maintainable code that is easy to change, validate and extend; </li>
<li>Productive developers that will easily implement new features, and be happy working on the code;</li>
<li>You will have no surprises, and everything will behave as expected;</li>
<li>Your project will be more scalable; </li>
</ul>
<p>Last but not least, according to Michael C. Feathers, also involved in the early Agile movement:</p>
<blockquote>
<p>“Clean code always looks like it was written by someone who cares.”</p>
</blockquote>
<hr />
<p><a target="_blank" href="https://elpythonista.com/review-of-clean-code">Photo by El Pythonista</a></p>
]]></content:encoded></item><item><title><![CDATA[How to run TensorFlow on NVIDIA GPU (Ubuntu 20.04 - May/2022)]]></title><description><![CDATA[Intro
The use of GPUs is incredibly helpful for many activities related to Machine Learning and Data Science, but correctly setting up your environment to leverage the processing power of these devices can often be a little confusing and time consumi...]]></description><link>https://blog.amlstation.com/how-to-run-tensorflow-on-nvidia-gpu-ubuntu-2004-may2022</link><guid isPermaLink="true">https://blog.amlstation.com/how-to-run-tensorflow-on-nvidia-gpu-ubuntu-2004-may2022</guid><category><![CDATA[Machine Learning]]></category><category><![CDATA[Data Science]]></category><category><![CDATA[software development]]></category><category><![CDATA[TensorFlow]]></category><category><![CDATA[Artificial Intelligence]]></category><dc:creator><![CDATA[Patrick Metzner Morais]]></dc:creator><pubDate>Fri, 20 May 2022 18:15:35 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1653069722050/v28trS4DK.jpg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h3 id="heading-intro">Intro</h3>
<p>The use of GPUs is incredibly helpful for many activities related to Machine Learning and Data Science, but correctly setting up your environment to leverage the processing power of these devices can often be a little confusing and time consuming, specially for people new to the field.</p>
<p>The goal of this post is to summarize all necessary steps to run TensorFlow on an NVIDIA GPU from a fresh Ubuntu 20.04 installation in May/2022. </p>
<p>This guide will cover the setup of:</p>
<ol>
<li>NVIDIA drivers</li>
<li>CUDA Toolkit</li>
<li>CUDNN</li>
<li>NVIDIA Container Toolkit <em>(optional)</em></li>
</ol>
<p>The computer (notebook) used to develop this guide was equipped with an <code>Intel® Core™ i7</code> CPU and an <code>NVIDIA GeForce MX250</code> GPU. The following steps may vary slightly depending on your equipment. </p>
<hr />
<h3 id="heading-1-nvidia-drivers-installation">1 - NVIDIA drivers installation</h3>
<p>To install the latest NVIDIA drivers you will need to:</p>
<ol>
<li>Uninstall old drivers</li>
<li>Retrieve new lists of packages</li>
<li>Remove unused packages</li>
<li>Search for latest driver version</li>
<li>Install latest drivers (510 in the example bellow)</li>
<li>Reboot</li>
</ol>
<p>These steps can be done with the following commands:</p>
<pre><code>sudo apt<span class="hljs-operator">-</span>get purge nvidia<span class="hljs-operator">-</span><span class="hljs-operator">*</span>

sudo apt<span class="hljs-operator">-</span>get update

sudo apt<span class="hljs-operator">-</span>get autoremove

apt search nvidia<span class="hljs-operator">-</span>driver

sudo apt install libnvidia<span class="hljs-operator">-</span>common<span class="hljs-number">-510</span>

sudo apt install libnvidia<span class="hljs-operator">-</span>gl<span class="hljs-number">-510</span>

sudo apt install nvidia<span class="hljs-operator">-</span>driver<span class="hljs-number">-510</span>

sudo reboot
</code></pre><hr />
<h3 id="heading-2-cuda-toolkit-installation">2 - CUDA Toolkit installation</h3>
<ol>
<li><p>Check the <a target="_blank" href="https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html#pre-installation-actions">pre-installation steps from the official documentation</a> to make sure you have all necessary prerequisites.</p>
</li>
<li><p>Install Linux headers</p>
<pre><code> sudo apt<span class="hljs-operator">-</span>get install linux<span class="hljs-operator">-</span>headers<span class="hljs-operator">-</span>$(uname <span class="hljs-operator">-</span>r)
</code></pre></li>
<li><p>Install CUDA Toolkit following the <a target="_blank" href="https://developer.nvidia.com/cuda-downloads?target_os=Linux&amp;target_arch=x86_64&amp;Distribution=Ubuntu&amp;target_version=20.04&amp;target_type=deb_local">official documentation</a> or running the commands bellow:</p>
<pre><code> wget https:<span class="hljs-comment">//developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/cuda-ubuntu2004.pin</span>

 sudo mv cuda<span class="hljs-operator">-</span>ubuntu2004.pin <span class="hljs-operator">/</span>etc<span class="hljs-operator">/</span>apt<span class="hljs-operator">/</span>preferences.d/cuda<span class="hljs-operator">-</span>repository<span class="hljs-operator">-</span>pin<span class="hljs-number">-600</span>

 wget https:<span class="hljs-comment">//developer.download.nvidia.com/compute/cuda/11.7.0/local_installers/cuda-repo-ubuntu2004-11-7-local_11.7.0-515.43.04-1_amd64.deb</span>

 sudo dpkg <span class="hljs-operator">-</span>i cuda<span class="hljs-operator">-</span>repo<span class="hljs-operator">-</span>ubuntu2004<span class="hljs-number">-11</span><span class="hljs-number">-7</span><span class="hljs-operator">-</span>local_11<span class="hljs-number">.7</span><span class="hljs-number">.0</span><span class="hljs-number">-515.43</span><span class="hljs-number">.04</span><span class="hljs-operator">-</span>1_amd64.deb

 sudo cp <span class="hljs-operator">/</span><span class="hljs-keyword">var</span><span class="hljs-operator">/</span>cuda<span class="hljs-operator">-</span>repo<span class="hljs-operator">-</span>ubuntu2004<span class="hljs-number">-11</span><span class="hljs-number">-7</span><span class="hljs-operator">-</span>local<span class="hljs-operator">/</span>cuda<span class="hljs-operator">-</span><span class="hljs-operator">*</span><span class="hljs-operator">-</span>keyring.gpg <span class="hljs-operator">/</span>usr<span class="hljs-operator">/</span>share<span class="hljs-operator">/</span>keyrings<span class="hljs-operator">/</span>

 sudo apt<span class="hljs-operator">-</span>get update

 sudo apt<span class="hljs-operator">-</span>get <span class="hljs-operator">-</span>y install cuda
</code></pre></li>
<li><p>Reboot to fix mismatched versions of drivers and libraries if you get the following error when running <code>nvidia-smi</code></p>
<pre><code> Failed <span class="hljs-keyword">to</span> initialize NVML: Driver/library <span class="hljs-keyword">version</span> mismatch
</code></pre></li>
</ol>
<hr />
<h3 id="heading-4-cudnn-installation">4 - CUDNN installation</h3>
<ol>
<li><p>Download cnDNN</p>
<ul>
<li>Register for the <a target="_blank" href="https://developer.nvidia.com/accelerated-computing-developer">NVIDIA Developer Program</a>.</li>
<li>Go to: <a target="_blank" href="https://developer.nvidia.com/cudnn">NVIDIA cuDNN home page</a>.</li>
<li>Click <code>Download cuDNN</code>.</li>
<li>Complete the short survey and click Submit.</li>
</ul>
</li>
<li><p>Install CUDNN following the <a target="_blank" href="https://docs.nvidia.com/deeplearning/cudnn/install-guide/index.html#installlinux-deb">official documentation</a> or running the commands bellow for version <code>8.4.0.27</code>:</p>
<pre><code> sudo dpkg <span class="hljs-operator">-</span>i cudnn<span class="hljs-operator">-</span>local<span class="hljs-operator">-</span>repo<span class="hljs-operator">-</span>ubuntu2004<span class="hljs-number">-8.4</span><span class="hljs-number">.0</span><span class="hljs-number">.27_1</span><span class="hljs-number">.0</span><span class="hljs-operator">-</span>1_amd64.deb 

 sudo apt<span class="hljs-operator">-</span>key add <span class="hljs-operator">/</span><span class="hljs-keyword">var</span><span class="hljs-operator">/</span>cudnn<span class="hljs-operator">-</span>local<span class="hljs-operator">-</span>repo<span class="hljs-operator">-</span>ubuntu2004<span class="hljs-number">-8.4</span><span class="hljs-number">.0</span><span class="hljs-number">.27</span><span class="hljs-operator">/</span>7fa2af80.pub

 sudo apt<span class="hljs-operator">-</span>get update

 sudo apt<span class="hljs-operator">-</span>get install libcudnn8

 sudo apt<span class="hljs-operator">-</span>get install libcudnn8<span class="hljs-operator">-</span>dev
</code></pre></li>
</ol>
<hr />
<h3 id="heading-5-nvidia-container-toolkit-installation">5 - NVIDIA Container Toolkit installation</h3>
<p>The steps 1-4 are enough to run TensorFlow locally on NVIDIA GPUs, but there are a few extra necessary steps in case you want to use the GPU in a Docker container with <a target="_blank" href="https://github.com/NVIDIA/nvidia-docker">NVIDIA Container Toolkit</a>.</p>
<ol>
<li><p>Uninstall previous versions of Docker Engine</p>
<pre><code> sudo apt<span class="hljs-operator">-</span>get purge docker<span class="hljs-operator">-</span>ce docker<span class="hljs-operator">-</span>ce<span class="hljs-operator">-</span>cli containerd.io docker<span class="hljs-operator">-</span>compose<span class="hljs-operator">-</span>plugin

 sudo rm <span class="hljs-operator">-</span>rf <span class="hljs-operator">/</span><span class="hljs-keyword">var</span><span class="hljs-operator">/</span>lib<span class="hljs-operator">/</span>docker

 sudo rm <span class="hljs-operator">-</span>rf <span class="hljs-operator">/</span><span class="hljs-keyword">var</span><span class="hljs-operator">/</span>lib<span class="hljs-operator">/</span>containerd
</code></pre></li>
<li><p>Install NVIDIA Container Toolkit following the <a target="_blank" href="https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html#docker">official documentation</a> or running the commands bellow:</p>
<pre><code> curl https:<span class="hljs-comment">//get.docker.com | sh \</span>
   <span class="hljs-operator">&amp;</span><span class="hljs-operator">&amp;</span> sudo systemctl <span class="hljs-operator">-</span><span class="hljs-operator">-</span><span class="hljs-built_in">now</span> enable docker

 distribution<span class="hljs-operator">=</span>$(. /etc<span class="hljs-operator">/</span>os<span class="hljs-operator">-</span>release;echo $ID$VERSION_ID) \
   <span class="hljs-operator">&amp;</span><span class="hljs-operator">&amp;</span> curl <span class="hljs-operator">-</span>fsSL https:<span class="hljs-comment">//nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \</span>
   <span class="hljs-operator">&amp;</span><span class="hljs-operator">&amp;</span> curl <span class="hljs-operator">-</span>s <span class="hljs-operator">-</span>L https:<span class="hljs-comment">//nvidia.github.io/libnvidia-container/$distribution/libnvidia-container.list | \</span>
         sed <span class="hljs-string">'s#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g'</span> <span class="hljs-operator">|</span> \
         sudo tee <span class="hljs-operator">/</span>etc<span class="hljs-operator">/</span>apt<span class="hljs-operator">/</span>sources.list.d/nvidia<span class="hljs-operator">-</span>container<span class="hljs-operator">-</span>toolkit.list

 sudo apt<span class="hljs-operator">-</span>get update

 sudo apt<span class="hljs-operator">-</span>get install <span class="hljs-operator">-</span>y nvidia<span class="hljs-operator">-</span>docker2

 sudo systemctl restart docker
</code></pre></li>
<li><p>Test the installation with:</p>
<pre><code> sudo docker run <span class="hljs-operator">-</span><span class="hljs-operator">-</span>rm <span class="hljs-operator">-</span><span class="hljs-operator">-</span>gpus all tensorflow<span class="hljs-operator">/</span>tensorflow:latest<span class="hljs-operator">-</span>gpu nvidia<span class="hljs-operator">-</span>smi
</code></pre><p> The result should look similar to this:
<img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1653068594151/3ohhRC4CW.png" alt="image.png" /></p>
</li>
</ol>
<hr />
<h3 id="heading-6-bonus-enabling-gpu-access-with-docker-compose">6 - Bonus: Enabling GPU access with Docker Compose</h3>
<p>According to the <a target="_blank" href="https://docs.docker.com/compose/gpu-support/#enabling-gpu-access-to-service-containers">official Docker documentation</a>, in order to enable GPU access with Docker Compose, the following <code>deploy</code> information should be included to your <code>docker-compose.yml</code> file.</p>
<pre><code><span class="hljs-attr">services:</span>
  <span class="hljs-attr">test:</span>
    <span class="hljs-attr">image:</span> <span class="hljs-comment"># your image</span>
    <span class="hljs-attr">deploy:</span>
      <span class="hljs-attr">resources:</span>
        <span class="hljs-attr">reservations:</span>
          <span class="hljs-attr">devices:</span>
            <span class="hljs-bullet">-</span> <span class="hljs-attr">driver:</span> <span class="hljs-string">nvidia</span>
              <span class="hljs-attr">count:</span> <span class="hljs-number">1</span>
              <span class="hljs-attr">capabilities:</span> [<span class="hljs-string">"gpu"</span>]
</code></pre><hr />
<p><a target="_blank" href="https://www.pexels.com/photo/gray-laptop-computer-343239/">Photo by Jordan Harrison</a></p>
]]></content:encoded></item><item><title><![CDATA[Git Workflow For Data Scientists (git rebase)]]></title><description><![CDATA[Intro
It is not uncommon to find new data scientists entering the workforce with great programming and analytical skills, but with limited experience related to coding on team projects. This can happen for a variety of reasons, and sometimes limits t...]]></description><link>https://blog.amlstation.com/git-workflow-for-data-scientists-git-rebase</link><guid isPermaLink="true">https://blog.amlstation.com/git-workflow-for-data-scientists-git-rebase</guid><category><![CDATA[Git]]></category><category><![CDATA[Data Science]]></category><category><![CDATA[software development]]></category><category><![CDATA[workflow]]></category><category><![CDATA[Productivity]]></category><dc:creator><![CDATA[Patrick Metzner Morais]]></dc:creator><pubDate>Thu, 05 May 2022 00:03:34 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1651537530578/dHbtxeAv6.jpg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h3 id="heading-intro">Intro</h3>
<p>It is not uncommon to find new data scientists entering the workforce with great programming and analytical skills, but with limited experience related to coding on team projects. This can happen for a variety of reasons, and sometimes limits the productivity of these individuals until they get used to the team's workflow. </p>
<p>The goal of this post is to shed some light on a simple, but effective git workflow based on <code>git rebase</code> to hopefully help new data scientists hit the ground running when joining team projects.</p>
<p>Git is one of the most popular version control systems used by developers to track and store changes to almost any kind of file present on a software project. It has great functionalities related to code synchronization and backup that are specially helpful for developers working on bigger projects.</p>
<p>The following workflow might not be suited for every situation, but it has been great for our team at <a target="_blank" href="https://amlstation.com/">Auto ML Station</a>.</p>
<h3 id="heading-git-workflow-based-on-git-rebase">Git workflow based on git rebase</h3>
<p>The following explanation assumes that the team in question uses the <code>develop</code> branch as their main branch for features in development and that you are about to start a new task on <code>my_new_branch</code>.</p>
<p>Since there are more people working on the project, two things have to be considered:</p>
<ol>
<li>It is not recommended to work directly on the <code>develop</code> branch;</li>
<li>You have to pull all the latest changes from the remote <code>develop</code> branch to your local repository to make sure your work is up to date with the rest of the team. </li>
</ol>
<p>This can be done with:</p>
<ul>
<li><code>git checkout develop</code></li>
<li><code>git pull</code></li>
</ul>
<p>After synchronizing your local and remote repos, you can see bellow that <code>E</code> is the team's latest commit and you want to start your work based on it.</p>
<pre><code>    A<span class="hljs-operator">-</span><span class="hljs-operator">-</span><span class="hljs-operator">-</span>B<span class="hljs-operator">-</span><span class="hljs-operator">-</span><span class="hljs-operator">-</span>C<span class="hljs-operator">-</span><span class="hljs-operator">-</span><span class="hljs-operator">-</span>D<span class="hljs-operator">-</span><span class="hljs-operator">-</span><span class="hljs-operator">-</span>E <span class="hljs-string">'develop'</span>
</code></pre><p>You can use the following commands to create <code>my_new_branch</code> and switch into it to start working:</p>
<ul>
<li><code>git branch my_new_branch</code> (this step can be done on a web browser via Jira, Bitbucket or some other similar tool that might be synchronized to your remote repository)</li>
<li><code>git checkout my_new_branch</code></li>
</ul>
<p>After completing your task, let's assume you used the commands bellow to add 3 new commits (<code>H</code>, <code>I</code> and <code>J</code>) to your new branch.</p>
<ul>
<li><code>git status</code> (to check for modified files)</li>
<li><code>git add &lt;file_names&gt;</code> (to stage modified files)</li>
<li><code>git commit -m 'my new commit message'</code> (to commit staged modifications)</li>
</ul>
<p>Your git tree should look like the following:</p>
<pre><code>          H<span class="hljs-operator">-</span><span class="hljs-operator">-</span><span class="hljs-operator">-</span>I<span class="hljs-operator">-</span><span class="hljs-operator">-</span><span class="hljs-operator">-</span>J <span class="hljs-string">'my_new_branch'</span>
         <span class="hljs-operator">/</span>
    D<span class="hljs-operator">-</span><span class="hljs-operator">-</span><span class="hljs-operator">-</span>E <span class="hljs-string">'develop'</span>
</code></pre><p>Suppose that while you were working on <code>my_new_branch</code> one of your teammates was also working on commits <code>F</code> and <code>G</code>, and his/her git tree looks like the this:</p>
<pre><code>          F<span class="hljs-operator">-</span><span class="hljs-operator">-</span><span class="hljs-operator">-</span>G <span class="hljs-string">'teammate_branch'</span>
         <span class="hljs-operator">/</span>
    D<span class="hljs-operator">-</span><span class="hljs-operator">-</span><span class="hljs-operator">-</span>E <span class="hljs-string">'develop'</span>
</code></pre><p>At this point, any of you can merge into <code>develop</code> without conflicts, but as soon as that's done, the other one will have to complete some extra steps to update the local work branch before being able to merge into <code>develop</code> without issues.</p>
<p>Let's assume your colleague pushed his/her branch before you and your team merged his/her work into the remote <code>develop</code>. </p>
<p>At this point, you have to pull his/her new changes to your local <code>develop</code> branch with:</p>
<ul>
<li><code>git checkout develop</code></li>
<li><code>git pull</code></li>
</ul>
<p>After this synchronization your git tree will be as follows:</p>
<pre><code>          H<span class="hljs-operator">-</span><span class="hljs-operator">-</span><span class="hljs-operator">-</span>I<span class="hljs-operator">-</span><span class="hljs-operator">-</span><span class="hljs-operator">-</span>J <span class="hljs-string">'my_new_branch'</span>
         <span class="hljs-operator">/</span>
    D<span class="hljs-operator">-</span><span class="hljs-operator">-</span><span class="hljs-operator">-</span>E<span class="hljs-operator">-</span><span class="hljs-operator">-</span><span class="hljs-operator">-</span>F<span class="hljs-operator">-</span><span class="hljs-operator">-</span><span class="hljs-operator">-</span>G <span class="hljs-string">'develop'</span>
</code></pre><p>Now you can switch to <code>my_new_branch</code> and update it with the latest <code>develop</code> commits with the commands below.</p>
<ul>
<li><code>git checkout my_new_branch</code></li>
<li><code>git rebase develop</code></li>
</ul>
<p>The <code>rebase</code> command will update your branch with all commits from <code>develop</code> and then add your commits on top of them with updated hashes (<code>H*</code>, <code>I*</code> and <code>J*</code>) as seen bellow. </p>
<pre><code>                  H<span class="hljs-operator">*</span><span class="hljs-operator">-</span><span class="hljs-operator">-</span>I<span class="hljs-operator">*</span><span class="hljs-operator">-</span><span class="hljs-operator">-</span>J<span class="hljs-operator">*</span> <span class="hljs-string">'my_new_branch'</span>
                 <span class="hljs-operator">/</span>
    D<span class="hljs-operator">-</span><span class="hljs-operator">-</span><span class="hljs-operator">-</span>E<span class="hljs-operator">-</span><span class="hljs-operator">-</span><span class="hljs-operator">-</span>F<span class="hljs-operator">-</span><span class="hljs-operator">-</span><span class="hljs-operator">-</span>G <span class="hljs-string">'develop'</span>
</code></pre><p>With that, all that's left to do is push your work into the remote repo:</p>
<ul>
<li><code>git push --set-upstream origin my_new_branch</code> or just <code>git push</code> in case the target branch already exists remotely.</li>
</ul>
<p>Now your team can review and merge your branch into <code>develop</code> (probably by creating a <code>pull request</code> that I expect to explore on a future post). </p>
<p>The final result will be a <code>develop</code> branch that looks like this:</p>
<pre><code>    D<span class="hljs-operator">-</span><span class="hljs-operator">-</span><span class="hljs-operator">-</span>E<span class="hljs-operator">-</span><span class="hljs-operator">-</span><span class="hljs-operator">-</span>F<span class="hljs-operator">-</span><span class="hljs-operator">-</span><span class="hljs-operator">-</span>G<span class="hljs-operator">-</span><span class="hljs-operator">-</span><span class="hljs-operator">-</span>H<span class="hljs-operator">*</span><span class="hljs-operator">-</span><span class="hljs-operator">-</span>I<span class="hljs-operator">*</span><span class="hljs-operator">-</span><span class="hljs-operator">-</span>J<span class="hljs-operator">*</span> <span class="hljs-string">'develop'</span>
</code></pre><h3 id="heading-summary">Summary</h3>
<ol>
<li><code>git checkout develop</code></li>
<li><code>git pull</code></li>
<li><code>git branch my_new_branch</code> (this step can be done on a web browser via Jira, Bitbucket or some other similar tool that might be synchronized to your remote repository)</li>
<li><code>git checkout my_new_branch</code></li>
<li>Work on your code...</li>
<li><code>git status</code></li>
<li><code>git add &lt;file_names&gt;</code></li>
<li><code>git commit -m 'my new commit message'</code></li>
<li><code>git checkout develop</code></li>
<li><code>git pull</code></li>
<li><code>git checkout my_new_branch</code></li>
<li><code>git rebase develop</code></li>
<li><code>git push --set-upstream origin my_new_branch</code> or just <code>git push</code> in case the target branch already exists remotely</li>
</ol>
<p>For more information on <code>git rebase</code>, check the <a target="_blank" href="https://git-scm.com/docs/git-rebase">official git rebase documentation</a> for a more detailed explanation of the command and its options.</p>
<hr />
<p><a target="_blank" href="https://www.pexels.com/photo/woman-programming-on-a-notebook-1181359/">Photo by Christina Morillo</a></p>
]]></content:encoded></item></channel></rss>