AutoML Station's team blog

Clean Code - What does it mean, and why is it important?

Patrick Metzner Morais — Fri, 01 Jul 2022 14:08:44 GMT

Intro

Most of us have already seen something like "Must be able to write Clean Code" in the list of requirements for some data science and/or software development opportunity, but what does it mean, and why is it important?

Robert C. Martin (Uncle Bob), one of the professionals behind the Agile Manifesto (2001), extensively explored the principles of Clean Code on his book Clean Code: A Handbook of Agile Software Craftsmanship, originally published in 2008.

In short, Clean code is about being professional, writing maintainable and testable code to enable programmers to be more productive. It comprises a set of techniques that allows developers to go fast on the long run.

This post sumarizes some of the main topics covered in Uncle Bob's book, but the complete reading is highly encouraged.

Why Clean Code matters?

Unless we talk about a project in its initial stage, in most scenarios, a lot of the software development work is related to maintenance and extension of existing code, meaning that a lot of time is spent reading them rather than just writing new lines.

Not following the principles of Clean Code can drastically increase the time needed to read and fully understand a piece of code, leading to bottlenecks and large losses. In extreme circumstances the development speed can slow down almost to a halt.

Clean code is a great way to go fast and have productivity on the long run, and in the following section we explore how to write it.

How to write Clean Code?

To make this post lighter to read and facilitate future consultations, this section is divided in a few lists roughly summarizing what I believe are the most important and easier to implement principles of Clean Code, leaving denser and more advanced topics, like S.O.L.I.D., for future posts.

Chapter 2 - Meaningful Names

Any software is filled with names, and choosing them wisely can take time, but significantly improves code readability. When choosing names for variables and functions, as well as files and directories, keep the following in mind:

Names should be relevant and explicit. No relevant information should be just implicit;
Names should encode intent, and should reveal information regarding what something is and what it does without the need of comments;
Names may be long if necessary;
Names should be intuitive and searchable. Using pronounceable names may help with that;
Prioritize verbs to name functions and methods, and substantives for classes and objects;
Avoid using "magic numbers" (unexplained constants). Use named constants instead;

# Bad Example
vf = vi + 9.81 * t

# Good Example
gravitational_acceleration = 9.81
final_velocity = initial_velocity + gravitational_acceleration * time_in_seconds

Chapter 3 - Functions

A lot of the aspects taken into account when writing clean functions, also apply when writing clean classes, covered on Chapter 10 of the book.

I will explore that in more detail on a future post about S.O.L.I.D., but in short, some of the key factors to consider when writing functions are:

Functions should be small. They should have one single responsibility and accomplish it in the simplest way possible;
A function should be reusable in multiple places throughout the code to make maintenance easier on the long run;
Don't Repeat Yourself (DRY):
- If you find yourself repeating a piece of code, evaluate if it could become a function;
- This ensures no ambiguity and applies to different aspects of the project's development, such as documentation, tests and databases;
Use the least amount of arguments necessary. Too many arguments, specially boolean ones, make it harder to respect the single responsibility rule;

Chapter 4 - Comments

Comments do not make up for bad code, and they should often be avoided, but if you find yourself in a situation that comments are really necessary, consider the following:

Do not explain the code. A well written code should be self explanatory, and there should be only one source of truth which is the code itself;
Comments lie:
- While code is refactored, comments are not always updated and eventually are not representative of the code where they are inserted in;
- A comment that is false is worse than the absence of a comment;
Instructions for other programmers and rationale behind decisions might be worthy of comments, but should be kept to a minimum, and should be reviewed and updated as the code evolves;
Comments that generate documentation are good. The further removed your documentation is from the source code, the more likely it is to become outdated. Embedding the documentation directly into the code is sometimes a good strategy;

Chapter 5 - Formatting

Formatting is about communication within a team, and good formatting improves code readability and maintainability;
It can be done with indentation, vertical and horizontal alignment, spaces and other IDE rules;
There are some tools that automate part of the formatting job in some languages. Black is an interesting option if you are working with Python;

Chapter 7 - Error Handling

Things can go wrong, but the professional programmer makes sure that the code always does what it is supposed to do. When handling errors, consider the following:

Use exceptions over error codes:
- Exceptions are easier to debug;
- When using exceptions, it is possible, but not necessary to handle all possible cases;
Treat exceptions and try-catch blocks appropriately:
- Your "catches" should always lead the program to a consistent state;
- You should be able to determine the source of errors, so create informative messages to go along with any exception;

Chapter 8 - Boundaries

It is very common to work on projects that depend on third-party software, packages and libraries. Sometimes they are bought, sometimes we rely on open source projects, and sometimes we use code developed by colleagues from our own organization.

Either way we must set the boundaries to integrate foreign code into ours. To keep your code clean, consider the following:

Write tests for the third-party code:
- It is a great opportunity to learn how to use them;
- It enables us to detect behavioral differences when there are new releases of the third-party packages;
Write wrapper APIs:
- Your code remains unchanged when updating or migrating (only the wrappers change);
- Testing updates and migrations becomes easier;

Chapter 9 - Unit Tests

A code is only really clean once it is validated with tests, and test code should also follow Clean Code principles already mentioned. When writing tests, also consider:

One assert per test is ideal;
Tests should be F.I.R.S.T.:
- Fast. Should run fast to enable frequent execution;
- Independent. Should not depend on each other. You should be able to run only a small group of tests, and there should not be a cascading effect when something goes wrong;
- Repeatable. Should be repeatedly executable in different environments (Q.A., production, development, etc.);
- Self-validating. Tests should return True or False so that errors are not subjective and don't require specific knowledge to interpret the result;
- Timely. Tests should be implemented along with the code (ideally before the code), so that the code never gets too complicated to be tested;

Boy Scout rule

One last advice. Whenever writing code, always try to follow the Boy Scout rule:

Leave your code base cleaner than you found it:
- If it is safe to do so;
- If your code is already covered by tests;
Change names of variables and functions, maybe breaking down large functions into smaller ones, but without refactoring the code so you don't waste time;
Writing good code is not good enough. The code has to be kept clean over time, and we must play an active role in it;

What are the results of Clean Code?

If you consistently follow the principles of Clean Code, you will have:

Readable, testable and maintainable code that is easy to change, validate and extend;
Productive developers that will easily implement new features, and be happy working on the code;
You will have no surprises, and everything will behave as expected;
Your project will be more scalable;

Last but not least, according to Michael C. Feathers, also involved in the early Agile movement:

“Clean code always looks like it was written by someone who cares.”

Photo by El Pythonista

How to run TensorFlow on NVIDIA GPU (Ubuntu 20.04 - May/2022)

Patrick Metzner Morais — Fri, 20 May 2022 18:15:35 GMT

Intro

The use of GPUs is incredibly helpful for many activities related to Machine Learning and Data Science, but correctly setting up your environment to leverage the processing power of these devices can often be a little confusing and time consuming, specially for people new to the field.

The goal of this post is to summarize all necessary steps to run TensorFlow on an NVIDIA GPU from a fresh Ubuntu 20.04 installation in May/2022.

This guide will cover the setup of:

NVIDIA drivers
CUDA Toolkit
CUDNN
NVIDIA Container Toolkit (optional)

The computer (notebook) used to develop this guide was equipped with an Intel® Core™ i7 CPU and an NVIDIA GeForce MX250 GPU. The following steps may vary slightly depending on your equipment.

1 - NVIDIA drivers installation

To install the latest NVIDIA drivers you will need to:

Uninstall old drivers
Retrieve new lists of packages
Remove unused packages
Search for latest driver version
Install latest drivers (510 in the example bellow)
Reboot

These steps can be done with the following commands:

sudo apt-get purge nvidia-*

sudo apt-get update

sudo apt-get autoremove

apt search nvidia-driver

sudo apt install libnvidia-common-510

sudo apt install libnvidia-gl-510

sudo apt install nvidia-driver-510

sudo reboot

2 - CUDA Toolkit installation

Check the pre-installation steps from the official documentation to make sure you have all necessary prerequisites.

Install Linux headers

 sudo apt-get install linux-headers-$(uname -r)

Install CUDA Toolkit following the official documentation or running the commands bellow:

 wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/cuda-ubuntu2004.pin

 sudo mv cuda-ubuntu2004.pin /etc/apt/preferences.d/cuda-repository-pin-600

 wget https://developer.download.nvidia.com/compute/cuda/11.7.0/local_installers/cuda-repo-ubuntu2004-11-7-local_11.7.0-515.43.04-1_amd64.deb

 sudo dpkg -i cuda-repo-ubuntu2004-11-7-local_11.7.0-515.43.04-1_amd64.deb

 sudo cp /var/cuda-repo-ubuntu2004-11-7-local/cuda-*-keyring.gpg /usr/share/keyrings/

 sudo apt-get update

 sudo apt-get -y install cuda

Reboot to fix mismatched versions of drivers and libraries if you get the following error when running nvidia-smi
```
 Failed to initialize NVML: Driver/library version mismatch
```

4 - CUDNN installation

Download cnDNN
- Register for the NVIDIA Developer Program.
- Go to: NVIDIA cuDNN home page.
- Click Download cuDNN.
- Complete the short survey and click Submit.

Install CUDNN following the official documentation or running the commands bellow for version 8.4.0.27:

 sudo dpkg -i cudnn-local-repo-ubuntu2004-8.4.0.27_1.0-1_amd64.deb 

 sudo apt-key add /var/cudnn-local-repo-ubuntu2004-8.4.0.27/7fa2af80.pub

 sudo apt-get update

 sudo apt-get install libcudnn8

 sudo apt-get install libcudnn8-dev

5 - NVIDIA Container Toolkit installation

The steps 1-4 are enough to run TensorFlow locally on NVIDIA GPUs, but there are a few extra necessary steps in case you want to use the GPU in a Docker container with NVIDIA Container Toolkit.

Uninstall previous versions of Docker Engine

 sudo apt-get purge docker-ce docker-ce-cli containerd.io docker-compose-plugin

 sudo rm -rf /var/lib/docker

 sudo rm -rf /var/lib/containerd

Install NVIDIA Container Toolkit following the official documentation or running the commands bellow:

 curl https://get.docker.com | sh \
   && sudo systemctl --now enable docker

 distribution=$(. /etc/os-release;echo $ID$VERSION_ID) \
   && curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \
   && curl -s -L https://nvidia.github.io/libnvidia-container/$distribution/libnvidia-container.list | \
         sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
         sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list

 sudo apt-get update

 sudo apt-get install -y nvidia-docker2

 sudo systemctl restart docker

Test the installation with:

 sudo docker run --rm --gpus all tensorflow/tensorflow:latest-gpu nvidia-smi

The result should look similar to this:

6 - Bonus: Enabling GPU access with Docker Compose

According to the official Docker documentation, in order to enable GPU access with Docker Compose, the following deploy information should be included to your docker-compose.yml file.

services:
  test:
    image: # your image
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: ["gpu"]

Photo by Jordan Harrison

Git Workflow For Data Scientists (git rebase)

Patrick Metzner Morais — Thu, 05 May 2022 00:03:34 GMT

Intro

It is not uncommon to find new data scientists entering the workforce with great programming and analytical skills, but with limited experience related to coding on team projects. This can happen for a variety of reasons, and sometimes limits the productivity of these individuals until they get used to the team's workflow.

The goal of this post is to shed some light on a simple, but effective git workflow based on git rebase to hopefully help new data scientists hit the ground running when joining team projects.

Git is one of the most popular version control systems used by developers to track and store changes to almost any kind of file present on a software project. It has great functionalities related to code synchronization and backup that are specially helpful for developers working on bigger projects.

The following workflow might not be suited for every situation, but it has been great for our team at Auto ML Station.

Git workflow based on git rebase

The following explanation assumes that the team in question uses the develop branch as their main branch for features in development and that you are about to start a new task on my_new_branch.

Since there are more people working on the project, two things have to be considered:

It is not recommended to work directly on the develop branch;
You have to pull all the latest changes from the remote develop branch to your local repository to make sure your work is up to date with the rest of the team.

This can be done with:

git checkout develop
git pull

After synchronizing your local and remote repos, you can see bellow that E is the team's latest commit and you want to start your work based on it.

    A---B---C---D---E 'develop'

You can use the following commands to create my_new_branch and switch into it to start working:

git branch my_new_branch (this step can be done on a web browser via Jira, Bitbucket or some other similar tool that might be synchronized to your remote repository)
git checkout my_new_branch

After completing your task, let's assume you used the commands bellow to add 3 new commits (H, I and J) to your new branch.

git status (to check for modified files)
git add (to stage modified files)
git commit -m 'my new commit message' (to commit staged modifications)

Your git tree should look like the following:

          H---I---J 'my_new_branch'
         /
    D---E 'develop'

Suppose that while you were working on my_new_branch one of your teammates was also working on commits F and G, and his/her git tree looks like the this:

          F---G 'teammate_branch'
         /
    D---E 'develop'

At this point, any of you can merge into develop without conflicts, but as soon as that's done, the other one will have to complete some extra steps to update the local work branch before being able to merge into develop without issues.

Let's assume your colleague pushed his/her branch before you and your team merged his/her work into the remote develop.

At this point, you have to pull his/her new changes to your local develop branch with:

git checkout develop
git pull

After this synchronization your git tree will be as follows:

          H---I---J 'my_new_branch'
         /
    D---E---F---G 'develop'

Now you can switch to my_new_branch and update it with the latest develop commits with the commands below.

git checkout my_new_branch
git rebase develop

The rebase command will update your branch with all commits from develop and then add your commits on top of them with updated hashes (H*, I* and J*) as seen bellow.

                  H*--I*--J* 'my_new_branch'
                 /
    D---E---F---G 'develop'

With that, all that's left to do is push your work into the remote repo:

git push --set-upstream origin my_new_branch or just git push in case the target branch already exists remotely.

Now your team can review and merge your branch into develop (probably by creating a pull request that I expect to explore on a future post).

The final result will be a develop branch that looks like this:

    D---E---F---G---H*--I*--J* 'develop'

Summary

git checkout develop
git pull
git branch my_new_branch (this step can be done on a web browser via Jira, Bitbucket or some other similar tool that might be synchronized to your remote repository)
git checkout my_new_branch
Work on your code...
git status
git add
git commit -m 'my new commit message'
git checkout develop
git pull
git checkout my_new_branch
git rebase develop
git push --set-upstream origin my_new_branch or just git push in case the target branch already exists remotely

For more information on git rebase, check the official git rebase documentation for a more detailed explanation of the command and its options.

Photo by Christina Morillo