Author: Krzysztof Satola (satola.net)
From: https://github.com/ksatola
Version: 0.1.0

Project Environment Setup

Table of Contents


Apple MacOs (Mojave)

Introduction

There are a few technical options for a data science project setup: Python, R, cloud analytical services (AWS SageMaker, Azure ML) or locally installed software like IBM SPSS or KNIME, to name a few. The approach and toolkit should be chosen based not only on the data scientist's personal preference and professional skills but mainly based on how it would fit the business problem solution’s requirements, cost and timing. How the data product is meant to be used and maintained after it is created and by whom. What the other factors are which could impact technical decisions. Last but not least, it is a good practice to strive for consistency which makes the learning curve for all project team members shallower, this is valid not only for a single project but also for a portfolio of all projects being executed in a company, institution or any other kind of organization, or by any group.


Python

This project's technical components are based mainly on Python-related tools and modules. Mainly because most of the components needed are natively available in Python, but also because the final data product is a web app which we want to be able to customize. Building all the pieces in Python, allows full flexibility and increases the speed of combining different components. It also makes it easier to combine ready-to-use components with custom code getting us to the specified point faster.

To install Python 3, go to python.org download webpage, dowload and install the latest Python 3 version available for your platform. For more detailed instructions, see Using Python on a Macintosh or Installing Python 3 on Mac OS X.


Python Virtual Environment

Virtualenv is a tool that lets you create an isolated Python environment for your project. It creates an environment that has its own installation directories, that doesn’t share dependencies with other virtualenv environments. You can also configure what version of Python you want to use for each individual environment. It's very much recommended to use virtualenv when dealing with Python applications.

Remember to add venv to your project's .gitignore file so you don't include all of that in your source code (for details see the section about Git).

It is preferable to install big packages (like Numpy), or packages you always use (like IPython) globally (unless you need different versions of these packages for different projects). All the rest should be installed in a virtualenv.

For more, see macOS Setup Guide - Virtualenv or Virtualenv documentation.

Open the Terminal window and navigate to the folder. The next step is installing a Python virtual environment.

$ cd /Users/ksatola/Documents/git/air-polution
$ pwd

Install the Python virtual environment.

$ pip3 install virtualenv

Create a virtual environment for the project. While in the project folder:

$ virtualenv venv

Start the environment:

# bash 
$ source venv/bin/activate

#fish
$ source venv/bin/activate.fish

After the work is done, deactive the environment:

$ deactivate

Managing python packages

Install

In the Terminal, while in the project's folder, run:

(venv) $ pip install matplotlib
(venv) $ pip install pandas
(venv) $ pip install jupyterlab
(venv) $ pip install qgrid
(venv) $ pip install tree

Document

To document installed packages and save them in a requirements.txt file for further faster installation (and environment settings sharing):

$ pip freeze > requirements.txt

To install all dependencies from the file:

$ pip install -r requirements.txt

To document the project folders structure using tree, a recursive directory listing command that produces a depth indented listing of files:

$ tree -L 1

.
├── README.md
├── data
├── notebooks
├── requirements.txt
├── src
├── text
└── venv

Git source code repository

Git is a free and open source distributed version control system designed to handle everything from small to very large projects with speed and efficiency. We will use git and its free web version (github) to manage and share the code.

Check if you have git installed:

$ git --version

git version 2.20.1 (Apple Git-117)

If git is not installed you will be prompted to install it. A recommended way of installing git on macOS (10.9 or above) is by installing XCode Command Line Tools. You can also install git using a binary installer from https://git-scm.com/download/mac.

To create a git repository (for example in github.com), in the Terminal navigate to the parent folder location where the repository folder should be created and clone the repository:

$ cd /Users/ksatola/Documents/git/
$ git clone https://github.com/ksatola/air-polution.git

Go to the created folder containing all other project resources. The initial folder tree should look like this:

$ cd air-polution

$ tree -L 1

.
├── data
├── notebooks
├── src
└── text

Configure basic setting for your local git repository (do not use my login or e-mail, choose different ones):

$ git config --global user.name "ksatola"
$ git config --global user.email "krzysztof@satola.net"
$ git config -l

Create .gitignore file to exclude text/, notebooks/, data/, venv/ and .idea folders/files from being managed by git. The content of these folders contain binaries or other resources which we do not want to store in git and send across to the origin repository, as they can have significant volume, are not in TXT format (so cannot be properly managed by git) or contain personal settings which differer among developers and should be kept locally (otherwise they would override their settings every time when pulling/pushing from/to the remote repository.

Simple one-branch git flow:

(venv) $ git status
(venv) $ git pull
(venv) $ git add .
(venv) $ git commit -m "FEAT: git repo and python configuration update"
(venv) $ git push

For more about git and how to use it read Github Atlassian tutorial.


Jupyter Notebook/Lab for experimentation and prototyping

The JupyterLab is a web-based interactive development environment for Jupyter notebooks, code, and data. The Jupyter Notebook is an open-source web application that allows you to create and share documents that contain live code, equations, visualizations and narrative text. Uses include: data cleaning and transformation, numerical simulation, statistical modeling, data visualization, machine learning, and much more.

In most cases JupyterLab and Jupyter Notebook are interchangeable and any of them can be used to read notebooks created for this project, the first has programming like IDE interface (configurable and sizeable windows and tabs), the second is a classic version of the tool.

For more read installation instructions.


PyCharm as an Integrated Development Environment

PyCharm belongs to JetBrains family of software development tools, and is described by their authors as "The Python IDE for Professional Developers". It is full-fledged integrated programming environment (IDE) with many configuration and extension options. It is a good choice for Python programming as it can be used for free (Community Edition) and supports good programming practices.

To start using PyCharm, after installing it, create a new PyCharm project in the folder cloned from the git repository. It will configure the PyCharm project inside the folder adding a hidden folder .idea).

For more read (https://www.jetbrains.com/pycharm/).

In PyCharm right-click on project folders in the Project tab, and choose from the context menu Mark Directory As:

  • excluded: venv, notebooks, data, text
  • sources root: src

This will make PyCharm working faster.

Microsoft Windows 10

For general comments regarding Python, Python virtual environment, managing Python packages, git, Jupyter Notebooks/Lab and PyCharm, see sections for MacOS (above). For specific instructions on how to configure these tools on Windows 10, see below. All console code provided above will work the same on MacOS and Windows 10 (in the Linux virtual machine).

In [ ]:
# Available soon