Skip to content

The Three Flavors of Python Development

Working on a data science team, I've seen many different types of python code. These different styles and practices all tend to fall into three categories of python development and all have their place.

This is just my attempt to define these different practices and determine where they should be used.

Type 1: Research Code

Put simply, this is code that was written to simply "get the job done" as quick as possible. This code is probably not designed to be read by someone else and might have been written in a Jupyter notebook.

Go Fast 🔥

This is great for making sure a workflow works or prototyping functionality. The main advantage to writing research code is it is extremely fast to write. No fancy software patterns, worries about obnoxiously nested for-loops, DRY code1, or worries about documentation or clarity of code.

This is the main (and I would argue only) benefit of research code. This speed of development is really important in cases in which feasibility analysis is needed before more time is spent on building out an intermediate or final solution.

Summary

Benefits of Research Code Downsides of Research Code
✅ ease/speed of development ⛔ little/no security considerations
⛔ minimal readability
⛔ little/no documentation
⛔ lack of robustness
⛔ resource inefficiency

Type 2: Semi-Prod Code

Semi-Prod code is the compromise between research code and our third type of writing python. It addresses the downsides of research code without taking too much of a hit to speed and overall development experience. Things might still be subject to change in this type of development, so we want to ensure that we're not abstracting away all our options. Focusing on locality of behavior over 'clean code' is preferred as this is almost always easier to refactor. If something can be reused, be thoughtful!

At this point, our code still isn't documented (because it might change), but we're going to write self-documenting code that makes full use of the type hinting that Python offers. This way, other developers don't have to guess what you were thinking.

Using Type Hinting

If there's a type you're looking for, check the typing python module.

from typing import List, Dict

def my_function(a: int, b: Dict[int, float]) -> List[str]:
  """We'll now get type hinting for everything that uses this function!"""
  ...
Depending on the version of python that you're using, there are more/less features baked into the standard library. See this source for more details.

There are also some tools you'll want to use in this flavor of development to keep your codebase clean while you work fast and implement changes.

Dependency Management

Ensure that you're tracking dependencies in the codebase through tools like pip with requirements.txt/requirements-dev.txt files or a more heavyweight solution like pipenv.

Formatting / Linting

Ensure standard code structure & cleanliness by requiring formatting. This keeps the code looking good and easy to read while requiring minimal time to set up for a project. My favorite formatter is Black, it's super opinionated so you don't need to be. 😅

Also tag on linting. This is software that statically analyzes your code and looks for mistakes or bad practices. This is super helpful for catching stupid bugs that python would otherwise let slide until it's too late. My favorite is Flake8, but there are many others to choose from.

Include it in your CI-CD!

If you want to get really fancy, include formatting/linting checks as a part of your CI-CD pipeline and have it run before MRs can be merged.

Maybe you're okay with them not passing at times, but it's a good reference and reminder to clean things up when you can and catch those silly little bugs. 🐛

Summary

Benefits of Semi-Prod Code Downsides of Prototype Code
✅ self-documenting code 🟡 ease/speed of development ⛔ minimal documentation
✅ easy to read code ⛔ unoptimized
✅ thoughtful code reuse ⛔ little test coverage
✅ easy to work on with multiple devs

Type 3: Full-Prod Code

For Full-Prod python code, our #1 priority is going to be robustness. We're going for a solution that is readable, tested, and works well and (blazingly 🔥) fast. This is the point where time is allocated for optimizing performance.

More Static Code Analysis

At this point you'll want to break out the heavy-hitters of static code analysis to ensure that you're using the python type hinting as a bug-catching machine.

MyPy is going to be one of the first ones that you'll want to pull for. This will statically analyze your code to ensure that your types match properly, that you're passing in the correct type of argument to functions and are using returns from functions as the type that is annotated in the type signature.

If you're using docker as a part of your workflow, integrating container scanning is also important to make sure that your system is free of known vulnerabilities.

Unit / Integration Testing

Integrate unit tests via the unittest module and ensure that all tests are passing before MR approval. If your system uses any other systems (e.g. a database, object store, etc.) add integration tests via docker and docker-compose.

Document Reusable Code

I'm of the personal belief that there is a such thing as too much documentation, but adding some docs to guide the use of reusable functions via Docstrings can be very helpful.

Summary

Benefits of Production Code Downsides of Production Code
✅ fully documented code ⛔ development time
✅ easy to read code
✅ unit/integration tests
✅ fewer bugs due to static code analysis

Final Thoughts

While you could always just borrow different components from the different flavors of python dev, I think it's a helpful tool to guide priorities and establish standards that are necessary for your goals.

When to use ___

If you're not sure if something is going to work, or just need to prototype something working, use Research code for this. The important thing is that research code should be labeled as such so that others don't misinterpret it's use case.

If you have a solution that you know works, but it's bound to change over time, use Semi-Prod python. Include some extra stuff to let you keep moving fast while being organized. It really doesn't add that much overhead and shouldn't really slow you down. If in doubt, this is a good flavor to choose, since it's harder to go from Research python to Full-Prod python.

If you have a solution that works and have flushed out the major design issues through Semi-Prod code, port over to Full-Prod. Include extra checks to ensure security and correctness. Integrate testing to make sure that you can confidently deploy changes.

Please 🙏

Always always always ensure that we're not introducing vulnerabilities in our software by allowing SQL or HTML injections. Maybe this is okay in research code, but definitely not in any code that could possibly get to production.

I definitely missed some things here, but hopefully you get the gist.

👋


  1. DRY (Don't Repeat Yourself) is a programming paradigm where any duplicate logic is compacted into a single abstraction. 

Comments