The #1 reason Python rocks for data science

A few ago weeks ago we developed a simulation tool at AdTriba. It required solving convex optimization problems; not the stuff we wanted to write ourselves, and so we needed a package for it. For any statistical analysis, the first thing coming to mind is R. R has packages for virtually everything. But we never looked on CRAN and went straight to the Python Package Index (PyPI) instead.

Why? Because our tool was never intended to only run on a single computer. We needed a service for our users. We needed a data product. For us, as a small team, it's crucial to iterate quickly and to avoid overhead. Thanks to this lean approach it took us only three days from our first line of code to a working product our users could use to optimize their marketing budgets.

Python for data products

The decision to use Python might not be obvious, so let me explain our reasoning. Python has two areas, where it excels: machine learning and web development. At first glance, they may seem unrelated, but if you look closer both complement each other. You can use your favorite tools like IPython and pandas to wrangle your data and build an ML model from it. Afterward, you can use one of the plenty Python web frameworks to turn your model into a real product. There is no need to switch languages or rewrite a single line of code1.

To use Python as a basis for data products is quite simple. The only requirement is that you have some coding skills. As a data scientist you should be capable of writing quality code that's checked into version control and has at least some structure, i.e. contains separated functions and has well-defined inputs and output. If your ML code fulfills these requirements, then you're halfway done with your prototype.

Turn your script into a product

To illustrate the basic process of building a data product I will show you the steps we took to develop our simulation tool:

  1. With IPython notebooks we implemented the first proof-of-concept that used global variables to control the script.
  2. We extracted the IPython code into a regular *.py file to give it more structure. This refactoring resulted in a function simulate() that could easily be called with input data and returned calculated values.
  3. Next, we built a RESTful API around simulate() that would accept a JSON request as input and return a JSON response.
  4. We added the front end in our AdTriba Dashboard and consumed the RESTful API.
  5. Celebrate! 🎉

Alternatively, we could have started with step 3 and have the simulate() method return dummy data. Going this route enables you to tackle the problem from two sides: one side is the ML code, the other the web framework integration. You now have the huge advantage that both sides can work completely independent from each other, because you agreed to a mutual "contract": that simulate() accepts data in a specific format returns data in a specific format. On the ML side you implement the model, and whenever you are ready, you switch simulate() from dummy data output to real values. For the web development part nothing changes, because the data format you agreed upon is still valid, only the HTTP endpoint now delivers correct output depending what you feed in.

We applied this process multiple times at AdTriba already. It allowed us to iterate on this kind of services quickly. The developer responsible for the web part didn't have to wait for the data scientist to finish the ML code and you waste no time because nobody gets blocked.

If you are working solo, then it makes more sense to start with the ML part and continue from there. This approach has the advantage that you don't need to decide on the exact data format but can focus on your model and define the data structure whenever you're ready. Also, you minimize the context switching involved as you're solving one problem before starting with another.

Know your web tools

Assuming you already know a bit about building ML models, you're probably asking yourself how to get started with the web development part.

The first questions you have to answer is: What web framework should you use? As always, it depends. The two most popular Python web frameworks are Django and Flask. Both of them are great frameworks but with different philosophies:

  • Flask is a micro framework. It includes only the essentials if you need anything else you have to extend it via plugin libraries or you have to write it yourself.
  • Django, on the other hand, is batteries-included. You don't have to care about user management, database interactions, etc. Django has you covered.

I used both frameworks quite a lot, and they both are great. If you want to dive deeper into any of them, I can highly recommend you these books:

The simulation tool didn't require any persistence or user management, so we decided on using Flask. This decision allowed us to write a very lightweight service without any boilerplate code. It's only a thin layer around our simulation logic.

When I started to work on, I decided on Django immediately. Why? Because there was a database involved, I needed user management, and it required the functionality to create, edit and delete A/B tests (in web dev speech this is referred to as CRUD: create, read, update, delete).

If you don't know which framework to use, my rule of thumb is simple: Use Flask for internal tools without too much logic. Use Django for everything else. Every project will grow, and requirements will change. The moment comes where the effort for adding functionality to Flask will be bigger than the initial overhead for setting up Django.

In the next blog posts, I will cover data products in more detail and will show some examples of services I built. You don't want to miss my future articles? Then subscribe to my newsletter using the widget below.

As always, feel free to leave a comment or send me a tweet.

I'm currently creating a FREE email course about Building Data Products with Python. Sign up for it here and be the first to participate.

  1. I am aware that you cannot develop every data product in Python. In bigger companies, there might be requirements what languages to use for user-facing services, e.g. Java. Or data scientists are not proficient enough in Python. But if you are in a position to accompany the development process end-to-end (or you're doing this as a side project) then Python should be on your radar. 

Andy Goldschmidt

I'm Head of Data Science at AdTriba, a company for data-driven marketing attribution. Previously, I worked at Akanoo, an on-site targeting company and at Jimdo, a DIY website builder.

Hamburg, Germany