The #1 reason Python rocks for data science

A few ago weeks ago we developed a simulation tool at AdTriba. It required solving convex optimization problems; not the stuff we wanted to write ourselves, and so we needed a package for it. For any statistical analysis, the first thing coming to mind is R. R has packages for virtually everything. But we never looked on CRAN and went straight to the Python Package Index (PyPI) instead.

Why? Because our tool was never intended to only run on a single computer. We needed a service for our users. We needed a data product. For us, as a small team, it’s crucial to iterate quickly and to avoid overhead. Thanks to this lean approach it took us only three days from our first line of code to a working product our users could use to optimize their marketing budgets.

Python for data products

The decision to use Python might not be obvious, so let me explain our reasoning. Python has two areas, where it excels: machine learning and web development. At first glance, they may seem unrelated, but if you look closer both complement each other. You can use your favorite tools like IPython and pandas to wrangle your data and build an ML model from it. Afterward, you can use one of the plenty Python web frameworks to turn your model into a real product. There is no need to switch languages or rewrite a single line of code1.

To use Python as a basis for data products is quite simple. The only requirement is that you have some coding skills. As a data scientist you should be capable of writing quality code that’s checked into version control and has at least some structure, i.e. contains separated functions and has well-defined inputs and output. If your ML code fulfills these requirements, then you’re halfway done with your prototype.

Turn your script into a product

To illustrate the basic process of building a data product I will show you the steps we took to develop our simulation tool:

  1. With IPython notebooks we implemented the first proof-of-concept that used global variables to control the script.
  2. We extracted the IPython code into a regular *.py file to give it more structure. This refactoring resulted in a function simulate() that could easily be called with input data and returned calculated values.
  3. Next, we built a RESTful API around simulate() that would accept a JSON request as input and return a JSON response.
  4. We added the front end in our AdTriba Dashboard and consumed the RESTful API.
  5. Celebrate! 🎉

Alternatively, we could have started with step 3 and have the simulate() method return dummy data. Going this route enables you to tackle the problem from two sides: one side is the ML code, the other the web framework integration. You now have the huge advantage that both sides can work completely independent from each other, because you agreed to a mutual “contract”: that simulate() accepts data in a specific format returns data in a specific format. On the ML side you implement the model, and whenever you are ready, you switch simulate() from dummy data output to real values. For the web development part nothing changes, because the data format you agreed upon is still valid, only the HTTP endpoint now delivers correct output depending what you feed in.

We applied this process multiple times at AdTriba already. It allowed us to iterate on this kind of services quickly. The developer responsible for the web part didn’t have to wait for the data scientist to finish the ML code and you waste no time because nobody gets blocked.

If you are working solo, then it makes more sense to start with the ML part and continue from there. This approach has the advantage that you don’t need to decide on the exact data format but can focus on your model and define the data structure whenever you’re ready. Also, you minimize the context switching involved as you’re solving one problem before starting with another.

Know your web tools

Assuming you already know a bit about building ML models, you’re probably asking yourself how to get started with the web development part.

The first questions you have to answer is: What web framework should you use? As always, it depends. The two most popular Python web frameworks are Django and Flask. Both of them are great frameworks but with different philosophies:

  • Flask is a micro framework. It includes only the essentials if you need anything else you have to extend it via plugin libraries or you have to write it yourself.
  • Django, on the other hand, is batteries-included. You don’t have to care about user management, database interactions, etc. Django has you covered.

I used both frameworks quite a lot, and they both are great. If you want to dive deeper into any of them, I can highly recommend you these books:

The simulation tool didn’t require any persistence or user management, so we decided on using Flask. This decision allowed us to write a very lightweight service without any boilerplate code. It’s only a thin layer around our simulation logic.

When I started to work on, I decided on Django immediately. Why? Because there was a database involved, I needed user management, and it required the functionality to create, edit and delete A/B tests (in web dev speech this is referred to as CRUD: create, read, update, delete).

If you don’t know which framework to use, my rule of thumb is simple: Use Flask for internal tools without too much logic. Use Django for everything else. Every project will grow, and requirements will change. The moment comes where the effort for adding functionality to Flask will be bigger than the initial overhead for setting up Django. – The story of a failed side project

This is the story of, an evaluation and documentation service for A/B testing. It was a side project I invested much time in effort in, but in the end, it never took off, and I pulled the plug.

Let’s dive into ABBY’s history and the steps that lead to its eventual failure.

The origins

The original idea was born out of necessity. I wanted to approximate how many users were needed for an A/B test at our company. But all I could find from former tests was a lousily maintained Google Sheet that didn’t help me at all.

Some days later I sat at the airport waiting for my flight to PyData London. With some time until the flight went off I began sketching a prototype of a documentation service. After I laid the groundwork, it became clear that the whole thing is not overly complex. It was reasonable to have a working prototype when returning from the conference.

So I spent my two evenings in London writing the first rough version of ABBY – what a creative name, right?

When I returned to the office I showed it to my colleagues and they encouraged me to go on with it and polish it. After using it for the first A/B tests we realized ABBY would be way more useful, if it could also handle the evaluation part. Until then all people evaluating A/B tests in the company used their method of choice. No standard script existed, only custom R scripts and Python files.

In the course of weeks, ABBY became an indispensable tool for our A/B tests. And the more I worked on the project the more convinced I got that other people must have the same problems.

ABBY for everybody

I talked with my CEO about extending ABBY to a software-as-a-service. I truly appreciate that he didn’t stop me, but gave me the OK to work on it. Immediately after getting home from work that day I purchased a domain and a Bootstrap theme. (Learning 1: Never buy a domain before you need it). The initial plan was to just add a user concept around the existing code.

In retrospect, I’m not surprised that I ended up rewriting the whole thing from scratch. There are so many things that a SaaS product requires that an internal company tool doesn’t need. The good thing was that I could avoid a lot of the design mistakes I made in the first attempt. I knew the bottlenecks and problems from the first implementation. And I was eager to ship quality software that didn’t feel like a hack. (Learning 2: Lean is king: you have to be ashamed of your initial version.)

It was fun to work on something I “own”, to see the progress I made and the challenges I tackled one after another. And I learned a ton about things I wouldn’t have touched in my day job as a data scientist. For instance, I completely underestimated the time needed for the design. I had no clue about CSS (and I still don’t), but it’s necessary if you want to change the appearance of your website. Javascript has always been a weird language to me. Now I’m fascinated by AngularJS. It’s so easy to make your websites more interactive and fun to use.

Launch and gaining traction

My initial plan was to release – again, very creative, I know – as soon as possible. But I wasn’t satisfied with the result and decided that I needed to add one more feature. Again and again. It felt like an endless cycle, where I sat at home and implemented new features, tweaked the design or thought about possible next steps.

By the end of January 2015, after working on the project for more than half a year, I was sick of waiting any longer and put the website live. I tweeted about it and posted a link to Hacker News. Both gave me some initial traffic. There wasn’t much traffic from HN and it didn’t take long for the link to disappear from the start page. But it was enough to attract some people and convinced them to sign up.

It turns out there was some real interest in!

I was lucky enough that I got featured by Product Hunt, too. That got me massive traffic and a good number of signups for a few days. Additionally, I set up an AdWords campaign and spent some bucks to gauge further interest in the product. About 10% of users that clicked the ad signed up. (But admittedly that was favored by the fact that was for free while in beta).

The slow death

So what happened? Why did I shut it down instead of celebrating that I launched a project many people would be happy about?

After being happy to see all the signups come in I took a look at their usage behavior and that’s where I hit a wall. Only three customers were actively using the product, the rest dropped off after a page view or two. 97% of my user base was dead. Worse, they never were alive.

There are three main reasons why I think this endeavour failed:

1. User education

It turned out the problems addressed are real and many people face it. But they don’t know they face it. If you start a project you can’t afford to spend time on user education. Many people I spoke with ran test after test without any documentation; very few realized the benefit of a service like When someone finds your product they should immediately be able to understand the problem for which you offer a solution.

2. Use case was a niche product. You couldn’t run A/B tests with it. You were “just” able to evaluate tests and document them so you don’t lose valuable knowledge and insights. This reduced the potential audience enormously. Lots of companies use Optimizely or Google Analytics to run A/B tests. These tools already offer the evaluation part (although not the documentation part). Why should these users add another tool, if they don’t see its benefit?

3. Usability and Onboarding

So obviously couldn’t pick up users expectations. And it really didn’t surprise me once I tried to see the website from their perspective. After signing up you were presented with a white screen and a message to set up your first test. Great, most people actually clicked that. But what happened then was a UI/UX nightmare: a form with about 15 fields (only 2 or so mandatory), without any explanation. And only after that, you could start evaluating your test. Sure, I would also close the browser tab as soon as these abandoning users did.

So I decided to overhaul the flow to create a new A/B test. You could start with an empty test, import existing results or start with an evaluation of a finished test right away.

The idea was probably great and might have saved, but it was way too large for a side-project. I was lucky if I could spend 5-10 hours a week on the project. Implementing this flow took me weeks. And some weekend I just couldn’t find the motivation to continue. And that was the real reason shut down: I just couldn’t find motivation to finish this amount of work for a change that may or may not turn the product around. I simply put too much on my plate for a project with very limited resources.


But I learned an important lesson on the way: Start small and continue small. Never put more on your plate than you can eat. Or in case of a side-project: break down your work on easily achievable chunks.

The most important thing is to stay motivated. And nothing is more frustrating than working on a project for a year before realizing that nobody cares about it.

There are many guides out there how to approach a side project. But the most crucial thing is to gather feedback as early as possible. You won’t lose users with an ugly interface as long as you provide value to your customers.

Lift analysis – A data scientist’s secret weapon

Whenever I read articles about data science I feel like there is some important aspect missing: evaluating the performance and quality of a machine learning model.

There is always a neat problem at hand that gets solved and the process of data acquisition, handling, and model creation is discussed, but the evaluation aspect too often is very brief. But I truly believe it’s the most important fact when building a new model. Consequently, the first post on this blog will deal with a pretty useful evaluation technique: lift analysis.

Machine learning covers a wide variety of problems like regression and clustering. Lift analysis, however, is used for classification tasks. Therefore, the remainder of this article will concentrate on these kinds of models.

The reason behind lift charts

When evaluating machine learning models there is a plethora of possible metrics to assess performance. There are things like accuracy, precision-recall, ROC curve and so on. All of them can be useful, but they can also be misleading or don’t answer the question at hand very well.

Accuracy1, for example, might be a useful metric for balanced classes (that is, each label has about the same number of occurrences), but it’s totally misleading for imbalanced classes. Problem is: data scientists have to deal with imbalanced classes all the time, e.g. when predicting if a user will buy something in an online shop. If only 2 out of 100 customers buy anyway, it’s easy for the model to predict everyone as not buying and it still would achieve an accuracy of 98%! That’s absolutely not useful when trying to assess the model’s quality.

Of course, other metrics like precision and recall give you important information about your model as well. But I want to dig a bit deeper into another valuable evaluation technique, generally referred to as lift analysis.

To illustrate the idea, we’ll consider a simple churn model: we want to predict if a customer of an online service will cancel its subscription or not. This is a binary classification problem: the user either cancels the subscription (churn=1) or keeps it (churn=0).

The basic idea of lift analysis is as follows:

  1. group data based on the predicted churn probability (value between 0.0 and 1.0). Typically, you look at deciles, so you’d have 10 groups: 0.0 – 0.1, 0.1 – 0.2, …, 0.9 – 1.0
  2. calculate the true churn rate per group. That is, you count how many people in each group churned and divide this by the total number of customers per group.

Why is this useful?

The purpose of our model is to estimate how likely it is that a customer will cancel its subscription. This means our predicted (churn) probability should be directly proportional to the true churn probability, i.e. a high predicted score should correlate with a high actual churn rate. Vice versa, if the model predicts that a customer won’t churn, then we want to be sure that it’s really unlikely that this customer will churn.

But as always, a picture is worth a thousand words. So let’s see how an ideal lift chart would look like:

Here you can see that the churn rate in the rightmost bucket is highest, just as expected. For scores below 0.5, the actual churn rate in the buckets is almost zero. You can use this lift chart to verify that your model is doing what you expect from it.

Let’s say there would be a spike in the lower scored groups; then you know right away that your model has some flaw, it doesn’t reflect the reality properly. Because if it would, then the true churn rate can only decrease with decreasing score. Of course, lift analysis can help you only that far. It’s up to you to identify the cause of this problem and to fix it, if necessary1. After improving the model, you just can come back to the lift chart and see if the quality improved.

Additionally, I drew a black line for the hypothetical average churn rate (20%). This is useful to define a targeting threshold: scores below the threshold will be set to 0, scores above to 1. In our example, you might want to try to keep customers from canceling their subscription by giving them a discount. Then you would target all users with a score between 0.8 and 1.0 because this is the range where the churn rates are higher than the average churn rate. You don’t want to pour money down the drain for customers, who have a below-average churn probability.

But what is lift exactly?

Until now, we only looked at nice charts. But usually, you’re interested in the lift score as well. The definition is pretty simple:

$$ \text{lift} = \frac{\text{predicted rate}}{\text{average rate}} $$

rate in our situation refers to the churn rate, but might as well be a conversion rate, response rate etc.

Looking back at our example chart, the highest group would have a lift of 0.97 / 0.2 = 4.85 and the second-highest group of 1.8. That means, if you only target users with a score higher than 0.9, you can expect to catch nearly five times more churning users than you would by targeting the same number of people randomly.


Just like every other evaluation metric lift charts aren’t a one-off solution. But they help you get a better picture of the overall performance of your model. You can quickly spot flaws if the slope of the lift chart is not monotonic. Additionally, it helps you to set a threshold, which users are worth targeting. Last but not least, you have an estimate of how much better you can target users compared to random targeting.

I hope this first blog post gave you some new insights or you enjoyed it as a refresher. If you have any questions or feedback, just leave a comment or shoot me a tweet.

  1. The ratio of correctly labeled observations to the total number of observations.