There’ s no one better to talk about Cake than the people who work there. In this series “Tell us about…” we interview Cake employees.
In this series we give the floor now to Davy Cielen, Data Scientist and Jessica Ruelens, co-founder and Head of Data Science at Cake. 👇
🍰 What do you do at Cake?
Jessica: I’m a co-founder of Cake and in the very beginning, I mainly worked as a Data Scientist. Now that the team has expanded, I’m more involved with architecture and coordination. I also align with other parts of the company such as Devops and the Business side.
Davy: I was the second person to be recruited within the data team at Cake. I work as a Data Scientist.
Jessica: And meanwhile our Data team consists of 5 people. Nick is a Data Engineer, Thomas is a Data Analyst and Mohamed is an ETL Engineer.
🔎 What’s the difference between all these roles?
Jessica: A Data Scientist is a specialist in Machine Learning, he or she develops the predictive models or provides clustering algorithms.
Data Engineers ensure that all enriched data ends up in the right place, at the right time and in the right format.
An ETL Engineer takes care of the processes that bring together the different sources of data in a format that can be used by the data team (the transactions, the feedback from users, …).
And a Data Analyst creates reports of the data that is already there and that has been enriched, and ensures that it reaches the various stakeholders in the best possible way (both consumers and commercial partners of Cake).
Davy: The difference also exists in the business orientation. Where a Data Scientist or Analyst rather looks at the business value and formulates an answer to business issues based on data, the work of a Data Engineer is more technical.
Also, the time frame in which the work is done is different. A Data Analyst usually generates insights on ad hoc issues. A Data Scientist builds models on historical data and an engineer ensures robust data flows, especially when they need to be used in real-time.
Jessica: The distinction between all these different roles is very important to create a well-functioning team. I notice that in reality this is often neglected and the wrong profiles are being recruited for the wrong roles. As a consequence you end up with a mismatch of expectations. The Data domain is new to many companies and often you see some mistakes happen along the way. A good overview of the different roles can be found in this article.
🚀 How do you start building on such an empty page?
Jessica: That’s an iterative process. And actually, we still don’t know. 😂 No, of course that’s a joke. Actually, everything started with Cake’s mission, which was very clear from the start: improve the financial well-being of everyday consumers through a better banking app.
From this mission, we entered into discussions with institutions such as CEBUD (Centre for Budgetary Advice and Research) and the OCMW. And it quickly became clear that insight into your financial behaviour is crucial to your financial well-being. That’s also what all financial experts say: the less well we understand how our money works, the more wrong decisions we make. And we started from there.
Davy: Your banking transactions contain all the information you need to gain a solid insight. Only not in a clear way. So enriching and organizing the banking transactions became the first step. And in such a process you go through several alternating phases of more strategic thinking and more technical implementation and concerns.
Jessica: As an academic institution, CEBUD remains an important sounding board for the further development of the app. We, in our turn, also help them to test certain assumptions.
💁 How does this data get enriched?
Jessica: We build a model for data processing that is self-learning. The more time we invest in developing the model and the more users there are and the more data comes in, the more accurate the information that comes out of the model. But of course you can’t build such a model in one day. It goes through several phases before it is finished and ready for implementation.
The first phase is called the exploratory phase. And that is no more or less than understanding what information is coming in with good common sense. And there are no fixed rules for that.
In a second phase, we collect concrete examples on the basis of which we can later give the model “hints” (the so-called features) of what it has to look out for in order to assign something to a particular transaction. For example, you can teach the model that transactions on a Saturday evening between 8pm and midnight are probably restaurant or bar expenses.
In a third phase, the story becomes more technical and the Data Scientists build a prototype.
In a fourth phase, the Data Engineers build a working model. This is the implementation phase in which real-time data can start flowing through the model.
Davy: Of course you start by analyzing the transactions and the information that is available. For Cake, this is all the information that, thanks to the PSD2 legislation, comes to us through the connection with the banks (the so-called API).
I then came up with the idea of applying Natural Language Processing. This is a discipline in artificial intelligence that allows computers to interpret and understand human language. However, we are not dealing with human language in this case but with a “pile” of terms that do not have the normal human sentence structure; so no verbs, capital letters, punctuation marks. Hence it is a difficult task to make this system work on raw and unstructured data like this one:
Because we are using a method for which it is not meant in the first instance, we had to “retrain” it. And there were (and still are) some challenges involved:
- You don’t have any examples to learn from. This is already difficult for large retail chains with multiple points of sale, not to mention for small entrepreneurs who only have one point of sale.
- Often the information we receive from the bank is limited in length, so we have to deal with abbreviations and sometimes cryptic descriptions (e.g. MRS in the transaction below stands for a Bpost branch).
- Multiple languages (especially for Belgium) adds to the complexity.
- Different types of spelling per point of sale or store. For example, a “Proximus” transaction sometimes comes in as “Proxihus”.
- Behind a certain shop or point of sale, there is often a legal entity or legal person with a different name or location. E.g. candy store “Zoet” in Mechelen comes in as “Neuhaus” in Londerzeel.
Jessica: In the meantime, we’ve come a long way for Flanders. We already analysed more than 1.5 million transactions for a total value of more than 470 million.
Approximately 60% of the terminal transactions have now been enriched with the right point of sale and/or the right category. In the beginning, we had to manually tag a large amount of transactions. A great deal now runs automatically, however there are always going to be exceptions that will have to be adjusted manually.
Anyway, this remains an area of tension between making decisions and estimating the right degree of accuracy. We are constantly making assumptions that we set in motion and on which new adjustments are made. And this remains a process of going forward and backward where we hope we end up further than where we started.
🏁 Will the model ever be finished?
Davy: No, never. 😀 Everything changes every day. Every social, cultural or economic context, or change in it, has an effect on the model that has to be taken into account. For instance, the example where the model learns that transactions on a Saturday night are probably restaurants or bars. This has become worthless during the Corona crisis. Transactions on a Saturday evening are now more likely to be the activation of a Netflix subscription. We will then have to manually intervene and retrain the model.
Jessica: Moreover, when we later move to other countries, we will have to retrain the model again. On the one hand due to the different structure of information that comes in through the bank, but also due to different languages or in some cases even a different script.
Davy: For Wallonia too, we have come a long way. Of course the information comes in in a different language but the banks are the same, the way the information comes in is the same and the large retail chains are mostly the same too.
In the meantime, we have also started the first tests with Dutch banks, so we will have to rethink our search for new rules. For Belgium, for example, we can deduce from the monthly amount of the child allowance (groeipakket) how many children a person has. If that system works differently in the Netherlands, we will have to come up with different rules.
Jessica: The ultimate goal however, providing insight into personal finances, is universal. The rules about what a healthy financial life looks like are also mostly universal. Only the way in which we have to build up the insights for each market is different.
On top of that we will have to take into account the composition of the team and the evolving needs when we expand. Even though we already have a fairly diverse team with diverse backgrounds and working from different countries.
📝 What about user feedback?
Jessica: Feedback from users is extremely important to give the model hints or to put it on the right track. And feedback from individual users is used to improve the model for all users. So giving feedback in the app makes the app better for everyone.
This week, a new feature was added that allows users to give feedback that is displayed in real time in your app. As soon as the model has multiple trusted users who give the same feedback, it will be applied for everyone. That’s what we call the “wisdom of the crowd”.
Davy: This new feedback feature will be a big leap forward in the enrichment of transactions. We will start with location feedback but that will be expanded in the coming weeks to other types of feedback as well.
🔐 How do you guarantee the security of the data?
Jessica: Since we work with transactional data, privacy and security is always our top priority. First of all, as soon as we receive the transaction details, they are completely stripped of all identification data. This means that all information that can link a transaction data to a specific person will be removed and stored in a separate database. How exactly this works can be read here.
Davy: In addition, only a restricted number of people have access to all information and only when strictly necessary. Data Scientists, for example, do have access to the raw data (i.e. the transactions stripped of the identity data). The Data Engineers do not have that type of access because that is not necessary. They build the model with test data, specimen so to speak (in IT jargon this is called the QA environment). The Data Analyst, on the other hand, only has access to the anonymized enriched data but never to the raw data.
🥳 What is your ultimate goal? When are you going to be satisfied?
Davy: 80% enrichment would be fantastic. That last 20% is very difficult because there is a large longtail of the so-called one-off transactions. Transactions that happen only once with one or a few users will always be difficult.
Jessica: I will be satisfied if we see an effect in the financial behavior of people before and after the installation of the app. Then we know that we really have made an impact on financial well-being and that we are living up to our mission. I’m convinced that although the app doesn’t show all of its possibilities yet, we’re already doing this by providing insights. We can already see it in the reactions of users coming in. Even at Cake, we have colleagues who have adjusted their purchasing behaviour after being confronted with their Zalando habits. I’m not mentioning any names. 💪
Davy: In any case, the more users we have, the more transactions we can analyse, the more accurate the insights we generate. That’s the core of the Cake ecosystem!