In a few of my jobs, and more frequently as a consultant, I’ve seen businesses make an effort to create a data science capability in their organization. It’s well known that most of these attempts fail.
Having seen both successes and failures up-close, I’m confident that there is a clearly definable process for successfully creating a data science capability. I’m going to sketch out that process here.
The “engineering first” approach: an example
The process I advocate depends heavily on engineering and product working together with domain experts long before any data scientists are hired. In fact, the process of preparing for your first data science hire generally begins about six months to a year before your first job advertisement goes out.
If you’re thinking about bringing data science into your business, you’re in the following situation:
- You have data and you have good reason to think that it’s potentially valuable.
- You have use-cases in mind — either products to build, features to enable, or analyses to be done.
- Someone has hypothesized that data science could be useful in turning that data into value.
Businesses in this situation see large opportunities that are not being exploited at all. There’s some kind of process that is far less efficient than it could be, or there is a core capability that the business is completely lacking.
The “engineering first” approach is based on the observation that when there is a significant gap between a business’s current capabilities and what it could do with data science, there are simpler solutions that don’t require data science, but can make a measurable impact on their own. Furthermore, these simpler solutions often require the same core capabilities required by a data science group. This is where the opportunity lives.
Let’s consider a hypothetical example. Suppose you own a company called “ShirtCo”, which sells shirts online. The shirts are available in a variety of styles and they’re made by many different designers. You want to add a shirt recommendation system to your website, which will recommend shirts for people to purchase based on their previous purchases.
This is a classic data science problem, and there are well-known techniques for building a recommendation system. Thus, the temptation is for ShirtCo to run out and hire a data scientist immediately. But in the “engineering first” approach, there is an opportunity to have a major impact on the business, while simultaneously laying the foundation for doing great data science work later.
Because ShirtCo doesn’t have a recommendation system at all, there is an opportunity to create a pretty good recommender without any data science, and therefore without any data scientists. In every case I’ve seen, whenever someone thinks that data science could be of value, it has also turned out that there are simple deterministic rules or other heuristics that could be implemented. These simple rules aren’t as valuable as a real data science model, but they’re far better than nothing, which is the current status quo.
For example, someone at ShirtCo might have good reason to think that people are likely to purchase shirts made by the same designer as the shirts they’ve purchased in the past. So there could be a recommendation system that applies the simple process:
- Find the customer’s most recent shirt purchase.
- Identify the designer of that shirt.
- Find the newest shirt from the same designer.
- Recommend it.
We notice a few things about this recommender process. First, it’s simple and transparent, requiring no data science or other statistical techniques whatsoever. This means that it could be implemented by any competent engineer. Second, even without a rigorous statistical test, it’s highly likely that this would be a vast improvement over the status quo, which is no recommendation system at all. Third, the engineering challenges required to implement this recommendation system are almost identical to the engineering challenges required to implement a “real” data science model. This is the key to the “engineering first” approach.
Let’s list some of the engineering challenges that must be overcome in order to implement this simple recommender:
- Data on shirt sales, customers, and designers must be located.
- The various tables containing that data have to be joined appropriately.
- There needs to be an automated process for keeping that data up-to-date.
- Someone needs to build a service that takes a customer’s ID, looks up their previous purchases, finds the right designer’s shirts in the current inventory, and returns a recommendation.
- All of this has to have a viable UX design and be built into ShirtCo’s website.
- There should be a monitoring system for tracking how many shirts from the recommendation system are actually purchased.
Those engineering challenges are exactly the same as what you would face if you were trying to support a data science team. There is literally no difference in the core capabilities required for a simple rules-driven recommender and a sophisticated, state-of-the-art recommendation system. To be sure, there could be differences in degree — for example, a data science model might require a lot more features of shirts and customers. But the core capabilities required by both are the same.
Hence, the “engineering first” approach: Solve all the engineering and product challenges first, before hiring your first data scientist. Do this by building a rules-based system that is an improvement over the status quo. Only after you’ve shown that the business can do this should you even consider hiring your first data scientist.
The five steps of the “engineering first” approach
By following this approach, you make it far less likely that your data science effort will fall into any of the common traps. After all, the failure modes for a data science effort stem from two sources: First, there is a failure to define a product that can have a significant positive impact on the business. Second, there is a failure to provide the right data and engineering infrastructure to put the data science model into production. Falling into these traps after you’ve hired one or more data scientists is often fatal.
Therefore, you should prove that the business can avoid these traps before hiring a data scientist. If you can successfully define a high-impact product while simultaneously laying the data and engineering foundation, then you can confidently hire a data scientist with the knowledge that this hire will not be a waste of time and money.
Thus, the “engineering first” approach consists of the following steps:
- Identify an opportunity where data science could make a big impact on the business, and where currently the business is doing nothing.
- Consult with product and domain experts to see if there is any conventional wisdom or domain knowledge that suggest a simple set of rules which could be implemented first. Those rules should be able to “move the needle” for the business problem, even if they have some obvious flaws.
- Identify all of the engineering and design challenges required to implement those rules.
- Build the rules-based system as an internal product, and use this task as an opportunity to lay the foundation for future data science work.
- Measure the impact of the system you’ve built.
As I mentioned above, this process is not simple and it generally takes between six months and a year. But at the end of the process, you can confidently hire a data scientist with the knowledge that you’ll be able to avoid the vast majority of traps that typically ensnare new data science groups.
Additional benefits to the “engineering first” approach
Other than enabling a data science effort, there are additional benefits.
First, you will dramatically mitigate your risk. If your effort fails, it will have failed without the additional time and money spent hiring a data scientist. In addition, the risk to the business’s reputation is also mitigated — there will be no data scientist out there saying that your business wasted time and couldn’t get a model deployed.
Second, you create value for the business faster and cheaper. Your rules-based system is not just a “dry run” for future data science projects. It is an independently valuable project on its own. And it can probably be done in half the time you would need in order to hire a data scientist to do similar work.
Third, it makes it much easier to hire a good data scientist. A good data scientist will want to have an impact on the business. But it’s well known among data scientists that most data science work ultimately goes unused. If you can go into an interview and point to all the preliminary work that’s been done, you increase the chances that the right data scientist will accept your job offer.
Fourth, you are training your entire organization in how to work with a data scientist. Everyone will have already practiced creating data sets and pipelines, and building a project into the core infrastructure of the business. For a new data scientist, it will feel like the company has been working with data scientists already.
Lastly, by implementing and measuring the impact of a simpler system, you create a baseline for measuring the impact of the data science group. Too often, people suspect that the data science group is having an impact, but they’re not confident of this. A properly functioning data science group should be able to regularly beat the baseline defined by your rules-based system.
This approach to starting a data science group in your business accomplishes the goals you should always be trying to meet: risk mitigation, fast learning, quick value production, and the ability to iterate and improve. It is an excellent method for avoiding all the most common failure modes for a new data science effort, and it has a number of substantive additional benefits.