The Data Scientist’s Guide to Choosing Data Vendors | by Elad Cohen | Jun, 2024
I’ve served as the VP of Data Science, AI, and Research for the past five years at two publicly traded companies. In both roles, AI was central to the company’s core product. We partnered with data vendors who enriched our data with relevant features that improved our models’ performance. After having my fair share of downfalls with data vendors, this post will help you save time and money when testing out new vendors.
Warning: Don’t start this process until you have very clear business metrics for your model, and you’ve already put a decent amount of time into optimizing your model. Working with most data vendors for the first time is usually a long process (weeks at best, but often months) and can be very expensive (some data vendors I’ve worked with cost tens of thousands of dollars a year, others have run up in the millions of dollars annually when operating at scale).
Since this is typically a big investment, don’t even start the process unless you’re clearly able to formulate how the go/no-go decision will take place. This is the #1 mistake I’ve seen, so please reread that sentence. For me, this has always required transforming all the decision inputs into dollars.
For example — your model’s performance metric might be the PRAUC of a classification model predicting fraud. Let’s assume your PRAUC increases from 0.9 to 0.92 with the new data added, which might be a tremendous improvement from a data science perspective. However, it costs 25 cents per call. To figure out if this is worth it, you’ll need to translate the incremental PRAUC into margin dollars. This stage may take time and will require a good understanding of the business model. How exactly does a higher PRAUC translate to higher revenue/margin for your company? For most data scientists, this isn’t always straightforward.
This post won’t cover all aspects of selecting a data vendor (e.g., we won’t discuss negotiating contracts) but will cover the main aspects expected of you as the data science lead.
If it looks like you’re the decision maker and your company operates at scale, you’ll most likely get cold emails from vendors periodically. While a random vendor might have some value, it’s usually best to talk to industry experts and understand what data vendors are commonly used in that industry. There are tremendous network effects and economies of scale when working with data, so the largest, best-known vendors can typically bring more value. Don’t trust vendors who offer solutions to every problem/industry, and remember that the most valuable data is typically the most painstaking to create, not something easily scraped online.
A few points to cover when starting the initial conversations:
- Who are their customers? How many large customers do they have in your industry?
- Cost (at least order of magnitude), as this might be an early deal breaker
- Time travel capability: Do they have the technical capability to ‘travel back in time’ and tell you how data existed at a snapshot back in time? This is critical when running a historical proof of concept (more on that below).
- Technical constraints: Latency (pro-tip: always look at p99 or other higher percentiles, not averages), uptime SLA, etc.
Assuming the vendor has checked the boxes on the main points above, you’re ready to plan a proof of concept test. You should have a benchmark model with a clear evaluation metric that can be translated to business metrics. Your model should have a training set and an out-of-time test set (perhaps one or more validation sets as well). Typically, you’ll send the relevant features of the training and test set, with their timestamp, for the vendor to merge their data as it existed historically (time travel). You can then retrain your model with their features and evaluate the difference on the out-of-time test set.
Ideally, you won’t be sharing your target variable with the vendor. At times, vendors may request to receive your target variable to ‘calibrate/tweak’ their model, train a bespoke model, perform feature selection, or any other type of manipulation to better fit their features to your needs. If you do go ahead and share the target variable, be sure that it’s only for the train set, never the test set.
If you got the willies reading the paragraph above, kudos to you. When working with vendors, they’ll always be eager to demonstrate the value of their data, and this is especially true for smaller vendors (where every deal can make a huge difference for them).
One of my worst experiences working with a vendor was a few years back. A new data vendor had just signed a Series A, generated a bunch of hype, and promised extremely relevant data for one of our models. It was a new product where we lacked relevant data and believed this could be a good way to kickstart things. We went ahead and started a POC, during which their model improved our AUC from 0.65 to 0.85 on our training set. On the test set, their model tanked completely — they had ridiculously overfit on the training set. After discussing this with them, they requested the test set target variable to analyze the situation. They put their senior data scientist on the job and asked for a 2nd iteration. We waited a few more weeks for new data to be gathered (to serve as a new unseen test set). Once again, they improved the AUC on the new train dramatically, only to bomb once more on the test set. Needless to say, we did not move forward.
- Set a higher ROI threshold:
Start by calculating the ROI — estimate the incremental net margin generated by the model relative to the cost. Most projects will want a nice positive return. Since there’s a bunch of room for issues that erode your return (data drift, gradual deployment, limitation on usage with all your segments, etc.), set a higher threshold than you typically would. At times, I’ve required a 5X financial return on the enrichment costs as a minimum bar to move forward with a vendor, as a buffer against data drift, potential overfitting, and uncertainty in our ROI point estimate. - Partial Enrichment:
Perhaps the ROI across the entire model isn’t sufficient. However, some segments may demonstrate a much higher lift than others. Splitting your model into two might be best and enriching only those segments. For example, perhaps you’re running a classification model to identify fraudulent payments. Maybe the new data tested gives a strong ROI in Europe but not elsewhere. - Phased Enrichment: If you’ve got a classification model, you can consider splitting your decision into two phases:
- Phase 1- Run the existing model
- Enrich only the observations near your decision threshold (or above your threshold, depending on the use case). Every observation further from the threshold is decided in Phase 1.
- Phase 2 — Run the second model to refine the decision
This approach can be very useful in reducing costs by enriching a small subset while gaining most of the lift, especially when working with imbalanced data. It won’t be as useful if the second model creates a large size of change. For example, if apparently very safe orders are later identified as fraud due to the enriched data, you’ll have to enrich most (if not all) of the data to gain that lift. Phasing your enrichment will also potentially double your latency time as you’ll be running two similar models sequentially, so carefully consider how you optimize the tradeoff across your latency, cost, and performance lift.
Working effectively with data vendors can be a long and tedious process, but the performance lift to your models can be significant. Hopefully, this guide will help you save time and money. Happy modeling!