When It Comes To Big Data, Is Bigger Always Better?
When it comes to size, we hear a lot about “Super Size,” “Upsize,” “Go big or go home.” So when your company hops on the Big Data bandwagon, is bigger better? Is more data more helpful?
Not necessarily. The fact is more data does not automatically equal more information. The quality of data is just as, if not more, important.
Predictive models provide the best possible mechanism for gleaning insights from and creating action out of Big Data. In layman’s terms, predictive models use past known outcomes to predict future unknown outcomes. But as data scientist Mio Alter has pointed out, predictive models aren’t magic wands that you can just“point” at massive quantities of data. Jeffrey Heer, professor of computer science at the University of Washington and co-founder of Trifacta, recently told The New York Times, “It’s an absolute myth that you can send an algorithm over raw data and have insights pop up.”
What does that mean for data scientists using predictive models to make the best forecasts? In short, it means that data scientists must be careful about what they put into their models — focusing on quality of data versus sheer quantity.
Four Rules For Predictive Modeling
When determining which data will drive the best insights via predictive modeling, there are four hard-and-fast rules that prove that bigger is not always better.
1. Feed the algorithm meaningful information. Machine-learning algorithms cannot distinguish between meaningful and non-meaningful information. For example, if we were to predict which companies would be highly likely to make a purchase, the size of the company might be a meaningful feature in the model, but the company’s name would not. To a computer, all data and information is just a sequence of numbers. It’s up to data scientists to ensure they only feed their algorithms meaningful, useful information.
2. The value is in intent activity data, not static attribute data. Big data comes in various forms. Understanding the differences between those types of data is critical. In today’s digital environment, the static attributes of an individual and/or company — employee title and company size, for example — are readily available information. Most anyone can crawl the web for static attribute data. On the other hand, while intent data (also known as activity data) is more challenging to incorporate into the models, it leads to more accurate and relevant predictions.
Let’s see the differences between static and activity data in action. Say you are trying to predict who will buy routers today. While static data can show you the ideal buyer profile (IT department decision-maker), it has no bearing on an individual’s recent behavior, which could influence the prediction. Leah Smith may be an IT decision-maker, but perhaps she just bought a router last month and has now been researching switches.
This would indicate that Leah is no longer in the market to buy routers, but may be a good prospect for switches. That is the value of intent data. Time-sensitive intent activity data — buyers’ digital footprints—provide insights into whether a prospect is ready to buy now. Understanding what type of data is used in your predictive models is therefore a key indicator of the accuracy and relevancy of the model’s predictions. While models containing only static information aren’t necessarily wrong, they aren’t nearly as useful as models containing recent intent data.
3. Focus on past data. Data scientists should be careful about blending data that contains “future” information into their models.The core concept of building predictive models requires analyzing past events to predict the future. Say you’re trying to predict which customers are more likely to buy servers using the data of customers who have purchased servers from you in 2013. You need to train the model to recognize the patterns that led to the sale of the server.
Some variables — for example, 2013 total sales — would actually contain future information at the time the server was sold. In other words, since predictive models must be built and trained on past data up to but NOT including the sale, the variable 2013 total sales would trip up the modeling algorithm precisely because it includes the sale of the server.
Just as it is important to include both time-dependent and static attributes to generate the most relevant predictions, it is also critical to understand when static attributes are really time-dependent attributes in disguise and when it’s appropriate to use them.
4. Bring in net-new data. Finally, many variables in fact represent the same information. Adding additional data that doesn’t provide any net-new information can lead to a noisy, unstable model as it tries to pick which of the variables are more important in the model. For example, employee count and annual revenue are both proxies for company size. If the model indicates, “large companies are more likely to purchase high-end servers,” the model may choose employee count to represent the company size in one iteration, while it may choose annual revenue in another. However, neither model can distinguish between two similarly sized companies in different industries.
If the truth were really “large tech companies are more likely to purchase high-end servers,” then the model would give the same prediction to a large agriculture company as it would to a large computer software company, unless you add a feature to represent industry.The takeaway? Overlapping static data that doesn’t add new value is a barrier to effective predictive intelligence.
Therefore, if you are looking to launch a Big Data predictive intelligence initiative, think beyond the numbers. Don’t be fooled by the hype that bigger (or more) is the primary indication of success. When analytics companies claim to be including thousands of signals or features in their predictive models, be skeptical. Are those really thousands of distinct signals, or is there only a fraction that are truly distinct with several different ways of representing the same information? More predictors do not necessarily equal more information.
As Big Data Made Simple has advised: “don’t focus on data collection…focus on existing data.” Most companies already have enough data to begin taking advantage of predictive intelligence. If you are inclined to pursue additional data, look for sources that will complement the data you already have and provide net-new information.
Only then will your models have the best chance of turning that knowledge into action.