With the emergence of big data technologies and the availability of new data sources, one of the main questions concerning analytics is “how granular should the analyzed data be?” In retail organizations, these new technologies now enable us to analyze a single cashier transaction; a product location on a shelf; or even the eyeball movements of a potential customer. With these examples in mind, the question becomes whether or not we really need such a granular level of analyses.
On the one hand, “the more granular the better” is a widely-accepted concept when referring to analytics. On the other, it’s not always clear what we gain or lose by supporting such a granular level of data. Moreover, the analytical insights might change dramatically when analyzing the same dataset at different levels of granularity. For example, take a look at our previous blog post, Market Basket Analysis and Tricky Analytics.
In short, the notion that granular transactional data is always better than aggregated data is simply not accurate. As in many tradeoffs in real life, granularity often comes with a price tag. One evident cost results from the efforts of collecting, maintaining, and continuously analyzing huge granular datasets. It might take months to gather, enrich, and cleanse the required dataset, sometimes to the point that it is simply not cost-effective and risking the project’s ROI. Professionals tend to underestimate the efforts needed in the data collection and organization phase. “After all,” they often say, “the data exists somewhere in the organization and it’s only a question of integrating few databases.” In reality, though, it usually takes longer than expected.
The second cost is related to the well-known tradeoff between precision and accuracy. Often as precision rises, accuracy decreases and vice-versa. For example, it might be more accurate to predict an aggregated monthly demand for a specific product over the entire retail chain rather than to predict the granular demand for that product in a specific store on a specific day.
Granular retail data is affected by a huge amount of noise from various sources, such as miscalculated inventory levels, sporadic abnormal transactions, irregular customer orders and dataset errors. The noise is often inversely proportional to the granularity level of the data. A large mismatch of inventory is more probable at a store level than at a chain level. Similarly, abnormal transactions do not happen on a regular basis, while dataset errors are mitigated and often corrected at higher levels.
Statistically speaking, the Central Limit Theorem is a good example on how aggregating the data can eliminate outliers and extreme observations that cancel each other. The outcome is a smoother and more predictable distribution.
To summarize, granular data, in its raw disaggregated form can be somewhat misleading. To counter it, proper levels of aggregation should be defined. These levels of granularity are often dictated by the question asked. If the applied analytical task is of a transactional level, such as cross-sale and up-sale of an online customer, the task itself requires granular transactional data. But if the analytic task is related, for example, to the store operation, such as fixing local operational failures, localizing assortments or locating unfulfilled demand of products, then an aggregated data will probably fit better such a task.
Sometimes one simply needs to take a few steps back and waive granularity in order to see the big picture.