Skip to main content

Using Big Data To Predict Consumer Choice:

Consumer data is a valuable asset in the current age of data with “smart things” (sensors) delivering large amounts of information to consumers and businesses. According to IBM, in 2012 more than 2.5 exabytes (2.5 billion gigabytes) of data was generated daily. By 2015 this number has grown and, according to forecasts, will continue to grow to 40,000 exabytes by 2020. Under these circumstances, businesses develop innovative techniques to extract and analyse data “on the fly” in order to create quick value propositions for the consumers. The availability of large masses of data catalyses the rise of the domain of data-driven business models (DDBM) which looks at how the data can be used in order to develop new and improve existing business modelling mechanisms.

Yet, the creation of meaningful analytical tools for DDBM is complicated not only because of the volume of the data but also because of the complexity of human decision processes and the way these processes are reflected in the data. Particularly, household consumption data shows that people who shop in the same store may opt for different products and/or brands of products. For example, when making grocery purchases, consumers often tend to alternate brands of products they choose. This is one of the reasons why current online systems developed by some providers such as, e.g., Amazon, which suggest products and services to users and which are intended to nudge users to purchase suggested services and goods, have not gained much popularity.

One of the main disadvantages of the currently available purchasing data is that even though it allows analysts to observe consumer choices as well as providing them with useful demographic information about consumers; it is hard to tell whether observed choices are a result of consumer true preferences or merely a product of noise in these preferences. Analytics is particularly complicated for cases when consumers opt for products and services from different brands in different environments. Under these circumstances, it is important to not only pay attention to the models which help us analyse the data generated by consumer choices, but also to the types of data used for the analysis.

Not All Data Is The Same

Recent literature on servitization and business models provides some insights into this issue. Particularly, researchers tell us that not all data is the same making a distinction between content data and metadata. In consumer choice content data provides an account of decisions made by the users with regard to purchasing products or services. However, this data does not contain information about the context in which these decisions were made. Content data includes Big Data and Connected Internet-of-Things (henceforth, IoT) Data as we know it. At the same time, metadata refers to the data which contains specific references to the context and gives an opportunity to understand how decision architecture (features of the decision environment) affects choices made by users. While content data is used to create technology-based servitization mechanisms, metadata places an individual consumer in the centre of the provision system where data is used as a service.

Predicting Brand Loyalty Using Content Data Versus Metadata

My recent paper looks into how content and metadata can be used to predict brand loyalty and finds that information about contexts can provide important clues into consumer decision making. It turns out that metadata can generate better (more accurate) predictions of consumption behaviour than content data.

Using decision-theory, we can apply concepts of precise, noisy and imprecise preferences to brand choice and propose a simple mechanism which establishes the link between the preference type and brand loyalty.

Preference-Brand Loyalty Mechanism
Preference-Brand Loyalty Mechanism

Preference-brand Loyalty Mechanism

According to this mechanism, various offerings (products and services) can be divided into three categories: Green itemsYellow items and Red items.

Red items include offerings for which an individual has strong precise preference: if these are available, an individual would always prefer these offerings to any other offerings. This means that for these offerings an individual would have high brand loyalty. Figure below shows consumption pattern for the use of shampoo for 74 days of observation from behaviour of an actual consumer. The vertical axes shows remaining weight of the shower gel while the horizontal axes depicts the day of observation from 0 (first day of observation) to 73 (last day of observation). On the horizontal axes the data is arranged by week, where the first week of the study runs from 0 to 6 (7 days).

126 - Big Data Consumer Choice - Shampoo Definition

Yellow items include offerings for which an individual has strong preference but this preference may be in some contexts distorted by noise: an individual would have chosen these offerings over others every time they were available, but, due to fatigue, tremble error or more sophisticated mistakes, this individual may choose other options over the offerings he or she prefers. This means that an individual would often choose the same brand but choice of other brands may also be observed as in the example of the toothpaste consumption below:

126 - Big Data Consumer Choice - Toothpaste Definition

Finally, Green items are a product of imprecise preferences: an individual will purchase offerings from different suppliers and the brand loyalty will be low. For example, shower gel consumption pattern below is consistent with the Green item.

126 - Big Data Consumer Choice - Gel Definition

Using a case study, I tested the Preference-Brand Loyalty mechanism and found that metadata allows predicting brand loyalty better than content data. Using the “Beauty Box” sensor device, the data about weight of all shower products after each use was recorded in a household with two adults. In addition, study participants were also asked to write a detailed diary recording their purchasing behavior.

“Beauty Box” Prototype developed by Helen Oliver for the HAT project (Oliver, 2015)
Beauty Box” Prototype developed by Helen Oliver for the HAT project

While sensor data provided content data, when combined with the diary records, we were able to generate metadata about consumption.

Why Does Metadata Allow Us To Generate Better Predictions?

Consider the following example from my study. Assume that we only observe content data from shower gel, shampoo and toothpaste consumption which looks like this over 74 days of observation (brand data is recorded using a bar code scanner embedded into the “Beauty Box”) :

Shower gel
Shower Gel


Shower gel: The figure above shows that the consumer in our example alternated between different brands of shower gel changing 6 brands during 74 day (12 weeks) of the study. Using content data, we can conclude that a new bottle of shower gel is purchased every 12-13 days; and brands of shower gel are alternated without repetition which allows us to put shower gel in the Green items category according to the Preference-Brand Loyalty mechanism.

Metadata provides more information about the choices of shower gel brands but, in this case, does not allow us to formulate better predictions about the future consumption compared with the content data. Specifically, analysis of the consumer purchasing diary reveals that all shower gels were bought at different locations/shops. The analysis of content and metadata on the shower gel consumption in this example allow us to place shower gel under Green itemscategory. In this case, both content data and metadata would make the same prediction: in the future we can expect to observe low brand loyalty for shower gel brands (i.e., this consumer will continue alternating brands in the future).

Shampoo: Content data shows that shampoo consumption pattern is more complex: consumer chooses Brand B twice and Brand C once during the observation period (shampoo of Brand B is replaced after 8 weeks and shampoo of Brand C is replaced after 1.5 weeks). Content data allows us to classify shampoo as Yellow item.

However, metadata reveals additional information which changes this classification. Specifically, consumer stated that they always bought Brand B, and Brand C was purchased because the pharmacy where the consumer was shopping for Brand B did not have Brand B in stock. Therefore, a much smaller bottle of shampoo of Brand C was purchased in a hope that it would soon be replaced by Brand B. Therefore, for this consumer, shampoo should not be classified as Yellow item. Rather it should be classified as a Red item.

In this case, both content data and metadata would make different predictions: content data would predict that every once in a while consumer would prefer to buy a different shampoo brand to Brand B while metadata would predict that consumer would prefer to always buy shampoo of Brand B (which is a more accurate prediction).

Toothpaste: Using content data we can make the following conclusions from the observed patterns: consumer needs to replace toothpaste every 2 weeks if they use Brand C and every week if they use Brand D; and toothpaste can be classified as a Yellow item because consumer mostly uses one brand of toothpaste (Brand C). However, consumer occasionally deviates from their preferred choice in favour of other brands (Brand D).

Metadata, however, reveals that the picture is more complex than that depicted by the content data. Specifically, the purchasing diary revealed that consumer always buys toothpaste of Brand C. Yet, Brand D was used for 3 weeks during the observation period because it was prescribed by the dentist. Therefore, Brand D was not purchased because consumer really preferred to buy Brand D. Instead, it was purchased on doctors’ instructions. Therefore, taking into account metadata, we should classify toothpaste Red item rather than Yellow item.

In this case, both content data and metadata would make different predictions: content data would predict that every once in a while consumer would prefer to buy a different toothpaste brand to Brand C while metadata would predict that consumer would prefer to always buy toothpaste of Brand C (which is a more accurate prediction).

Comparison between content data and metadata is summarized below

Shower gel Shampoo Toothpaste
Content data Green Yellow Yellow
Metadata Green Red Red

What conclusions can we make from this example? Overall, metadata allow to significantly reduce noise in the data even when we have few observations. Content data may be able to produce the same results as metadata, however, in order to reach the same conclusions content dataset has to include more data and over a longer time period to establish robust behavioural patterns. If consumer data is available over a relatively short time period, metadata is the way to go as it would produce more accurate predictions about future consumption. Therefore, it is important to develop markets and technology which would encourage and  incentivize consumers to collect context-dependent data within their households (e.g., see the HAT project). If we are able to predict consumption patterns better, we will also be able to develop better (more optimal) business models and create more personalized goods/services in the future.


About the Author:

Ganna Pogrebna

3798791880957e5ffb98e1cb49a795caDr Ganna Pogrebna is a co-investigator in the HARRIET project. Ganna is Associate Professor of Decision Science and Service Systems at WMG. She is a decision theorist/behavioural economist and empirical econometrician with particular interest in decision making in a digital domain, IoT and behavioural aspects of platform choice. She also works on behavioural aspects of digitisation and business models in application to individual and household choice as well as smart cities where her areas of expertise include quantitative modelling, data analysis, and new business models. In 2011-2012 Ganna received a Leverhulme Early Career Fellowship in behavioural science. She has published in high quality peer-refereed economics and business journals. Within HARRIET, Ganna is particularly interested in the influence on the digital technology on individual decision making within the household and in the workplace. Her tasks include analysis of sensor data and derivation of HAT/HARRIET Algorithms for new products/services and business models.

One Comment

  • Enable iD says:

    Editors Note – Definitions of ‘metadata’ can vary:

    Industry may typically refer to ‘metadata’ as the barcode and properties of a product, ‘structural metadata’. Whereas, this post refers to ‘descriptive metadata’, the instances of a products application, layered over ‘content data’ which is described as “an account of decisions made by the users with regard to purchasing products or services”.

    Further definitions of ‘structural metadata’ and ‘descriptive metadata’ can be found at

Leave a Reply