TV viewership patterns can be a treasure trove of valuable information. One of the US’s pioneering DVR firms wished to build a service that could comb through viewership patterns to discern the optimum time slot to telecast an advertisement and contacted us to help build the first phase of the project.
Dissecting the data
We were first granted access to a massive data dump that consisted of pseudonymized data. With data for a single day amounting to several gigabytes, the data dump was quite hefty, to say the least. We first had to clean and preprocess all data, and this was accomplished using cloud infrastructure and tools such as AWS Glue and AWS Athena.
Because we were dealing with TV viewership data, we had to apply a normalisation process to account for broadcast durations. This was absolutely necessary in order to prevent information from being unwittingly skewed toward longer broadcasts. For instance, NFL games and movies typically last a few hours and are thus viewed for longer with more people tuning in. Left as is, the data will show that most viewers prefer to watch longer programs. In other instances, viewers may be tuned in, but may not be actively watching a programme. This can easily mislead a decision maker, and more accurate results can be obtained by focusing on actual viewer interest instead.
To tweak the data to account for these quirks, we built an algorithm that intelligently adjusts viewership data to reflect program length and actual engagement (i.e. the length of time during which the viewer was actually watching the programme). The algorithm would first look at a data point from a household and extract the view time data for a given program. It would then determine the interval between the programme watched and the times at which the DVR was tuned immediately before and after. If this time interval exceeded a predetermined set of conditions, then an adjustment (referred to as ‘capping’) would be automatically applied. The capping mechanism also guaranteed that very few conditional expressions would be needed, thus making the algorithm quite easy to maintain.
Rich, descriptive data
The next step was to generate information relating to a predefined set of metrics such as the Most Popular TV Drama, the Most Popular NBA Games in a Month, the Most Popular TV program, and Average Program Completion Rate. To do this, we routed processed data through AWS Glue, while queries were formulated through AWS Athena. The results were then visualised using AWS QuickSight and transformed into a set of beautiful, richly detailed, and highly usable insights which could be used by any decision maker, at any time.
Accurate, AI-driven Models and Predictions
We also took steps to equip our data vault with artificial intelligence so it could provide accurate forecasts. Users were clustered based on their viewership signatures using tools such as Apache Spark and numerous Python libraries to help better understand groups of similar viewers. Machine learning and distributed processing tools such as Apache Spark MLib and AWS Elastic MapReduce were used to crunch through the entire dataset and spit out information such as the best time to telecast a show in order to garner the highest engagement, and also generate content recommendations for the viewer. The insights generated by this predictive analytics module could also be valuable for content production houses, as it provides them the ability to make informed production decisions. After all, why spend money on a show which would likely be a flop when you can plough that money into a potential hit?
This project, while challenging, also came with tough delivery timelines and we owe a lot of kudos to our data science team for coming through.