Video Recommendation System with TigerGraph🐯

Dashboarding with Plotly Express, Plotly Dash, & TigerGraph

9 min readApr 29, 2022

This project has been co-created by Atishaye Jain, Bhavya Tyagi, and Janhavi Lande as a part of the GitHub Externship Program at TigerGraph for the winter cohort of 2022.

Introduction

This application is to get your favorite videos recommended powered by Graph Database based on provided keywords and the ability to pinpoint where the keyword was spoken in the videos! The dataset incorporates a variety of themes like finance, cryptocurrency, animals, astronomy, literature, etc. for you to choose from.

We are using TigerGraph Cloud as our graph database. TigerGraph provides a graph AI platform that is built on the industry’s first and only distributed native graph database, which comes with a SQL-like query language called GSQL and tightly integrates tooling and enterprise connectors to ensure data scientists and developers can design and deploy analytical solutions in weeks.

Framework

Here we can use Vosk, Google Speech Recognition or Facebook Model of Speech Recognition to generate transcripts.

Framework Flow for Sentiment Analysis and Topic Modeling

Concepts Used

There are a few Machine Learning and Artificial Intelligence concepts being used in different aspects of this project:

Cleaning & Preprocessing: This helps us clear any kind of noise present in the transcripts extracted from videos.
Sentiment Analysis of Transcripts: Performed Sentiment Prediction through different models like flair, and Vader and chose Vader as it provided the best accuracy to us.
Topic Modelling of Transcripts: We performed topic modeling as a way to categorize video transcripts into topics and themes.
Keyword Extraction: Using TF-IDF, we have obtained keywords from each video to aid in visualizing the more weighted keywords on our graph.
Entity Extraction: Extracted keywords and entities within transcripts to help better understand the insights.
Timestamping of Transcripts: Performed several models for timestamping of text. We needed something that does word-by-word timestamping for our application instead of a whole sentence. So we chose Vosk Model for timestamping.

Google Colabs for the parts above is present in the GitHub repository as well as in this drive link.

Getting started with TigerGraph Cloud

First, you’ll need to set up your TigerGraph solution. To do so, go to https://tgcloud.io/ and either log in or create an account if you do not already have one.

TigerGraph’s Cloud Portal | Solution Creation

First, visit tgcloud.io and register for a free account. Then, log in and click on the “My Solutions” tab.
Then, click on “Create Solution” in the upper-right corner.
Create a blank solution.
For instance, in settings, keep everything at the free defaults.
The only required field for solution settings is setting a password that you must remember.

In case of any queries, you can refer to the video tutorial below

Video Tutorial and Project Walkthrough

Perfect! Wait till the status of your solution is “Ready” and then we’ll prepare our datasets.

Importing our solution & schema

From the GraphStudio landing page, click on “Import an Existing Solution” and select the downloaded file vid-rec.tar.gz.

Representing how to load & map data via TG Cloud UI

Installing the queries

To the left of “GSQL queries,” press the rectangle with the up arrow (with the hover data of “Install all queries”).

You’re done, congratulations! Your graph is now fully set up!

Queries

Now that we have loaded the data onto our graph, it’s time to retrieve the essential and desired information. We will be writing several queries in GSQL (TigerGraph’s Modern Graph Query Language) to achieve the same. For more information, check out this great resource!

Here, we will dive into the Recommendation Query:

This query returns a Max Heap that contains the top three recommended videos’ IDs and their respective Jaccard similarity scores. Below is the complete query

First, we create a TYPEDEF TUPLE consisting of Video_Audio Id along with the Jaccard score. We then create a HeapAccum with a limit of 3 values which orders the values by their respective Jaccard score in descending order. Next, we store the preprocessed transcript of the current video in the SetAccum input. Then, we traverse over all the preprocessed transcripts in our database and consecutively store them in SetAccum vid_with_transcripts. Finally, we accumulate the Jaccard scores of the current transcript with all the other transcripts in the HeapAccum and at last, the HeapAccum is printed.

Voila! Now that we have the queries we can head over to the Dashboard.

Dazzling Dashboard | UI Overview

The UI consists of four pages that incorporate a video recommendation display and various visualizations. This project covers the following aspects:

General Overview: This lets the user select or type in the keywords they’re interested in. Further, users can select from a list of videos that contain that particular keyword using the dropdown. Additional functionality of pinpointing different locations for the same keyword is also provided in the third dropdown but is optional. On hitting the ‘Go at Timestamp’ button, the video selected pops up[the JavaScript handling will be talked about a little later], and on playing, it starts from the chosen timestamp or 0 seconds.
Further, the user can get the same video's visuals or recommendations based on that particular video via clicking their respective buttons. For visuals, the user is taken to the second page, wherein the dropdown will come pre-selected with the video corresponding to which card’s button was clicked. We will discuss this functionality in detail further as Multipage Integrations & Data Transfer.
For more recommendations based on the selected video, the Jaccard Similarity query kicks in and matches the keywords and topic keywords we already have extracted and displays videos along with their respective Jaccard similarity scores. This will also be discussed in detail in this blog's further part.
Get Visuals: This page can be reached by directly clicking its nav link from the sidebar or through the Multipage Integration Functionality we just talked about. We have kept this one special yet informative with just two visualizations: colorful Named Entity Recognition and a classic Cytoscape(discussed later).
Topic Analytics: We perform topic modeling utilizing the Latent Dirichlet Allocation algorithm to obtain topics.
About: This page gives a general description of what we are doing and how things work, along with guidelines on how one of you can contribute to this project, in the last contact details of us for the help in need.

JavaScript Integration & Callbacks

If we wanted to have collapsible divisions, that would’ve been quickly done using the dash components, but with that, we were missing the smooth scroll and scrolling back to the top when we changed pages. Further, when we use JavaScript, we get more controls in our hands.

If you’re familiar with JS a little, then writing a code like this would be a cakewalk:

JS Code for smooth scroll and hiding/unhiding divs

But the major challenge was to get this code working onboard as dash doesn’t allow the DOM(Document Object Model) to be directly manipulated. So, we did that using a self-callable JS function and returned this as a callback, and this fires when the page with id=’main-page’ loads. The following code:

Multipage Integration & Data Transfer

The smooth transition from the ‘General Overview’ page to the ‘Get Visualizations’ page with a dropdown pre-selected with the choice of video user picks is quite interesting. Isn’t it?

Callback for rendering content on URL change

The code above is what handles what content is to be visible when the path changes in the URL. Refer to the documentation for more info.

Now we know how the page changes when the URL changes but how the data is transferred to another page? To understand that, we must know a little about global variables in python. Whenever your function gets triggered through a callback, it cannot directly update the variables at runtime. But you can make getters and setters or, in simple words, another function to set and get values of a variable you want to update at runtime. The global variables will help use those values even above they’re declared or defined, as well as have a consistent value throughout the colab. The critical thing to note is that the variable has to be declared global inside the function as well.

In the above code, you can see we have a getter function called getIndexValue(). This returns the ‘Get Video Visuals’ button index value ranging from 0 to 3 (the number of cards on the General Overview Page). The default indexValue is -1, and if this value hasn’t been updated or we can say the callback hasn’t fired up yet then the default first value of dropdown is automatically selected. This serves as the use-case when the user goes directly onto the ‘Get Visuals Page’ without going via the first page.

The only thing left to discuss for this one is how we get to know which button was clicked at the runtime? For this as well, we have something in the dash documentation called dash.callback_context. You can head over to advance callbacks for more. The same is implemented in the dashboard as well for all four buttons.

Sentiment Prediction using Vader

We have predicted the overall compound sentiment of each video using Vader’s sentiment analyzer shown as a gauge dash component. You can find the Google Colab here.

Sample Card showing Video Metrics and Sentiment Score

Named Entity Recognition

Named entity recognition (NER) aids in locating and classifying named entities in text into predefined categories such as the names of persons, organizations, locations, expressions of times, quantities, monetary values, percentages, etc. It helps in answering some real-world questions such as:

Which companies were mentioned in the videos?
Were specified products and names mentioned in the video?

Named Entity Recognition with Entity Bar Graph

The bar chart on the right will detail the count of entities in the transcript selected.

Cytoscape

Dash Cytoscape is a graph visualization component for creating easily customizable, high-performance, interactive, and web-based networks. It extends and renders Cytoscape.

Here, we will use the Cytoscape to find the relation between a video’s theme, its topic keywords, and the videos they belong to.

The usage-elements.py lets you progressively expand your graph by using “tapNodeData” as the input and elements as the output. The app initially pre-loads the entire dataset but only loads the graph with a single node. It then constructs four dictionaries that map every single node ID to its following nodes, following edges, followers nodes, and followers edges.

Then, it lets you expand the incoming or the outgoing neighbors by clicking the node you want to expand. This is done through a callback that retrieves the followers (outgoing) or following (incoming) from the dictionaries and adds them to the elements.

Refer to the Google Colab to access the full code.

Topic Modeling with LDA

We have performed topic modeling using a famous topic modeling technique, Latent Dirichlet Allocation, on our video transcripts datasets to extract the topics present in the videos, which will aid in our visualizations. The Dirichlet model describes the pattern of the words that are repeating together, occurring frequently, and these words are similar to each other. This stochastic process uses Bayesian inferences to explain “the prior knowledge about the distribution of random variables”.

We obtain pyLDAvis visualizations with a theme legend on the bottom that maps each topic with a theme wherein each topic consists of similar videos.

You can find the Google Colab link here.

Learn More

Dash-Plotly Documentation: From components to callbacks, head over to this for any queries.

Dash Bootstrap Components: Consists of tons of components to choose from readily available to be used for UI.

TigerGraph Documentation: For any GSQL or scheme-related queries.

Project ReadMe: Gives a glance at how this project runs and how you can contribute.

In case you have any queries on any aspect feel free to contact the below mentioned:
Atishaye Jain, Bhavya Tyagi, and Janhavi Lande
Thank you for your time reading this!