Video Recommendation System with TigerGraphđŻ
Dashboarding with Plotly Express, Plotly Dash, & TigerGraph
This project has been co-created by Atishaye Jain, Bhavya Tyagi, and Janhavi Lande as a part of the GitHub Externship Program at TigerGraph for the winter cohort of 2022.
Introduction
This application is to get your favorite videos recommended powered by Graph Database based on provided keywords and the ability to pinpoint where the keyword was spoken in the videos! The dataset incorporates a variety of themes like finance, cryptocurrency, animals, astronomy, literature, etc. for you to choose from.
We are using TigerGraph Cloud as our graph database. TigerGraph provides a graph AI platform that is built on the industryâs first and only distributed native graph database, which comes with a SQL-like query language called GSQL and tightly integrates tooling and enterprise connectors to ensure data scientists and developers can design and deploy analytical solutions in weeks.
Framework
Here we can use Vosk, Google Speech Recognition or Facebook Model of Speech Recognition to generate transcripts.
Concepts Used
There are a few Machine Learning and Artificial Intelligence concepts being used in different aspects of this project:
- Cleaning & Preprocessing: This helps us clear any kind of noise present in the transcripts extracted from videos.
- Sentiment Analysis of Transcripts: Performed Sentiment Prediction through different models like flair, and Vader and chose Vader as it provided the best accuracy to us.
- Topic Modelling of Transcripts: We performed topic modeling as a way to categorize video transcripts into topics and themes.
- Keyword Extraction: Using TF-IDF, we have obtained keywords from each video to aid in visualizing the more weighted keywords on our graph.
- Entity Extraction: Extracted keywords and entities within transcripts to help better understand the insights.
- Timestamping of Transcripts: Performed several models for timestamping of text. We needed something that does word-by-word timestamping for our application instead of a whole sentence. So we chose Vosk Model for timestamping.
Google Colabs for the parts above is present in the GitHub repository as well as in this drive link.
Getting started with TigerGraph Cloud
First, youâll need to set up your TigerGraph solution. To do so, go to https://tgcloud.io/ and either log in or create an account if you do not already have one.
- First, visit tgcloud.io and register for a free account. Then, log in and click on the âMy Solutionsâ tab.
- Then, click on âCreate Solutionâ in the upper-right corner.
- Create a blank solution.
- For instance, in settings, keep everything at the free defaults.
- The only required field for solution settings is setting a password that you must remember.
In case of any queries, you can refer to the video tutorial below
Perfect! Wait till the status of your solution is âReadyâ and then weâll prepare our datasets.
Importing our solution & schema
From the GraphStudio landing page, click on âImport an Existing Solutionâ and select the downloaded file vid-rec.tar.gz.
Installing the queries
To the left of âGSQL queries,â press the rectangle with the up arrow (with the hover data of âInstall all queriesâ).
Youâre done, congratulations! Your graph is now fully set up!
Queries
Now that we have loaded the data onto our graph, itâs time to retrieve the essential and desired information. We will be writing several queries in GSQL (TigerGraphâs Modern Graph Query Language) to achieve the same. For more information, check out this great resource!
Here, we will dive into the Recommendation Query:
This query returns a Max Heap that contains the top three recommended videosâ IDs and their respective Jaccard similarity scores. Below is the complete query
First, we create a TYPEDEF TUPLE consisting of Video_Audio Id along with the Jaccard score. We then create a HeapAccum with a limit of 3 values which orders the values by their respective Jaccard score in descending order. Next, we store the preprocessed transcript of the current video in the SetAccum input. Then, we traverse over all the preprocessed transcripts in our database and consecutively store them in SetAccum vid_with_transcripts. Finally, we accumulate the Jaccard scores of the current transcript with all the other transcripts in the HeapAccum and at last, the HeapAccum is printed.
Voila! Now that we have the queries we can head over to the Dashboard.
Dazzling Dashboard | UI Overview
The UI consists of four pages that incorporate a video recommendation display and various visualizations. This project covers the following aspects:
- General Overview: This lets the user select or type in the keywords theyâre interested in. Further, users can select from a list of videos that contain that particular keyword using the dropdown. Additional functionality of pinpointing different locations for the same keyword is also provided in the third dropdown but is optional. On hitting the âGo at Timestampâ button, the video selected pops up[the JavaScript handling will be talked about a little later], and on playing, it starts from the chosen timestamp or 0 seconds.
- Further, the user can get the same video's visuals or recommendations based on that particular video via clicking their respective buttons. For visuals, the user is taken to the second page, wherein the dropdown will come pre-selected with the video corresponding to which cardâs button was clicked. We will discuss this functionality in detail further as Multipage Integrations & Data Transfer.
- For more recommendations based on the selected video, the Jaccard Similarity query kicks in and matches the keywords and topic keywords we already have extracted and displays videos along with their respective Jaccard similarity scores. This will also be discussed in detail in this blog's further part.
- Get Visuals: This page can be reached by directly clicking its nav link from the sidebar or through the Multipage Integration Functionality we just talked about. We have kept this one special yet informative with just two visualizations: colorful Named Entity Recognition and a classic Cytoscape(discussed later).
- Topic Analytics: We perform topic modeling utilizing the Latent Dirichlet Allocation algorithm to obtain topics.
- About: This page gives a general description of what we are doing and how things work, along with guidelines on how one of you can contribute to this project, in the last contact details of us for the help in need.
JavaScript Integration & Callbacks
If we wanted to have collapsible divisions, that wouldâve been quickly done using the dash components, but with that, we were missing the smooth scroll and scrolling back to the top when we changed pages. Further, when we use JavaScript, we get more controls in our hands.
If youâre familiar with JS a little, then writing a code like this would be a cakewalk:
But the major challenge was to get this code working onboard as dash doesnât allow the DOM(Document Object Model) to be directly manipulated. So, we did that using a self-callable JS function and returned this as a callback, and this fires when the page with id=âmain-pageâ loads. The following code:
Multipage Integration & Data Transfer
The smooth transition from the âGeneral Overviewâ page to the âGet Visualizationsâ page with a dropdown pre-selected with the choice of video user picks is quite interesting. Isnât it?
The code above is what handles what content is to be visible when the path changes in the URL. Refer to the documentation for more info.
Now we know how the page changes when the URL changes but how the data is transferred to another page? To understand that, we must know a little about global variables in python. Whenever your function gets triggered through a callback, it cannot directly update the variables at runtime. But you can make getters and setters or, in simple words, another function to set and get values of a variable you want to update at runtime. The global variables will help use those values even above theyâre declared or defined, as well as have a consistent value throughout the colab. The critical thing to note is that the variable has to be declared global inside the function as well.
In the above code, you can see we have a getter function called getIndexValue(). This returns the âGet Video Visualsâ button index value ranging from 0 to 3 (the number of cards on the General Overview Page). The default indexValue is -1, and if this value hasnât been updated or we can say the callback hasnât fired up yet then the default first value of dropdown is automatically selected. This serves as the use-case when the user goes directly onto the âGet Visuals Pageâ without going via the first page.
The only thing left to discuss for this one is how we get to know which button was clicked at the runtime? For this as well, we have something in the dash documentation called dash.callback_context. You can head over to advance callbacks for more. The same is implemented in the dashboard as well for all four buttons.
Sentiment Prediction using Vader
We have predicted the overall compound sentiment of each video using Vaderâs sentiment analyzer shown as a gauge dash component. You can find the Google Colab here.
Named Entity Recognition
Named entity recognition (NER) aids in locating and classifying named entities in text into predefined categories such as the names of persons, organizations, locations, expressions of times, quantities, monetary values, percentages, etc. It helps in answering some real-world questions such as:
- Which companies were mentioned in the videos?
- Were specified products and names mentioned in the video?
The bar chart on the right will detail the count of entities in the transcript selected.
Cytoscape
Dash Cytoscape is a graph visualization component for creating easily customizable, high-performance, interactive, and web-based networks. It extends and renders Cytoscape.
Here, we will use the Cytoscape to find the relation between a videoâs theme, its topic keywords, and the videos they belong to.
The usage-elements.py lets you progressively expand your graph by using âtapNodeDataâ as the input and elements as the output. The app initially pre-loads the entire dataset but only loads the graph with a single node. It then constructs four dictionaries that map every single node ID to its following nodes, following edges, followers nodes, and followers edges.
Then, it lets you expand the incoming or the outgoing neighbors by clicking the node you want to expand. This is done through a callback that retrieves the followers (outgoing) or following (incoming) from the dictionaries and adds them to the elements.
Refer to the Google Colab to access the full code.
Topic Modeling with LDA
We have performed topic modeling using a famous topic modeling technique, Latent Dirichlet Allocation, on our video transcripts datasets to extract the topics present in the videos, which will aid in our visualizations. The Dirichlet model describes the pattern of the words that are repeating together, occurring frequently, and these words are similar to each other. This stochastic process uses Bayesian inferences to explain âthe prior knowledge about the distribution of random variablesâ.
We obtain pyLDAvis visualizations with a theme legend on the bottom that maps each topic with a theme wherein each topic consists of similar videos.
You can find the Google Colab link here.
Learn More
Dash-Plotly Documentation: From components to callbacks, head over to this for any queries.
Dash Bootstrap Components: Consists of tons of components to choose from readily available to be used for UI.
TigerGraph Documentation: For any GSQL or scheme-related queries.
Project ReadMe: Gives a glance at how this project runs and how you can contribute.
In case you have any queries on any aspect feel free to contact the below mentioned:
Atishaye Jain, Bhavya Tyagi, and Janhavi Lande
Thank you for your time reading this!