DEV Community

Cover image for Top re:Invent 2024 Videos
Geoffrey Wiseman for AWS Community Builders

Posted on

Top re:Invent 2024 Videos

Every year, Amazon Web Services holds a conference, "AWS re:Invent", and they release the conference sessions as videos on YouTube. It's a big conference with a lot of sessions, so there are lots of videos to pick from. So much so that it's hard to look at the list of videos and decide which ones you're interested in. If you have specific topics you want to learn about, you can search for videos on that topic, but looking at the full list is not realistic.

This year, while looking at the long list of videos, I thought it would be nice to see which are the most popular videos from re:Invent. YouTube doesn't make it very easy to get that list, so I decided to experiment with a custom-built solution. If you want to skip ahead to the working version, it's available here.

V1: The Experiment

I wrote a short Python script to call the YouTube Data API v3, identify all the videos published to the AWS Events channel in December 2024, and rank by view count.

V1 Architecture

As an experiment, I'd call it a success -- I got a long list of videos, sorted by views. I shared that list in a few places, but I knew that it wasn't complete. First of all, AWS was still uploading new videos. Secondly, as people continued to watch and share, the view counts would continue to rise, and that would likely impact the rank. In particular, the newer videos had had less time to achieve view count success than the first videos that were posted nearly a week before.

So, although the experiment was success, I wanted more.

V2: Website

If I wanted to run that script to get updated data and be able to share it, the obvious solution was to turn it into single-purpose website that I could share and that anyone could refresh whenever they wanted to get updated data.

I considered options -- I could turn the Python script into a static site generation script that made HTML and deploy that (e.g. to S3), then schedule a job to generate new content every so often.

I didn't want to run and schedule the job on my own infrastructure, and since this was an AWS-related project, I decided I'd like it to run on AWS. I didn't need the job to run super-frequently, so I didn't really want spin up an EC2 instance, so this felt like a good job for Lambda.

I packaged up the script and the google api client into a Lambda .zip file, uploaded it and began some testing. I didn't want to have to redeploy to change simple parameters, so I externalized some of the more obvious parameters into environment variables and added those to the Lambda function as well. I didn't really need a load balancer, so I added a Lambda function URL.

V2 Architecture

That worked -- but even in testing I was starting to run up against the limitations of the YouTube Data API. In order to prevent abuse (and reduce the cost to their infrastructure), YouTube's Data API has quotas. Each search call uses 100 units of the 10,000 units you're allocated by default, and getting enough information to rank by views uses at least another unit, so you're limited to less than 100 search calls. Each search call returns a maximum of 50 results, so rendering the page once was using up nearly a quarter of my quota, and if I were to share this with others, it would almost immediately stop working.

V3: Caching

Now we're back to architectural choices. If I'm going to use a cache to avoid hitting the YouTube Data API quota, how am I going to do that?

Internal Caching
Do I cache the data from YouTube and only refresh periodically -- once or twice a day? With more intelligent caching, I could even save the video information and on refresh I'd only have to identify new videos and update the video counts on existing videos, which would use far less quota. Then I'd have to store that data somewhere, and since you can't rely on a Lambda staying in memory, that meant an external cache of some kind -- a file on S3, a cache like Redis, or a datastore of some kind, maybe DynamoDB. I was briefly tempted to test out Aurora DSQL.

V3a Architecture

External Caching
I could also cache externally -- render the HTML and store that instead of the data used to generate the HTML. That also saves the work to generate the HTML, so it's a bit more efficient, although perhaps a little less flexible in some ways.

I could go back to a static site generation plan -- schedule the Lambda to run once or twice a day, take the output from the lambda function and store it on S3, where I could serve it up directly.

The simplest solution seemed to be CloudFront. In theory I could continue to use the Lambda unmodified and CloudFront could cache the results of the Lambda function, and then many page refreshes would simply be CloudFront cache hit.

This isn't quite as flexible as the internal caching, but it saves me from setting up a scheduled job for static site generation, so I decided to try this path.

V3b Architecture

V4: Resolving Quota Issues

I published the CloudFront URL to a few places, and heard back from a friend who was travelling that he got an Internal Server error. I suspected the quota had been exceeded, and I checked the logs -- sure enough, it had. This revealed two problems in V3: error handling, and CloudFront cache plurality.

I dealt with the error handling first -- I added some code to trap the exception and handle it a little better than the Internal Server Error that was showing. I didn't do anything sophisticated because the goal was to not hit the quota, rather than handle it well.

The CloudFront problem required a small architectural change -- each CloudFront location has its own cache, and if people from all over the world load the Lambda, each location will invoke the Lambda, and since I can only get away with about four invocations in a day on the quota I had, that was going to be a problem.

Fortunately, CloudFront has a solution for that -- you can add Origin Shield, which effectively means that CloudFront will also cache internally. If the first request comes in Toronto, the closest CloudFront location will check its own cache, then the internal cache, and then invoke the lambda. CloudFront will then store the result in the internal cache and within the location cache. The second request might come in from Japan, which will get a cache miss in the location-based cache, but a cache hit from the internal cache, and CloudFront will populate the location cache from the internal cache and avoid invoking the lambda a second time.

V4 Architecture

This worked well -- the quota errors went away, but then I discovered a new problem. Some of the earliest popular videos were no longer showing in the list.

V5: Resolving Dropped Videos

Was it a bug in the code? I did some testing. Turns out, the YouTube Data API pagination only gets you so many pages -- at some point YouTube doesn't give you another next page token, even though there are more videos that match your criteria.

I tested a few options -- I could break up the month into smaller date ranges and run a search for each range. I was worried this would simply hit the quota faster, but it did seem to work. I had to do a little work to make sure no duplicates showed up, but otherwise it seemed to be ok.

While I was experimenting with that, I discovered that the YouTube Data API had a feature I missed on the first pass - I could order the results by view count. This meant that I no longer needed to iterate through every video in the month, I could simply use YouTube's own knowledge of the most popular videos that matched the search criteria that I'd established.

This also meant I could significantly reduce my API traffic -- I could issue just enough search requests to fill my list and stop. No architectural change, just a code change.

I tested it, and that worked well.

Here's the current version of the Python code. There are still improvements that could be made, certainly, but I'm willing to share imperfect code:

import datetime
import html
import os
import re
from typing import NamedTuple

from googleapiclient.discovery import build
from googleapiclient.errors import HttpError


class TopVideoConfig(NamedTuple):
    """
    Configuration for fetching top videos from YouTube.

    Attributes:
        api_key (str): The API key for accessing the YouTube Data API.
        channel_id (str): The ID of the YouTube channel to fetch videos from.
        published_after (str): The start date for fetching videos (ISO 8601 format).
        published_before (str): The end date for fetching videos (ISO 8601 format).
        max_requests (int): The maximum number of API requests to make.
        max_results (int): The maximum number of video results to return.
    """
    api_key: str
    channel_id: str
    published_after: str
    published_before: str
    max_requests: int
    max_results: int


class VideoResult(NamedTuple):
    """
    One record of top videos from YouTube.
    """
    video_id: str
    title: str
    published_at: str
    view_count: int


class Retriever(object):
    def __init__(self, config: TopVideoConfig):
        self.config = config
        self.youtube = build('youtube', 'v3', developerKey=config.api_key)
        self.retrieved_ids = set()
        self.videos = []
        self.search_requests = 0
        self.title_pattern = r'(^AWS re:Invent 2024\s*-\s*|)'
        self.next_token = None

    def fetch(self) -> list[VideoResult]:
        start_date = self.config.published_before
        while True:
            results = self.fetch_search_page()
            print(f"Videos: {results}")
            self.videos.extend(results)
            if len(results) == 0:
                print("No more results found.")
                break
            if self.search_requests >= self.config.max_requests:
                print(f"Reached max requests limit of {self.config.max_requests}")
                break
            if not self.next_token:
                print("No more pages found.")
                break
        print(f"Retrieved {len(self.videos)} videos using {self.search_requests} requests")
        self.videos.sort(key=lambda v: v.view_count, reverse=True)
        return self.videos

    def fetch_search_page(self) -> list[VideoResult]:
        video_ids = self.fetch_video_ids()
        if len(video_ids) == 0:
            return []
        self.retrieved_ids.update(video_ids)

        video_request = self.youtube.videos().list(
            part='snippet,statistics',
            id=','.join(video_ids)
        )
        response = video_request.execute()
        return [self.transform_video(video) for video in response['items']]

    def fetch_video_ids(self) -> set[str]:
        request = self.youtube.search().list(
            part='id',
            channelId=self.config.channel_id,
            maxResults=50,
            order='viewCount',
            type='video',
            publishedAfter=self.config.published_after,
            publishedBefore=self.config.published_before,
            pageToken=self.next_token
        )
        response = request.execute()
        video_ids = {item['id']['videoId'] for item in response['items']}
        print(f"Retrieved {len(video_ids)} video ids with page token {self.next_token}")
        self.search_requests += 1
        self.next_token = response.get('nextPageToken')
        return video_ids.difference(self.retrieved_ids)

    def transform_video(self, video: dict[str, any]) -> VideoResult:
        return VideoResult(
            video_id=video['id'],
            title=self.transform_title(video['snippet']['title']),
            published_at=video['snippet']['publishedAt'],
            view_count=int(video['statistics']['viewCount'])
        )

    def transform_title(self, title: str) -> str:
        return html.escape(re.sub(self.title_pattern, '', title))


# Load configuration from environment variables
def get_config() -> TopVideoConfig:
    # API Key
    api_key = os.getenv('YOUTUBE_API_KEY')

    # AWS Events channel ID
    channel_id = os.getenv('YOUTUBE_CHANNEL_ID')

    # Dates
    published_after = os.getenv('YOUTUBE_PUBLISHED_AFTER')
    published_before = os.getenv('YOUTUBE_PUBLISHED_BEFORE')
    max_requests = int(os.getenv('MAX_REQUESTS', '1'))
    max_results = int(os.getenv('MAX_RESULTS', '50'))

    return TopVideoConfig(api_key, channel_id, published_after, published_before, max_requests, max_results)


def render_videos(config: TopVideoConfig, videos: list[VideoResult]) -> str:
    response = "<html><head>"
    response += "<title>Top re:Invent 2024 Videos</title>"
    response += "<style>body { font-family: sans-serif; } li { padding-bottom: 10px; }</style>"
    response += "</head><body>"
    response += "<h1>Top re:Invent 2024 Videos</h1>\n"
    response += f"<h3>(as of {datetime.datetime.now().strftime('%Y-%m-%d %H:%M:%S %Z')})</h3>\n"
    response += "<ol>"
    for video in videos[:config.max_results]:
        response += f"<li><a href='https://youtube.com/watch?v={video.video_id}'>{video.title}</a>"
        response += f" ({video.view_count:,} views)</li>\n"
    response += "</ol></body></html>"
    return response


## Bootstrap for Lambda
def lambda_handler(event, context):
    config = get_config()
    try:
        videos = Retriever(config).fetch()
        response = render_videos(config, videos)
        return {
            'statusCode': 200,
            'body': response,
            'headers': {
                'Content-Type': 'text/html; charset=utf-8'
            }
        }
    except HttpError as e:
        return {
            'statusCode': e.resp.status,
            'body': e.content,
            'headers': {
                'Content-Type': 'text/plain; charset=utf-8'
            }
        }


## Bootstrap for Direct Testing
if __name__ == "__main__":
    print(lambda_handler(None, None))
Enter fullscreen mode Exit fullscreen mode

If I were making this maintainable, I'd probably split it up a bit more, move the title transformation into the rendering code, add some more error handling improvements, etc. The HTML could be templated, but I wanted to keep the dependencies low.

Done?

It seems to be working and stable. There are still things I could do to improve it, but it's working well enough that I might move on to the next experiment. It's configurable if I decide to run the same experiment for the next re:Invent or another conference. I can't run too many of these in parallel without requesting a YouTube Quota increase, but I could certainly run a few.

I wouldn't mind automating the deployment of it, but there aren't any really interesting parts to that for me, because I've done lots of AWS automation and deployments, so it would mostly be to save myself time, and unless I need to deploy it frequently, it's likely to end up costing me time instead. For an experiment, I'm not sure the automation is warranted.

I'm also curious to see what the total cost of running this will end up being -- I'm expecting it to be fairly cheap, but we'll see.

Top comments (1)

Collapse
 
bhaveshgohel profile image
Bhavesh Gohel

This is amazing. ️‍🔥