
Things I learnt when self-hosting my own podcast

Podnews hosts its own podcast (it’s called Podnews Daily, a daily news briefing about podcast and on-demand). I’ve learnt quite a lot about self-hosting it. I thought I’d write down some detail.
Here’s my setup. I use a web server on Amazon Lightsail, which produces an RSS feed. I wrote the RSS feed generation script, which uses a database. The RSS feed is written to a static file on my Lightsail server; I host the audio on Amazon S3.
Amazon CloudFront is in front of both of these. CloudFront is charged based on total requests, and total bandwidth transferred.
(Here are all the tools I use.)
Learning: use a content delivery service for audio
I used to think that a content delivery network like CloudFront wasn’t necessary for a podcast. Most people get their podcasts automatically downloaded overnight, for example, and fast speed or even a consistent transfer rate isn’t necessary if you’re downloading unattended.
Things have changed, though: most notably, the advent of smart speakers and Spotify, both of which don’t download but rather ‘stream’ podcast audio on-demand. Suddenly, it’s important to burst-deliver the first bit of a podcast as quick as you can (to lessen the wait between hitting “play” and hearing the audio). Speed is now important for audio, so a CDN becomes useful.
Secondly, when using tools like WebSub or PodPing to “announce” to podcast apps that you’ve a new episode ready, traffic can be quite hard work for a server as potentially hundreds of podcast apps all jump to download the audio as soon as they can. That can hurt any server, especially ones that are charged based on CPU use, so CloudFront is handy.
CloudFront also keeps its own logs, and can be configured to drop them all into an Amazon S3 bucket for analysis later. By piping all my traffic through Cloudfront, I have one simple logfile regardless of where all the stuff is.
Cloudfront’s “behaviours” allow you to direct different URL patterns to different origin servers, too. This allows me to switch, if I want to, from Amazon S3 to somewhere else to host the audio — and keep the URL the same. This has already been useful (in the early days, I was serving audio from the webserver, rather than S3).
For the US and Europe, where the majority of my traffic is, Cloudfront is a little cheaper in terms of bandwidth, too. Additionally, you can keep pricing down by restricting Cloudfront to US/EU only. Some areas are significantly more expensive.
Our podcast stats page calculates the cost of serving our audio.
The RSS feed
A podcast needs an RSS feed, of course, to function. This gets hit many, many times and is quite a large file. The RSS feed is therefore cached on CloudFront and is normally fed to clients in a compressed format.
Podnews’s main RSS feed is served just over 23,000 times a day; a total of 1.6GB of data.
I produce a different version of the RSS feed for each user-agent (so that I can do some fancy monitoring). This is normally a bad thing for caching, but RSS is a little different, given that there’s a finite amount of useragents for RSS feeds. In a typical day, that feed sees 460 different useragents, and 86% of my RSS feeds are still cached by Cloudfront.
The RSS stats page shows when podcast aggregator apps come to check on my RSS feed. Overcast, Google and PodcastAddict check roughly every four minutes.
Learning: use WebSub and PodPing
The way RSS feeds work is that someone like Apple Podcasts comes along every so often to check whether I’ve just added a new podcast. While I can influence the “every so often”, I’ve no actual control over when Apple, or anyone else, comes back to check. It’s very wasteful in terms of bandwidth, too.
WebSub or PodPing are the computer equivalent of me telling you: “Stop asking me all the time whether I’ve something new for you. I’ll tell Bill over there, and Bill will give you a call, OK? Go give him your telephone number.”
So, when I publish a new podcast, it appears on many podcast apps instantly, since they use WebSub. Literally, I press the publish button in my own code, it informs the hub, I look at my phone, and there’s the new podcast. Because I also support PodPing, services connected to that also see new episodes almost immediately.
Learning: you’re on your own with stats
Podcast stats are possible with a self-hosted solution, but are hard. Normal hit-based server log analysis won’t work, since they count things that aren’t actual downloads of audio.
Option one is to use a redirect service. Podtrac, Blubrry and others produce these. I use OP3, which is free (Podnews is a sponsor) and gives good, reliable stats. Here are Podnews’s numbers.
But I can also produce my own, bespoke, service; and so I did, for a while. Here’s how it works. First, I put my Cloudfront logs into an Amazon S3 bucket (which all disappear after 90 days, following our privacy policy). I’ve configured Amazon Athena to see this as a giant database.
Every day I run a cronjob to make a query to Athena, to pull just yesterday’s podcast data into a CSV file. Here’s the current query:
SELECT
lower(to_hex(md5(to_utf8(requestip)))) AS encoded_ip,
referrer,date,time,uri,bytes,useragent,querystring
FROM cloudfront_logs
WHERE
(uri LIKE '/audio/pod%' AND
method='GET' AND SUBSTR
(uri,-3,1) = 'm')
AND
"date"=current_date - interval '1' day
ORDER BY time
This gives me a full list of all calls to audio, with tokens instead of IP addresses. I’m querying 2.5GB of data, and the cost is about $0.015 each query.
I then store the resulting CSV file in an S3 bucket, and write its ID into a database.
Then, I used to run some ugly PHP that iterates through all the log lines, and does the counting and matching. It discards requests that aren’t 750KB (which is roughly one minute of audio); which currently means it won’t catch any clients who request audio in 250KB chunks, for example; not sure there are too many of those. It was fun, but OP3 does a much better job of spotting bots and nonsense traffic.
Hope that’s interesting to some. Use the 'contact us’ page if you’ve questions.
