Things I learnt when self-hosting my own podcast
This article is at least a year old
I host my own podcast (it’s called Podnews podcasting news, a daily news briefing about podcast and on-demand). I’ve learnt quite a lot about self-hosting it. I thought I’d write down some detail.
Here’s my setup. I use a web server on Amazon Lightsail, which produces an RSS feed. I wrote the RSS feed generation script, which uses a database. I host the audio on Amazon S3.
Amazon Cloudfront is in front of both of these. Cloudfront is charged based on total bandwidth transferred.
(Here are all the tools I use.)
Learning: use a content delivery service for audio
I used to think that a content delivery network like Cloudfront wasn’t necessary for a podcast. Most people get their podcasts automatically downloaded overnight, for example, and fast speed or even a consistent transfer rate isn’t necessary if you’re downloading unattended.
Things have changed, though: most notably, the advent of Google Podcasts and smart speakers, both of which don’t download but rather stream podcast audio on-demand. Suddenly, it’s important to burst-deliver the first bit of a podcast as quick as you can (to lessen the wait between hitting “play” and hearing the audio). Speed is now important for audio, so a CDN becomes useful.
Secondly, when using tools like WebSub or PodPing to “announce” to podcast apps that you’ve a new episode ready, traffic can be quite hard work for a server as potentially hundreds of podcast apps all jump to download the audio as soon as they can.
Cloudfront also keeps its own logs, and can be configured to drop them all into an Amazon S3 bucket for analysis later. By piping all my traffic through Cloudfront, I have one simple logfile regardless of where all the stuff is.
Cloudfront’s “behaviours” allow you to direct different URL patterns to different origin servers, too. This allows me to switch, if I want to, from Amazon S3 to somewhere else to host the audio — and keep the URL the same. This has already been useful (in the early days, I was serving audio from the webserver, rather than S3).
For the US and Europe, where the majority of my traffic is, Cloudfront is a little cheaper in terms of bandwidth, too. Additionally, you can keep pricing down by restricting Cloudfront to US/EU only. Some areas are significantly more expensive.
Our podcast stats page calculates the cost of serving our audio.
The RSS feed
A podcast needs an RSS feed, of course, to function. This gets hit many, many times and is quite a large file. The RSS feed is therefore cached on Cloudfront and is normally fed to clients in a compressed format.
Podnews’s main RSS feed is served just over 19,000 times a day; a total of 1.6GB of data.
I produce a different version of the RSS feed for each user-agent (so that I can do some fancy monitoring). This is normally a bad thing for caching, but RSS is a little different, given that there’s a finite amount of useragents for RSS feeds. In a typical day, that feed sees 460 different useragents, and 86% of my RSS feeds are still cached by Cloudfront.
The RSS stats page shows when podcast aggregator apps come to check on my RSS feed. Overcast, Google and PodcastAddict check roughly every four minutes.
Learning: use WebSub and PodPing
The way RSS feeds work is that someone like Apple Podcasts comes along every so often to check whether I’ve just added a new podcast. While I can influence the “every so often”, I’ve no actual control over when Apple, or anyone else, comes back to check. It’s very wasteful in terms of bandwidth, too.
WebSub or PodPing are the computer equivalent of me telling you: “Stop asking me all the time whether I’ve something new for you. I’ll tell Bill over there, and Bill will give you a call, OK? Go give him your telephone number.”
So, when I publish a new podcast, it appears on Google Podcasts instantly, since Google uses WebSub. Literally, I press the publish button in my own code, it informs the hub, I look at my phone, and there’s the new podcast. Because I also support PodPing, services connected to that also see new episodes almost immediately.
Learning: produce multiple pieces of audio
I produce three versions of the podcast…
A 48kbps MP3 (mono) at -14 LUFS
An AAC-HE file at 56kbps (stereo) at -16 LUFS
An Opus file at 16kbps (mono) at -14 LUFS
(You’ll be surprised just how good the 16kbps version sounds, I’ll bet).
Apps that are on an allowlist get the AAC version. Almost everyone else gets the MP3 version, except a few apps on KaiOS which gets the Opus version. This enables me to keep my costs down (and the costs of my listeners).
I produce these with the audio editor I use, Hindenburg Journalist Pro, which allows you to configure more than one publishing point, and I therefore produce these three automatically at the end of the production process. I attach cover images using a complicated bit of AppleScript.
AAC is supported by virtually everything that supports podcasts these days. AAC and -16 LUFS is what Apple wants. Everyone’s happy. (I’ve written a thing about LUFS and loudness).
The MP3 version is sent to devices that I’m unsure whether they support MP3 - and for devices that I’ve not recognised. The Opus file is given to KaiOS apps by default (a special operating system for developing countries).
All versions are also in the alternateEnclosure
tag, which is selectable on a few different podcast apps.
(And in fact, I also produce another version of the podcast: an ad-free version with a little less tech-talk in it, for use by Apple’s Siri service and by Podcast Radio in the UK, which also makes the podcast available in the UK on the Radioplayer app. That’s produced in MP3 for Podcast Radio, and in AAC for Siri.)
In total, in a typical day, I see 3,200 requests for audio, and a total of just over 7GB of data transfer.
Learning: you’re on your own with stats
Podcast stats are possible with a self-hosted solution, but are hard. Normal hit-based server log analysis won’t work, since they count things that aren’t actual downloads of audio.
Option one is to use a redirect service. Podtrac, Blubrry and others produce these. I was using Spotify’s Chartable for some time.
However, I wanted to produce my own, bespoke, service; and so I did. Here’s how it works. First, I put my Cloudfront logs into an Amazon S3 bucket (which all disappear after 90 days, following our privacy policy). I’ve configured Amazon Athena to see this as a giant database.
Every day I run a cronjob to make a query to Athena, to pull just yesterday’s podcast data into a CSV file. Here’s the current query:
SELECT
lower(to_hex(md5(to_utf8(requestip)))) AS encoded_ip,
referrer,date,time,uri,bytes,useragent,querystring
FROM cloudfront_logs
WHERE
(uri LIKE '/audio/pod%' AND
method='GET' AND SUBSTR
(uri,-3,1) = 'm')
AND
"date"=current_date - interval '1' day
ORDER BY time
This gives me a full list of all calls to audio, with tokens instead of IP addresses. I’m querying 2.5GB of data, and the cost is about $0.015 each query.
I then store the resulting CSV file in an S3 bucket, and write its ID into a database.
Then, the podcast stats page is some ugly PHP that iterates through all the log lines, and does the counting and matching. It discards requests that aren’t 750KB (which is roughly one minute of audio); which currently means it won’t catch any clients who request audio in 250KB chunks, for example; not sure there are too many of those, but I should look.
Hope that’s interesting to some. Use the 'contact us’ page if you’ve questions.