Things I learnt when self-hosting my own podcast
January 14, 2019 · Updated April 20, 2019 · By James Cridland · 5.8 minutes to read
I host my own podcast (it’s called Podnews, a daily news briefing about podcast and on-demand). I’ve learnt quite a lot about self-hosting it. I thought I’d write down some detail.
Here’s my setup: I use a web server on Amazon EC2, which produces an RSS feed. (I wrote the RSS feed generation script, which uses a database). I host the audio on Amazon S3.
Amazon Cloudfront is in front of both of these. Cloudfront is charged based on total bandwidth transferred.
Here are all the tools I use.
Learning: use a content delivery service for audio
I used to think that a content delivery network like Cloudfront wasn’t necessary for a podcast. Most people get their podcasts automatically downloaded overnight, for example, and fast speed or even a consistent transfer rate isn’t necessary if you’re downloading unattended.
Things have changed, though: most notably, the advent of Google Podcasts and smart speakers, both of which don’t download but rather stream podcast audio on-demand. Suddenly, it’s important to burst-deliver the first bit of a podcast as quick as you can (to lessen the wait between hitting “play” and hearing the audio). Speed is now important for audio, so a CDN becomes useful.
Secondly, Cloudfront keeps its own logs, and can be configured to drop them all into an Amazon S3 bucket for analysis later. By piping all my traffic through Cloudfront, I have one simple logfile regardless of where all the stuff is. This may be helpful and useful.
Cloudfront’s “behaviours” allow you to direct different URL patterns to different origin servers, too. This allows me to switch, if I want to, from Amazon S3 to somewhere else to host the audio — and keep the URL the same. This has already been useful (in the early days, I was serving audio from the webserver, rather than S3).
Learning: set Cloudfront to US/EU edge servers only
Cloudfront has thousands of edge servers all over the world: but I don’t use them. Instead, I’ve set Cloudfront to only use US and European edge servers.
I do this for a few reasons: every Cloudfront edge server will connect to my server to grab the data, and I want as few computers accessing my server as possible. If the server gets overloaded, that’s bad news for all my visitors.
Second, it’ll still be fast enough: access to US and European internet is pretty speedy from anywhere in the world.
And third: Amazon charge you less. US/EU is $0.085 per GB; every other part of the world is more expensive: South Africa is 30% more, Australia 34%, South America a staggering 294% more. Ain’t nobody got time for that.
The RSS feed
A podcast needs an RSS feed, of course, to function. This gets hit many, many times and is quite a large file. The RSS feed is therefore cached on Cloudfront and is normally fed to clients in a compressed format.
Since the podcast is a timely one, the RSS feed will only be cached (on Cloudfront or other servers) for five minutes. Essentially this means that if there are 4,000 users on the same Cloudfront edge server all asking for the RSS feed, my own server only produces the file once. I suspect there’s a nicer way to do this.
Learning: use pub-sub
The way RSS feeds work is that someone like Apple Podcasts comes along every so often to check whether I’ve just added a new podcast. While I can influence the “every so often”, I’ve no actual control over when Apple, or anyone else, comes back to check. It’s very wasteful in terms of bandwidth, too.
Pubsubhubbub is the computer equivalent of me telling you: “Stop asking me all the time whether I’ve something new for you. I’ll tell Bill over there, and Bill will give you a call, OK?”
Bill is a “hub” server. I link to Bill in my RSS feed. If you’re running a podcast app, you can just ask Bill to let you know (“subscribe”) when I’ve “published” something new to my RSS feed. I let Bill know as soon as I update it.
The upshot of this computer gobbledegook is that when I publish a new podcast, it appears on Google Podcasts instantly, since Google uses pubsubhubbub. Literally, I press the publish button in my own code, it informs the hub, I look at my phone, and there’s the new podcast.
Pubsub is probably supported by some others, too. I’d like to learn more.
Learning: produce multiple pieces of audio
After a bit of thought, I produce two versions of the podcast: one at 80kbps stereo AAC-HE at -16 LUFS, and one at 256kbps stereo MP3 at -14 LUFS. The audio editor I use, Hindenburg Journalist Pro, allows you to configure more than one publishing point, and I therefore produce these both automatically at the end of the production process. If I was a little cleverer, I’d get Hindenburg to do the actual uploading to Amazon S3, but the fact that it doesn’t is probably helpful to be honest, given how many times I’ve overwritten my local copy by mistake.
Most people who get the podcast will get the AAC version. It sounds very good, at considerably lower bitrates than the equivalent 128kbps MP3 would be. AAC is supported by virtually everything that supports podcasts these days. AAC and -16 LUFS is what Apple wants; -16 LUFS is also what Google wants. Everyone’s happy.
The MP3 is much higher bandwidth, and would cost me almost five times more to serve. It’s exclusively given to devices that cache or transform my content, which is currently Amazon Alexa and Spotify. As chance would have it, they both also require -14 LUFS, a slightly louder output than the -16 LUFS that Apple require.
I’ve been able to give both Amazon and Spotify different RSS feeds to enable this. Amazon has a separate RSS feed, just containing one episode; Spotify uses the same RSS feed as everyone else, but with a query string identifying it as Spotify. I currently don’t deliver anything separately based on user-agent; that’s hard to do with Cloudfront (and essentially breaks caching).
Podcast stats are possible with a self-hosted solution, but are hard. Normal hit-based server log analysis won’t work, since they count things that aren’t actual downloads of audio.
Option one is to use a redirect service. Podtrac, Blubrry or Chartable all produce these, and I‘m using Chartable’s solution for now.
However, I’d prefer a more bespoke service, and am cooking one up. It turns out that if you dump all your Cloudfront logs into an Amazon S3 bucket, you can configure Amazon Athena to see these as a giant database, which you can do SQL queries against. (I’m querying 2.5GB of data, and the cost is about $0.015 each query). So, here’s a moderately compliant SQL statement (as long as you strip bots out of this):
SELECT count(*),uri,useragent FROM cloudfront_logs WHERE "date" BETWEEN DATE '2019-01-20' AND DATE '2019-04-20' AND uri LIKE '/audio/%' AND bytes>750000 GROUP BY uri,useragent,requestip ORDER BY useragent
…this won’t catch people grabbing the podcast in little chunks, but there aren’t very many of those.
I’d like to work with others to produce more interesting reports. I think you could do some good reporting just using a clever SQL statement — with the one above, if there’s a separate table with some regex, I reckon you could get this looking rather prettier.
And that’s the lot for now
But if you’ve any questions — since quite a few people ask about my systems — I’d love to help. There are comments just down there.
—James is the Editor of Podnews, and was first involved in podcasting in January 2005.
<< BBC’s podcast exclusivity angers listeners; Spotify plans more