The 45-minute episode is done. The edit is clean, the master is at -16 LUFS, the file is uploading to the host. And the same question that follows every finished episode: "Are we going to do the clips this week?" At which point one co-host opens TikTok and starts scrolling for inspiration, and the clips don't happen.
The workflow friction isn't motivation. It's the gap between audio editing software, video editing software, caption tools, and the platform-specific export requirements that nobody memorized. This article documents a concrete workflow for getting three publishable vertical clips out of a finished episode without touching a video editor.
Understanding the platform constraints before you export
The three major vertical platforms have slightly different requirements that matter for export settings:
- TikTok: Maximum video length for standard uploads is 10 minutes (as of mid-2025 guidelines for accounts in good standing), but the algorithm heavily favors clips under 60 seconds for discovery reach on non-follower feeds. For a podcast clip, 30–90 seconds is the practical sweet spot. Format:
MP4, H.264 baseline or main profile, 1080×1920 (9:16), 30fps. Audio:AAC 128 kbpsminimum; higher is accepted. Maximum file size: 287.6 MB for 10-minute uploads; for a 60-second clip, you're well within limits at any reasonable bitrate. - Instagram Reels: Maximum length 90 seconds. Same
1080×1920frame,H.264,AAC. Instagram's playback encoder will re-compress your video at upload, so delivering at a reasonably high bitrate (2–4 Mbps video) preserves quality after their compression pass. - YouTube Shorts: Maximum 60 seconds. Same vertical frame. YouTube's ingest accepts a wider codec range but
H.264 + AACis the safest path for consistent behavior across mobile uploads.
The common thread: 1080×1920, H.264, AAC. Any tool that exports in this format satisfies all three platforms from a single output. The length differences mean your 90-second Instagram Reel and your TikTok clip can be the same file; your YouTube Short needs to be trimmed to 60 seconds or split into a separate export.
What makes a good podcast clip
Not every good podcast moment makes a good short-form clip. The medium requires different characteristics than the full episode:
Self-contained premise within 5 seconds. The viewer has no context for your show. A clip that starts with "So as I was saying earlier..." is already dead. The best clips start at a moment of assertion or tension — a surprising claim, a strong opinion, a counterintuitive fact — that doesn't require setup from earlier in the episode.
Audible without headphones. Roughly 40–60% of TikTok viewing happens with the phone speaker at low volume or effectively muted with captions on (platform data on muted-video behavior varies by source, but caption usage rates support designing for non-audio consumption). This means the captions carry the content, not just the audio. A clip where the host is making a complex numerical argument that doesn't land visually in the caption will underperform.
Clean visual framing for the vertical safe zone. Each platform overlays UI elements at the bottom (action bar, caption area) and top (profile handle, notification area) of the 1080×1920 frame. The safe zone for important content is roughly the middle 60% of the vertical height — approximately y: 340px to y: 1580px in a 1920px tall frame. Any waveform visualization, speaker label, or face cam should sit in this zone. Text captions should not touch the top 17% or bottom 15% of the frame.
A two-host comedy show's clip workflow
A two-host comedy podcast — weekly episodes, around 12k monthly downloads, active on TikTok and Instagram Reels — developed a consistent clip routine that runs in under 20 minutes after episode completion. The show records remotely: one host in Nashville, one in Memphis. Their episode structure is conversational, 40–55 minutes, no fixed segment format.
Their process: immediately after the edit is signed off, one host scans the episode timeline for timestamp candidates. They're looking for three specific types: a hot take (strong opinion delivered in a single statement), a disagreement moment (brief back-and-forth where both hosts have energy), and a story beat (one host telling a specific anecdote with a clear beginning-middle-end within 60 seconds). These three types tend to work consistently across episode topics.
Once candidates are identified, the clip tool in Rebel Audio uses the session's transcript to let them navigate directly to those timestamps. They select the clip range, review the suggested caption text (the transcript segment automatically populates the caption layer), adjust the caption placement to sit above the bottom safe zone, and check that the waveform bars — which animate during speech — are not covering the caption text.
Export settings: TikTok/Reels export (single output covers both platforms), 1080×1920, H.264 main, 3 Mbps video, AAC 128 kbps. YouTube Shorts gets trimmed to 60 seconds and exported separately. Three clips, three exports. Total time from episode completion to clips in the upload queue: 18–22 minutes on a normal week, closer to 30 minutes when the episode had less natural clip structure.
Caption burn-in versus SRT files
Vertical clip captions exist in two forms: burned into the video as a pixel layer (hard captions) or delivered as a separate .srt file (soft captions) that the platform renders on top of the video at playback.
For TikTok and Instagram Reels, hard captions are the correct choice for organic-reach clips. Platform-rendered soft captions from SRT files are available for accessibility compliance but use the platform's default caption styling — usually a white sans-serif on a semi-transparent strip that looks generic and often misaligns with the video's visual design. Burned-in captions with branded styling (font, color, word-by-word highlight animation) consistently outperform SRT captions in watch-through metrics among creators who've tested both.
YouTube Shorts accepts both, and for Shorts the platform's auto-caption system is good enough that many creators skip manual captions entirely and let YouTube's speech recognition handle it. This is a reasonable time-saving choice for YouTube specifically — less so for TikTok, where caption accuracy and styling are more directly tied to clip performance.
One note on headroom: burned-in caption text should be rendered at a size that's readable on a phone screen without scaling — typically 36–48px equivalent in a 1080px wide frame for one to three words per line, 28–36px for longer caption lines. Text too small reads as a subtitle; text too large crowds the frame and looks amateurish. The sweet spot for podcast clip captions is approximately 2–3 words per display beat, sized to fill roughly 40–50% of the frame width per line.
The workflow trap to avoid
We are not saying that vertical clips should take priority over the episode itself. We are saying that the clips should come from the same session workflow as the episode, not from a separate tools stack that requires re-importing audio and re-editing for a different format.
The common failure mode: finishing the episode edit in a DAW, exporting a mixed-down MP3, uploading that to a distribution platform, then opening a separate video editor with the MP3, adding a static waveform animation background, rendering, and uploading to TikTok. This workflow takes 45–60 minutes and produces clips that look like podcast audiograms — static backgrounds with a bouncing line — rather than the vertical-native format that the platforms actually reward with reach.
The difference between "I'll do the clips tomorrow" and "the clips are already done" is usually not effort — it's whether the clip creation step is adjacent to the edit step or requires rebuilding from scratch. For a show releasing weekly, that adjacency compounds over 52 episodes per year.
Measuring whether the clips are working
Short-form clip performance from podcast content follows a specific pattern: most clips from a given episode will generate 200–2,000 views; a small percentage will enter the platform's recommendation system and generate 10k–100k+ views. The ratio varies widely by niche, audience size, and how well the clip matches the platform's current engagement patterns.
For a show at 12k monthly downloads, a successful clip cycle that drives 5–10% new listener conversion (per platform attribution, which is imprecise but directionally useful) means 600–1,200 additional monthly listeners from the clip channel. That's a meaningful growth rate for an indie show at that stage. The metric to track is not raw clip views but subscribe-on-click or episode-play events that the platform attribution window captures — which requires setting up a link in the profile that points to an episode landing page or show RSS link.
The shows that get consistent growth from clips are not the ones with the highest production quality in the clips themselves. They're the ones who post consistently — three clips per episode, every episode, without skipping weeks when the episode "doesn't have good clip material." The algorithm rewards consistency more than any single viral moment. Getting the clip workflow under 20 minutes makes consistency possible at a weekly cadence without it consuming the host's entire production day.