Remix.run Logo
marxism 5 days ago

I'm working on Video Clip Library (https://videocliplibrary.com), a desktop app for browsing, searching, and managing large collections of video files locally on your computer.

It's essentially a small search engine for videos that runs locally on your laptop. Previously, the system just extracted information about whole video files and maintained a searchable list of those files.

I'm putting the finishing touches on a major architectural change to solve a request from someone who creates highlight reels for professional soccer matches. They needed to tag and search for specific moments within videos - like all goals scored by a particular player across dozens of camera angles of dozens of matches.

This required re-engineering the entire data model to support metadata entries that point to specific segments of a file entity rather than just the file itself.

Instead of treating each file as an atomic unit, I now separate the concept of "content" from the actual files that contain it. This distinction allows me to:

1. Create "virtual clips" - time segments within a video that have their own metadata, tags, and titles - without generating new files

2. Connect multiple files that represent the same underlying content (like a match highlight with different ad insertions for YouTube vs. Twitch)

3. Associate various resolution encodings with the same logical content

For example, a content creator might have multiple versions of the same video with different ad placements for different channels. In the old system, these were just separate, unrelated files. Now, they're explicitly modeled as variants of the same logical content, making organization and search much more powerful.

I also completely reworked the job system, moving from code running in the Electron main process to spawning Temporal in a child process. I know it sounds like overkill for a desktop app, but it's been surprisingly perfect for a desktop app that's processing hundreds of terabytes of video.

When you're running complex ML pipelines on random people's home computers, things go wrong - custom FFmpeg versions that don't support certain codecs, permission issues with mounted volumes, their local model server crashes. With Temporal, when a job fails, I just show users a link to the Temporal Web UI where they can see exactly what failed and why. They can fix their local config and retry just that specific activity instead of starting over. It's cut my support burden dramatically.

My developer experience is so much better too. The ML pipeline for face recognition has multiple stages (small model looks for faces, bigger model does pose detection, embedding generation with even larger model) that takes minutes to run. With Temporal, I can iterate on just the activity that's failing without rerunning the entire pipeline. Makes developing these data-intensive workflows so much more manageable.

escapecharacter 5 days ago | parent [-]

How do you think of this work relative to other organizing systems like Jellyfin or Plex? Is the major new feature you care about the time subsection clips?

marxism 4 days ago | parent [-]

I like the virtual clips feature because it does a better job of getting out of the way of how many people think. Before, I was just telling people 'here's your entire video file, good luck.' If you had 5 hours of footage from a sports tournament but your kid was only doing something interesting for 10 minutes total, you were stuck with the whole file.

My perspective is all those hours of raw footage are just raw materials waiting to be shaped into stories, highlights, or presentations. The value is concentrated in a few hotspots.

Jellyfin and Plex appear to have been built on fundamentally different technical assumptions than Video Clip Library. They expect media to remain connected and accessible to the server at all times - when drives disconnect, they often purge those entries from their databases, requiring full rescans when reconnected. It appears Jellyfin only fixed this in Oct 2024.

The reality for many isn't sleek network storage - it's often just a plastic container filled with labeled hard drives sitting in a closet.

Video Clip Library is architected specifically for the archival cold storage workflow where most media is physically offline. The database maintains complete metadata even when drives are disconnected. When you search for 'soccer highlights from 2018,' it not only tells you what file contains that footage but precisely where that physical drive is located: 'in the blue SSD in Alice's desk, bottom drawer'. You can upload pictures of each drive, print out barcodes, write detailed notes. Organization stuff.

This workflow doesn't necessarily make sense for full-time professionals with dedicated workstations, but it's ideal for the long-tail use cases that originally drove me to build this software - normal people with occasional video projects. Of course, as is often the case, people bring it to their day job and start pushing for more business-oriented features. But the genesis of this software was for the individual creator, the freelancer, or small teams of auteurs collaborating on creative projects. A tool to accommodate the stop-and-start reality of passion projects. A poor man's editing with Proxies.

escapecharacter 4 days ago | parent [-]

How often do you see yourself updating and editing a particular video clip over time? For a given video, do you find all the relevant clips when you first save it to disk? Or are you accumulating video clips for a source video over time? I’m generally interested in patterns of revisiting source media

marxism 4 days ago | parent [-]

Thanks for the questions!

So far everyone is accumulating clip annotations on video files over time.

I'm thinking of clips as essentially write-only/append-only annotations. Labels or metadata attached to sections of videos rather than new files. The system is designed to support overlapping clips and allows you to filter/view all clips for a video.

To clarify, Video Clip Library is purely a search engine - it doesn't composite or edit videos. Although it will let you re-encode to save space. I built it for scenarios like: "I have a catalog of shots from the last five years, and when working on a new project, I might want to reuse B-roll or footage I've already taken." A YouTuber doing a Then and Now will find footage from their first year.

For me personally, the virtual clips feature will improve my learning process. I'm not a professional videographer. Naturally I spend time studying work from more skilled creators, trying to understand what makes it effective. I'm excited to take notes on specific moments - "these are the places across many different videos where I feel afraid" or "interesting rack focus technique here" - with notes and tags scoped to their own clips. I was already taking these notes in Obsidian. But it wasn't great.

I find a beauty in the layering: I can create overlapping clips that represent different aspects of the same footage - one layer for emotional responses, another for technical observations. Note: here I'm creating them manually one hour here, one hour there over months as I find the time or interest waxes. I might only annotate a few thousand clips across a couple hundred films in my lifetime. That's ok. I don't need the computer to understand the videos perfectly frame by frame.

The professional use case that prompted this feature is different - teams collect footage, then editors assemble compilations and marketing materials months later. They will run AI models to annotate videos as they're ingested, or apply new models to existing catalogs. Then someone with a creative concept can quickly search: "Do we already have footage that supports this idea or do we need to shoot something new?"