| ▲ | peatmoss 4 hours ago |
| I recently bought a tablet for sheet music, mostly to replace a stack of jazz "Real Books" at jam sessions. And the phone camera scans I made are okay, but fixed in size and have a lot of artifacts. And it would be great to transpose on the fly for e.g. Bb or Eb instruments, but being a scan this is obviously not possible. I got digging into the state of optical music recognition and came away concluding that music is basically a greenfield for AI wherever you look. Optical music recognition is pretty terrible. AI understanding of music theory is terrible (actually looking at music that is; LLMs do okay at text descriptions of theory concepts where you can imagine some online texts making it in). I think the issue is that we still don't have great digital formats that encode the dots on paper that musicians read. Music notation is pretty rich. Midi doesn't capture all of what's needed for symbolic understanding, because it was mostly made for capturing aspects relevant for playback or performance. MusicXML seems to be the closest for a digital format that encodes the information a musician would want, but there aren't great corpora of training data that would connect a MusicXML representation to sheet music images or to audio. I think that's because MusicXML falls short of encoding enough information to engrave music. Tools like MuseScore need to track a bunch of layout information that isn't encodable in MusicXML. Lilypond format is less verbose that MusicXML and contains a bit more information that is useful to the score creators, but most people don't create sheet music in lilypond. (As an aside, Lilypond bums me out with the state of jazz fonts. I hate looking at "legit" scores in jazz context) I realize this is mildly off topic, but every time I see people making incremental gains on OCR, which to my mind is pretty good, I am reminded of how abysmal OMR is. |
|
| ▲ | kwon-young 3 hours ago | parent | next [-] |
| So, the format for musicologist and researcher in music is the MEI format: https://music-encoding.org/ for which the reference engraver is verovio: https://www.verovio.org/index.xhtml
Note that verovio is able to engrave in svg format while keeping a maximum of information from the original mei score, meaning that you can extract enough metadata to create an actual detection dataset for a deep learning model.
This is my horrible hacked up script that will create a coco dataset from a set of scores engraved with verovio: https://github.com/kwon-young/music/blob/main/svg2pl.py
I have published a synthetic music score dataset from this: https://www.kaggle.com/datasets/kwonyoungchoi/trompa-coco/da...
I anyone wants to try and fit a detector on top is welcome :) To understand why OMR is so neglected is because most people widely underestimate the difficulty of the task.
It has a specific blend of the most extreme shapes combined with an extremely complicated graphical grammar... |
| |
| ▲ | peatmoss 2 hours ago | parent [-] | | Thank you for this! Both MEI format and the Verovio engraver are news to me. I will check them out. My first thought was whether MEI format is being added to MuseScore (the sheet music editor I use these days). It looks like it is: https://music-encoding.org/musescore-doc/ As a somewhat related aside, now that the MuseScore people own Hal Leonard and seem to pushing integration with their cloud subscription service, I wonder if they'll see some of these directions as potentially competing with them. I don't think there's anyone who wouldn't love a transposable clean digital version of their Real Books... and if Hal Leonard is in the business of selling Real Books, I can see where good OMR might be a problem for them. I guess piracy of scanned versions is already rampant, so maybe it's a wash. |
|
|
| ▲ | indiv0 3 hours ago | parent | prev | next [-] |
| > music is basically a greenfield for AI wherever you look AIN'T THAT THE TRUTH. My girlfriend is studying musicology and she has some physical disabilities that make it difficult for her to write things down sometimes. So I try to help her by writing some AI-powered TTS/OCR/etc. apps here and there. It becomes painfully obvious that music was never considered an important part of any AI training dataset, anywhere. These days, I'm pleasantly surprised by how well Opus 4.8 understands/explains music theory (as you said). But ask him to transcribe/OCR/OMR some sheet music and he'll confidently give you the MusicXML/Lilypond equivalent of "2 + 2 = horse". I really hope this ignored area will be swept up with the rest of the rising AI wave, but it's still criminally undervalued. |
| |
| ▲ | peatmoss 2 hours ago | parent [-] | | I recently left a job at where I was working with open data producers / providers across a lot of domains. A lot of data is produced and released for free by governments and nonprofits because it's either directly part of the mission, or it's a natural byproduct of the organization's mission. Occasionally, you'd have really great datasets come out of industry / commercial organizations because the data were a byproduct and didn't create a scenario where a data release would create opportunity for competition. I've been thinking about what kind of organization could be self-sustaining and also produce good music AI training data as a natural byproduct. An ideal arrangement would be something that provided some incentive or benefit to musicians in exchange for their recorded interpretation of sheet music. Soundslice, mentioned by another user, seems to do that. They let both teachers and students upload recordings of music that has been turned into MusicXML. The recordings, paired to those snippets of sheet music, has to be a gold mine. Assuming they have enough users. If they aren't already working on stem separation and automatic transcription, they probably should be. Still, my hope would be to figure out some kind of sustainable model where that dataset could be created and released for open model development... As a domain, I see AI in music as a boon to human creativity. I am very much a novice jazz improvisor, and a passable amateur technician on the trombone. Human instructors can do a lot for me, but there's a lot that is "grinding it out" repetition, where I think AI could be a huge aid. I heard Sam Harris on a podcast recently talk about his bullishness on the humanities (paraphrasing: people don't care if a human reads their MRI if detection is good, but people probably do care that a human wrote the novel they're reading). Music might even be a better example of the irreplaceability of people. While some people might bop along to a tune composed by Suno on the radio, live music is just so much more enjoyable for me. And even better than listening to a live show played by masters, is playing together with friends. To the extent that AI can patiently help us learn the skills to express our own creativity, I'm here for it! |
|
|
| ▲ | elasticdog 2 hours ago | parent | prev | next [-] |
| For just chord analysis, there's "Harte notation", which is meant to be unambiguous representation of the notes (https://ismir2005.ismir.net/proceedings/1080.pdf). That obviously doesn't get you all of the additional information necessary for engraving and full representation of the music, but there are research datasets available using it like https://github.com/smashub/choco. I've also used the https://github.com/MarkGotham/When-in-Rome dataset for some analysis work, but again that's not 100% what you're looking for. You might like the "iReal Pro" app for the replacement and transposition of jazz standards on your tablet. It's pretty great for that use case versus camera scans. |
|
| ▲ | singpolyma3 4 hours ago | parent | prev | next [-] |
| What about sheet music typesetting formats like https://abcnotation.com/ ? |
| |
| ▲ | peatmoss 4 hours ago | parent | next [-] | | I forgot to mention ABC. I have seen a few LLMs look at that. There was a model / paper published a couple years back called ChatMusician that built around it. With the caveat that I'm not terribly fluent in ABC, it seems to me that simple things are simple, but hard things seem to be nearly pathological. And (again, maybe a lapse in my understanding) it seems like there may be a fair number of concepts that are impossible to convey in ABC? Lastly, if I understand correctly, ABC got its start and is mostly popular as a simplified format for church songbooks. I'd imagine that would, uh, influence the training corpora towards sounding a bit... church songbooky. EDIT: I may have been overly dismissive of ABC on first glance. It does seem like people have extended it quite a bit, and that it's at least, in theory, capable of encoding most of what I'd expect. And it's human readable, which is a benefit. Though, readability does take a stiff penalty the more richness you add (e.g. dynamics, articulations, stacked notes, etc) | |
| ▲ | WhitneyLand 3 hours ago | parent | prev [-] | | The simplicity is really cool. To let LLMs compose music I chose json for context efficiency, but this seems like it could be better choice, simple, efficient, already a real format. https://github.com/whitneyland/riffmcp |
|
|
| ▲ | genxy 3 hours ago | parent | prev | next [-] |
| Create a benchmark for this problem that researchers can easily run and the problem will solve itself. |
|
| ▲ | WhitneyLand 4 hours ago | parent | prev | next [-] |
| “there aren't great corpora of training data that would connect a MusicXML representation to sheet music images or to audio” It may not be necessary…a lot of the training pairs/data for this could probably be procedurally created via code. Would be pretty fun to work on and see it come to life. |
| |
| ▲ | peatmoss 4 hours ago | parent [-] | | I'd imagine that rendered audio that just used midi voices (even high quality "Real Instruments" midi voices) would be pretty brittle for e.g. stem separation or automatic transcription. In a best case, I think you'd start with a clean digital representation, render sheet music imagery, and then have lots of recordings by a bunch of real instrumentalists playing the same music. On the topic of stem separation, I've wondered about creating a quasi-synthetic dataset by taking chunks of recordings by real musicians playing them back in a real space in various combinations and recording the resulting analog-blended cacophony. Could repeat in various environments like cathedrals, basement bars, etc for realism :-) |
|
|
| ▲ | mcbetz 4 hours ago | parent | prev | next [-] |
| I observe that music OCR space and the only really good solution so far is soundslice. You scan and review some edge cases and get really good results. Paid service by a small company, very worthy to be supported! |
| |
| ▲ | peatmoss 3 hours ago | parent [-] | | I just signed up a trial, and uploaded a messy Real Book scan. It did very well! It missed the coda markings, but then again the directive in the Real Book was nonstandard. I guess that's a case where a multimodal model might have been able to read the text ("after solos, D.C. al coda") and do something smarter. |
|
|
| ▲ | ramses0 2 hours ago | parent | prev | next [-] |
| So I made a comment a while back about lilypond: https://news.ycombinator.com/item?id=46148831 A salient extract: ...but why is it so complicated? A novice interpretation of "music" is "a bunch of notes!" ... my amateur interpretation of "music" is "layers of notes". You can either spam 100 notes in a row, or you effectively end up with: melody = [ a, b, [c+d], e, ... ]
bassline = [ b, _, b, _, ... ]
music = melody + bassline
score = [
"a bunch of helper text",
+ melody,
+ bassline,
+ page_size, etc...
]
...so Lilypond basically made "Tex4Music", and the format serves a few dual purposes...[snip] |
|
| ▲ | aidenn0 3 hours ago | parent | prev [-] |
| As someone who has never looked at a jazz score, can you share an example of how jazz sheet music would benefit from different fonts? |
| |
| ▲ | peatmoss 2 hours ago | parent [-] | | It's just an entrenched aesthetic preference. Jazz fonts (fonts in this context refers both to the words and the music symbols) tend to be quite heavy with thick lines. I've heard that the thick hand-written style was originally to make charts more readable in dimly lit clubs, but with tablets and such, that's an anachronism now. You can look at samples of Hal Leonard's Real Book(s) on their website to get a sense of what it looks like. Again, just an aesthetic preference, but one I and many others hold nonetheless. | | |
| ▲ | elasticdog 2 hours ago | parent [-] | | I also don't love the conventional handwritten aesthetic you often see for jazz fonts. For a project I've been working on, I ended up pulling the handful of chord symbol glyphs out of MuseScore's Leland Text font and adjusting them for use in the UI since I couldn't find a suitable option out there. |
|
|