Cool, EDGAR is an amazing public service. I think they use Akamai as their CDN so the downloads are remarkably fast.

A few years ago I wrote an SGML parser for the full SEC PDS specification (super tedious). But I have trouble leveraging my own efforts for independent research because I don't have a reliable securities master to link against. I can't take a historical CUSIP from 13F filings and associate it to a historical ticker/return. Or my returns are wrong because of data errors so I can't fit a factor model to run an event study using Form 4 data.

I think what's missing is a serious open source effort to integrate/cleanse the various cheapo data vendors into something reasonably approximating the quality you get out of a CRSP/Compustat.

▲

jgfriedman1999 20 hours ago | parent [-]

Yep! Pretty sure it is still Akamai. Via testing I've noticed they cap downloads at ~6mbps from e.g. home internet, but not GitHub or AWS.

SGML parsing is fun! - I've opensourced a sgml parser here https://github.com/john-friedman/secsgml

Securities master to link against - Interesting. Here's a pipeline off the top of my head 1. Get CUSIP, nameOfIssuer, titleOfClass using the Institutional Holdings database 2. Use the company metadata crosswalk to link CUSIP + titleOfClass to nameOfIssuer to get cik https://github.com/john-friedman/datamule-data/blob/master/d... (recompiled daily using GH actions) 3. Get e.g. us-gaap:EarningsPerShareBasic from the XBRL database. Link using cik. Types of stock might be a member - so e.g. Class A, Class B? Not sure there.

For form 4, not sure what you mean by event study. Would love to know!

▲

conditionnumber 17 hours ago | parent [-]

Event study: A way to measure how returns respond to events. Popularized by Fama in "The Adjustment of Stock Prices to New Information" but ubiquitous in securities litigation, academic financial economics, and equity L/S research. The canonical recipe is MacKinlay's "Event Studies in Economics and Finance". Industry people tend to just use residual returns from Axioma / Barra / in house risk model.

So let's say your hypothesis is "stock go up on insider buy". Event studies help you test that hypothesis and quantify how much up / when.

Cool metadata table, I'm curious about the ticker source (Form4, 10K, some SEC metadata publications?).

My comment about CUSIP linking was trying to illustrate a more general issue: it's difficult to use SEC data extractions to answer empirical questions if you don't have a good securities master to link against (reference data + market data).

Broadly speaking a securities master will have 2 kinds of data: reference data (identifiers and dates when they're valid) and market data (price / volume / corporate actions... all the stuff you need to accurately compute total returns). CRSP/Compustat (~$40k/year?) is the gold standard for daily frequency US equities. With a decent securities master you can do many interesting things. Realistic backtests for the kinds of "use an LLM to code a strategy" projects you see all over the place these days. Or (my interest) a "papers with code" style repository that helps people learn the field.

What you worry about with bad data is getting a high tstat on a plausible sounding result that later fails to replicate when you use clean data (or worse, try to trade it). Let's say your securities master drops companies 2 weeks before they're delisted... just holding the market is going to have serious alpha. Ditto if your fundamental data reflects restatements.

On the reference data front, the Compustat security table has (from_date, thru_date, cusip, ticker, cik, name, gics sector/industry, gvkey, iid) etc all lined up and ready to go. I don't think it's possible to generate this kind of time-series from cheap data vendors. I think it could be possible to do it using some of the techniques you described, and maybe others. Eg get (company-name, cik, ticker) time-series from Form4 or 10K. Then get (security-name, cusip) time-series from the 13F security lists SEC publishes quarterly (pdfs). Then merge on date/fuzzy-name. Then validate. To get GICS you'd need to do something like extract industry/sector names from a broad index ETF's quarterly holdings reports, whose format will change a lot over the years. Lots of tedious but valuable work. Also a lot of surface area to leverage LLMs. I dunno, at this point it may be feasible to use LLMs to extract all this info (annually) from 10Ks.

On the market data front, the vendors I've seen have random errors. They tend to be worst for dividends/corporate-actions. But I've seen BRK.A trade $300 trillion on a random Wednesday. Haven't noticed correlation across vendors, so I think this one might be easy to solve. Cheap fundamental data tends to have similar defects to cheap market data.

Sorry for the long rant, I've thought about this problem for a while but never seriously worked on it. One reason I haven't undertaken the effort: validation is difficult so it's hard to tell if you're actually making progress. You can do things like make sure S&P500 member returns aggregate to SPY returns to see if you're waaay off. But detailed validation is difficult without a source of ground truth.

	▲	jgfriedman1999 16 hours ago \| parent [-]
		Love the long rant. re: metadata table - it's constructed from the SEC's submissions.zip, which they update daily. What my script does is download the zip, decompress just the bytes where the information (ticker, sic code, etc) is stored, then convert into a csv. And yep! Agree with most of this. Currently, I'd say my data is in the stage where it's useful for startups / phd research and some hedge funds / quant stuff (at least that's who is using it so far!) I've seen the trillion dollar trades, and they're hilarious! You see it every so often in Form 3,4,5 disclosures. re: LLMs, this is something I'm planning to move into in a month or two. I'm mostly planning to use older NLP methods which are cheaper and faster, while using LLMs for specific stuff like structured output. e.g. wrds boardex data can be constructed from 8-k item 5.02s. I think the biggest difficulty wrt to data is just the raw data ingest is annoying AF. My approach has been to make each step easy -> use it to build the next step.