Remix.run Logo
Digital Archivists: Protecting Public Data from Erasure(spectrum.ieee.org)
195 points by rbanffy a day ago | 45 comments
dmillar a day ago | parent | next [-]

Many criminal records, petty or otherwise, are public record. When archived, expunged or dismissed infractions never truly become that. A traffic violation or other petty misdemeanor from 20 years ago, that has been expunged from official record, can show up on a background check because companies archive public data. So, there is a flip side to this.

overfeed a day ago | parent | next [-]

Public data is incompatible with secrecy. Expunged records still appear in newspapers archives if the local reporter on the Crimes beat captured the proceedings. IMO, "expunged" means removed from Official court records - not from the public memory, including newspapers, archived websites, police blotters and prosecutors' files.

InvOfSmallC 19 hours ago | parent | prev [-]

The fact that you get it out from your criminal record doesn't mean they get forgotten. Think about a paper writing about your crime. That will be public and archived forever.

badlibrarian a day ago | parent | prev | next [-]

There's a lot of panic and overlap in the space; a way to coordinate these efforts would be helpful.

Internet Archive et al. made noise and promises but told volunteers to stop because they couldn't actually handle the ingest.

https://www.reddit.com/r/Archiveteam/comments/1jbgycm/us_gov...

These folks made a notable effort.

https://webrecorder.net/blog/2025-03-25-govarchive-us-and-mi...

nla a day ago | parent | prev | next [-]

Best thing I ever heard from the head of archives at the BBC:

Once you format shift, you will always be format shifting.

Keep your originals whenever you can.

rippit 9 hours ago | parent | next [-]

As someone who spent the last 2 days figuring out how best to digitise my father's old Hi8, Digital8 and MiniDV tapes, I take umbridge with this!

Keep originals if you can, but make copies ASAP, as close to lossless as possible. Don't depend on the right hardware being around in the future.

pjc50 12 hours ago | parent | prev | next [-]

I can see the value in this, but .. originals, and the gear to read them, do not last forever. Plus for many formats the act of reading puts wear on the physical artifacts. So if you want to actually use the information, you have to format shift it to digital in the first place. And then you're back to the same question as the rest of us, how to maintain the bits.

anitil 20 hours ago | parent | prev [-]

I don't understand this phrase, are you able to explain it?

bell-cot 15 hours ago | parent [-]

Guess: If properly stored (physically), good-quality paper documents and photographs will last for centuries. But as soon as you digitize them - you're now chained to the treadmill of maintaining/upgrading/migrating digital archiving systems. Compared to keeping the old-fashioned Archive Storage Room dry (and fire-free), that's 100X the labor and expense. Forever.

wizzard0 11 hours ago | parent [-]

A lot of paper archives and libraries burned just recently in LA.

bell-cot 10 hours ago | parent [-]

True.

But from fire-resistant storage cabinets, to concrete-lined file rooms, to underground archives, the tech to make archives ~99.5% fire-proof is more than a century old. And if you add redundant storage sites for the high-value stuff...

Vs. anything digital is far more vulnerable to digital malice.

Damogran6 a day ago | parent | prev | next [-]

Hypothetically: -Government leader says they're nuking data -Mad rush to back up data through other means -Government leader declares they've 'transferred the cost of maintaining data out of government, thus making for a smaller, more efficient, government'

I hate everything about this.

krunck a day ago | parent | next [-]

There is inherent inefficiency in government accountability efforts. I'm ok with that.

riku_iki a day ago | parent | prev [-]

In general it makes sense to shift this part to business, if data is valuable, there will be market and services. Probably problem is how fast they nuked without grace period.

tehjoker a day ago | parent [-]

im okay with data being hosted for free or cheap by the government and not being price gouged for access to public data

riku_iki a day ago | parent [-]

I think many people are very not Ok how government handles data: https://news.ycombinator.com/item?id=43237352

forgetfreeman a day ago | parent [-]

Are these same people proposing private industry would do a better job? https://privacybee.com/blog/these-are-the-largest-data-broke...

riku_iki a day ago | parent [-]

Government is also regularly being hacked

tehjoker 20 hours ago | parent [-]

when was the last time we didn't hear about private companies getting hacked lmao they're terrible!!

mikrl a day ago | parent | prev | next [-]

How does this relate to dox?

Let’s say an individual posted identifying or incriminating information online, inadvertently or intentionally, in a public place.

Then a third party decides to store it, and possibly make it accessible to others.

If the original self doxxing user then pulled the original dox, but was unable to scrub the rest, would that information still be considered public, or would it be private? Was it ever truly public? Or private for that matter?

ziddoap a day ago | parent | next [-]

If you intentionally post something publicly, it's public. Full stop.

The tricky part is dealing with inadvertent or malicious (i.e. some other party), posting of private information to a public space. That's really hard to deal with on multiple levels.

For one, the archives would retain the information and scrubbing it is effectively impossible.

Secondly, legitimate things which should remain public (i.e. were posted publicly, are of public interest, etc.) can be argued to have been inadvertently or maliciously posted. So you need some way to moderate and create rulings for each individual case, which quickly becomes untenable due to the sheer volume of information being posted and the inordinate amount of time required to investigate vs. post.

calebio a day ago | parent | prev | next [-]

That's a really good question.

In my head, I'm imagining someone early in the morning posting a flyer up on a bulletin board downtown.

Throughout the day many folks walked by and took photos of the flyer with their cell phone.

At the end of the day, the original person came back and removed the flyer.

IMO, at the time that the folks took the photo of the flyer, that flyer was public information. It remains public information even after the flyer is removed[0].

This isn't a great analogy of mine, and has plenty of holes, but was interesting to me after I read your comment. I know it was in the context of doxxing, but I think it's pretty interesting philosophically.

I think something similar applies to photos taken of other people in public spaces. Both the person who took the photo and the subject of the photo are no longer in that physical public space, but the actions took place within that space.

I think something similar applies to digital "public spaces". But what does a public space even mean in the context of walled gardens[1], etc.

[0] you then run into the question of what happens if someone posts non-public information, publicly? [1] are digital walled garden communities that different from physical communities that gate access, whether free or paid. Whether information shared within those contexts are public or private is an interesting thread as well.

sixothree a day ago | parent | prev [-]

Which data set are you thinking this might apply to?

Teever a day ago | parent | prev | next [-]

I made this related submission[0] recently but it was flagged.

This stuff is very important to talk about so I hope that this submission by rbanffy isn't also flagged.

[0] https://news.ycombinator.com/item?id=43543075

hsuduebc2 a day ago | parent | next [-]

I agree. I do not understand how this is perceived as an political issue and thus got flagged.

Climate change is perceived for some reason politically too and not get flagged so often.

donnachangstein a day ago | parent | prev [-]

No it isn't. It's merely a cause du jour for data hoarders to justify their hobby in light of this Chicken Little hysteria.

30 years ago it was thought collecting every issue of magazines like TV Guide was important. No one even knows what that is anymore.

No one is ever going to look at 99% of this data. In the meantime, send more hard drives for my NAS!!

hermannj314 a day ago | parent | next [-]

My wife takes thousands of photos every year, when my daughter was young she took even more.

When we were moving out of our apartment there was damage to a door hinge that we never noticed when we moved in but that had definitely been there from the onset of our two years of living in that apartment.

Guess what? I had a photo from the day after we moved in of that door hinge in a state of damage! Not because we took the photo for that intention, but because my daughter was playing in the hallway and my wife snapped a photo and it just happened to capture the damage. Saved me several hundreds of dollars in repair costs from my landlord.

You are right, 99% of the data will never be looked at. But do you know what the 1% is today? I'm guessing you don't.

donnachangstein a day ago | parent [-]

Your example of personal family photos is in no way comparable to storing terabytes of essentially unindexed data for which one has no detailed knowledge about, under the notion that the government is somehow lighting a match to everything, and they're going to save it.

The government doesn't delete anything. It might be moved or inaccessible to the public but that data is somewhere in perpetuity.

It's one of the most deranged larps I've ever seen, then they pat each other on the back on BlueSky, desperately wanting to be a part of something.

These people envision themselves as folk heroes when what they really need to do is go outside and touch grass.

spookie a day ago | parent | next [-]

> The government doesn't delete anything. It might be moved or inaccessible to the public but that data is somewhere in perpetuity.

If the government is democratic and values integrity? Sure.

Otherwise I wouldn't bet on it. My own country's history books and my parents' own life stories have already warned me about how fickle democracy is. No democratic country is free from that fact. Some think "checks and balances" ought to be enough to prevent it, but I wouldn't be so sure.

alnwlsn a day ago | parent | prev | next [-]

Patently false. https://www.archives.gov/personnel-records-center/fire-1973

nancyminusone a day ago | parent | prev [-]

If it's inaccessible to the public, it might as well be deleted. What's the difference? If you can't get it, you don't have it.

squarefoot a day ago | parent | prev | next [-]

Among the deleted data there was the police accountability database. You probably won't have to deal with thugs now feeling omnipotent and immune from prosecution because of this.

https://www.police1.com/federal-law-enforcement/national-law...

squarefoot 16 hours ago | parent [-]

Typo that I can't correct anymore: that would be "won't want to deal".

dreamworld a day ago | parent | prev | next [-]

It might be of some interest to cultural historians in the future. But I think it makes more sense to take sample+curated data. But in any case if we can afford it, eh why not.

rbanffy a day ago | parent [-]

We don't know now what to curate for the future. We should preserve as much of everything we can - we don't know what will be important in 50, or 500 years.

Case in point: retrocomputing is my hobby. I buy, restore, preserve, and use old computers. Most of them are home computers, because business computers go directly from the office to the recycling facility or the landfill. Unless someone deliberately preserved, say, a Burroughs B-25 desktop, or the similar from Data General, they are gone.

Suppafly a day ago | parent [-]

My son is into retrocomputing, mostly using older hardware I have from when I was younger, and we have a stack of old compaq desktops where you can't access the bios because it requires a specific floppy that is nearly impossible to find online. This is 486/pentium era stuff, the older stuff is even harder to find.

rbanffy 11 hours ago | parent [-]

I've been looking for a DEC terminal with Sixel, Tektronix and ReGIS graphics for a while, with zero success. They weren't rare at all - they were a massive success, and, yet, it seems almost all ended up in a recycling facility or an e-waste dumpster. Many other terminals emulated them and expanded on their feature set.

peppermill a day ago | parent | prev | next [-]

I think the data being discussed is quite a bit different than old TV Guides...

NoMoreNicksLeft a day ago | parent | next [-]

I was, believe it if you wish, thinking about old TV guides just this morning and wondering how one would even go about archiving those. Most of the stumbling blocks for taking apart the glued binding for scanning have been figured out, of course, but for any given week there may have been as many as 60 or 70 editions (for each television market, I think). None of these have proper ISSN numbers as far as I'm aware, and other than the listings they can be visually indistinguishable. Then there is the challenge of finding those, and not knowing whether this or that edition is missing (from time to time, the company would create new additions for new regions, or fold old ones back into some other are) along with even parsing the content. Many of these tv shows aren't on themoviedb or thetvdb, and if the shows are, then there won't be episode listings (there were 6000 Donahue talk show episodes, after all). On top of all of that, you can't necessarily know what was on tv at a given time and day, with federal government preemptions, commercials, unreported last-minute rescheduling, etc.

But I can also see why people might want to keep more interesting data, like when the Federal Cheese-Sniffing Agency moved offices back in 1982 and they have meticulous records of the 483 filing cabinets that had to be moved from the original location to their new home in Furrytown, Pennsylvania.

zorpner a day ago | parent | prev [-]

I wonder if those would be useful in identifying the potential contents of specific Marion Stokes tapes (my understanding is that they're sorted, but are only labeled with channel and date/time and are being archived slowly): https://libwww.freelibrary.org/blog/post/5393

thowawatp302 a day ago | parent | prev [-]

I’ve had the idea of recreating tv channels on my plex server by using tv guide data from the late 90s early 00s

The insurmountable part of that project would be getting the guide data.

You don’t know what other people will want in the future

Teever a day ago | parent [-]

That's a great idea.

There's are sites that stream old content with a old tube tv UI wrapped around the video frame but they don't have all the commercials and they don't follow the old schedules like you suggest.

I've got a friend who has hoarded digitized copies of VHS recordings of old cartoons from that era complete with the commercials, so the content is definitely out there.

hsuduebc2 a day ago | parent | prev [-]

I wonder. Maybe for this would be blockchain actually usefull technology?

jefurii 6 hours ago | parent | next [-]

git-annex is not exactly blockchain but because of the way it operates -- storing files by their hashes, the whole Git commit structure -- it gives you several useful things: It becomes easy to clone repositories while guaranteeing that clones are identical. It also becomes easy to ensure that files are not tampered with.

badlibrarian a day ago | parent | prev [-]

https://blog.archive.org/2023/10/20/celebrating-1-petabyte-o...

Though given the space in general and some of the people involved it all should be audited very carefully.