Rendered at 15:04:49 GMT+0000 (Coordinated Universal Time) with Cloudflare Workers.
remus 2 days ago [-]
That's a real shame. I am involved with some history-related projects and the number of websites which go offline is huge, and the wayback machine is incredibly helpful for unearthing these dead sites.
It is not hard to imagine a future in 50 years time where a huge percentage of this content is lost forever, or at best incredibly hard to find.
horacemorace 2 days ago [-]
This future is here already, policy makers have it locked up. Any person who remembers what microfiche is understands the magnitude of this problem of not having a trustworthy public record.
If we extended public policy from the library era, the library of congress itself would be the Internet Archive.
Terr_ 2 days ago [-]
> If we extended public policy from
Similarly and tangentially, when the US Constitution was made in an era of horseback/carriages, it explicitly authorized the creation of a public national postal service (USPS).
If we extended that older public policy with today's technological context, they would have authorized a national Internet Service Provider. (And, like with USPS, specialized private competitors would exist.)
account42 1 days ago [-]
That's at least better than other countries who have essentially privatized most of their existing public infrastructure.
VorpalWay 1 days ago [-]
Some countries have national archives that all published material must by law be submitted to, including material published online. I know at least Sweden and the UK has that. This will be available for researchers, though usually you have to physically travel to the archive to access the data, so not as convenient as IA.
(It is worth noting that at least in Sweden "published" here has a very specific meaning, that doesn't include personal websites etc, but it does include news outlets.)
AnthonyMouse 2 days ago [-]
In the walls of the cubicle there were three orifices. To the right of the
speakwrite, a small pneumatic tube for written messages, to the left, a
larger one for newspapers; and in the side wall, within easy reach of
Winston's arm, a large oblong slit protected by a wire grating. This last
was for the disposal of waste paper. Similar slits existed in thousands or
tens of thousands throughout the building, not only in every room but at
short intervals in every corridor. For some reason they were nicknamed
memory holes. When one knew that any document was due for destruction, or
even when one saw a scrap of waste paper lying about, it was an automatic
action to lift the flap of the nearest memory hole and drop it in,
whereupon it would be whirled away on a current of warm air to the enormous
furnaces which were hidden somewhere in the recesses of the building.
noufalibrahim 1 days ago [-]
I gave a talk about this when I worked for The Archive. There was an article in Scientific American about how the average lifetime of a page on the net before it 404s is about 100 days. That article is offline now and we accessed it through the wayback machine.
My own last project before I left was to ingest records from crawl dumps from the defunct cuil.com website. About 200 TB of stuff that brought back 60 billion URLs.
The nature of the internet has changed and it's become an ephemeral place for many people where you just through things in and others mine it as "data".
random3 1 days ago [-]
I wonder what’s the motivation though. Subscriptions? At a minimum they could limit based on recency. It’s “news” after all so having a number of days of delay for archival would solve most issues, I assume.
account42 1 days ago [-]
Unfortunately the IA itself has become way less usable as their aggressive anti-bot protection means that actually doing any kind of (manual) explorative research, as opposed to pulling an individual website that you already know the exact URL of, is more than likely to get you temp banned.
hungryhobbit 2 days ago [-]
There's an incredibly simple fix: block the archive for a week. No one is paying after a week, so you let the Archive access after that.
I don't see why every news outlet doesn't just do this.
nightshift1 2 days ago [-]
Good idea, but only if the article can't be edited during that week. What's worth preserving is the version the audience actually read. Articles routinely get ninja-edited after publication, sometimes repeatedly. Changelogs should be mandatory but they're useless if we can't keep them honest.
someperson 1 days ago [-]
Block public access not archival
simonh 1 days ago [-]
The reason they're blocking archives is people can go to the archive, to bypass paywalls and avoid targeted adverts, instead of the news site. It's also to prevent AI scrapers harvesting articles.
someperson 16 hours ago [-]
I meant that news sites should provide an API for Internet Archive to scrape their articles at all times to catch changes, but not provide any public access for an indefinite period of time (as an escrow) but eventually release it once the AI scraping issues blows over.
wolvoleo 13 hours ago [-]
True, it's the main reason archive.today exists really
Ma8ee 2 days ago [-]
I'd rather let Archive block access to that specific article for a while, but still archiving from the start.
ajb 2 days ago [-]
In effect, robots.txt should have an "embargo" directive?
Bender 1 days ago [-]
I dont know if this is still the case but if I told IA via robots.txt not to archive my site, it would still crawl it, archive it but not display it until I shut the site down. Once robots.txt was no longer reachable they would display the archived content. The only way to stop that was to start the site back up making robots.txt reachable and wait for them to crawl it again.
pj_mukh 1 days ago [-]
Do these major publications charge per article? They should, but they don't. So their whole sell is that in aggregate (so access to all, including old articles) they are worth paying monthly for.
In which case archive is a major revenue slumper
alpinisme 1 days ago [-]
How would archive not be a revenue drain if there was pay per read articles? I would think the incentive to try to find a free version would increase not decrease, especially for a wide class of articles that are basically, “I’m curious but not that curious” which in aggregate I might pay money for (they add value to my subscription) but individually feel wasteful (do I really want to pay to satisfy this curiosity?)
deepfriedbits 2 days ago [-]
It's not about the paywall in this case. It's to prevent AI companies from scraping a publication's archives for training data. If AI companies want that data, they can compensate publishers, not extract it for free from the Internet Archive.
leonidasrup 1 days ago [-]
Yes, it's probably cheaper to just download the newspaper articles from Internet Archive than to buy them directly from newspapers. Training costs minimization, or should we call it stealing?
NicuCalcea 2 days ago [-]
The article is about AI companies using the Internet Archive to source training data, not about people using it to avoid paywalls. AI companies don't care that the data is one week old.
someperson 1 days ago [-]
Internet Archive can keep it escrowed until AI training kerfuffle blows over
sieabahlpark 2 days ago [-]
[dead]
ray_v 2 days ago [-]
Greed and spite.
foxglacier 2 days ago [-]
You people need to stop saying this. You're being greedy when you buy groceries from a cheaper supermarket. You're being greedy when you negotiate your salary or choose a job based on pay, or anything where you're trying to get more stuff for yourself. Those things are all perfectly good behaviors, they make the world more productive, so everyone wins overall. Greed isn't a problem.
Spite? No evidence of that. They probably just don't want to lose the money from paying customers and ads. You're just making up fantasy. Perhaps projecting your own spite.
pasquinelli 1 days ago [-]
> Greed isn't a problem.
you listed
1. buying the cheapest groceries you can reasonably find
2. trying to get the highest salary you can
3. literally any time you try to get more for yourself
that's a weak list from which to conclude that greed isn't a problem, especially since in the case of 1. and 2. someone's making money off you, the person who's supposedly greedy in these scenarios.
SR2Z 1 days ago [-]
The beauty of capitalism is that both parties can benefit from trade.
pasquinelli 21 hours ago [-]
why did anyone trade prior to capitalism?
SR2Z 19 hours ago [-]
Consensual trade between rational actors leads to both of them benefiting. Before capitalism, the ideology which focused on property rights and individual freedoms didn't really have a name...
pasquinelli 15 hours ago [-]
so how's it the beauty of capitalism?
SR2Z 11 hours ago [-]
'cause that's the one where you get to own things?
pasquinelli 11 hours ago [-]
if that's the one where you get to own things then there wasn't trade before capitalism, because you can't trade things without ownership. is that what you're saying?
arjie 1 days ago [-]
It's clear that people place some non-zero value on archival content. It should be unsurprising that news outlets also place some non-zero value on it. Given that they place some non-zero value on it, it is unsurprising that they do not give it away for zero. Disagreeing with their estimation of the value is understandable, but surely it's easy to see why most news outlets do what they do.
storus 2 days ago [-]
Not trying to be paranoid, but losing recorded history raw as it was originally reported could lead to quick AI-assisted rewrites in the archives of news outlets to fit whatever narrative of the "jour" is in fashion/that powerful of those times want. We are already seeing it in new editions of some old books that suddenly miss some currently controversial topics. History is written by the victors could change to history is rewritten by the (current) victors, as they see fit.
riffraff 1 days ago [-]
Actual physical newspaper still exists tho.
svachalek 2 days ago [-]
There really should be a micropayments setup on the internet that's not advertising based. Let these models pay a nickel to read the article, covered by the multi trillion dollar AI blank check.
AnthonyMouse 2 days ago [-]
The biggest problem with micropayments is that the buyer needs to be anonymous or you'll be creating a massive surveillance apparatus, which is the thing we're trying to get rid of. But existing laws make it difficult to build something like that which is easy for normies to use, so someone either needs to come up with a creative solution or reform the laws.
_jackdk_ 2 days ago [-]
GNU Taler[1] is an interesting middle-ground in the payments space: privacy-preserving for consumers, non-blockchain digital cash, and keeps merchant activity taxable.
I do worry about their whitepaper recommending it for a CBDC[2] (linked from [3]) which points out the state can implement negative interest rates, and that its architecture requires the issuer to get involved even in "spot your friend a $20"-level use cases. Since the issuer would presumably be required to KYC everyone, that also creates a big surveillance problem.
Well my thought is in this case the buyer is OpenAI, Anthropic, or Google.
AnthonyMouse 19 hours ago [-]
If normal people get to be anonymous then you don't know if someone accessing the site is OpenAI, Anthropic or Google, at which point why are they going to pay when anonymous access doesn't?
Also, this works pretty poorly for scrapers because people would just set up massive junk farms to collect micropayments from crawlers, and then either the amounts would be too small for real creators to get anything or anything requiring them just wouldn't get accessed. The latter is probably what a lot of the media companies want, but then if the AI companies aren't paying and the normal users aren't paying, who is?
jeroenhd 1 days ago [-]
For as long as there's free data to be downloaded, there will be an AI to ruin a public good. Why pay when you can just have your scraper generate a million trial accounts?
A pay wall at the news site would just bankrupt the internet archive, and a pay wall at the internet archive will kill most public interest in the service.
poisonfountain 2 days ago [-]
Cloudflare is trying to push for that, but every time it's mentioned people complain (because they hate Cloudflare for making them wait 2s for a captcha) and nobody proposes an alternative solution. I don't think this is going to happen, unfortunately, and the internet will get silo-ed into oblivion.
overfeed 2 days ago [-]
People don't hate micropayments because it's Cloudflare promoting them, it's because it truly is a shit idea, for many reasons.
People would equally reject Netflix, if Netflix fooated the idea of replacing the subscription model with pay-per-view micropayments.
> ...nobody proposes an alternative solution
Such is the human condition - some problems simply have no satisfactory solutions.
svachalek 2 days ago [-]
I'm absolutely the opposite. I'd do a reasonable pay per view on Netflix, but I don't want 73 different streaming subscriptions draining my account every month.
overfeed 20 hours ago [-]
> I'd do a reasonable pay per view on Netflix
You know it's not going to be reasonable because Netflix management wants to show continued profit growth, ad infinitum.
Streaming micropayments will likely include dynamic pricing, timed availability, $0.99 pilots and $20 season finales, doubled pricing when you choose the ad-free option, , "membership fees" that are a subscription in disguise, among other terrible (for the consumer) tactics.
The pricing model is just one small component of the picture, and it cannot fix systemic problems with the studio model, consolidation, chasing infinite growth, and IP law.
kube-system 1 days ago [-]
I do rent movies on Apple TV or Google Play for $4 rather than pay Netflix $20/mo for a smaller collection of movies.
I would never watch 6 movies in a month, and the selection with Netflix sucks by comparison to what is available when renting.
The subscription services are only a good deal if you binge them
HeWhoLurksLate 2 days ago [-]
I would gladly pay $4 to not have to watch twenty-seven ads before the main fight in an MMA match
overfeed 2 days ago [-]
If perfect information and perfect competition were attainable, the free hand of the market would deliver such a service to you. Since it's not, you'll have to bear the dudewipes ads.
lo_zamoyski 2 days ago [-]
> People would equally reject Netflix, if Netflix fooated the idea of replacing the subscription model with pay-per-view micropayments.
You sure about that?
Something like over half of Netflix viewers believe their subscription isn't justified by how much they watch or else they aren't sure of it. Less than half believe the subscription cost is justified.
Whether a PPV model would actually be cheaper for the first half is a good question, but it is possible. Certainly, in my case, I do not watch $20 worth of content on Netflix a month. I would gladly take PPV.
AnthonyMouse 2 days ago [-]
> because they hate Cloudflare for making them wait 2s for a captcha
Why does it work like that anyway? Every time I open a page on some sites, their vexing box shows up to waste my time. Five minutes later I want a different page on the same site and it does the same thing. They can't do it once and cache the result?
Forgeties79 2 days ago [-]
My issue with cloudflare is that if I run a VPN it randomly just locks up. Coin toss I can get through if I am routing thorough another country.
There's a river of cash flowing to the pockets of the wealthy and to the megalomaniac projects of hyperscaler, but not to drip a few pennies onto the pockets of people providing such an important public service as journalists.
b65e8bee43c2ed0 2 days ago [-]
Good.
wormius 2 days ago [-]
Ugh - our local paper used to have a wonderful archive, that got limited and locked down after the pandemic. IDK if they got bought out, but it's a real shame, I think some of the problem is things that used to be public information (birthdates, families, names) in hospital admissions (I found old entries of my friends parents and my own for being "in the hospital" in the newspaper for example).
I'm sure that plays a role, but still... This obviously is about cost and money making, not security as a whole (ime)
ghaff 2 days ago [-]
A lot of those aggregated records very quickly become a very precise public record. I'm not saying if it's good or bad but a lot of people on this site probably object to having their lives be essentially an open book which is very close to being the case as soon as a relatively small number of facts are opened up.
It's more the case when the addresses and birthdates of public figures, which are often a matter of public record, enter the picture but it's easier to find out information about a lot of people with a bit of data than most people realize if anyone really cares to investigate.
O1111OOO 2 days ago [-]
Newspapers are failing at an astounding rate. Archive.org is just a (poor) scapegoat for their inability to survive. This makes the point everyone else is making even more important - that those stories need to be archived before they are lost for all time.
"Since the early 2000s, the U.S. has lost about 40% of its local newspapers and about 75% of the jobs in newspaper journalism, according to a 2025 report from the Medill School of Journalism at Northwestern University. A study published last year by Rebuild Local News and Muck Rack shows that in 2002, there were roughly 40 journalists per 100,000 people in the United States. Today, it’s down to about eight journalists."[0]
There is a future where AI companies start hiring their own reporters, and it might be sooner rather than later.
blitzar 1 days ago [-]
Based on the coverage it looks a lot like journalists and publications have already been bought and paid for.
sandeepkd 2 days ago [-]
I think its bound to happen and in some ways it a good thing to happen too. The current state of AI affairs is a lot about outrightly selling some one else's intellectual property. The short term incentives are eroding the trust and goodwill among the natural knowledge actors.
The next natural thing to happen would be privatization or consolidation of the internet itself. Its already happening in the form of grabbing and consolidating IPv4 addresses.
drtz 2 days ago [-]
> The current state of AI affairs is a lot about outrightly selling some one else's intellectual property.
Blocking archiving in a flailing attempt to keep AIs away is extremely shortsighted. Archiving is important for keeping historical context, especially when it comes to news and journalism.
sandeepkd 2 days ago [-]
There is a natural flow of information that allows the information producers to make money for their work. How do you expect that the information producers would be even able to continue to create information when the they are not getting paid anymore.
One possible solution that I can think of for the long term good could be to just allow archival, no retrieval of the latest information, at-least for 6 months or a year. This should theoretically allow most goals.
ronsor 2 days ago [-]
[dead]
CM30 2 days ago [-]
This is rather worrying for fact checkers and those that want to track changes to news articles. The amount of times I've seen articles either get silently edited or go missing entirely is far higher than I'd have liked.
The Internet Archive at least provides one solution there, especially given the somewhat dubious practices Archive.is/today seems to be up to at the moment.
But I suspect that's probably another reason these sites don't want their work archived.
dspillett 1 days ago [-]
My cynical view is that a lot of these outlets would have liked to block the archive anyway⁰ but didn't as it could look bad to do so, and AI scraping is a convenient excuse. Much like some (but far from all) of the recent job cuts that have been announced “due to AI”.
An even more cynical view is that the information on many local news sites in recent years isn't worth archiving anyway, it is largely generic rubbish filtered down from on high because these days most local outlets are owned by large national groups¹ that use them for little more than a place to insert adverts.
For actual local news, which those outlets do sometimes still carry, archiving personal blogs, event sites, and some social media content³, would be more useful than local news outlets. It is a shame that a lot of this has moved to platforms that are more difficult to archive (distraction media providers block the archive too, discord and similar services are more difficult to easily/meaningfully archive, heck searching for non-recent information on them that you know is there can be a pain, etc.)
--------
[0] For numerous reasons including some idea that it could affect their advertising revenue, that they don't want things which they correct [because they are actually wrong or because those higher up the ownership chain are happy with certain truths] are embarrassingly preserved in their original form, etc.
[1] Like Reach Plc or Newsquest¹ (which owns the most prominent local rag where I live) in the UK.
[2] Which is in turn owned by the US company USA Today Co.
[3] Though this is probably largely blocked from the archive too.
dylan604 2 days ago [-]
The oldest news station in my city has partnered with a local college to house their archives going all the way back to when their content was gathered on black&white film. IIRC, using the footage is free (with proper attribution), but you have to pay for the media you choose to receive. The last time I looked, this was pre-digital play out systems, so we're talking video tape costs plus a fee to cover equipment/VTR type use. Not sure what they do now if you just want a digital file.
I'm pretty sure similar was done for the newspaper. However, the oldest paper was bought and killed decades ago, so not sure what happened there.
While not as convenient as a live website, most news sources will have an actual physical archive that you can access with some real intent.
kccqzy 2 days ago [-]
On AWS, S3 has a downloader-pay option since basically forever. Maybe a non-technical user would be baffled by this, but a HN user won’t have trouble paying for the download using their existing AWS account. This would be a fine solution if the cost of distribution is the only concern, without considering royalty for copyright.
dylan604 1 days ago [-]
Most places (all??) aren't going to be doing AWS->AWS transfers like that. They'll have a web portal with you can use a payment option that ultimately generates an expiring authorized URL.
evanjrowley 2 days ago [-]
They should allow access after the news becomes old. That's what the archive is intended for.
GCA10 2 days ago [-]
JSTOR does exactly this with scholarly journals, and it works out pretty well. Recent issues are accessible only to paying customers.
Back issues (usually at least a few years old) are available via JSTOR for free in small amounts and through subscriptions for bulk users. I'm sure there's some reason to fight about the details, but from a distance it looks like a pretty good compromise.
ok123456 2 days ago [-]
Newspapers think their archives are worth money, and that people who are interested in genealogy will pay for newspapers.com subscriptions.
smith7018 2 days ago [-]
Agreed. IA should take snapshots of the articles over time and then make them publicly available X months/years later. There's no reason to immediately publicly mirror the articles beyond people trying to get around paywalls.
blablabla123 1 days ago [-]
I'm surprised this didn't happen earlier and genuinely curious why. Decades ago the first time for me to see a redundant server setup was within a local newspaper's office. So it's likely not because people aren't tech-savvy or anything.
flippant 2 days ago [-]
Apologies for the self-promo. Downvote and I'll know not to do it again.
This trend of outright banning the Internet Archive has me extremely worried. I fear a future where news articles are memoryholed, and no one can remember exactly what was reported and how sensational it all seemed.
I've been working on this project [0] for a while. Originally, I started with a tool that would allow people to snapshot webpages in their own browser, and they could selectively share their snapshots. Then by consensus, everyone could understand what exactly had changed, and they could draw their own conclusion about why.
While working on it, I realized that an authoritative answer to "what did it look like on $DATE" can't be produced by a no-name company. It's gotta be a non-commercial entity that's got a track record of integrity. The dream would be to allow MemoryHole customers to submit their snapshots to the Internet Archive (or other non-commercial entity). It's definitely a copyright nightmare - so no clue how this could work.
> It's definitely a copyright nightmare - so no clue how this could work.
It could work as a decentralized free and open source system that doesn't care about copyright. Like how torrents work now, but it would be good to have it work over Tor or something. Perhaps as a DAO for the management aspect of it. I don't know how exactly. But disregarding copyright by using a centralized company is the wrong idea.
Or you can do the lawful approach and try to work within the framework of that copyright nightmare. But "fuck copyright" is an easier path.
entropie 2 days ago [-]
You - as a company - can just avoid any copyright stuff when your extension saves the stuff only on the client. I see there are many other issues then.
The torrent approach is nice. I could imagine a selfhosted way to store the data (for a group of people)
flippant 2 days ago [-]
> I could imagine a selfhosted way to store the data (for a group of people)
Linkwarden does this well. You can share a collection for a small group of people.
Is there a way to export/download my saves in a reasonable way?
flippant 2 days ago [-]
Thank you! Yes, you just get a zip file with all of your saved pages.
It looks like this:
├── files
│ └── 632daffb-2f4f-4795-bb4d-3149d24f4264
│ ├── original.html
│ ├── readerview.html
│ └── screenshot.png
├── manifest.json
└── metadata.csv
2 days ago [-]
acidhousemcnab 2 days ago [-]
Perhaps I imagined this, however some months ago on X someone pointed out a historical article on dailymail.co.uk related to Prince Phillip and Epstein had been scrubbed, which likely would be intelligence or through D-Notices, but where instead of showing a 404 page would redirect to an article that was similar but benign. I checked the URL on the Wayback Machine and it turned up zero results, but not even the redirected article, however the user on X had screen grabbed the original, which everyone was reading and commenting on. As of 21st May I can't find this discussion on X and Grok denies it ever existed. This is a "maximally truth-finding" AI, so I must be mistaken. Perhaps the Internet Archive cannot be trusted, so this is why 340 local news outlets need to limit access.
grosswait 2 days ago [-]
This sounds like the beginning of a story where the next odd thing is your family and friends don’t know who you are, and know one has ever heard of you.
acidhousemcnab 1 days ago [-]
Yeah, so about that - it’s called Project White Bear, involved UK GCHQ, theatre act called GAYA 4.0, AI human “psychological” experimentation, and propaganda depicting Russians as Mongolian beasts with dripping claws and fangs. Not making this up.
endofreach 2 days ago [-]
why not just agree on a release date?
while i enjoy circumventing moneyfences, i understand the wallmakers do not. i think this would be an easy deal, if someone just laid it on the table.
arjie 2 days ago [-]
It's interesting how much we lost with the end of the advertising model (though likely its death would arrive with agentic access anyway). An unsurprising reaction to that was the advent of the widespread paywall. And in a world where every paywalled article on social media, including HN, is on an archived paywall-bypass site there was going to be a natural cat-and-mouse game. The distributed payment model of online advertising was surprisingly effective. No single person was worth very much but the aggregate of attention had a probabilistic conversion that enabled a sufficient ecosystem of news.
Now most of those who spend money get access to relatively good news in comparison to those who don't. The interesting thing is that if you model the utility of a customer base as trifactorial (subscriptions, ad-supported, influence-ability) and you set ad-support to near zero you're left with this situation where those with no ability to pay are now overwhelmingly useful to the website provider only as an influenceable base.
"If you're not paying, you're not the customer, you're the product", we used to say[0]. It turns out that's true, but if you can't pay by looking at ads, you will pay by the actions you take when you believe what the actual customer wants you to believe.
0: Though sometimes you do pay and you're still "the product" haha!
internet_points 2 days ago [-]
they should make a browser extension that lets logged in readers submit the contents of their tab to IA
xp84 2 days ago [-]
> "as profit margins for news thin, it’s only become more important to news publishers to protect their intellectual property."
So their argument is that people who would be paying money at their paywalls, are going to IA to get their news for free? And if they can thwart those people, they'll show up and become monthly subscribers?
I am vaguely sympathetic to newspapers as a concept, though the actually owners of approximately all of them are just PE companies looking to extract maximum profit from this dying industry, not really trying to prolong their existence.
But I think everyone who is interested in subscribing to their newspapers' paywalls already has subscribed. Those of us who bypass paywalls with that archive.whatever site, or apparently IA (I have never tried it for this purpose) are doing so because there is zero chance we're going to (recurringly!) pay the asking price for some random out-of-town newspaper, The Verge, Bloomberg, whatever. It's fair game to call us immoral for that decision, but if (and it's a big if) this move prevents more people from being able to bypass a paywall, I predict zero incremental dollars will go to the news publishers.
munk-a 2 days ago [-]
IA is sort of caught in the middle of a conflict it didn't ask for, here. The same tools that allow IA to do it's work are also used by Google to scrape and resell the news. There are ways to allow one usage without the other - but the simpler and more foolproof approach is to block both.
wolvoleo 13 hours ago [-]
One thing I don't understand is how for decades it was totally fine to subsidise a new site with a few ads. But now they're all paywalling. What's changed?
jqmccleary 2 days ago [-]
If we don't know the past we wont know it's repeating
forestingfisher 2 days ago [-]
Of course, there are other archivers that don’t care
jmclnx 2 days ago [-]
Maybe they should allow the Internet Archive access to their article after a week or 2.
But I think this will hurt them as time goes on more then help. IIRC, one news org blocked free access and their revenue fell. I think that was in Australia.
But seems they are using AI as the reason. So allowing after a week will not avoid AI access.
But, what happens of an AI Company subscribes to the news site using a person's name (or a fake name) ? They will still get the article and avoid hassles.
celsoazevedo 2 days ago [-]
It may be easier to convince them if the Internet Archive doesn't allow access for <period of time>. Not good for the average user now, but at least it would be archived for the future. Better than having no archive at all.
fragmede 2 days ago [-]
Yeah IA needs to get their heads out of their asses and just do that. It's an archive, but if it's available at the same time as it's relevant, then it's being used as alternate access.
ranger_danger 2 days ago [-]
That sounds like a good idea to me.
One of the tests for Fair Use in the US, as I understand it, would be whether the archived work "competes" with the original.
If people start going to IA instead to read the news, the newspaper might have a claim. But if they're doing it to get around paywalls, or purely for archival/historical/research purposes, that may be allowed.
But the reality is such decisions are subjective and will be up to whatever judge happens to get such a case in front of them if this is challenged.
PaulHoule 2 days ago [-]
In general judges seem to understand that the copyright holder has some interest in these situations but not seem to understand that the rest of the community has some rights too.
stronglikedan 2 days ago [-]
That's okay. The AI knows everything now, and forever more. Farwell IA.
_ink_ 2 days ago [-]
Thanks, Big Tech!
phkahler 2 days ago [-]
Are they blocking due to AI scraping, or due to people using archive as a way around paywalls?
For the later, archive could just limit access to stuff that's less than 7 days old.
archagon 2 days ago [-]
Too bad archive.today went a bit crazy. We need an anonymous archiving site that will simply not respond to takedown requests (except for CSAM and similar). The internet is too important for that.
charcircuit 2 days ago [-]
If the block is merely user agent based IA can spoof a different user agent to get these sites.
Not surprising, sites like Reddit use it to get around their paywalls.
Redditors then had the gall to pretend like it wasn’t their number one use case.
wolvoleo 13 hours ago [-]
I see archive.is / archive.today used a lot more for that.
First of all it's a lot faster than archive.org, second of all they are much better at bypassing the paywalls. In some cases the operator even seems to have taken up a subscription or at least a user account (in some cases you could see there is a user logged in). They're really clever at this.
For this site the main purpose is indeed the bypass. For the internet archive it's not.
And yeah it could easily be fixed by just publishing the pages a month later or so.
fortyseven 2 days ago [-]
Just burn the whole fucking internet down. We can't have nice things.
wolvoleo 13 hours ago [-]
Isn't this what the big money guys have been doing for the past 20 years? And the politicians with their age BS.
picsao 2 days ago [-]
[dead]
b00ty4breakfast 2 days ago [-]
Of course they are, because they are not primarily concerned with the reporting of noteworthy events. They are most worried about profit with the secondary goal of reporting but only insofar as it serves the first goal. This is a wider trend across many industries.
Obviously, a business needs to have an income but it's becoming more common for businesses to function first and foremast as revenue generators and the thing that enables that is only seen as a means to an end. When the quality of the product/service and it's function as a revenue generator diverge, the product/service will always take 2nd chair.
Maybe we could argue that the primary product is the revenue, especially when there are investors involved who are looking for big returns.
psb5 2 days ago [-]
More than even that, there is more news being generated than there are 3 inch chimp brains available to digest it all (even with AI busy summerizing everything) or act on it.
There is no media theory of information of what happens when info explodes beyond capacity of the system to consume it. (UN report on Attention Economy says less than 1% is actually consumed by humans)
So media orgs, instead of coming up with one, they just keep mindlessly doing what they know how to do - generate more info. Platforms and corps subsidize this activity for their own interests.
So media orgs have no signal/warped signals of how useless what they are doing is.
no-name-here 2 days ago [-]
Among the countless local and global newspapers etc, either present or recent decades, are there any that you believe were or are primarily concerned with reporting noteworthy news?
xp84 2 days ago [-]
When it comes to the companies named here, I would argue that they have shown that reporting isn't even a secondary goal or a goal at all. Journalists don't even make that much money, but they've still gutted newsrooms very thoroughly. I assume that they already have people working on setting up an LLM connected to feeds of press releases, government announcements, public police crime reports, prominent social media accounts, etc. to create a repository of slop they can use (which will bear a vague resesmblance to 'news') without having even one reporter employed. And then they'll try to sell access to that slop feed back to the AI vendor (which hopefully won't buy it).
frmersdog 2 days ago [-]
As good a time as any to remind people that the Southern Strategy was never really all that Southern:
Historically-speaking, if your local news can twist the context to make you easier to sell to (products, services, ideologies), they will do that.
khat 2 days ago [-]
Or maybe, just maybe these news sites shouldn't be shipping 40MB JS bloated, ad infected websites. You're a news station just ship the words, make people pay for the images. This keeps bandwidth down for non payers, and foots the bill for those who do use the bandwidth. You pay for what you use, and reduce the overhead while you're at it.
carlosjobim 2 days ago [-]
Bandwidth is not their concerns.
khat 1 days ago [-]
Their concern is profit. Text only pages has little footprint both on the network and on storage. I was trying to be concise.
no-name-here 2 days ago [-]
1. Is the idea that the primary costs for such news sources are hosting costs?
2. And that if news sites offered the text for free but paywalled images they'd be more sustainable than they are now?
khat 1 days ago [-]
1. In the direct relation of the cost of the website, yes. Mainly in news sites like CNN, Fox, MSNBC. The articles are usually already written for TV.
2. As profitable? No. Sustainable? Yes.
It is not hard to imagine a future in 50 years time where a huge percentage of this content is lost forever, or at best incredibly hard to find.
Similarly and tangentially, when the US Constitution was made in an era of horseback/carriages, it explicitly authorized the creation of a public national postal service (USPS).
If we extended that older public policy with today's technological context, they would have authorized a national Internet Service Provider. (And, like with USPS, specialized private competitors would exist.)
(It is worth noting that at least in Sweden "published" here has a very specific meaning, that doesn't include personal websites etc, but it does include news outlets.)
My own last project before I left was to ingest records from crawl dumps from the defunct cuil.com website. About 200 TB of stuff that brought back 60 billion URLs.
The nature of the internet has changed and it's become an ephemeral place for many people where you just through things in and others mine it as "data".
I don't see why every news outlet doesn't just do this.
In which case archive is a major revenue slumper
Spite? No evidence of that. They probably just don't want to lose the money from paying customers and ads. You're just making up fantasy. Perhaps projecting your own spite.
you listed
1. buying the cheapest groceries you can reasonably find 2. trying to get the highest salary you can 3. literally any time you try to get more for yourself
that's a weak list from which to conclude that greed isn't a problem, especially since in the case of 1. and 2. someone's making money off you, the person who's supposedly greedy in these scenarios.
I do worry about their whitepaper recommending it for a CBDC[2] (linked from [3]) which points out the state can implement negative interest rates, and that its architecture requires the issuer to get involved even in "spot your friend a $20"-level use cases. Since the issuer would presumably be required to KYC everyone, that also creates a big surveillance problem.
[1]: https://www.taler.net/en/index.html
[2]: https://www.snb.ch/public/asset/de/www-snb-ch/publications/r...
[3]: https://www.taler-systems.com/en/digital-currency.html
Also, this works pretty poorly for scrapers because people would just set up massive junk farms to collect micropayments from crawlers, and then either the amounts would be too small for real creators to get anything or anything requiring them just wouldn't get accessed. The latter is probably what a lot of the media companies want, but then if the AI companies aren't paying and the normal users aren't paying, who is?
A pay wall at the news site would just bankrupt the internet archive, and a pay wall at the internet archive will kill most public interest in the service.
People would equally reject Netflix, if Netflix fooated the idea of replacing the subscription model with pay-per-view micropayments.
> ...nobody proposes an alternative solution
Such is the human condition - some problems simply have no satisfactory solutions.
You know it's not going to be reasonable because Netflix management wants to show continued profit growth, ad infinitum.
Streaming micropayments will likely include dynamic pricing, timed availability, $0.99 pilots and $20 season finales, doubled pricing when you choose the ad-free option, , "membership fees" that are a subscription in disguise, among other terrible (for the consumer) tactics.
The pricing model is just one small component of the picture, and it cannot fix systemic problems with the studio model, consolidation, chasing infinite growth, and IP law.
I would never watch 6 movies in a month, and the selection with Netflix sucks by comparison to what is available when renting.
The subscription services are only a good deal if you binge them
You sure about that?
Something like over half of Netflix viewers believe their subscription isn't justified by how much they watch or else they aren't sure of it. Less than half believe the subscription cost is justified.
Whether a PPV model would actually be cheaper for the first half is a good question, but it is possible. Certainly, in my case, I do not watch $20 worth of content on Netflix a month. I would gladly take PPV.
Why does it work like that anyway? Every time I open a page on some sites, their vexing box shows up to waste my time. Five minutes later I want a different page on the same site and it does the same thing. They can't do it once and cache the result?
I'm sure that plays a role, but still... This obviously is about cost and money making, not security as a whole (ime)
It's more the case when the addresses and birthdates of public figures, which are often a matter of public record, enter the picture but it's easier to find out information about a lot of people with a bit of data than most people realize if anyone really cares to investigate.
"Since the early 2000s, the U.S. has lost about 40% of its local newspapers and about 75% of the jobs in newspaper journalism, according to a 2025 report from the Medill School of Journalism at Northwestern University. A study published last year by Rebuild Local News and Muck Rack shows that in 2002, there were roughly 40 journalists per 100,000 people in the United States. Today, it’s down to about eight journalists."[0]
[0] https://theconversation.com/why-the-pittsburgh-post-gazettes...
The next natural thing to happen would be privatization or consolidation of the internet itself. Its already happening in the form of grabbing and consolidating IPv4 addresses.
Blocking archiving in a flailing attempt to keep AIs away is extremely shortsighted. Archiving is important for keeping historical context, especially when it comes to news and journalism.
One possible solution that I can think of for the long term good could be to just allow archival, no retrieval of the latest information, at-least for 6 months or a year. This should theoretically allow most goals.
The Internet Archive at least provides one solution there, especially given the somewhat dubious practices Archive.is/today seems to be up to at the moment.
But I suspect that's probably another reason these sites don't want their work archived.
An even more cynical view is that the information on many local news sites in recent years isn't worth archiving anyway, it is largely generic rubbish filtered down from on high because these days most local outlets are owned by large national groups¹ that use them for little more than a place to insert adverts.
For actual local news, which those outlets do sometimes still carry, archiving personal blogs, event sites, and some social media content³, would be more useful than local news outlets. It is a shame that a lot of this has moved to platforms that are more difficult to archive (distraction media providers block the archive too, discord and similar services are more difficult to easily/meaningfully archive, heck searching for non-recent information on them that you know is there can be a pain, etc.)
--------
[0] For numerous reasons including some idea that it could affect their advertising revenue, that they don't want things which they correct [because they are actually wrong or because those higher up the ownership chain are happy with certain truths] are embarrassingly preserved in their original form, etc.
[1] Like Reach Plc or Newsquest¹ (which owns the most prominent local rag where I live) in the UK.
[2] Which is in turn owned by the US company USA Today Co.
[3] Though this is probably largely blocked from the archive too.
I'm pretty sure similar was done for the newspaper. However, the oldest paper was bought and killed decades ago, so not sure what happened there.
While not as convenient as a live website, most news sources will have an actual physical archive that you can access with some real intent.
Back issues (usually at least a few years old) are available via JSTOR for free in small amounts and through subscriptions for bulk users. I'm sure there's some reason to fight about the details, but from a distance it looks like a pretty good compromise.
This trend of outright banning the Internet Archive has me extremely worried. I fear a future where news articles are memoryholed, and no one can remember exactly what was reported and how sensational it all seemed.
I've been working on this project [0] for a while. Originally, I started with a tool that would allow people to snapshot webpages in their own browser, and they could selectively share their snapshots. Then by consensus, everyone could understand what exactly had changed, and they could draw their own conclusion about why.
While working on it, I realized that an authoritative answer to "what did it look like on $DATE" can't be produced by a no-name company. It's gotta be a non-commercial entity that's got a track record of integrity. The dream would be to allow MemoryHole customers to submit their snapshots to the Internet Archive (or other non-commercial entity). It's definitely a copyright nightmare - so no clue how this could work.
[0] - https://memoryhole.app
It could work as a decentralized free and open source system that doesn't care about copyright. Like how torrents work now, but it would be good to have it work over Tor or something. Perhaps as a DAO for the management aspect of it. I don't know how exactly. But disregarding copyright by using a centralized company is the wrong idea.
Or you can do the lawful approach and try to work within the framework of that copyright nightmare. But "fuck copyright" is an easier path.
The torrent approach is nice. I could imagine a selfhosted way to store the data (for a group of people)
Linkwarden does this well. You can share a collection for a small group of people.
https://github.com/linkwarden/linkwarden
Tor is fine especially for onion sites. You just have to understand the limitations.
(I2P is also good.)
Is there a way to export/download my saves in a reasonable way?
It looks like this:
├── files
│ └── 632daffb-2f4f-4795-bb4d-3149d24f4264
│ ├── original.html
│ ├── readerview.html
│ └── screenshot.png
├── manifest.json
└── metadata.csv
Now most of those who spend money get access to relatively good news in comparison to those who don't. The interesting thing is that if you model the utility of a customer base as trifactorial (subscriptions, ad-supported, influence-ability) and you set ad-support to near zero you're left with this situation where those with no ability to pay are now overwhelmingly useful to the website provider only as an influenceable base.
"If you're not paying, you're not the customer, you're the product", we used to say[0]. It turns out that's true, but if you can't pay by looking at ads, you will pay by the actions you take when you believe what the actual customer wants you to believe.
0: Though sometimes you do pay and you're still "the product" haha!
So their argument is that people who would be paying money at their paywalls, are going to IA to get their news for free? And if they can thwart those people, they'll show up and become monthly subscribers?
I am vaguely sympathetic to newspapers as a concept, though the actually owners of approximately all of them are just PE companies looking to extract maximum profit from this dying industry, not really trying to prolong their existence.
But I think everyone who is interested in subscribing to their newspapers' paywalls already has subscribed. Those of us who bypass paywalls with that archive.whatever site, or apparently IA (I have never tried it for this purpose) are doing so because there is zero chance we're going to (recurringly!) pay the asking price for some random out-of-town newspaper, The Verge, Bloomberg, whatever. It's fair game to call us immoral for that decision, but if (and it's a big if) this move prevents more people from being able to bypass a paywall, I predict zero incremental dollars will go to the news publishers.
But I think this will hurt them as time goes on more then help. IIRC, one news org blocked free access and their revenue fell. I think that was in Australia.
But seems they are using AI as the reason. So allowing after a week will not avoid AI access.
But, what happens of an AI Company subscribes to the news site using a person's name (or a fake name) ? They will still get the article and avoid hassles.
One of the tests for Fair Use in the US, as I understand it, would be whether the archived work "competes" with the original.
If people start going to IA instead to read the news, the newspaper might have a claim. But if they're doing it to get around paywalls, or purely for archival/historical/research purposes, that may be allowed.
But the reality is such decisions are subjective and will be up to whatever judge happens to get such a case in front of them if this is challenged.
For the later, archive could just limit access to stuff that's less than 7 days old.
Redditors then had the gall to pretend like it wasn’t their number one use case.
First of all it's a lot faster than archive.org, second of all they are much better at bypassing the paywalls. In some cases the operator even seems to have taken up a subscription or at least a user account (in some cases you could see there is a user logged in). They're really clever at this.
For this site the main purpose is indeed the bypass. For the internet archive it's not.
And yeah it could easily be fixed by just publishing the pages a month later or so.
Obviously, a business needs to have an income but it's becoming more common for businesses to function first and foremast as revenue generators and the thing that enables that is only seen as a means to an end. When the quality of the product/service and it's function as a revenue generator diverge, the product/service will always take 2nd chair.
Maybe we could argue that the primary product is the revenue, especially when there are investors involved who are looking for big returns.
There is no media theory of information of what happens when info explodes beyond capacity of the system to consume it. (UN report on Attention Economy says less than 1% is actually consumed by humans)
So media orgs, instead of coming up with one, they just keep mindlessly doing what they know how to do - generate more info. Platforms and corps subsidize this activity for their own interests.
So media orgs have no signal/warped signals of how useless what they are doing is.
https://www.uh.edu/news-events/stories/052815watchingtvracia...
https://www.mediamatters.org/legacy/video-what-happens-when-...
Historically-speaking, if your local news can twist the context to make you easier to sell to (products, services, ideologies), they will do that.
2. And that if news sites offered the text for free but paywalled images they'd be more sustainable than they are now?