@PixelJones What I'd really like to do is to get away from the per-item / per-use payment model, and instead think of the infostream as a distribution utility for which access rather than use is the principle consideration, and in which payment ability (wealth & income) rather than content value is the basis on which payments are made.

The questions of both what content is made available and how that content is compensated I'm leaving somewhat vague, though in general we have systems which work for this, and which have worked for nearly a century now based on broadcast & cable media, audit-based measurement (Nielson, Aribitron, etc.), distributor-based negotiations (with individual broadcast stations or networks), and something closely approaching a common-carrier model for the actual access providers (that is, ISPs).

The points @dangillmor raised are valid: a gatekeeper monopoly is a critical hazard, and is worth addressing from a competitiveness standpoint, independent of this proposal.

Why "all you can eat"? Two principle reasons:

1: Need for information is strongly independent of capacity to pay, and often inversely associated.

2: There are entirely novel capabilities afforded by access at scale which a usage-based payment model largely forecloses on. Aaron Swartz's work which lead to his prosecution and suicide based on wholesale downloading of JSTOR scientific papers is a key case in point. It's possible to look through, over, and among a corpus to find relationships not otherwise manifest. (I'm doing something along these lines with my #HackerNewsAnalytics series posted here on the Fediverse.)

The notion of an individual or household account, associated with personal mobile devices and/or household Internet service, from which pro-rata payments are then allocated amongst various providers is one option for compensation, though even that might well not be ideal. That imposes a huge surveillance component itself (who is reading, listening to, or watching what), and could well disproportionately benefit or starve less substantial or more substantial works. More critically on that last: works which are far more expensive to produce at quality, such as investigative journalism or scientific research.

Some sense of local, regional, national, and global providers / publishers, within genres, funded with a specific budget and for a minimum guaranteed time period, would provide the institutional stability to provide certain classes of work: news, education, business and government publications, academic research, and of course, entertainment.

And, again, multiple revenue streams, including premium subscriptions, patronage, advertising, etc., could well be additional components. But an access-based automatic and universally-billed tier really does seem to be a possibility that's rarely mentioned or advocated.

@cobalt

#UniversalContentSyndication #PayingForJournalism

On general discussion forums and "paying for media"

One frequent dispute online is over paywalled links, and the general advisability on various grounds of sharing workarounds. I happen to have data for Hacker News (HN), so that's what I discuss here.

As I'm sitting on a trove of ~190k front page stories and the sites linked by them, I can bring some insight to this debate. As of 21 June 2023, there were 52,642 distinct sites which have made just the front page (30 items/day). That's roughly 3% of all submitted posts, which would be a rather larger site tally.

How many of those 52,642 sites should HN members subscribe to?

If we restrict that to only the sites with 100+ front-page submissions, that number falls to 149. Still, arguably, excessive.

Of the sites I've identified as "general news" (all sites w/ >= 17 appearances, plus a few others), that list is 146.

Those constitute 8.47% of all HN front-page posts, the second-largest overall category following blogs.

I would suggest that expecting the 600k+ active HN participants, let alone the 5 million or so total monthly users, to individually subscribe to more than a very small handful of such sites is entirely unrealistic.

Subscriptions are a concept which worked reasonably well for local newspapers serving limited areas for which some fraction of households might subscribe to one, and far fewer multiple dailies. The majority of expenses were covered by advertising, however.

Whatever business model people are going to suggest for online media, it's going to have to address the fact that individual people cannot and will not register many thousands, or even dozens, of subscriptions.

(Adapted from an earlier HN comment: https://news.ycombinator.com/item?id=36832354)

Edits: Rephrasing.

#HackerNews #HackerNewsAnalytics #Paywalls #Subscriptions #Journalism

As of 21 June 2023, there were 52,642 distinct sites (as defined by HN) which ha... | Hacker News

HackerNews changed how it dealt with highly-active discussions around January 2009, based on evidence I see (far fewer spicy threads after that date).

I'm also seeing that spicy stories actually tend to rank slightly higher on the page (a lower "storypos", that is, story position, value), which is counter to my expectation. This may of course be due to selection bias --- moderators specifically lift limit on overheated stories, so that those stories that do survive are more appropriate to HN.

I'd like to look at semantic / sentiment elements here as well, words or phrases which seem more prevalent on high-ratio stories. Here my analytic methods work against me as the HN title of a post is often quite short and not especially descriptive, though with some examples (as with the mental health study mentioned earlier).

#HackerNews #HackerNewsAnalytics

Hacker News "Ratio": political commentary sites

Continuing my look at the comments/votes ratio, a look at sites which tend to focus on political commentary and their "spiciness". These tend to be well above mean (0.63), median (0.52), and tend to be a standard deviation or more from the mean (1 sd: 0.78, 2 sd: 0.92, 3 sd: 1.06).

Stories Vote Comm Ratio Site
2 18 57 3.167 heritage.org
4 143 224 1.566 hoover.org
9 473 603 1.275 breitbart.com
8 1724 1873 1.086 cityobservatory.org
9 364 379 1.041 mises.org
1 56 55 0.982 adamsmith.org
7 2488 2372 0.953 city-journal.org
1 92 85 0.924 manhattan-institute.org
70 13143 11614 0.884 reason.com
5 854 722 0.845 jacobinmag.com
1 204 153 0.750 theblaze.com
13 1607 1202 0.748 bostonreview.net
5 1682 1252 0.744 tribunemag.co.uk
4 629 465 0.739 nationaljournal.com
5 1907 1400 0.734 americanaffairsjournal.or
12 2164 1584 0.732 alternet.org
10 1302 871 0.669 cato.org
5 738 493 0.668 dailycaller.com
9 1387 844 0.609 dailykos.com
5 759 450 0.593 rawstory.com
10 2538 1455 0.573 rootsofprogress.org
2 552 275 0.498 theroot.com
30 7881 3850 0.489 rt.com
2 1256 467 0.372 wsws.org

Note that general news tends somewhat toward spicy, though not as much as the explicitly political sites. Of the 147 sites I'd identified as "general news", ratio statistics are:

n: 147, sum: 94.415, min: 0.092, max: comms,, mean: 0.642279, median: 0.605, sd: 0.433165

%-ile:

5: 0.234, 10: 0.341, 15: 0.4515,
20: 0.491, 25: 0.51, 30: 0.5305,
35: 0.5415, 40: 0.566, 45: 0.581,
55: 0.614, 60: 0.6285, 65: 0.654,
70: 0.68, 75: 0.716, 80: 0.734,
85: 0.7875, 90: 0.8715, 95: 1.1925

(As with other toots in this series, Markdown formatting is used, toot.cat may be better than your own instance's presentation.)

#HackerNews #HackerNewsAnalytics

The 20 "spiciest" sites seem to be (using a cut-off of 20+ stories):

apnews.com 36 14674 17512 1.193
sfchronicle.com 25 5771 6174 1.070
variety.com 24 5479 4992 0.911
mattmaroon.com 73 3332 3023 0.907
axios.com 92 38075 34150 0.897
bizjournals.com 20 2183 1959 0.897
cnbc.com 174 59983 53056 0.885
apple.com 241 99945 88396 0.884
reason.com 70 13143 11614 0.884
nypost.com 28 5851 5088 0.870
markevanstech.com 22 290 251 0.866
macrumors.com 62 18700 16162 0.864
nikkei.com 56 17568 15174 0.864
economist.com 829 119205 102702 0.862
thewalrus.ca 30 6194 5199 0.839
techradar.com 30 7227 6053 0.838
backreaction.blogspot.com 33 7209 5968 0.828
strongtowns.org 27 8279 6857 0.828
mondaynote.com 45 7581 6268 0.827
coindesk.com 22 10236 8355 0.816

And the 20 least spicy sites are:

particletree.com 37 997 227 0.228
brendangregg.com 40 11135 2512 0.226
intruders.tv 28 324 73 0.225
aphyr.com 34 8514 1910 0.224
andrewchen.typepad.com 51 757 168 0.222
michaelnielsen.org 31 3335 723 0.217
igvita.com 38 3626 767 0.212
startuplessonslearned.blo 24 1101 232 0.211
citusdata.com 51 8361 1717 0.205
ferd.ca 21 5883 1132 0.192
ocks.org 27 6036 1120 0.186
tensorflow.org 22 5612 1020 0.182
aosabook.org 21 3899 669 0.172
ocw.mit.edu 41 8793 1500 0.171
david.weebly.com 20 1364 226 0.166
jslogan.com 24 97 16 0.165
burningdoor.com 23 149 23 0.154
linusakesson.net 26 4531 684 0.151
github.com/0xax 22 2168 121 0.056

#HackerNews #HackerNewsAnalytics

The Hacker News Ratio

One concept Hacker News uses to moderate discussions is a "flamewar detector", which based on moderator comments over the years is triggered when a discussion has > 40 comments AND there are more comments than votes on the article.

That had long struck me as questionable, but it's now something I can look at and ... it seems reasonably accurate. I've calculated ratios of all 178,882 HN Front Page stories (as of 2023-6-31), and ... do I have some ratios.

Basic stats:
n: 178882, sum: 89796.9, min: 0.00, max: 21.00, mean: 0.501990, median: 0.4, sd: 0.432899

Percentiles:
%-ile: 5: 0.08, 10: 0.13, 15: 0.17, 20: 0.21, 25: 0.24, 30: 0.27, 35: 0.3, 40: 0.33, 45: 0.37, 55: 0.44, 60: 0.48, 65: 0.53, 70: 0.58, 75: 0.64, 80: 0.72, 85: 0.82, 90: 0.96, 95: 1.22

Because of how I've parsed and processed data, it's not entirely straightforward to pull up the specific posts, though I can find those by the date and story position (ranked 1--30 on the page).

And ... yeah, the stories that tend to rate high based on this metric do tend to be sort of flamey.

The most ratioed post of all time was "juwo beta is released (at last!) Please use it and help improve it!", from 18 April 2007, at 21.0:

https://news.ycombinator.com/item?id=14253

Sometime around 2009--2010 the flamewar detector seems to have been implemented and ratios tend to be much lower, though there are still some pretty spicy discussions. One from the National Institutes on Health titled "Mental illness, mass shooting,s and the politics of American firearms", posted on 26 May 2022 (for a story originally dating from 2015) is the highest-ratioed post after the flamewar detector came into use, at 5.99:

https://news.ycombinator.com/item?id=31511274

I find it interesting how being able to query my archive affords insights on HN which aren't available through the standard search tools. It's possible to look for specific keywords, or submissions or comments from a specific account, but searching for contentious posts isn't really A Thing.

I'm doing some further digging to see what patterns might emerge by site, though finding a good minimum number of front-page appearances is one question I'm looking at.

#HackerNews #HackerNewsAnalytics

juwo beta is released (at last!). Please use it and help improve it! | Hacker News

More on "UNCLASSIFIED": there are 36,520 of those sites right now. (Despite knowing better I keep diving in and classifying more of them.)

It's not practical to list all of them. But we can randomly sample. And large-sample statistics start to apply at about n=30, so let's just grab 30 of those sites at random using sort -R | head -30:

1 sfg.io
1 extroverteddeveloper.com
2 letmego.com
1 thestrad.com
2 bombmagazine.org
1 domlaut.com
1 bootstrap.io
1 jumpdriveair.com
2 desmos.com
1 leo32345.com
1 echopen.org
1 schd.ws
1 web3us.com
7 akkartik.name
1 bcardarella.com
1 cancerletter.com
1 platinumgames.com
1 industrytap.com
2 worldoftea.org
1 motion.ai
1 vectorly.io
2 enterprise.google.com
1 lift-heavy.com
1 davidpeter.me
1 panoye.com
3 thestrategybridge.org
2 fontsquirrel.com
1 kettunen.io
1 moogfoundation.org
2 elekslabs.com

That's a few foundations, a few blogs, a corporate site (enterprise.google.com), and something about tea, all with a small number of posts (1--7).

I'm looking at some slightly larger samples (60--100) here on my own system, and can actually make some comparisons across samples (to see how much variance there is) which can give some more information on tuning what I would expect to find under the "UNCLASSIFIED" sites.

Which is one way of using #StatisticalMethods to make estimates where direct measurement or assessment is impractical.

#HackerNewsAnalytics #HackerNews #MediaAnalysis #RandomSampling #Statistics

So ... I'm starting to get the reporting by site classification across years down and ... it is interesting.

Preliminary and buggy code yet. Also this is highly dependent on how I've actually classified sites.

I've got a few classifications I'd wanted to keep an eye on:

  • Programming-specific sites. A lot of this is github and gitlab, basically, software projects with code. I'm distinguishing software (which is mostly about use) and programming which involves, or at least anticipates, actual development.

  • "Political commentary". I used this as a description for ... highly political sites (spot-checking to see what stories actually hit the front page, though I should be more robust in that). The list: reason.com, rt.com, bostonreview.net, alternet.org, cato.org, rootsofprogress.org, breitbart.com, dailykos.com, mises.org, dailycaller.com, jacobinmag.com, rawstory.com, tribunemag.co.uk, hoover.org, heritage.org, theroot.com, wsws.org, adamsmith.org, manhattan-institute.org, theblaze.com.

And there's "academic / science" which is mostly university and academic press / journal sites.

Anywho....

... at least from initial takes, the trending on these does not suggest a trending toward sensationalistic topics and/or sites, but the opposite. Much more programming FP stories in recent years, fewer political commentary, and more academic/science items.

Presuming this holds up as I code further.

This is one of the fun things about data analysis: stuff jumps out at you, sometimes confirming hunches, but often radically violating preconceptions.

I want to look more closely at what happens in the lead-up and follow-on to the 2016 US elections cycle in particular....

Hrm. What does spike is cryptocurrency-specific sites in 2014. Though that falls off again. (I suspect as that discussion enters more mainstream sources.)

And "general info" and "general interest" sites seem to rise in recent years.

#HackerNewsAnalytics #HackerNews #MediaAnalysis

OK, current stats are 63.5% of posts classified, with 29.8% of sites classified, a/k/a the old 65/30 rule. The mean posts per unclassified site is 1.765, so my returns for further classification will be ... small.

Full breakdown:

4 20
14 19
13 18
23 17
32 16
37 15
48 14
55 13
96 12
120 11
122 10
168 9
247 8
315 7
396 6
622 5
1052 4
2016 3
5103 2
26494 1

A ... large number of sites w/ <= 20 posts are actually classified, mostly by regexp rules & patterns. Oh, hey, I can dump that breakdown as well:

35 20
27 19
47 18
31 17
33 16
41 15
51 14
45 13
42 12
29 11
46 10
46 9
47 8
91 7
138 6
178 5
269 4
524 3
1624 2
11472 1

I could pick just under 4% more posts by classifying another 564 sites but ... that sounds a bit too much like work at the moment. Compromises and trade-offs.

Now to try to turn this into an analysis over time.

I've been working with a summary of activity by site, so running analysis has been pretty quick (52k records and gawk running over that).

To do full date analysis requires reading nearly 180k records, and ... hopefully not having to loop through 52k sites for each of those. Gawk's runtimes start to asplode when running tens of millions of loop iterations, especially if regexes are involved.

#HackerNewsAnalytics #HackerNews #gawk #awk #DataAnalysis #MediaAnalysis

Oh, and something that would be really useful would be a quick way of looking up a website and getting a rough classification as to what type of content it presents.

Wikipedia can offer some of this, occasionally sources such as Crunchbase, though the first is hard to parse.

The Alexa Crawl (Amazon, originally by Brewster Kahle of the Internet Archive) used to offer this as well, though I think that's no longer active.

If anyone knows of other / better sources, I'd love to know.

#DearMastomind #DearHivemind #HackerNewsAnalytics