ProleWiki RAG MCP vs WSWS' (trots) Socialism AI - Lemmygrad
How many keywords can you stuff in a title right? I’m posting this in the
prolewiki community because we’ll be discussing ProleWiki’s own in-development
RAG for LLMs, but first you probably saw the WSWS, i.e. the trots, published
‘Socialism AI’. In their press release
[https://www.wsws.org/en/articles/2025/12/12/gpid-d12.html], they basically
self-congratulate themselves about how cool this is for the workers movement and
socialism and great victory this and great victory that blahblahblah. You know
how trots are. Their system is usable through ai.wsws.org [http://ai.wsws.org]
or something iirc, it’s a web-interface so yes it’s cool that it comes as a
package you can just run from any device and don’t have to fiddle with it,
there’s also a lot of problems with it especially when coming from
self-proclaimed communists. Though with how much of a joke trots are to
everyone, I feel like I’m not really throwing oil into the fire with this post
lol. We looked into how their system works because they give absolutely 0
indication on the technical implementation, and found several notices of
copyright in the Terms of Service. They say that the output from their AI
belongs to them, for example. Courts in the US have found that LLM output is
public domain but sure I guess, not really my area of expertise. We’ll get into
it. ## Understanding what WSWS did * WSWS did not train a model from the
ground-up * WSWS did not fine-tune an existing open-source model * WSWS is not
running and hosting their own model. What WSWS does (and you can find this out
from just using browser tools, i.e. F12 on their homepage) is use the chatGPT
and Deepseek APIs. Their pipeline is like this (as far as we can ascertain from
simple browser tools): You send your prompt -> they add their own instructions
to it -> LLM fetches WSWS blog articles to answer your prompt -> LLM reads blog
articles -> LLM answers your prompt with the WSWS blog articles as sources. This
is what we call RAG, or Retrieval-Augmented Generation. The technique is legit,
I’m not disputing that, it’s just the way they did it is both inefficient and
concerning. ## The Problems I have with that way of doing things We’ll get into
the technical problems when I detail what the ProleWiki MCP will look like. it’s
also very closed-source and obfuscated. Mind you I did not create an account
(too much hassle if I want to retain my privacy on it), but you have to
understand your prompt + llm output transits through OpenAI and Deepseek. There
is no privacy when using this service, it goes straight to the feds with OAI.
Secondly they sell paid tiers, starting at 5$ per month for 150 messages which
is… absolutely nothing. Thirdly everything is closed off. They did not release
any documentation on how this works or how you could run this yourself. Selling
paid tiers is not a problem in itself at least for me personally. You have to
break even and they do pay API access to openAI and Deepseek (though Deepseek is
very cheap). The problem I have is they at least should offer an open-source
implementation for people who know how to use it, at the very least make the RAG
files available. This is not the case. I’m also a proponent of paying it
forward. Yes this costs them money, but they could find a way to break even in
ways that don’t consist of just selling another SaaS (software-as-a-service).
Let people pay it forward for others or something. Accept that you will lose
some money on running this and cover with dues or people in the party who have
money and don’t mind maintaining this service. Accept donations. Lots of ways
you can do this that are not so commercial, i.e. “if you can’t pay you must
vacate the premises”. ## The technical implementation: ProleWiki MCP vs.
Socialism AI A few months ago we started working with a dev who was making the
Marxists Internet Archive available for RAG use. This project evolved and they
are now making a ProleWiki MCP with the pages we sent them. It’ll still be RAG,
but more efficient. So first, let’s look at how the Socialism AI RAG works. If
you remember the pipeline: You send your prompt -> they add their own
instructions to it -> LLM fetches WSWS blog articles to answer your prompt (<--
we are here) -> LLM reads blog articles -> LLM answers your prompt with the WSWS
blog articles as sources. The problem we’ve found is what kind of data exactly
the LLM gets access to. Imagine it like a bin the LLM can sift through to make
an answer with. If you provide it with the link to the page, it parses that as
html code, with all its tags, headers, script calls etc. Imagine me giving you a
page full of html code and asking you “can you answer when Lenin was born from
this info?” You can, but it’s gonna take a while and a lot of it is simply
unnecessary. And you only have this one page to make an answer. If Lenin’s DOB
is not neatly written on it, you have to do extra thinking to put it together
(this is the context window - the LLM simply won’t look through 250k WSWS
articles, it has to pick and choose which articles are more likely to help
answer the question). Therefore we can optimize this bin. Instead of giving you
full pages you can pick from, we can give you individual lines. In our RAG for
ProleWiki, what our dev did was some math that extracts every line from our
pages on the principle of 1 line = 1 idea. Then it puts these ideas together in
a matrix and sorts them by semantic closeness. What this means is if you’re the
LLM, you don’t get a full page on the October Revolution or Lenin
[https://en.prolewiki.org/wiki/Vladimir_Lenin] to answer a question with. You
can see our page on Lenin is quite lengthy and if you asked a question that is
not on this page when the LLM pulled it to look at it before answering (for
example you can see the self-exile section is empty), it might not answer your
question as best it could. With the semantic matrix, instead of picking from
pages, it picks from lines to make a coherent answer. Instead of looking at just
Lenin’s page and filling its entire context window with it, it looks at semantic
information relating to Lenin’s self-exile on ProleWiki - or other sources you
add to the corpus, the ‘bin’ - and then makes an answer on this. This means if
we have information about Lenin’s self-exile on say the USSR page (because why
not!), it will pull exactly that thread from that page. And this is much more
powerful than what the WSWS did and why they offer such measly usage rates. They
are filling up context window and sending noise tokens because they’re giving an
entire <!DOCTYPE HTML><head><meta-name>… html page instead of just the relevant
content. ## But where does the MCP come in? MCPs are kinda new, and were made
for AI to work with. I wouldn’t be the best person to explain them but basically
it lets an LLM look at some data (website, files, etc) and work with that data
in some way. Mostly used in agentic work, tools are exposed to the llm such as
view file or edit file, so it can perform these operations itself instead of
having you do it and then confirm. So if you have an agent (such as crush
[https://github.com/charmbracelet/crush], our favorite here on lemmygrad), an
LLM can and will view and edit the files you tell it to. These are an example of
2 tools. With an MCP, you give the LLM access to data it can read and can also
give it its own tools. You could make a tool “ProleWiki-fetch”. When the LLM
decides to use this tool, it communicates with the ProleWiki MCP you have
installed locally and lets it say “okay, let’s use the prolewiki-fetch tool to
look at data from prolewiki to answer this question”. Then the MCP does its
magic and sends back to the LLM the information. And not only that, but as we
said you can also run this locally. We are still figuring out how we’ll package
all of this but most likely we’ll make the source files available so that anyone
can build any RAG or make their own cloud web interface if they want. Likewise
for the MCP, it will be downloadable with our source files so that you can just
add it to your agent interface and start using it to query the LLM and answer
with prolewiki content. Communism is not in a position of strength currently.
So, I don’t see any reason we should be trying to hide and obfuscate any of our
content. On the contrary, proletarian education demands it be accessible without
discrimination. Unlike trots, we trust the people to make the right decisions
collectively - if someone wants to use ProleWiki content to train a model and
paywall that, let them. There will be 10 more that won’t be. In fact speaking of
models, our dev is also working on something there… but I was asked not to say
too much about it as it’s very experimental 🤐