I wrote an article sharing multi-year insights into how I am ingesting about 800 feed sources (RSS, Alienvault OTX and the Twitter realtime hose).

I also share some details on structuring for searchability in Janusgraph (backed by Cassandra and Lucene). All relatively computational inexpensive (one node only).

https://252.no/~tommy/writings/2021-02-16-signals-feeds.html

#gremlin #janusgraph #go #parser #aggregator #feeds #rss #otx #twitter

@tommy
Aren't you overwhelmed under information overload?
@paoloredaelli No. The program enables filtering of irrelevant articles/objects. Thus increased focus on topics of interest. Probably about 10-20 "threads" each day. The ingestion pipe handles thousands of objects each day, but most of these I never see.

I achieve this using a combination of hotwords and a graph. This makes it possible to traverse from e.g. the hotword "targeted", via and article and then match "cyber".

@tommy Are you ingesting the full Twitter firehose?
@mcg No. I use a set of 40 hotwords that are applied to the hose. Other than that the hose is limited to the number of tweets that Twitter provides for free. This was better before they started restricting the API.
@tommy Thanks for clarifying. I was thinking, if you somehow managed the full feed with little hardware.....
@mcg just so I get you right: are you asking about processing, memory, bandwidth and disk performance, or post-ingestion analysis performance using Cassandra/Gremlin/Badger?
@tommy The toolchain you described in the post to inject and process the stream. It sounded like you were using that to handle a full Twitter firehose.
@mcg I see. This is not at that scale. Looking at the current stats, the program ingests and scans 8 tweets per second - with URL retrieval and file parsing added. So, hardware isn't a problem in that regard. The batch feed retrieval is more of a challenge in such.