Mastodawn

I wrote an article sharing multi-year insights into how I am ingesting about 800 feed sources (RSS, Alienvault OTX and the Twitter realtime hose).

I also share some details on structuring for searchability in Janusgraph (backed by Cassandra and Lucene). All relatively computational inexpensive (one node only).

https://252.no/~tommy/writings/2021-02-16-signals-feeds.html

#gremlin #janusgraph #go #parser #aggregator #feeds #rss #otx #twitter

Show thread

Paolo Redaelli Feb 17, 2021

@tommy
Aren't you overwhelmed under information overload?

Show thread

tommy Feb 17, 2021

@paoloredaelli No. The program enables filtering of irrelevant articles/objects. Thus increased focus on topics of interest. Probably about 10-20 "threads" each day. The ingestion pipe handles thousands of objects each day, but most of these I never see.

I achieve this using a combination of hotwords and a graph. This makes it possible to traverse from e.g. the hotword "targeted", via and article and then match "cyber".

Show thread

Matthew Feb 17, 2021

@tommy Are you ingesting the full Twitter firehose?

Show thread

tommy Feb 17, 2021

@mcg No. I use a set of 40 hotwords that are applied to the hose. Other than that the hose is limited to the number of tweets that Twitter provides for free. This was better before they started restricting the API.

Show thread

Matthew Feb 17, 2021

@tommy Thanks for clarifying. I was thinking, if you somehow managed the full feed with little hardware.....

Show thread

tommy Feb 17, 2021

@mcg just so I get you right: are you asking about processing, memory, bandwidth and disk performance, or post-ingestion analysis performance using Cassandra/Gremlin/Badger?

Show thread

Matthew Feb 17, 2021

@tommy The toolchain you described in the post to inject and process the stream. It sounded like you were using that to handle a full Twitter firehose.

Show thread

tommy Feb 17, 2021

@mcg I see. This is not at that scale. Looking at the current stats, the program ingests and scans 8 tweets per second - with URL retrieval and file parsing added. So, hardware isn't a problem in that regard. The batch feed retrieval is more of a challenge in such.