updated #clusterless subpop cli build to provide a Homebrew tap for easy installation.

https://github.com/ClusterlessHQ/subpop

subpop is an experimental tool for diffing datasets from the cli.

runs on #Linux and #macOS but sadly written in #java so no native binaries just yet.

GitHub - ClusterlessHQ/subpop: A CLI for diffing datasets

A CLI for diffing datasets. Contribute to ClusterlessHQ/subpop development by creating an account on GitHub.

GitHub

while pondering my need for a remote compute environment, vs having random boxes littered about generating heat, I realized I could add a 'device' component concept to #Clusterless.

this concept not only compliments the current model types, it will be handy standalone.

consider an #AWS ec2/ecs instance doing some complex work and dropping files into S3 (over the new mount point feature) where a clusterless DAG takes over processing when the files arrive (via the S3 put boundary).

Ryan Singel (@[email protected])

Indie creators ditching one VC-funded company for another thinking it'll be different this time. It won't. The incentives remain the same LiveJournal was a trap Medium was a trap Patreon is a trap Substack is a trap Etsy became a trap Beehiiv... The difference this time is there are real indie alternatives Ghost WordPress Transistor.fm Fediverse Outpost.pub (self plug) Buttondown Liberapay OpenCollective And more The hacienda must be built #SubstackMigration #Substack

Writing Exchange

@seldo I don't know any firsthand,

but I spent the last couple weeks exploring what a #RAG pipeline would look like so I could write a sample application/pipeline using my #OpenSource #clusterless project

https://github.com/ClusterlessHQ

unfortunately the idea I had wasn't ultimately suitable for RAG and could be a simple BERT/BART summarizer pipeline without having a open/elasticsearch backend or other vector db.

still looking for a fun RAG based prototype I could build and share.

Clusterless

Clusterless has 12 repositories available. Follow their code on GitHub.

GitHub

need to dig into this, but i've been doing replay (redrive) on #aws StepFunctions for years with my #data pipelines

https://aws.amazon.com/blogs/big-data/build-efficient-etl-pipelines-with-aws-step-functions-distributed-map-and-redrive-feature/

replay is one feature I haven't added back to #Clusterless yet, though all the metadata is there.

https://github.com/ClusterlessHQ

Build efficient ETL pipelines with AWS Step Functions distributed map and redrive feature | Amazon Web Services

AWS Step Functions is a fully managed visual workflow service that enables you to build complex data processing pipelines involving a diverse set of extract, transform, and load (ETL) technologies such as AWS Glue, Amazon EMR, and Amazon Redshift. You can visually build the workflow by wiring individual data pipeline tasks and configuring payloads, retries, […]

Amazon Web Services

@c_chep

currently all my #clusterless examples (and scenario tester) use jsonnet, but it's got weak overall support.

CUE looks interesting, but no Java implementation for embedding (if that was a thing I was considering)

Tessellate is now on Docker Hub

https://hub.docker.com/r/clusterless/tessellate

Tessellate is a command line tool for reading and writing #data to/from multiple locations and across multiple formats.

#clusterless

Docker

Automating #AWS CloudWatch log export into S3 is no simple task.

Next #clusterless release will now have a new Component type called Activity that is simply a scheduled task..

The first Activity will be function that exports cloud watch logs created within the previous interval.

As they arrive, any arc can subscribe to the data drop and do things. To simplify that task, I'll update #tessellate

The cw log is a delimited text file with two columns, one is json. unlike all the others in aws!

ok, here's a new one for #aws users.

would anyone be interested in an automated way to extract CloudWatch logs (continuously) into an s3 bucket.

and have them converted into #parquet (/etc) for downstream custom processing. or simply partitioned with partition updates to AWS Athena/Glue?

the challenge for users is getting the `detail` json field exposed since it's app specific.

with #clusterless devs could then inject custom processing for custom app logs into the #data pipeline

I finally wrote up some documentation on using #clusterless Tessellate for #data stuff

https://docs.clusterless.io/tessellate/1.0-wip/index.html

Tessellate :: Clusterless Docs