Mastodawn

cody Apr 15, 2017

I need an Append only database. Okay...

Show thread

cody Apr 15, 2017

‪Orange website post from 738 days ago tells me Kafka is a high speed append only RDBMS. I have doubts. ‬

Show thread

iximeow Apr 15, 2017

@valarauca1 uhhh... i've only used kafka as a message passing system. i guess you could use it as a relational database? it's pretty neat though, definitely heard of it used under high throughput

Show thread

cody Apr 15, 2017

@iximeow I'm trying to store something like 140million rows of data for analysis and I seem to have very few options.

Show thread

cody Apr 15, 2017

@iximeow okay that's one scheme. There are a few others. It is actually only some ~10GiB of data. I'm thinking I just need to host it all in RAM and write my analysis tool.

Show thread

iximeow Apr 15, 2017

@valarauca1 if you don't mind me asking, what are you analysis'ing?

Show thread

cody

@iximeow ALL of github

Show thread

iximeow Apr 15, 2017

@valarauca1 did you grab the dataset from google's github query thingie? i need to scrape github still (and codeplex, before it goes down..)

Show thread

cody Apr 15, 2017

@iximeow yeah reducing the bigquery dataset took about 4TB of cloud credits.

Show thread

cody Apr 15, 2017

@iximeow (well not all) repoIDs, userIDs, and timestamps.

Everything is pretty much a uint64_t

Show thread

iximeow Apr 15, 2017

@valarauca1 aaahh. i need all the header files of all of the things :( fairly worried about impending size constraints

Show thread

cody Apr 15, 2017

@iximeow grepping across all files in BigQuery file contents is only ~3TiB <$40 they give you data size estimate when you type in a valid query you can throw in the cost calculator.

Show thread

cody Apr 15, 2017

@iximeow BQ charges for network transfer, not in place filtering/joining/hashing. But it's computation is STUPID SLOW. Doing a 1000 item cross join will take >20 hours.