Mastodawn

tor sparql queen ✅Apr 12, 2017

does anyone have a good example / bit of code i can look over for using spark + pref python to iterate over a large number of HTTP calls?

Show thread

use @[email protected]Apr 12, 2017

@cm_harlow are you doing some sort of massive crawl?

Show thread

tor sparql queen ✅Apr 12, 2017

@ekansa ... yeah :-/ i'm trying to get a bunch of MARC records outta LoC via 4 diff api routes

Show thread

use @[email protected]Apr 12, 2017

@cm_harlow And their terms of service are OK with that?

Sorry, I've never done anything with Spark. Sounds like you're doing a huge job though. I wonder if the LoC could just give you a big data dump?

Show thread

tor sparql queen ✅Apr 12, 2017

@ekansa terms of service for 1 of the services have a query time limit, others are open.

their data dumps are not the representation i need, a lossy representation at that, and nearly a year out of date.

and they're not willing to generate new data dumps for me at this moment and send me to the various request options.

(but i could pay a vendor for what i need)

Show thread

use @[email protected]Apr 12, 2017

@cm_harlow Wow! Seesh.

So you doing some sort of analysis on all these MARC records? Or is this to buildup your own data for retrieval services? Or something else.

Sorry, I can't help but I'm super intrigued by the scale of your project!

Show thread

tor sparql queen ✅Apr 12, 2017

@ekansa heh, nw. i'm looking to retrieve a full dump of the Authorities to then serve up + manage in a few different ways - as a Git repo, as a ResourceSync source, as a IPFS repo.

Show thread

tor sparql queen ✅Apr 12, 2017

@ekansa + perform a more granular conversion to RDF, enhance with reconciliation, + serve/publish in same mechanisms.

Show thread

tor sparql queen ✅Apr 12, 2017

@ekansa the management aspect means checking on an Atom feed + other spots for notification of records added or updates, then pulling in less intensive fashion. But I need to get over the hump of a preliminary pull of all the data.

Show thread

use @[email protected]Apr 12, 2017

@cm_harlow Cool use of Atom, an old but great kind of API.

Show thread

tor sparql queen ✅Apr 12, 2017

@ekansa for sure.

i'm basically doing whatever I can to get these datasets better published + shared. I want to explore what forking of large auth. datasets could look like but can't wait on LoC to move towards something other than Voyager / Z39.50 / SRU / MarkLogic for their data services

Show thread

use @[email protected]Apr 12, 2017

@cm_harlow Interesting. I've had trouble with really big Git repos before (memory issues), but you can probably divide into a number of smaller repos.

Show thread

tor sparql queen ✅Apr 12, 2017

@ekansa yeah - a la who's on first

Show thread

tor sparql queen ✅Apr 12, 2017

@ekansa i like what WOF is doing, find their kinda complete separation from other efforts in terms of data models a bit disconcerting though tbh

Show thread

use @[email protected]

@cm_harlow what is WOF?

Show thread

tor sparql queen ✅Apr 12, 2017

@ekansa sorry, who's on first - https://whosonfirst.mapzen.com/