One question for #webarchiving experts: I've identified 30 WARCs in our collections that contain the highest number of records declared as a certain MIME type I'm interested in.

I've requested the extraction of these AIPs, and hopefully I'll soon have access to the 30 gzipped WARC. Then I'll have to extract from these the records that have the said MIME type in the HTTP header, and turn then into a stand-alone file. How would you perform these operations?

#digipres

@BertrandCaron I don't know about these types, but extracting the files of interest doesn't sound like it would take more than a short program to achieve.
@aarbrk @BertrandCaron yeah I'd end up writing a Python script using warcio to stream and filter records to a new file. Probably.
@anj @aarbrk OK, thank you Andy! I have no experience with WARCs, so I looked in #COPTR and only warctools is mentioned. Maybe warcio should be added!

@BertrandCaron @anj @aarbrk

Y, what @anj said. I'm more on the Java side, so I'd use netpreserve's jwarc.

That said, you might give Tika a try:

java -jar tika-app-3.0.0.jar -z my-file.warc

@tallison @anj @aarbrk
Oh, nice!

Than you Tim, that's very good to know!!!