Gijs Hendriksen presenting our work on "remote querying" to provide access to huge Web resources through de facto standard tech: Parquet files in S3 queried using #DuckDB to facilitate IR research at very acceptable latencies.
Run your ClueWeb experiment in 10 minutes or so, and repeat your experiments on recent Web data from the #openwebsearcheu Open Web Index.
