OBELISC: An Open Web-Scale Filtered Dataset of Interleaved Image-Text Documents
Presents an open web-scale filtered dataset of interleaved image-text documents comprising 141M web pages extracted from Common Crawl, 353M associated images, and 115B text tokens
Aran Komatsuzaki on Twitter
“OBELISC: An Open Web-Scale Filtered Dataset of Interleaved Image-Text Documents Presents an open web-scale filtered dataset of interleaved image-text documents comprising 141M web pages extracted from Common Crawl, 353M associated images, and 115B text tokens repo:…”


