Multimodal C4: An Open, Billion-scale Corpus of Images Interleaved With Text
103M documents containing 585M images interleaved with 43B English tokens
GitHub - allenai/mmc4: MultimodalC4 is a multimodal extension of c4 that interleaves millions of images with text.
MultimodalC4 is a multimodal extension of c4 that interleaves millions of images with text. - GitHub - allenai/mmc4: MultimodalC4 is a multimodal extension of c4 that interleaves millions of images...