VirusShare 2022 Decompiled
(aka Vi-Sha-Fy Unwrapped)

Another block of 31,536,000 seconds has come to an end, but the malware never stops. Here at VirusShare we checked the graphs and queried the databases to see how our 2022 went.

We started 2022 with a new set of 12 antivirus scanning engines and added one more this past October. These 13 scanners check and add a new file to the corpus every 0.6 seconds.

The database continued to grow, adding:

12,121,716 new malware samples in 6.5 TB

and

22,650,352 new 'clean' files in 2.6 TB

The entire system uses 86 TB of storage,
hosting 55.8 million malware files,
224.5 million 'clean' files, and
many more in the hopper.

Web crawlers have always been a part of the VirusShare infrastructure, but we dedicated some time to improving them and crawled:

26,361,862 unique URLs (twice)

from all over the internet. The crawlers along with new code to extract URLs from newly added malware samples were a big contributor to the volume added in the latter half of the year.

As the year came to a close, we added the @VXShare account to Mastodon and converted the tweets about new malware package releases over to toots. The plan for 2023 is to use Mastodon for these notifications. This new platform also makes for a nice way to tell you all about how the year went in a single post. Many thanks to @jerry for maintaining the infosec.exchange instance.

It is important to point out some of the costs of keeping VirusShare running. In 2022, we spent:

$4,000 on electricity
$5,750 on Internet connectivity
$1,500 on hard drives
$500 on antivirus licensing

The costs were offset by income of <checks notes>:

$0.00

Which brings me to the PBS telethon portion of this year-end post:

We don't have ads.
We don't sell your info.

If you or your company has benefited from this free project, please consider giving back. If you are looking to get more from the site like increased API calls, daily malware feeds, or supporting a special feature you'd like to see, reach out to @VXShare or to @corvus4n6, the company who officially hosts and maintains the project. Improvements and daily maintenance are (still) handled by one human, a bot, and lots of code running in seemingly endless loops.

Best wishes for 2023!

@forensication & Melissa97 the bot.

@VXShare @jerry @corvus4n6 @forensication How do we access "clean" files?
@cucuman Clean files are not downloadable at the moment. They reside in an entirely different are of the system and there is also the issue of possible copyright complaints.
@VXShare well, this issue significantly hinders abillity to develop automated malware classification using machine learning. Lack of verified benign dataset..

@cucuman What 'clean' means in this case as well as in VirusTotal's (VT) case is that a file did not have any detections by any AV vendors. A 'verified clean' to me only means listed in the NIST NSRL hashlists as known software released by a vendor.

An NSRL set or lookup is doable, but we would still run into the same issue of sharing of copyrighted material and is also the reason NISat will not provide a copy of the software.

Happy to listen to some thoughts on this by technology/copyright lawyers, but it's a tricky space. "Clean" samples are downloadable on VT with download access, but they are also a subsidiary of the 17th largest company in the world.

Does providing access to a known good DLL for research purposes constitute fair use? What about a complete application?

@VXShare
For models i built, what i required anything which can be disassemble.

However I genuniely dont understand the copy-right issue. Most of the "clean" applications are accessible directly from vendors. I also dont understand why not bulk-sharing known good samples with the respected research institutues. What i know is AI based models most of the time requires balanced dataset in order to learn significant features otherwise your model will be quite limited. Respected researchers(like sophos and Booz Allen Hamilton etc.) also shared this groundtruth benign dataset problem in their papers yet only solution is building one to one good connection with security companies(impossible for individuals) or have a lot of money(lol). It is exactly same issue as "should we share malware", which hunted us for years. This solved by brave individuals, but now we have groundtruth benign problem.

I have built many novel models easily passing %95 accuracy with low fpr, yet these models limited to very small subsets and cannot be generalized or verified in real life. I'm about the throw them to the trash and switch anomaly based models due to these dataset problem or else completely quit researching this topic.

@cucuman Copyright laws were from a different, pre-digital time and in the current US legal system where everyone is sued for everything and defending yourself against allegations can get expensive, it is a reasonable concern. The vendors would argue that you should purchase the software from them that holds the file you want to examine - and also would point out reverse-engineering prohibitions in their licensing agreements.

I am still not a lawyer and this is not legal advice, but looking at the Section 107 of the Copyright Act which defines Fair Use it seems possible to work toward making 'clean' files available because it would easily fall under non-profit research and education. Files would typically be use in-part, and the effect on the market to be negligible, since everything would be fragmented and not exactly easy to find for someone looking to appropriate a particular software package.

You can read all about Fair Use and research case law at https://www.copyright.gov/fair-use/. I wonder what @eff or other digital copyright law experts have to say about sharing digital files for research purposes vs copyright laws.

If I go down this road, there are changes that will need to be made to the system to enable access to those files and to make it easy for a copyright holder to report and remove their data from the system. I've also had re-scanning on my list for some time to manage false positives/negatives.

U.S. Copyright Office Fair Use Index

The goal of the Index is to make the principles and application of fair use more accessible and understandable to the public by presenting a searchable database of court opinions, including by category and type of use (e.g., music, internet/digitization, parody).