Mastodawn

I have two directories that must contain identical files. I can use jdupes or fdupes to find & list files between these directories that are duplicates. How can I do the opposite of this & find files that are different from eachother? My original idea for this was to hash all files, sort by file name so lines would only differ by hash, then use diff to pick out file names from wrong hashes, but I'm not sure how to process the output of diff to get this.

#Linux

Show thread

Flounder Feb 4

@jackemled if you have two lists of hashes and you want to find the different files, use the join command

`join -j1 -v1 -v2 <(sort hashes_A) <(sort hashes_B) `

The v option says "show unjoinable things in this file" and you want all the unjoinable stuff in both files

Show thread

Luna Lactea Feb 4

@fl0und3r Thank you! I didn't know about join. This seems to pick random lines from the files & declare them unjoinable even though the first field is the same. I'm not sure why. If I pick the second field instead I expect there to be no output because the second fields are completely identical, & that is what actually happens. The lines it says are unjoinable on the first field, which is the hash, actually have the same hash in both files. At the end it says one line in each file was not in sorted order, which I don't think is true because both were sorted the same way & have the same exact contents except for hashes.

The output is only a tenth of the lines in the file, which is more than I know are corrupted or incorrect files, but also not the entire file. I'm not sure what about these lines makes them be counted as unjoinable despite being completely identical.

Show thread

Flounder Feb 4

@jackemled could it be a delimiter issue? Like if the file uses commas instead of spaces (like join assumes)?

Show thread

Luna Lactea Feb 4

@fl0und3r I used sha3sum for this & it delimits with double spaces. I think all of the SHA file hashing utilities use double spaces. I'll see if manually specifying it helps, but I'm not sure. I wonder if there are some binary differences that aren't visible, because I've seen git (in the Kate plugin at least) be upset about that before, that join is picking up on.

Show thread

Luna Lactea Feb 4

@fl0und3r Weird

user@host:~$ join -t '  ' -j 1 -v 1 -v 2 copy2 original2
join: multi-character tab ‘  ’

So for some reason this makes it output nothing, as if there are no differences, which I know is not true because the files have different hashes & real differences are shown in diff. Maybe join just doesn't like multicharacter delimiters. I get the same result when using './`, which is what every file path is starting with, so that's what I think.

I wonder if it would be easier to just hash the directories again & compare one pair of files at a time. I know that diff can be used on directories, but I don't think this is how it works. Some people have suggested doing a dry run with rsync. I don't think rsync dry runs work like that but I'll try it anyway when I'm back at the computer.

Show thread

Luna Lactea

@fl0und3r rsync of course said it would not have made any changes after the dry run, because it believes the files that it corrupted itself to be correct & unmodified.

I wonder if hashing is somehow including the filesystem's metadata about the file, because there were some unavoidable changes to filesystem metadata when copying, which I had to do because of some of NTFS's metadata being incompatible with tar & causing it to abort. Could sparse files cause a difference? Maybe a long run of zeros being truncated could be read directly instead of first being expanded to the original file when it's read.

Show thread

Flounder Feb 5

@jackemled unfortunately I think you're well beyond my understanding of NTFS😅. I'd think sha256sum would get all the bytes regardless of what the file system is doing (and none of the metadata), but I've been surprised before

Show thread

Luna Lactea Feb 5

@fl0und3r That's what I would think. This is really weird, because doesn't rsync use hashes to quickly tell if two files are different or not before then scanning each block of each one until it finds the difference? I wonder why sha3sum would give different hashes but rsync would see the files as identical, unless I'm astronomically unlucky & every single wrong file has corrupted in the perfect way to hash collide in whatever rsync uses by default but not in SHA3. I'm very unlucky, but I don't think I'm that unlucky!