Mastodawn

I have two directories that must contain identical files. I can use jdupes or fdupes to find & list files between these directories that are duplicates. How can I do the opposite of this & find files that are different from eachother? My original idea for this was to hash all files, sort by file name so lines would only differ by hash, then use diff to pick out file names from wrong hashes, but I'm not sure how to process the output of diff to get this.

Luna Lactea Feb 4

I've been trying to create an archive of some old family photos, but the NTFS partition they're on is causing issues with some metadata values being too large for any archiving utility to store, & they all refuse to add those files. So I copied everything to a BTRFS partition with rsync (which also complained about the same files but still copied them), & then I was able to actually archive the files. I'm double checking everything though, & ended up finding out that some files did not actually copy correctly despite rsync supposedly verifying the files to ensure correct copying. I don't know which files though, just that there are over 300 of them. I want to retry just those files, because rsync seems to update timestamps on all files even if I tell it not to & I want to avoid that. Also because it takes three hours to do the copying.

Chris 🖖Feb 4

@jackemled yeah, hashing all the files and comparing hashes is how I've done this kind of thing in the past. You could do a sort and uniq on a full list of all hashes to get a list of hashes that have changed.

Luna Lactea Feb 4

@chrisbier Is that the same as cat FILES… | sort -u? That might work but would have duplicate lines for each different file. I guess I could do it twice, sorting out unique lines by the second key instead of the whole line.

Luna Lactea Feb 4

@chrisbier No, it includes correct files too. It keeps the first of each duplicate line instead of removing all instances.

Spinda 🐲🦊🏳️‍🌈🏳️‍⚧️Feb 4

I *think* I've done this in the past by putting rsync in some dry-run mode where it only reports differences and doesn't do any writing.

Luna Lactea Feb 4

@spinda Yeah, but that would list all files, not just ones that are different. I would still have to manually extract the list anyway & I'm trying to avoid that because it's a task I have to do alot. I could probably do it with just AWK, but AWK is alot to learn.

@jackemled if you have two lists of hashes and you want to find the different files, use the join command

`join -j1 -v1 -v2 <(sort hashes_A) <(sort hashes_B) `

The v option says "show unjoinable things in this file" and you want all the unjoinable stuff in both files

Luna Lactea Feb 4

@fl0und3r Thank you! I didn't know about join. This seems to pick random lines from the files & declare them unjoinable even though the first field is the same. I'm not sure why. If I pick the second field instead I expect there to be no output because the second fields are completely identical, & that is what actually happens. The lines it says are unjoinable on the first field, which is the hash, actually have the same hash in both files. At the end it says one line in each file was not in sorted order, which I don't think is true because both were sorted the same way & have the same exact contents except for hashes.

The output is only a tenth of the lines in the file, which is more than I know are corrupted or incorrect files, but also not the entire file. I'm not sure what about these lines makes them be counted as unjoinable despite being completely identical.

@jackemled could it be a delimiter issue? Like if the file uses commas instead of spaces (like join assumes)?

Luna Lactea Feb 4

@fl0und3r I used sha3sum for this & it delimits with double spaces. I think all of the SHA file hashing utilities use double spaces. I'll see if manually specifying it helps, but I'm not sure. I wonder if there are some binary differences that aren't visible, because I've seen git (in the Kate plugin at least) be upset about that before, that join is picking up on.

Luna Lactea Feb 4

@fl0und3r Weird

user@host:~$ join -t '  ' -j 1 -v 1 -v 2 copy2 original2
join: multi-character tab ‘  ’

So for some reason this makes it output nothing, as if there are no differences, which I know is not true because the files have different hashes & real differences are shown in diff. Maybe join just doesn't like multicharacter delimiters. I get the same result when using './`, which is what every file path is starting with, so that's what I think.

I wonder if it would be easier to just hash the directories again & compare one pair of files at a time. I know that diff can be used on directories, but I don't think this is how it works. Some people have suggested doing a dry run with rsync. I don't think rsync dry runs work like that but I'll try it anyway when I'm back at the computer.

@jackemled join def does not like multi character deliminers. You could use `cat -A`to rule out any non-printable characters in the file. I tested this using md5sum but I can retry with sha256 once I'm back at my home computer

Luna Lactea Feb 4

@fl0und3r These files are almost 250,000 lines long, so manually checking them for nonprintable characters with them is hard even with them displayed. I could do that & then replace those characters with sed, but that might catch innocent bystander sequences of characters that just happen to match. I don't think it should matter though since both files should be the same after that. I don't know all of the characters & their representations this way though so I'm not sure I could actually get them all.

@jackemled what does `grep` say about the files? Or, heck, `file`? Surely grep would complain if its binary?

Luna Lactea Feb 4

@fl0und3r They are text files. Some have differences in their binary though, like when one character can be represented multiple ways for text encoding reasons. These two files should be the same encoding though.

Luna Lactea Feb 5

@fl0und3r rsync of course said it would not have made any changes after the dry run, because it believes the files that it corrupted itself to be correct & unmodified.

I wonder if hashing is somehow including the filesystem's metadata about the file, because there were some unavoidable changes to filesystem metadata when copying, which I had to do because of some of NTFS's metadata being incompatible with tar & causing it to abort. Could sparse files cause a difference? Maybe a long run of zeros being truncated could be read directly instead of first being expanded to the original file when it's read.

@jackemled unfortunately I think you're well beyond my understanding of NTFS😅. I'd think sha256sum would get all the bytes regardless of what the file system is doing (and none of the metadata), but I've been surprised before

Luna Lactea Feb 5

@fl0und3r That's what I would think. This is really weird, because doesn't rsync use hashes to quickly tell if two files are different or not before then scanning each block of each one until it finds the difference? I wonder why sha3sum would give different hashes but rsync would see the files as identical, unless I'm astronomically unlucky & every single wrong file has corrupted in the perfect way to hash collide in whatever rsync uses by default but not in SHA3. I'm very unlucky, but I don't think I'm that unlucky!

Sergio Scabuzzo (EcoTechie)Feb 4

@jackemled couldn't you use rsync's output using dry run? Not in front of computer atm.

Luna Lactea Feb 4

@Sergio No
https://furry.engineer/@jackemled/116013102403072098

Luna Lactea (@[email protected])

@spinda Yeah, but that would list all files, not just ones that are different. I would still have to manually extract the list anyway & I'm trying to avoid that because it's a task I have to do alot. I could probably do it with just AWK, but AWK is alot to learn.

Furry.Engineer - Duct tape, hotfixes, and poor soldering!