Mastodawn

I have two directories that must contain identical files. I can use jdupes or fdupes to find & list files between these directories that are duplicates. How can I do the opposite of this & find files that are different from eachother? My original idea for this was to hash all files, sort by file name so lines would only differ by hash, then use diff to pick out file names from wrong hashes, but I'm not sure how to process the output of diff to get this.

#Linux

Show thread

Flounder Feb 4

@jackemled if you have two lists of hashes and you want to find the different files, use the join command

`join -j1 -v1 -v2 <(sort hashes_A) <(sort hashes_B) `

The v option says "show unjoinable things in this file" and you want all the unjoinable stuff in both files

Show thread

Luna Lactea Feb 4

@fl0und3r Thank you! I didn't know about join. This seems to pick random lines from the files & declare them unjoinable even though the first field is the same. I'm not sure why. If I pick the second field instead I expect there to be no output because the second fields are completely identical, & that is what actually happens. The lines it says are unjoinable on the first field, which is the hash, actually have the same hash in both files. At the end it says one line in each file was not in sorted order, which I don't think is true because both were sorted the same way & have the same exact contents except for hashes.

The output is only a tenth of the lines in the file, which is more than I know are corrupted or incorrect files, but also not the entire file. I'm not sure what about these lines makes them be counted as unjoinable despite being completely identical.

Show thread

Flounder Feb 4

@jackemled could it be a delimiter issue? Like if the file uses commas instead of spaces (like join assumes)?

Show thread

Luna Lactea Feb 4

@fl0und3r I used sha3sum for this & it delimits with double spaces. I think all of the SHA file hashing utilities use double spaces. I'll see if manually specifying it helps, but I'm not sure. I wonder if there are some binary differences that aren't visible, because I've seen git (in the Kate plugin at least) be upset about that before, that join is picking up on.

Show thread

Luna Lactea Feb 4

@fl0und3r Weird

user@host:~$ join -t '  ' -j 1 -v 1 -v 2 copy2 original2
join: multi-character tab ‘  ’

So for some reason this makes it output nothing, as if there are no differences, which I know is not true because the files have different hashes & real differences are shown in diff. Maybe join just doesn't like multicharacter delimiters. I get the same result when using './`, which is what every file path is starting with, so that's what I think.

I wonder if it would be easier to just hash the directories again & compare one pair of files at a time. I know that diff can be used on directories, but I don't think this is how it works. Some people have suggested doing a dry run with rsync. I don't think rsync dry runs work like that but I'll try it anyway when I'm back at the computer.

Show thread

Flounder

@jackemled join def does not like multi character deliminers. You could use `cat -A`to rule out any non-printable characters in the file. I tested this using md5sum but I can retry with sha256 once I'm back at my home computer

Show thread

Luna Lactea Feb 4

@fl0und3r These files are almost 250,000 lines long, so manually checking them for nonprintable characters with them is hard even with them displayed. I could do that & then replace those characters with sed, but that might catch innocent bystander sequences of characters that just happen to match. I don't think it should matter though since both files should be the same after that. I don't know all of the characters & their representations this way though so I'm not sure I could actually get them all.

Show thread

Flounder Feb 4

@jackemled what does `grep` say about the files? Or, heck, `file`? Surely grep would complain if its binary?

Show thread

Luna Lactea Feb 4

@fl0und3r They are text files. Some have differences in their binary though, like when one character can be represented multiple ways for text encoding reasons. These two files should be the same encoding though.