okay it's a 54-byte header.
so PAC is a lazy TAR clone
I just need to write a script to decode it. but my brain isn't working now
the weird thing is that the text file suggests the PAC files contain filenames, but I don't see them. Now, there IS a stretch of bytes that could be a filename, but I can't seem to decode it as anything sensible:
B3 A5 A3 B2 A5 B4 6E B0 A1 A3

it does decode as shift-jis (which the text file was encoded as) but turns into:
ウ・」イ・エー。」

which I don't think makes any sense

and if you decode it as utf-16, the most reasonable encoding for windows computers at the time, you end up with ꖳ늣뒥끮ꎡ, which makes even less sense.

I'm pretty sure they didn't name the files in their Azumanga Daioh game in a mix of Mande, Korean, and Sino-Tibetan scripts

but by matching up the filenames with the text file (azmem.txt) and what subfiles are definitely inside azending.pac, that pile of gibberish is supposed to mean "secret.pac"
wait
maybe this means something.
the "C" in "SECRET" is encoded the same as the "C" in "PAC"
And note that the A in PAC is encoded as A1, which is only 2 less than the A3 which C is encoded as.
what encoding puts ABCDEF at A1 and up, though?

answer: nothing python 3.11 can encode to.

Maybe this isn't an encoding. Maybe this is encryption.

it's just the ascii value + 64

B3 A5 A3 B2 A5 B4 6E B0 A1 A3
subtract 64 from each letter

>>> ''.join(chr(x-64) for x in [0xB3,0xA5,0xA3,0xB2,0xA5,0xB4,0x6E,0xB0,0xA1,0xA3])
'secret.pac'

also the 54-byte header thing was wrong. it's variable length, because of course it is!
okay so, PAC:
the header for the file itself is 16 bytes.
Then each chunk starts with a null-terminated string, encoded with that silly +64 ASCII mode.
Then there's another NUL byte, then 32 bytes of per-chunk header, then the raw chunk data.
ugh.
the +64 ascii string thing doesn't work for all files. some of them end up negative

34 B6?

THAT DOESN'T MAKE ANY SENSE

way too short to be a filename and it's also -12, 118 after decoding
HOW DO YOU HAVE NEGATIVE ASCII INDEXES
if we assume it loops around and thus this should be F4 76, it's not valid shift-jis, but in utf-16 it'd be 直, which... makes little sense.
I changed my code to ignore that sometimes the filenames make no sense, but then it errors after that: apparently the filenames not decoding ALSO breaks the variable-length headers. Interesting.

interesting: logo.pac goes "40 3F 00 00 A7 AC AF A7 AF 9F 70 71 6E B4 A9 AD 60 D4"

so my code was stopping after 40 3F.
but A7 AC AF A7 ... looks more like a filename

and it encodes as "glogo_01.tim \x94"
so I must be missing something, like some out-of-band file length indicator
got it. the first 2-4? bytes of the PAC are a list of how many 4-byte words come before the filename.
the 40 3f 00 00 before the filename in LOGO.PAC isn't part of the filename, it's part of the header.

I can't figure out how it's determining when filenames end, though.

Maybe it's assuming they all have extensions and all extensions are 3 letters long?

that makes some of the files make sense and some of the others not make sense!
oh god there is compression
not all files are compressed. but some are

found the code where it parses the PAC headers.
It's terrible as expected.

The pre-pac header stuff gives you a pointer into each header, but then the fun part is that the pointer is not to the beginning, it's to the middle. So it looks things up by indexing forward AND backward

so the filename starts at the offset of, uh, negative 28

and here's how it determines the ending: it's until it hits a 0, OR the filename ends up being 12 characters long.

FUCK

someday I'm gonna reverse engineer a game and not want to timetravel back to its creation and ask them WHAT THE FUCK at gunpoint

sometimes I won't even ask, I'll just start shooting

so I'm just gonna take all my current PAC parsing code and throw it out and replace it with the nonsense of the actual code.

that was my fatal mistake: I was writing parsing code assuming this shit made any fucking sense

also I think there's a mistake in this code OR ghidra is decoding it incorrectly.
it seems to be trying to ensure all filenames are uppercase, but because it's wrong, it is corrupting all non-lowercase characters.
they might not have noticed if they apply the same "uppercase" transformation when trying to load filenames, because both would be corrupted in the same way
okay so now I've got working filenames, offsets, lengths, and compressed lengths. So I can find out what files are where and if they're compressed. I can't uncompress them yet.
I have located the decompression routine.
now to try to figure out what the fuck it does

this decompression routine is big-endian.

on a little-endian system.

WHERE DID THEY GET THIS

it seems it's loading 16bit lengths, then using the top 15 bits? with the lowest bit as a flag?

I don't recognize this. I don't think it's DEFLATE

so looking at this code, it doesn't seem to involve huffman encoding. there's no tables, just some look-back with a sliding (I think?) window.

So this is just a slightly fancy RLE, I think?

I'm gonna try to bypass figuring out the compression right now by just stuffing the ghidra code into a C program and calling it from python

Bingo. it works!

Mostly. my output file is always 64mb but that's because I don't have a good way to tell how big it should be

even more bingo.
I have textures now.
there's some weird-ass shit going on here. like, the datafiles have some PAC chunks with type 36.
As far as I can tell, there's no code that handles chunk-36.
So the only way that makes sense is if part of the game dynamically loads code which then registers a chunk-36 parser
so each character is stored in a a PAC file under viewer\ (inside AZU.APF)
So like, the first Chiyo-chan is chi_v.pac
That PAC file contains 91 texture images and a 95 kilobyte GMD file, which seems to contain all the geometry AND animations.
so the next step is to figure out how GMD works.
Fortunately I know where the function that parses it starts

the code also seems to handle decoding two versions of the GMD format, but I can only find one in use in the datafiles.

Maybe they used the other in the One Piece games, and just never dropped support?

oh hello. I took a quick look at the second of the One Piece games, and it turns out they did the same thing as Azumanga and included the executable inside the APF file... but they did it twice, and the two don't match!
and it looks like that for the second One Piece game, it uses .TMD files, which have a completely different header than the two supported by azumanga
ahh. it looks like the GMD format gets loaded recursively, and I bet that's why there are a total of 3 (not 2, as I suspected) different versions of it.
I bet versions 1 and 0 appear inside version 2

interesting string:
"This is Technosoft\'s communication shell password"

Technosoft is the company that game before Ganbarion. Apparently they built some kind of communication shell utility for their PS1 games

ugh I'm not being paid enough for this.

so it seems there's a complex function-pointer system used so that the main loop can stay the same, and it can be overwritten by overlays loaded from the APF files

but it's a ton of function pointers being thrown around and it just makes me go crosseyed

return (int)*(short *)(**(int **)(*(int *)(param_1 + 0xc) + 0x14) + 0x10);

C is such a simple and expressive language

@foone Why in the name of Odin’s pubes is that mess not typedef’ed?
@mos_8502 ghidra decompiled code. for all I know it is in the original

@mos_8502 @foone I strongly dislike C's function pointer syntax on a good day.

It would be interesting to see the original code, because there's no way this was explicitly constructed by human hands.

@HunterZ @foone I only ever use function pointers and block types with typedefs. The syntax is painful otherwise.
@foone ah, yes. thanks ghidra.
sometimes immense reverse engineering progress just looks like Yomi-at-an-angle
@foone damn you're only on anime 1/71? got some catching up to do

okay I have figured out where most of the controller logic for the char viewer is, and thus I know which function is called when it needs to load a new character

but reversing that will have to wait until tomorrow. I am fried

load_new_character

so fried I randomly pasted the name of the function at the end of that comment
Oh, I thought you were just tapping out for a less fried Foone to take over
these programmers did not have a compiler that was good at optimizing overlays.
I keep tracing through a complicated tree of functions in the overlay, then discovering an identical set of functions in the non-overlay.
and they're not in the part that gets overlaid: they're always visible
@foone maybe they had different people working on different overlays. Then along the way they noticed: multiple people have now implemented the same function over and over, let's include that in the non-overlay part. But then one guy just didn't bother to remove the old ones from his module.
Could be something like that.
so it seems the way this PAC format works is that there's an idea of a PAC-chunk-handler-list, which is a collection of approximately 47 callbacks for each of the chunk types.
But when a PAC file needs to be loaded for a special reason, you can pass a different pac-chunk
-handler-list.
@foone that's gotta be struct field accesses right?
@gsuberland yeah. like 4 deep
@foone I wish ghidra would comment these with clearer pseudocode alongside the syntactically accurate dereference code.
@gsuberland @foone something like that has been on my wishlist for a really long time, or at least some sort of nice UI to build struct defs :/
@foone this might just be a certified ghidra moment
@foone With that horrific piece of indirection, please give me a less expressive language so I'll be horrified about the code instead of being horrified about the design.
@foone https://cdecl.org/ is a self-described "C gibberish to English" translator
cdecl: C gibberish ↔ English