Mastodawn

Darrell Jan 4, 2023

@martinus How does it compare to something like boost::deque/devector or libcxx's std::deque. Seems to be an alternative to that instead of vector who's defining characteristic is contiguous

Martin Leitner-Ankerl Jan 5, 2023

@beached_whale I originally tried to use std::deque but its operator[] performance is a lot slower. It seems deque is optimized at push/poping on both ends, but I need very fast indexing performance.

With my container indexing is only ~5% slower than with std::vector or so in my benchmarks

Martin Leitner-Ankerl Jan 5, 2023

@beached_whale See here for assembler generated for indexing: https://godbolt.org/z/vz4fvbxGd

Compiler Explorer - C++

int const& index_vector(std::vector<int> const& v, size_t idx) { return v[idx]; } int const& index_deque(std::deque<int> const& v, size_t idx) { return v[idx]; } int const& index_segmented_vector(ankerl::segmented_vector<int> const& v, size_t idx) { return v[idx]; }

Darrell Jan 6, 2023

@martinus This is why I said libcxx's deque, it's using 4k blocks, not 512byte, or 64byte as libstdc++ and MS STL do. The asm is much better https://godbolt.org/z/r5af88hEf

Compiler Explorer - C++

@beached_whale Ah, nice. Unfortunately I can't use libc++ everywhere, it still misses lots of stuff.

https://godbolt.org/z/PsG1zjx1s

@beached_whale Also, it seems the generated code is worse when using e.g. std::string as the type, then std::deque is not using power of 2 which would be faster. Although the size of the allocated blocks won't that close to 4k any more

Compiler Explorer - C++

using T = std::string; //using T = int; T const& index_vector(std::vector<T> const& v, size_t idx) { return v[idx]; } T const& index_deque(std::deque<T> const& v, size_t idx) { return v[idx]; } T const& index_segmented_vector(ankerl::segmented_vector<T> const& v, size_t idx) { return v[idx]; } auto size_of_t() { return sizeof(T); }

Darrell Jan 6, 2023

@martinus If I remember correctly it's something like 4096/sizeof(T), not 4k items. Yeah. libcxx is just better than the other std implementations for cost of allocation(assuming that push_front/pop_front are not dominant).

I wonder what boost is using for their new flat map

---
Just realized I was thinking out loud and not really making a point

Darrell Jan 6, 2023

@martinus That's really good

@martinus What's your implementation strategy? I'm using a segmented vector for the ability to safely store pointers to elements, that uses increasing powers of two as blocks.

@foonathan I tried that too, even with CLZ intrinsics to find the index. For some reason that was still not as fast as this approach:

I allocate fixed sized segments, e.g. 4096 bytes (it's a template argument), and make sure that exactly 2^x elements fit into it. The segments are stored in an std::vector. Indexing then becomes

return m_segments[idx >> num_bits][idx & mask];

where num_bits and mask are constexpr members.

@martinus Ah right, I've abandoned contiguous indices and use num_bits == 21 (2 MiB) regardless of actual block size.

@foonathan I might change my implementation so that it behaves like std::vector until the block size is reached (doubling size and moving elements), and then switch to always allocating same size segments. Then it's not a problem to have big segment sizes even when containers might not need to hold many elements, and indexing is still the same speed

@martinus Wait, if you don't actually care about address stability, why have a segmented implementation? push_front?

@foonathan I don't want to have the memory allocation spikes when the vector gets full and relocates

@martinus Ah, I see. Yeah, then I reckon a hybrid approach might work.

> @foonathan Ah right, I've abandoned contiguous indices and use num_bits == 21 (2 MiB) regardless of actual block size.

I don't think I understand this. What I did previously was basically implement this here: https://plflib.org/matt_bentley_-_pot_blocks_bit_tricks.pdf

That way you can have contiguous indices. Finding the index of the actual bucket is just more or less a CLZ operation, finding index within the bucket means removing the highest set bit

@martinus Yes, I had that before as well, but I also found it too slow. Since I don't need contigouous indices I basicially treat every block as-if it was 2MiB - so the indices are something like 0, 1, 2, 3, 2097152, ...

@foonathan Ah, thanks for explaining, that's a neat trick! I didn't think about just skipping indices