Working on an alternative std::vector implementation that's not contiguous and can be the base for my ankerl::unordered_dense::map. https://github.com/martinus/unordered_dense

The big advantage is it doesn't have allocation spikes. Surprisingly, insertion is faster because no elements have to be moved around #cpp

GitHub - martinus/unordered_dense: A fast & densely stored hashmap and hashset based on robin-hood backward shift deletion

A fast & densely stored hashmap and hashset based on robin-hood backward shift deletion - martinus/unordered_dense

GitHub
@martinus How does it compare to something like boost::deque/devector or libcxx's std::deque. Seems to be an alternative to that instead of vector who's defining characteristic is contiguous

@beached_whale I originally tried to use std::deque but its operator[] performance is a lot slower. It seems deque is optimized at push/poping on both ends, but I need very fast indexing performance.

With my container indexing is only ~5% slower than with std::vector or so in my benchmarks

@beached_whale See here for assembler generated for indexing: https://godbolt.org/z/vz4fvbxGd
Compiler Explorer - C++

int const& index_vector(std::vector<int> const& v, size_t idx) { return v[idx]; } int const& index_deque(std::deque<int> const& v, size_t idx) { return v[idx]; } int const& index_segmented_vector(ankerl::segmented_vector<int> const& v, size_t idx) { return v[idx]; }

@martinus This is why I said libcxx's deque, it's using 4k blocks, not 512byte, or 64byte as libstdc++ and MS STL do. The asm is much better https://godbolt.org/z/r5af88hEf
Compiler Explorer - C++

int const& index_vector(std::vector<int> const& v, size_t idx) { return v[idx]; } int const& index_deque(std::deque<int> const& v, size_t idx) { return v[idx]; } int const& index_segmented_vector(ankerl::segmented_vector<int> const& v, size_t idx) { return v[idx]; }

@beached_whale Ah, nice. Unfortunately I can't use libc++ everywhere, it still misses lots of stuff.

@beached_whale Also, it seems the generated code is worse when using e.g. std::string as the type, then std::deque is not using power of 2 which would be faster. Although the size of the allocated blocks won't that close to 4k any more

https://godbolt.org/z/PsG1zjx1s

Compiler Explorer - C++

using T = std::string; //using T = int; T const& index_vector(std::vector<T> const& v, size_t idx) { return v[idx]; } T const& index_deque(std::deque<T> const& v, size_t idx) { return v[idx]; } T const& index_segmented_vector(ankerl::segmented_vector<T> const& v, size_t idx) { return v[idx]; } auto size_of_t() { return sizeof(T); }

@martinus If I remember correctly it's something like 4096/sizeof(T), not 4k items. Yeah. libcxx is just better than the other std implementations for cost of allocation(assuming that push_front/pop_front are not dominant).

I wonder what boost is using for their new flat map

---
Just realized I was thinking out loud and not really making a point

@martinus What's your implementation strategy? I'm using a segmented vector for the ability to safely store pointers to elements, that uses increasing powers of two as blocks.

@foonathan I tried that too, even with CLZ intrinsics to find the index. For some reason that was still not as fast as this approach:

I allocate fixed sized segments, e.g. 4096 bytes (it's a template argument), and make sure that exactly 2^x elements fit into it. The segments are stored in an std::vector. Indexing then becomes

return m_segments[idx >> num_bits][idx & mask];

where num_bits and mask are constexpr members.

@martinus Ah right, I've abandoned contiguous indices and use num_bits == 21 (2 MiB) regardless of actual block size.
@foonathan I might change my implementation so that it behaves like std::vector until the block size is reached (doubling size and moving elements), and then switch to always allocating same size segments. Then it's not a problem to have big segment sizes even when containers might not need to hold many elements, and indexing is still the same speed
@martinus Wait, if you don't actually care about address stability, why have a segmented implementation? push_front?
@foonathan I don't want to have the memory allocation spikes when the vector gets full and relocates
@martinus Ah, I see. Yeah, then I reckon a hybrid approach might work.

> @foonathan Ah right, I've abandoned contiguous indices and use num_bits == 21 (2 MiB) regardless of actual block size.

I don't think I understand this. What I did previously was basically implement this here: https://plflib.org/matt_bentley_-_pot_blocks_bit_tricks.pdf

That way you can have contiguous indices. Finding the index of the actual bucket is just more or less a CLZ operation, finding index within the bucket means removing the highest set bit

@martinus Yes, I had that before as well, but I also found it too slow. Since I don't need contigouous indices I basicially treat every block as-if it was 2MiB - so the indices are something like 0, 1, 2, 3, 2097152, ...
@foonathan Ah, thanks for explaining, that's a neat trick! I didn't think about just skipping indices
@martinus It is weird, especially when you're debugging. You normally don't see numbers that big in a debugging session for a small example... :D