C++哈希表实现：从零封装unordered_map与unordered

C++哈希表实现：从零封装unordered_map与unordered_set

东予薏米

1. C++哈希表封装实战：从零实现unordered_map与unordered_set

在C++标准库中，unordered_map和unordered_set作为高效的关联容器，其底层实现基于哈希表（散列表）。本文将带你深入理解哈希表的工作原理，并完整实现这两个容器的简化版本。不同于教科书式的理论讲解，我会结合多年工程实践经验，分享实际开发中的关键技术和避坑指南。

哈希表的核心优势在于平均O(1)时间复杂度的查找效率，这使其成为处理大规模数据时的首选数据结构。我们的实现将包含以下关键技术点：

开链法解决哈希冲突
动态扩容机制与素数表优化
迭代器的跨桶遍历实现
类型萃取与仿函数设计模式
const正确性控制

2. 哈希表基础结构与设计思路

2.1 哈希节点与基本框架

哈希表的基本构建单元是哈希节点，我们采用链式结构处理冲突：

cpp复制namespace hash_bucket {
    template<class T>
    struct HashNode {
        T _data;
        HashNode<T>* _next;

        HashNode(const T& data)
            : _data(data)
            , _next(nullptr) 
        {}
    };
}

每个节点包含数据域和指向下一个节点的指针，形成单向链表。这种结构简单高效，是处理哈希冲突的经典方法。

2.2 哈希函数设计

哈希函数的质量直接影响哈希表性能。我们提供默认实现和特化版本：

cpp复制template<class K>
struct HashFunc {
    size_t operator()(const K& key) {
        return static_cast<size_t>(key);
    }
};

// 字符串特化
template<>
struct HashFunc<std::string> {
    size_t operator()(const std::string& key) {
        size_t hash = 0;
        for (auto ch : key) {
            hash += ch;
            hash *= 131;  // 使用31的倍数作为乘数
        }
        return hash;
    }
};

经验之谈：字符串哈希采用31/131等乘数是因为它们是奇素数，在CPU移位操作中效率高且分布性好。实际工程中也可考虑更复杂的算法如MurmurHash。

2.3 素数表扩容机制

哈希表扩容时选择素数大小的桶数组可以减少哈希冲突：

cpp复制static const int __stl_num_primes = 28;
static const unsigned long __stl_prime_list[__stl_num_primes] = {
    53, 97, 193, 389, 769, 1543, 3079, 6151, 12289, 24593,
    49157, 98317, 196613, 393241, 786433, 1572869, 3145739,
    6291469, 12582917, 25165843, 50331653, 100663319,
    201326611, 402653189, 805306457, 1610612741, 3221225473,
    4294967291
};

inline unsigned long __stl_next_prime(unsigned long n) {
    const auto* first = __stl_prime_list;
    const auto* last = __stl_prime_list + __stl_num_primes;
    const auto* pos = std::lower_bound(first, last, n);
    return pos == last ? *(last - 1) : *pos;
}

扩容策略采用负载因子=1时触发，确保空间利用率与时间效率的平衡。

3. 哈希表核心实现

3.1 基本框架与插入操作

哈希表类模板定义如下：

cpp复制template<class K, class T, class KeyOfT, class Hash>
class HashTable {
    typedef HashNode<T> Node;
    std::vector<Node*> _tables;
    size_t _n;  // 元素数量

public:
    HashTable() : _tables(__stl_next_prime(1), nullptr), _n(0) {}
    
    // 插入操作
    pair<Iterator, bool> Insert(const T& data) {
        KeyOfT kot;
        if (auto it = Find(kot(data)); it != End())
            return {it, false};

        Hash hs;
        // 扩容检查
        if (_n == _tables.size()) {
            vector<Node*> new_tables(__stl_next_prime(_tables.size() + 1), nullptr);
            // 重新哈希所有元素
            for (size_t i = 0; i < _tables.size(); ++i) {
                Node* cur = _tables[i];
                while (cur) {
                    Node* next = cur->_next;
                    size_t hashi = hs(kot(cur->_data)) % new_tables.size();
                    cur->_next = new_tables[hashi];
                    new_tables[hashi] = cur;
                    cur = next;
                }
                _tables[i] = nullptr;
            }
            _tables.swap(new_tables);
        }
        
        size_t hashi = hs(kot(data)) % _tables.size();
        Node* newnode = new Node(data);
        newnode->_next = _tables[hashi];
        _tables[hashi] = newnode;
        ++_n;
        return {Iterator(newnode, this), true};
    }
};

插入操作采用头插法，时间复杂度平均O(1)，最坏情况O(n)。扩容时需要重新计算所有元素的哈希位置，这是哈希表操作中最耗时的部分。

3.2 查找与删除操作

cpp复制Iterator Find(const K& key) {
    KeyOfT kot;
    Hash hs;
    size_t hashi = hs(key) % _tables.size();
    Node* cur = _tables[hashi];
    
    while (cur) {
        if (kot(cur->_data) == key)
            return Iterator(cur, this);
        cur = cur->_next;
    }
    return End();
}

bool Erase(const K& key) {
    KeyOfT kot;
    Hash hs;
    size_t hashi = hs(key) % _tables.size();
    Node* prev = nullptr;
    Node* cur = _tables[hashi];
    
    while (cur) {
        if (kot(cur->_data) == key) {
            if (!prev) _tables[hashi] = cur->_next;
            else prev->_next = cur->_next;
            
            delete cur;
            --_n;
            return true;
        }
        prev = cur;
        cur = cur->_next;
    }
    return false;
}

避坑指南：删除节点时务必处理好前驱节点的指针，否则会导致内存泄漏或链表断裂。多线程环境下还需要考虑锁的问题。

4. 迭代器实现

4.1 迭代器设计

哈希表迭代器的特殊之处在于需要支持跨桶遍历：

cpp复制template<class K, class T, class Ref, class Ptr, class KeyOfT, class Hash>
struct HTIterator {
    typedef HashNode<T> Node;
    typedef HashTable<K, T, KeyOfT, Hash> HT;
    Node* _node;
    const HT* _pht;

    HTIterator(Node* node, const HT* pht) : _node(node), _pht(pht) {}

    Ref operator*() { return _node->_data; }
    Ptr operator->() { return &_node->_data; }

    Self& operator++() {
        if (_node->_next) {
            _node = _node->_next;
        } else {
            KeyOfT kot;
            Hash hs;
            size_t hashi = hs(kot(_node->_data)) % _pht->_tables.size();
            ++hashi;
            while (hashi < _pht->_tables.size()) {
                if (_pht->_tables[hashi]) {
                    _node = _pht->_tables[hashi];
                    return *this;
                }
                ++hashi;
            }
            _node = nullptr;
        }
        return *this;
    }
};

迭代器保存了当前节点指针和哈希表指针，使得跨桶遍历成为可能。operator++需要处理两种情况：当前桶内还有节点和需要切换到下一个非空桶。

4.2 begin()和end()实现

cpp复制Iterator Begin() {
    for (size_t i = 0; i < _tables.size(); ++i) {
        if (_tables[i]) {
            return Iterator(_tables[i], this);
        }
    }
    return End();
}

Iterator End() {
    return Iterator(nullptr, this);
}

begin()返回第一个非空桶的第一个节点，end()返回空指针。这种设计符合STL迭代器的通用模式。

5. unordered_map和unordered_set封装

5.1 unordered_set实现

cpp复制namespace xxx {
    template<class K, class Hash = HashFunc<K>>
    class unordered_set {
        struct SetKeyOfT {
            const K& operator()(const K& key) { return key; }
        };
        
    public:
        typedef typename HashTable<K, const K, SetKeyOfT, Hash>::Iterator iterator;
        
        iterator begin() { return _ht.Begin(); }
        iterator end() { return _ht.End(); }
        
        pair<iterator, bool> insert(const K& key) {
            return _ht.Insert(key);
        }
        
        // 其他接口...
        
    private:
        HashTable<K, const K, SetKeyOfT, Hash> _ht;
    };
}

unordered_set直接复用哈希表，通过SetKeyOfT仿函数提取键值。注意模板参数中使用const K确保键不可修改。

5.2 unordered_map实现

cpp复制namespace xxx {
    template<class K, class V, class Hash = HashFunc<K>>
    class unordered_map {
        struct MapKeyOfT {
            const K& operator()(const pair<const K, V>& kv) {
                return kv.first;
            }
        };
        
    public:
        typedef typename HashTable<K, pair<const K, V>, MapKeyOfT, Hash>::Iterator iterator;
        
        iterator begin() { return _ht.Begin(); }
        iterator end() { return _ht.End(); }
        
        pair<iterator, bool> insert(const pair<const K, V>& kv) {
            return _ht.Insert(kv);
        }
        
        V& operator[](const K& key) {
            auto ret = _ht.Insert({key, V()});
            return ret.first->second;
        }
        
        // 其他接口...
        
    private:
        HashTable<K, pair<const K, V>, MapKeyOfT, Hash> _ht;
    };
}

unordered_map的[]操作符通过insert实现，如果键不存在会插入默认构造的值。这是STL中常见的实现方式，提供了方便的访问接口。

6. 测试与验证

完整的测试代码验证各功能：

cpp复制void test_unordered_set() {
    xxx::unordered_set<int> us;
    us.insert(3);
    us.insert(1);
    us.insert(4);
    us.insert(1);  // 重复插入
    
    for (auto it = us.begin(); it != us.end(); ++it) {
        cout << *it << " ";
    }
    cout << endl;
}

void test_unordered_map() {
    xxx::unordered_map<string, int> word_count;
    word_count["apple"] = 5;
    word_count["banana"] = 3;
    word_count["apple"] += 2;  // 修改现有值
    
    for (auto& [word, count] : word_count) {
        cout << word << ": " << count << endl;
    }
}

测试应覆盖以下场景：

基本插入和查找
重复键处理
扩容触发
迭代器遍历
边界条件（空容器、首尾元素等）

7. 性能优化与工程实践

在实际项目中，哈希表的性能优化需要考虑以下方面：

负载因子调整：根据场景选择合适的负载因子阈值，平衡空间和时间效率
内存池：频繁的节点分配释放可能成为瓶颈，可考虑使用内存池优化
哈希函数优化：针对特定数据类型设计专用哈希函数
并发安全：多线程环境下需要适当的同步机制

一个常见的优化是引入局部性原理，将经常一起访问的元素放在相邻位置。这可以通过改进哈希函数或调整冲突解决策略实现。

哈希表实现看似简单，但要做到工业级强度需要考虑诸多细节。希望本文的实现能为你提供有价值的参考，在实际项目中可根据需求进行扩展和优化。