哈希表的C实现(三)---传说中的暴雪版
關于哈希表C實現,寫了兩篇學習筆記,不過似乎網上流傳最具傳奇色彩的莫過于暴雪公司的魔獸文件打包管理器里的hashTable的實現了;在沖突方面的處理方面,采用線性探測再散列。在添加和查找過程中進行了三次哈希,第一個哈希值用來查找,后兩個哈希值用來校驗,這樣可以大大減少沖突的幾率。
在網上找了相關代碼,但不知道其來源是否地道:
StringHash.h
1 #include <StdAfx.h>2 #include <string>
3
4 using namespace std;
5
6 #pragma once
7
8 #define MAXTABLELEN 1024 // 默認哈希索引表大小
9 //
10 // 哈希索引表定義
11 typedef struct _HASHTABLE
12 {
13 long nHashA;
14 long nHashB;
15 bool bExists;
16 }HASHTABLE, *PHASHTABLE ;
17
18 class StringHash
19 {
20 public:
21 StringHash(const long nTableLength = MAXTABLELEN);
22 ~StringHash(void);
23 private:
24 unsigned long cryptTable[0x500];
25 unsigned long m_tablelength; // 哈希索引表長度
26 HASHTABLE *m_HashIndexTable;
27 private:
28 void InitCryptTable(); // 對哈希索引表預處理
29 unsigned long HashString(const string &lpszString, unsigned long dwHashType); // 求取哈希值
30 public:
31 bool Hash(string url);
32 unsigned long Hashed(string url); // 檢測url是否被hash過
33 };
StringHash.cpp
#include "StdAfx.h"#include "StringHash.h"
StringHash::StringHash(const long nTableLength /*= MAXTABLELEN*/)
{
InitCryptTable();
m_tablelength = nTableLength;
//初始化hash表
m_HashIndexTable = new HASHTABLE[nTableLength];
for ( int i = 0; i < nTableLength; i++ )
{
m_HashIndexTable[i].nHashA = -1;
m_HashIndexTable[i].nHashB = -1;
m_HashIndexTable[i].bExists = false;
}
}
StringHash::~StringHash(void)
{
//清理內存
if ( NULL != m_HashIndexTable )
{
delete []m_HashIndexTable;
m_HashIndexTable = NULL;
m_tablelength = 0;
}
}
/************************************************************************/
/*函數名:InitCryptTable
/*功 能:對哈希索引表預處理
/*返回值:無
/************************************************************************/
void StringHash::InitCryptTable()
{
unsigned long seed = 0x00100001, index1 = 0, index2 = 0, i;
for( index1 = 0; index1 < 0x100; index1++ )
{
for( index2 = index1, i = 0; i < 5; i++, index2 += 0x100 )
{
unsigned long temp1, temp2;
seed = (seed * 125 + 3) % 0x2AAAAB;
temp1 = (seed & 0xFFFF) << 0x10;
seed = (seed * 125 + 3) % 0x2AAAAB;
temp2 = (seed & 0xFFFF);
cryptTable[index2] = ( temp1 | temp2 );
}
}
}
/************************************************************************/
/*函數名:HashString
/*功 能:求取哈希值
/*返回值:返回hash值
/************************************************************************/
unsigned long StringHash::HashString(const string& lpszString, unsigned long dwHashType)
{
unsigned char *key = (unsigned char *)(const_cast<char*>(lpszString.c_str()));
unsigned long seed1 = 0x7FED7FED, seed2 = 0xEEEEEEEE;
int ch;
while(*key != 0)
{
ch = toupper(*key++);
seed1 = cryptTable[(dwHashType << 8) + ch] ^ (seed1 + seed2);
seed2 = ch + seed1 + seed2 + (seed2 << 5) + 3;
}
return seed1;
}
/************************************************************************/
/*函數名:Hashed
/*功 能:檢測一個字符串是否被hash過
/*返回值:如果存在,返回位置;否則,返回-1
/************************************************************************/
unsigned long StringHash::Hashed(string lpszString)
{
const unsigned long HASH_OFFSET = 0, HASH_A = 1, HASH_B = 2;
//不同的字符串三次hash還會碰撞的幾率無限接近于不可能
unsigned long nHash = HashString(lpszString, HASH_OFFSET);
unsigned long nHashA = HashString(lpszString, HASH_A);
unsigned long nHashB = HashString(lpszString, HASH_B);
unsigned long nHashStart = nHash % m_tablelength,
nHashPos = nHashStart;
while ( m_HashIndexTable[nHashPos].bExists)
{
if (m_HashIndexTable[nHashPos].nHashA == nHashA && m_HashIndexTable[nHashPos].nHashB == nHashB)
return nHashPos;
else
nHashPos = (nHashPos + 1) % m_tablelength;
if (nHashPos == nHashStart)
break;
}
return -1; //沒有找到
}
/************************************************************************/
/*函數名:Hash
/*功 能:hash一個字符串
/*返回值:成功,返回true;失敗,返回false
/************************************************************************/
bool StringHash::Hash(string lpszString)
{
const unsigned long HASH_OFFSET = 0, HASH_A = 1, HASH_B = 2;
unsigned long nHash = HashString(lpszString, HASH_OFFSET);
unsigned long nHashA = HashString(lpszString, HASH_A);
unsigned long nHashB = HashString(lpszString, HASH_B);
unsigned long nHashStart = nHash % m_tablelength,
nHashPos = nHashStart;
while ( m_HashIndexTable[nHashPos].bExists)
{
nHashPos = (nHashPos + 1) % m_tablelength;
if (nHashPos == nHashStart) //一個輪回
{
//hash表中沒有空余的位置了,無法完成hash
return false;
}
}
m_HashIndexTable[nHashPos].bExists = true;
m_HashIndexTable[nHashPos].nHashA = nHashA;
m_HashIndexTable[nHashPos].nHashB = nHashB;
return true;
}
關于其中的實現原理,我覺得沒有比?inside MPQ說得清楚的了,于是用我蹩腳的E文,將該文的第二節翻譯了一遍(將原文和譯文都貼出來,請高手指正):
原理
Most?of?the?advancements?throughout?the?history?of?computers?have?been?because?of?particular?problems?which?required?solving.?In?this?chapter,?we'll?take?a?look?at?some?of?these?problems?and?their?solutions?as?they?pertain?to?the?MPQ?format.
貫穿計算機發展歷史,大多數進步都是源于某些問題的解決,在這一節中,我們來看一看與MPQ?格式相關問題及解決方案;
?
Hashes
哈希表
Problem:?You?have?a?very?large?array?of?strings.?You?have?another?string?and?need?to?know?if?it?is?already?in?the?list.?You?would?probably?begin?by?comparing?each?string?in?the?list?with?the?string?other,?but?when?put?into?application,?you?would?find?that?this?method?is?far?too?slow?for?practical?use.?Something?else?must?be?done.?But?how?can?you?know?if?the?string?exists?without?comparing?it?to?all?the?other?strings?
問題:你有一個很大的字符串數組,同時,你另外還有一個字符串,需要知道這個字符串是否已經存在于字符串數組中。你可能會對數組中的每一個字符串進行比較,但是在實際項目中,你會發現這種做法對某些特殊應用來說太慢了。必須尋求其他途徑。那么如何才能在不作遍歷比較的情況下知道這個字符串是否存在于數組中呢?
?
Solution:?Hashes.?Hashes?are?smaller?data?types?(i.e.?numbers)?that?represent?other,?larger,?data?types?(usually?strings).?In?this?scenario,?you?could?store?hashes?in?the?array?with?the?strings.?Then?you?could?compute?the?hash?of?the?other?string?and?compare?it?to?the?stored?hashes.?If?a?hash?in?the?array?matches?the?new?hash,?the?strings?can?be?compared?to?verify?the?match.?This?method,?called?indexing,?could?speed?things?up?by?about?100?times,?depending?on?the?size?of?the?array?and?the?average?length?of?the?strings.
解決方案:哈希表。哈希表是通過更小的數據類型表示其他更大的數據類型。在這種情況下,你可以把哈希表存儲在字符串數組中,然后你可以計算字符串的哈希值,然后與已經存儲的字符串的哈希值進行比較。如果有匹配的哈希值,就可以通過字符串比較進行匹配驗證。這種方法叫索引,根據數組的大小以及字符串的平均長度可以約100倍。?
unsigned long HashString(char *lpszString){
unsigned long ulHash = 0xf1e2d3c4;
while (*lpszString != 0)
{
ulHash <<= 1;
ulHash += *lpszString++;
}
return ulHash;
}
?
The?previous?code?function?demonstrates?a?very?simple?hashing?algorithm.?The?function?sums?the?characters?in?the?string,?shifting?the?hash?value?left?one?bit?before?each?character?is?added?in.?Using?this?algorithm,?the?string?"arr\units.dat"?would?hash?to?0x5A858026,?and?"unit\neutral\acritter.grp"?would?hash?to?0x694CD020.?Now,?this?is,?admittedly,?a?very?simple?algorithm,?and?it?isn't?very?useful,?because?it?would?generate?a?relatively?predictable?output,?and?a?lot?of?collisions?in?the?lower?range?of?numbers.?Collisions?are?what?happen?when?more?than?one?string?hash?to?the?same?value.
上面代碼中的函數演示了一種非常簡單的散列算法。這個函數在遍歷字符串過程中,將哈希值左移一位,然后加上字符值;通過這個算法,字符串"arr\units.dat"?的哈希值是0x5A858026,字符串"unit\neutral\acritter.grp"?的哈希值是0x694CD020;現在,眾所周知的,這是一個基本沒有什么實用價值的簡單算法,因為它會在較低的數據范圍內產生相對可預測的輸出,從而可能會產生大量沖突(不同的字符串產生相同的哈希值)。
?
The?MPQ?format,?on?the?other?hand,?uses?a?very?complicated?hash?algorithm?(shown?below)?to?generate?totally?unpredictable?hash?values.?In?fact,?the?hashing?algorithm?is?so?effective?that?it?is?called?a?one-way?hash.?A?one-way?hash?is?a?an?algorithm?that?is?constructed?in?such?a?way?that?deriving?the?original?string?(set?of?strings,?actually)?is?virtually?impossible.?Using?this?particular?algorithm,?the?filename?"arr\units.dat"?would?hash?to?0xF4E6C69D,?and?"unit\neutral\acritter.grp"?would?hash?to?0xA26067F3.
MPQ格式,使用了一種非常復雜的散列算法(如下所示),產生完全不可預測的哈希值,這個算法十分有效,這就是所謂的單向散列算法。通過單向散列算法幾乎不可能通過哈希值來唯一的確定輸入值。使用這種算法,文件名?"arr\units.dat"?的哈希值是0xF4E6C69D,"unit\neutral\acritter.grp"?的哈希值是?0xA26067F3。?
unsigned long HashString(char *lpszFileName, unsigned long dwHashType){
unsigned char *key = (unsigned char *)lpszFileName;
unsigned long seed1 = 0x7FED7FED, seed2 = 0xEEEEEEEE;
int ch;
while(*key != 0)
{
ch = toupper(*key++);
seed1 = cryptTable[(dwHashType << 8) + ch] ^ (seed1 + seed2);
seed2 = ch + seed1 + seed2 + (seed2 << 5) + 3;
}
return seed1;
}
?
Hash?Tables
哈希表
Problem:?You?tried?using?an?index?like?in?the?previous?sample,?but?your?program?absolutely?demands?break-neck?speeds,?and?indexing?just?isn't?fast?enough.?About?the?only?thing?you?could?do?to?make?it?faster?is?to?not?check?all?of?the?hashes?in?the?array.?Or,?even?better,?if?you?could?only?make?one?comparison?in?order?to?be?sure?the?string?doesn't?exist?anywhere?in?the?array.?Sound?too?good?to?be?true??It's?not.
?
問題:您嘗試在前面的示例中使用相同索引,您的程序一定會有中斷現象發生,而且不夠快?。如果想讓它更快,您能做的只有讓程序不去查詢數組中的所有散列值。或者?您可以只做一次對比就可以得出在列表中是否存在字符串。?聽起來不錯,真的么???不可能的啦
?
Solution:?A?hash?table.?A?hash?table?is?a?special?type?of?array?in?which?the?offset?of?the?desired?string?is?the?hash?of?that?string.?What?I?mean?is?this.?Say?that?you?make?that?string?array?use?a?separate?array?of?fixed?size?(let's?say?1024?entries,?to?make?it?an?even?power?of?2)?for?the?hash?table.?You?want?to?see?if?the?new?string?is?in?that?table.?To?get?the?string's?place?in?the?hash?table,?you?compute?the?hash?of?that?string,?then?modulo?(division?remainder)?that?hash?value?by?the?size?of?that?table.?Thus,?if?you?used?the?simple?hash?algorithm?in?the?previous?section,?"arr\units.dat"?would?hash?to?0x5A858026,?making?its?offset?0x26?(0x5A858026?divided?by?0x400?is?0x16A160,?with?a?remainder?of?0x26).?The?string?at?this?location?(if?there?was?one)?would?then?be?compared?to?the?string?to?add.?If?the?string?at?0x26?doesn't?match?or?just?plain?doesn't?exist,?then?the?string?to?add?doesn't?exist?in?the?array.?The?following?code?illustrates?this:
?
解決:一個哈希表就是以字符串的哈希值作為下標的一類數組。我的意思是,哈希表使用一個固定長度的字符串數組(比如1024,2的偶次冪)進行存儲;當你要看看這個字符串是否存在于哈希表中,為了獲取這個字符串在哈希表中的位置,你首先計算字符串的哈希值,然后哈希表的長度取模。這樣如果你像上一節那樣使用簡單的哈希算法,字符串"arr\units.dat"?的哈希值是0x5A858026,偏移量0x26(0x5A858026?除于0x400等于0x16A160,模0x400等于0x26)。因此,這個位置的字符串將與新加入的字符串進行比較。如果0X26處的字符串不匹配或不存在,那么表示新增的字符串在數組中不存在。下面是示意的代碼:
int GetHashTablePos(char *lpszString, SOMESTRUCTURE *lpTable, int nTableSize){
int nHash = HashString(lpszString), nHashPos = nHash % nTableSize;
if (lpTable[nHashPos].bExists && !strcmp(lpTable[nHashPos].pString, lpszString))
return nHashPos;
else
return -1; //Error value
}
Now,?there?is?one?glaring?flaw?in?that?explanation.?What?do?you?think?happens?when?a?collision?occurs?(two?different?strings?hash?to?the?same?value)??Obviously,?they?can't?occupy?the?same?entry?in?the?hash?table.?Normally,?this?is?solved?by?each?entry?in?the?hash?table?being?a?pointer?to?a?linked?list,?and?the?linked?list?would?hold?all?the?entries?that?hash?to?that?same?value.
上面的說明中存在一個刺眼的缺陷。當有沖突(兩個不同的字符串有相同的哈希值)發生的時候怎么辦?顯而易見的,它們不能占據哈希表中的同一個位置。通常的解決辦法是為每一個哈希值指向一個鏈表,用于存放所有哈希沖突的值;
MPQs?use?a?hash?table?of?filenames?to?keep?track?of?the?files?inside,?but?the?format?of?this?table?is?somewhat?different?from?the?way?hash?tables?are?normally?done.?First?of?all,?instead?of?using?a?hash?as?an?offset,?and?storing?the?actually?filename?for?verification,?MPQs?do?not?store?the?filename?at?all,?but?rather?use?three?different?hashes:?one?for?the?hash?table?offset,?two?for?verification.?These?two?verification?hashes?are?used?in?place?of?the?actual?filename.?Of?course,?this?leaves?the?possibility?that?two?different?filenames?would?hash?to?the?same?three?hashes,?but?the?chances?of?this?happening?are,?on?average,?1:18889465931478580854784,?which?should?be?safe?enough?for?just?about?anyone.
MPQs使用一個存放文件名的哈希表來跟蹤文件內部,但是表的格式與通常方法有點不同,首先不像通常的做法使用哈希值作為偏移量,存儲實際的文件名。MPQs?根本不存儲文件名,而是使用了三個不同的哈希值:一個用做哈希表偏移量,兩個用作核對。這兩個核對的哈希值用于替代文件名。當然從理論上說存在兩個不同的文件名得到相同的三個哈希值,但是這種情況發送的幾率是:1:18889465931478580854784,這應該足夠安全了。
?
The?other?way?that?an?MPQ's?hash?table?differs?from?the?conventional?implementation?is?that?instead?of?using?a?linked?list?for?each?entry,?when?a?collision?occurs,?the?entry?will?be?shifted?to?the?next?slot,?and?the?process?repeated?until?a?free?space?is?found.?Take?a?look?at?the?following?illustrational?code,?which?is?basically?the?way?a?file?is?located?for?reading?in?an?MPQ:
MPQ's的哈希表的實現與傳統實現的另一個不同的地方是,相對與傳統做法(為每個節點使用一個鏈表,當沖突發生的時候,遍歷鏈表進行比較),看一下下面的示范代碼,在MPQ中定位一個文件進行讀操作:
?
int GetHashTablePos(char *lpszString, MPQHASHTABLE *lpTable, int nTableSize){
const int HASH_OFFSET = 0, HASH_A = 1, HASH_B = 2;
int nHash = HashString(lpszString, HASH_OFFSET),nHashA = HashString(lpszString, HASH_A),nHashB = HashString(lpszString, HASH_B), nHashStart = nHash % nTableSize,nHashPos = nHashStart;
while (lpTable[nHashPos].bExists)
{
if (lpTable[nHashPos].nHashA == nHashA && lpTable[nHashPos].nHashB == nHashB)
return nHashPos;
else
nHashPos = (nHashPos + 1) % nTableSize;
if (nHashPos == nHashStart)
break;
}
return -1; //Error value
}
However?convoluted?that?code?may?look,??the?theory?behind?it?isn't?difficult.?It?basically?follows?this?process?when?looking?to?read?a?file:
Compute?the?three?hashes?(offset?hash?and?two?check?hashes)?and?store?them?in?variables.
Move?to?the?entry?of?the?offset?hash
Is?the?entry?unused??If?so,?stop?the?search?and?return?'file?not?found'.
Do?the?two?check?hashes?match?the?check?hashes?of?the?file?we're?looking?for??If?so,?stop?the?search?and?return?the?current?entry.
Move?to?the?next?entry?in?the?list,?wrapping?around?to?the?beginning?if?we?were?on?the?last?entry.
Is?the?entry?we?just?moved?to?the?same?as?the?offset?hash?(did?we?look?through?the?whole?hash?table?)??If?so,?stop?the?search?and?return?'file?not?found'.
Go?back?to?step?3.
?
無論代碼看上去有多么復雜,其背后的理論并不難。讀一個文件的時候基本遵循下面這樣一個過程:
1、計算三個哈希值(一個哈希偏移量和兩個驗證值)并保存到變量中;
2、移動到哈希偏移量對應的值;
3、對應的位置是否尚未使用?如果是,則停止搜尋,并返回“文件不存在”;
4、這兩個驗證值是否與我們要找的字符串驗證值匹配,如果是,停止搜尋,并返回當前的節點;
5、移動到下一個節點,如果到了最后一個節點則返回開始;
6、Is?the?entry?we?just?moved?to?the?same?as?the?offset?hash?(did?we?look?through?the?whole?hash?table?)??If?so,?stop?the?search?and?return?'file?not?found'.
7、回到第3步;
?
If?you?were?paying?attention,?you?might?have?noticed?from?my?explanation?and?sample?code?is?that?the?MPQ's?hash?table?has?to?hold?all?the?file?entries?in?the?MPQ.?But?what?do?you?think?happens?when?every?hash-table?entry?gets?filled??The?answer?might?surprise?you?with?its?obviousness:?you?can't?add?any?more?files.?Several?people?have?asked?me?why?there?is?a?limit?(called?the?file?limit)?on?the?number?of?files?that?can?be?put?in?an?MPQ,?and?if?there?is?any?way?around?this?limit.?Well,?you?already?have?the?answer?to?the?first?question.?As?for?the?second;?no,?you?cannot?get?around?the?file?limit.?For?that?matter,?hash?tables?cannot?even?be?resized?without?remaking?the?entire?MPQ?from?scratch.?This?is?because?the?location?of?each?entry?in?the?hash?table?may?well?change?due?to?the?resizing,?and?we?would?not?be?able?to?derive?the?new?position?because?the?position?is?the?hash?of?the?file?name,?and?we?may?not?know?the?file?name.
?
如果您注意的話,您可能已經從我們的解釋和示例代碼注意到,MPQ的哈希表已經將所有的文件入口放入MPQ中;那么當哈希表的每個項都被填充的時候,會發生什么呢?答案可能會讓你驚訝:你不能添加任何文件。有些人可能會問我為什么文件數量上有這樣的限制(文件限制),是否有辦法繞過這個限制?就此而言,如果不重新創建MPQ?的項,甚至無法調整哈希表的大小。這是因為每個項在哈希表中的位置會因為跳閘尺寸而改變,而我們無法得到新的位置,因為這些位置值是文件名的哈希值,而我們根本不知道文件名是什么。
?
一連總結了3篇關于哈希表的C實現,都是來源于網絡,整理學習,以備忘之;不能說都搞得很清楚,大致知道了哈希表是怎么實現的;當然還有很多開源項目都有自己的實現,如LUA、Redis、Apache等,精力有限,先挖個坑,日后有時間再補充吧。不管怎么說,有點孔乙己的嫌疑,呵呵!
總結
以上是生活随笔為你收集整理的哈希表的C实现(三)---传说中的暴雪版的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 如何利用阿里云安全产品加强你的网站防护能
- 下一篇: 48. Rotate Image