海量数据找中位数
有幾百億的整數,分布的存儲到幾百臺通過網絡連接的計算機上,你能否開發出一個算法和系統,找出這幾百億數據的中值?就是在一組排序好的數據中居于中間的數。顯然,一臺機器是裝不下所有的數據。也盡量少用網絡帶寬。
 
m1:?
將0-2^32劃分為若干個桶,每個電腦放一個桶,每個桶內有個計數器,記錄有多少數放在桶內
從第一個桶的計數器開始加,一直加到大于n/2
所以中位數肯定在這個桶內
 
m2:
http://matpalm.com/median/distributing.html
 
http://matpalm.com/median/erlang_multi.html
 
 
利用快排的思想,但是只計數不交換位置,節省移動操作。
in the multi process implementation splits the list of numbers to consider into a sub lists
 each sub list can be handled by a seperate erlang process, potentially across different machines
 (these lists don't have to be the same size)
 
 recall there a number of operations required in?distributed?case
 includes things like determining total number of elements, determining minimum value, etc
 each of these operations can be done against each sub list individually with the results aggregated
eg to determine total number of elements
single -> length( [1,2,3,4,5,6,7,8,9] ) = 9 multi -> sum ( length([1,2,3,4]), length([5,6]), length([7,8,9]) ) = sum([4,2,3]) = 9
 eg to determine minimum value
 
other changes required
rotation
 in the single list case we pick the pivot as the first value.
 in the multi list case we pick the pivot as the first value of the first list.
 recall in the algorithm that?rotation?is sometimes required.
 this is to ensure all potential values for a pivot are explored.
 so in the multi list case the rotation needs to operate at two levels; rotate the first list and then rotate the list of lists
 
 
 
 
給你1T(10^12)的int64整數,和僅有的1GB(10^9)的內存,如何設計算法和程序來找到它們的中值
 
 
大數據處理的題,都是采用分塊的。int64bit=8bytes.所以1G內存只能表示1GB/8B=2^27個數。1.先將1G內存分成M=(2^27-1)個塊,每塊數范圍為range=(2^64)/M; 2.創建一大小為M的數組A。A[0]表示數范圍是0-(range-1)。3.從頭掃描一次,數組A元素++;3.這樣就可以縮小范圍。之后再繼續縮小范圍
 
總結
 
                            
                        - 上一篇: 基础题
- 下一篇: 给定key值,在Binary Searc
