字符串编辑距离(转载)
Levenshtein Distance (LD, 來文史特距離)也叫edit distance(編輯距離),它用來表示2個字符串的相似度,LD定義為需要最少多少步基本操作才能讓2個字符串相等,基本操作包含3個:插入, 刪除, 替換;比如,kiteen和sitting之間的距離可以這么計算:??
??????????1,kitten?? --?? >?? sitten,?? 替換k為s;??
??????????2,sitten?? --?? >?? sittin,?? 替換e為i;??
??????????3,sittin?? --?? >?? sitting,?? 增加g;??
所以,其LD為3。
設計狀態d[m][n] = d(A[1..m], B[1..n]),易知:
d[0][0] = 0;
d[i][0] = i;
d[0][j] = j;
d[i][j] = min( d[i-1][j-1] + (If A[i]=B[j] Then 0 Else 1 End If), //修改一個字符
?????????????????? d[i-1][j] + 1, //插入一個字符
?????????????????? d[i][j-1] + 1??//刪除一個字符
于是可以遞推地填滿一個 m * n 的矩陣,即得答案。
計算LD的算法表示為(C++代碼):??
int?d[1010][1010];?
int?dist(string?a,?string?b){
????int?m = a.size(), n = b.size(), i, j;
????for(i = 0; i <= m; ++i) d[i][0] = i;
????for(j = 0; j <= n; ++j) d[0][j] = j;
????for?(i = 1; i <= m; ++i){
????????for(j = 1; j <= n; ++j){
????????????//????--------------???? a, b是從0開始計數的
????????????d[i][j] = d[i-1][j-1] + (a[i-1]==b[j-1]?0:1); //修改一個字符
????????????d[i][j] = min(d[i][j], d[i-1][j] + 1); //插入一個字符
????????????d[i][j] = min(d[i][j], d[i][j-1] + 1); //刪除一個字符
????????}
????}
????for?(i = 0; i <= m; ++i){ //打印矩陣
????????for(j = 0; j <= n; ++j)?
????????????printf("%5d ", d[i][j]);
????????printf("\n");
????}
????return?d[m][n];
}
這個算法其實就是一個矩陣的計算:??
引用調用dist("abcdef", "acddaf")可以得到輸出為:?
????0???? 1???? 2???? 3???? 4???? 5???? 6?
????1???? 0???? 1???? 2???? 3???? 4???? 5?
????2???? 1???? 1???? 2???? 3???? 4???? 5?
????3???? 2???? 1???? 2???? 3???? 4???? 5?
????4???? 3???? 2???? 1???? 2???? 3???? 4?
????5???? 4???? 3???? 2???? 2???? 3???? 4?
????6???? 5???? 4???? 3???? 3???? 3???? 3
最后的d[m][n]就是求得的答案。
貼一個優化空間復雜度為O(n)的代碼(滾動數組):
int?diff(char?*a,?char?*b){
????int?*d[2], i, j;
????int?m = strlen(a), n = strlen(b);
????d[0] =?new?int[n + 1];
????d[1] =?new?int[n + 1];
????int?turn = 0, pre, t;
????for?(i = 0; i <= n; ++i) d[turn][i] = i;
????for?(i = 1; i <= m; ++i){
????????pre = turn;
????????turn = (turn + 1) % 2;
????????d[turn][0] = i;
????????for(int?p=0;p<=n;p++)printf("%d ",d[pre][p]);printf("\n");
????????for(j = 1; j <= n; ++j){
????????????t = d[pre][j-1] + (a[i-1] == b[j-1] ? 0 : 1);
????????????t = min(t, d[pre][j] + 1);
????????????d[turn][j] = min(t, d[turn][j-1] + 1);
????????}
????}
????for(int?p=0;p<=n;p++)printf("%d ",d[turn][p]);printf("\n");
????t = d[turn][n];
????delete[] d[0];
????delete[] d[1];
????return?t;
}
轉載于:https://www.cnblogs.com/E-star/archive/2012/08/01/2618396.html
總結
以上是生活随笔為你收集整理的字符串编辑距离(转载)的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: wp imei
- 下一篇: hdu 4160 Dolls (最大独立