記錄一次敏感詞過濾演算法DFA的應用案例
阿新 • • 發佈:2020-12-25
目錄
0、 DFA是什麼?
在實現文字過濾的演算法中,DFA是唯一比較好的實現演算法。DFA即Deterministic Finite Automaton,也就是確定有窮自動機,它是是通過event和當前的state得到下一個state,即event+state=nextstate。下圖展示了其狀態的轉換
在這幅圖中大寫字母(S、U、V、Q)都是狀態,小寫字母a、b為動作。
在實現敏感詞過濾的演算法中,我們必須要減少運算,而DFA在DFA演算法中幾乎沒有什麼計算,有的只是狀態的轉換。
1、為什麼要用DFA
DFA(Deterministic Finite Automaton 確定有窮自動機)常用於在某大段文字中快速查詢某幾個關鍵詞是否存在。因為它的高效率,所以在這種場景中應用比較廣泛。
2、DFA工具類實現
cn.hutool.dfa.SensitiveUtil
/** * 初始化敏感詞樹 * @param sensitiveWords 敏感詞列表 */ public static void init(Collection<String> sensitiveWords){ sensitiveTree.clear(); sensitiveTree.addWords(sensitiveWords); // log.debug("Sensitive init finished, sensitives: {}", sensitiveWords); } /** * 查詢敏感詞,返回找到的所有敏感詞<br> * 密集匹配原則:假如關鍵詞有 ab,b,文字是abab,將匹配 [ab,b,ab]<br> * 貪婪匹配(最長匹配)原則:假如關鍵字a,ab,最長匹配將匹配[a, ab] * * @param text 文字 * @param isDensityMatch 是否使用密集匹配原則 * @param isGreedMatch 是否使用貪婪匹配(最長匹配)原則 * @return 敏感詞 */ public static List<String> getFindedAllSensitive(String text, boolean isDensityMatch, boolean isGreedMatch){ return sensitiveTree.matchAll(text, -1, isDensityMatch, isGreedMatch); }
3、效能對比效果
3.1 普通關鍵字過濾
JMH程式碼如下:
package com.autocoding.hutool; import java.util.ArrayList; import java.util.List; import java.util.concurrent.TimeUnit; import org.junit.Test; import org.openjdk.jmh.annotations.Benchmark; import org.openjdk.jmh.annotations.BenchmarkMode; import org.openjdk.jmh.annotations.Mode; import org.openjdk.jmh.annotations.OutputTimeUnit; import org.openjdk.jmh.annotations.Scope; import org.openjdk.jmh.annotations.State; import org.openjdk.jmh.runner.Runner; import org.openjdk.jmh.runner.options.Options; import org.openjdk.jmh.runner.options.OptionsBuilder; @BenchmarkMode(value = { Mode.All }) @OutputTimeUnit(TimeUnit.MILLISECONDS) @State(Scope.Thread) public class 普通關鍵字過濾Test { private static List<String> keywords = new ArrayList<String>(); static { keywords.add("大"); keywords.add("大土豆"); keywords.add("土豆"); keywords.add("剛出鍋"); keywords.add("出鍋"); } public static void main(String[] args) throws Exception { String name = 普通關鍵字過濾Test.class.getName(); Options options = new OptionsBuilder().include(name).forks(1).measurementIterations(10).warmupIterations(3) .build(); new Runner(options).run(); } @Benchmark @Test public void 普通測試() { String text = "我有一顆大土豆,剛出鍋的"; List<String> hitList=new ArrayList<>(); for(String keyword:keywords){ if(text.contains(keyword)){ hitList.add(keyword); } } System.err.println(hitList); } }
JMH測試結果如下:
Result "com.autocoding.hutool.普通關鍵字過濾Test.普通測試":
N = 10
mean = 0.562 ±(99.9%) 0.309 ms/op
Histogram, ms/op:
[0.300, 0.350) = 0
[0.350, 0.400) = 3
[0.400, 0.450) = 1
[0.450, 0.500) = 1
[0.500, 0.550) = 0
[0.550, 0.600) = 2
[0.600, 0.650) = 0
[0.650, 0.700) = 0
[0.700, 0.750) = 2
[0.750, 0.800) = 0
[0.800, 0.850) = 0
[0.850, 0.900) = 0
[0.900, 0.950) = 0
Percentiles, ms/op:
p(0.0000) = 0.371 ms/op
p(50.0000) = 0.518 ms/op
p(90.0000) = 0.958 ms/op
p(95.0000) = 0.981 ms/op
p(99.0000) = 0.981 ms/op
p(99.9000) = 0.981 ms/op
p(99.9900) = 0.981 ms/op
p(99.9990) = 0.981 ms/op
p(99.9999) = 0.981 ms/op
p(100.0000) = 0.981 ms/op
# Run complete. Total time: 00:00:44
Benchmark Mode Cnt Score Error Units
普通關鍵字過濾Test.普通測試 thrpt 10 5.787 ± 0.936 ops/ms
普通關鍵字過濾Test.普通測試 avgt 10 0.178 ± 0.025 ms/op
普通關鍵字過濾Test.普通測試 sample 54550 0.183 ± 0.003 ms/op
普通關鍵字過濾Test.普通測試:普通測試·p0.00 sample 0.057 ms/op
普通關鍵字過濾Test.普通測試:普通測試·p0.50 sample 0.173 ms/op
普通關鍵字過濾Test.普通測試:普通測試·p0.90 sample 0.216 ms/op
普通關鍵字過濾Test.普通測試:普通測試·p0.95 sample 0.253 ms/op
普通關鍵字過濾Test.普通測試:普通測試·p0.99 sample 0.426 ms/op
普通關鍵字過濾Test.普通測試:普通測試·p0.999 sample 2.303 ms/op
普通關鍵字過濾Test.普通測試:普通測試·p0.9999 sample 7.570 ms/op
普通關鍵字過濾Test.普通測試:普通測試·p1.00 sample 20.185 ms/op
普通關鍵字過濾Test.普通測試 ss 10 0.562 ± 0.309 ms/op
3.2 DFA關鍵字過濾
JMH程式碼如下:
package com.autocoding.hutool;
import java.util.List;
import java.util.concurrent.TimeUnit;
import org.junit.Test;
import org.openjdk.jmh.annotations.Benchmark;
import org.openjdk.jmh.annotations.BenchmarkMode;
import org.openjdk.jmh.annotations.Mode;
import org.openjdk.jmh.annotations.OutputTimeUnit;
import org.openjdk.jmh.annotations.Scope;
import org.openjdk.jmh.annotations.State;
import org.openjdk.jmh.runner.Runner;
import org.openjdk.jmh.runner.options.Options;
import org.openjdk.jmh.runner.options.OptionsBuilder;
import cn.hutool.dfa.WordTree;
@BenchmarkMode(value = { Mode.All })
@OutputTimeUnit(TimeUnit.MILLISECONDS)
@State(Scope.Thread)
public class DFA關鍵字過濾Test {
private static WordTree tree = new WordTree();
static {
tree.addWord("大");
tree.addWord("大土豆");
tree.addWord("土豆");
tree.addWord("剛出鍋");
tree.addWord("出鍋");
}
public static void main(String[] args) throws Exception {
String name = DFA關鍵字過濾Test.class.getName();
Options options = new OptionsBuilder().include(name).forks(1).measurementIterations(10).warmupIterations(3)
.build();
new Runner(options).run();
}
@Benchmark
@Test
public void DFA測試() {
String text = "我有一顆大土豆,剛出鍋的";
List<String> matchAll = tree.matchAll(text, -1, false, false);
System.err.println(matchAll);
}
}
JMH測試結果如下:
Result "com.autocoding.hutool.DFA關鍵字過濾Test.DFA測試":
N = 10
mean = 0.465 ±(99.9%) 0.255 ms/op
Histogram, ms/op:
[0.200, 0.250) = 0
[0.250, 0.300) = 1
[0.300, 0.350) = 2
[0.350, 0.400) = 1
[0.400, 0.450) = 1
[0.450, 0.500) = 2
[0.500, 0.550) = 0
[0.550, 0.600) = 0
[0.600, 0.650) = 1
[0.650, 0.700) = 0
[0.700, 0.750) = 2
[0.750, 0.800) = 0
Percentiles, ms/op:
p(0.0000) = 0.275 ms/op
p(50.0000) = 0.435 ms/op
p(90.0000) = 0.731 ms/op
p(95.0000) = 0.731 ms/op
p(99.0000) = 0.731 ms/op
p(99.9000) = 0.731 ms/op
p(99.9900) = 0.731 ms/op
p(99.9990) = 0.731 ms/op
p(99.9999) = 0.731 ms/op
p(100.0000) = 0.731 ms/op
# Run complete. Total time: 00:00:44
Benchmark Mode Cnt Score Error Units
DFA關鍵字過濾Test.DFA測試 thrpt 10 6.372 ± 0.267 ops/ms
DFA關鍵字過濾Test.DFA測試 avgt 10 0.156 ± 0.018 ms/op
DFA關鍵字過濾Test.DFA測試 sample 64804 0.154 ± 0.002 ms/op
DFA關鍵字過濾Test.DFA測試:DFA測試·p0.00 sample 0.088 ms/op
DFA關鍵字過濾Test.DFA測試:DFA測試·p0.50 sample 0.144 ms/op
DFA關鍵字過濾Test.DFA測試:DFA測試·p0.90 sample 0.180 ms/op
DFA關鍵字過濾Test.DFA測試:DFA測試·p0.95 sample 0.209 ms/op
DFA關鍵字過濾Test.DFA測試:DFA測試·p0.99 sample 0.324 ms/op
DFA關鍵字過濾Test.DFA測試:DFA測試·p0.999 sample 1.000 ms/op
DFA關鍵字過濾Test.DFA測試:DFA測試·p0.9999 sample 4.551 ms/op
DFA關鍵字過濾Test.DFA測試:DFA測試·p1.00 sample 30.933 ms/op
DFA關鍵字過濾Test.DFA測試 ss 10 0.465 ± 0.255 ms/op