1. 程式人生 > 其它 >記錄一次敏感詞過濾演算法DFA的應用案例

記錄一次敏感詞過濾演算法DFA的應用案例

技術標籤:Java基礎DFA關鍵字過濾

目錄

0、 DFA是什麼?

1、為什麼要用DFA

2、DFA工具類實現

3、效能對比效果

3.1 普通關鍵字過濾

3.2 DFA關鍵字過濾


0、 DFA是什麼?

在實現文字過濾的演算法中,DFA是唯一比較好的實現演算法。DFA即Deterministic Finite Automaton,也就是確定有窮自動機,它是是通過event和當前的state得到下一個state,即event+state=nextstate。下圖展示了其狀態的轉換

在這幅圖中大寫字母(S、U、V、Q)都是狀態,小寫字母a、b為動作。
在實現敏感詞過濾的演算法中,我們必須要減少運算,而DFA在DFA演算法中幾乎沒有什麼計算,有的只是狀態的轉換。

1、為什麼要用DFA

DFA(Deterministic Finite Automaton 確定有窮自動機)常用於在某大段文字中快速查詢某幾個關鍵詞是否存在。因為它的高效率,所以在這種場景中應用比較廣泛。

2、DFA工具類實現

cn.hutool.dfa.SensitiveUtil

	/**
	 * 初始化敏感詞樹
	 * @param sensitiveWords 敏感詞列表
	 */
	public static void init(Collection<String> sensitiveWords){
		sensitiveTree.clear();
		sensitiveTree.addWords(sensitiveWords);
//		log.debug("Sensitive init finished, sensitives: {}", sensitiveWords);
	}

/**
	 * 查詢敏感詞,返回找到的所有敏感詞<br>
	 * 密集匹配原則:假如關鍵詞有 ab,b,文字是abab,將匹配 [ab,b,ab]<br>
	 * 貪婪匹配(最長匹配)原則:假如關鍵字a,ab,最長匹配將匹配[a, ab]
	 * 
	 * @param text 文字
	 * @param isDensityMatch 是否使用密集匹配原則
	 * @param isGreedMatch 是否使用貪婪匹配(最長匹配)原則
	 * @return 敏感詞
	 */
	public static List<String> getFindedAllSensitive(String text, boolean isDensityMatch, boolean isGreedMatch){
		return sensitiveTree.matchAll(text, -1, isDensityMatch, isGreedMatch);
	}

3、效能對比效果

3.1 普通關鍵字過濾

JMH程式碼如下:

package com.autocoding.hutool;

import java.util.ArrayList;
import java.util.List;
import java.util.concurrent.TimeUnit;

import org.junit.Test;
import org.openjdk.jmh.annotations.Benchmark;
import org.openjdk.jmh.annotations.BenchmarkMode;
import org.openjdk.jmh.annotations.Mode;
import org.openjdk.jmh.annotations.OutputTimeUnit;
import org.openjdk.jmh.annotations.Scope;
import org.openjdk.jmh.annotations.State;
import org.openjdk.jmh.runner.Runner;
import org.openjdk.jmh.runner.options.Options;
import org.openjdk.jmh.runner.options.OptionsBuilder;

@BenchmarkMode(value = { Mode.All })
@OutputTimeUnit(TimeUnit.MILLISECONDS)
@State(Scope.Thread)
public class 普通關鍵字過濾Test {
	 
	private static List<String> keywords = new ArrayList<String>();
	static {
		keywords.add("大");
		keywords.add("大土豆");
		keywords.add("土豆");
		keywords.add("剛出鍋");
		keywords.add("出鍋");
	}

	public static void main(String[] args) throws Exception {
		String name = 普通關鍵字過濾Test.class.getName();
		Options options = new OptionsBuilder().include(name).forks(1).measurementIterations(10).warmupIterations(3)
				.build();
		new Runner(options).run();
	}

	@Benchmark
	@Test
	public void 普通測試() {
		String text = "我有一顆大土豆,剛出鍋的";
	    List<String> hitList=new ArrayList<>();
	    for(String keyword:keywords){
	    	if(text.contains(keyword)){
	    		hitList.add(keyword);
	    	}
	    }
		System.err.println(hitList);

	}

	 
}

JMH測試結果如下:

Result "com.autocoding.hutool.普通關鍵字過濾Test.普通測試":
  N = 10
  mean =      0.562 ±(99.9%) 0.309 ms/op

  Histogram, ms/op:
    [0.300, 0.350) = 0 
    [0.350, 0.400) = 3 
    [0.400, 0.450) = 1 
    [0.450, 0.500) = 1 
    [0.500, 0.550) = 0 
    [0.550, 0.600) = 2 
    [0.600, 0.650) = 0 
    [0.650, 0.700) = 0 
    [0.700, 0.750) = 2 
    [0.750, 0.800) = 0 
    [0.800, 0.850) = 0 
    [0.850, 0.900) = 0 
    [0.900, 0.950) = 0 

  Percentiles, ms/op:
      p(0.0000) =      0.371 ms/op
     p(50.0000) =      0.518 ms/op
     p(90.0000) =      0.958 ms/op
     p(95.0000) =      0.981 ms/op
     p(99.0000) =      0.981 ms/op
     p(99.9000) =      0.981 ms/op
     p(99.9900) =      0.981 ms/op
     p(99.9990) =      0.981 ms/op
     p(99.9999) =      0.981 ms/op
    p(100.0000) =      0.981 ms/op


# Run complete. Total time: 00:00:44

Benchmark                                Mode    Cnt   Score   Error   Units
普通關鍵字過濾Test.普通測試                thrpt     10   5.787 ± 0.936  ops/ms
普通關鍵字過濾Test.普通測試                 avgt     10   0.178 ± 0.025   ms/op
普通關鍵字過濾Test.普通測試               sample  54550   0.183 ± 0.003   ms/op
普通關鍵字過濾Test.普通測試:普通測試·p0.00    sample          0.057           ms/op
普通關鍵字過濾Test.普通測試:普通測試·p0.50    sample          0.173           ms/op
普通關鍵字過濾Test.普通測試:普通測試·p0.90    sample          0.216           ms/op
普通關鍵字過濾Test.普通測試:普通測試·p0.95    sample          0.253           ms/op
普通關鍵字過濾Test.普通測試:普通測試·p0.99    sample          0.426           ms/op
普通關鍵字過濾Test.普通測試:普通測試·p0.999   sample          2.303           ms/op
普通關鍵字過濾Test.普通測試:普通測試·p0.9999  sample          7.570           ms/op
普通關鍵字過濾Test.普通測試:普通測試·p1.00    sample         20.185           ms/op
普通關鍵字過濾Test.普通測試                   ss     10   0.562 ± 0.309   ms/op

3.2 DFA關鍵字過濾

JMH程式碼如下:

package com.autocoding.hutool;

import java.util.List;
import java.util.concurrent.TimeUnit;

import org.junit.Test;
import org.openjdk.jmh.annotations.Benchmark;
import org.openjdk.jmh.annotations.BenchmarkMode;
import org.openjdk.jmh.annotations.Mode;
import org.openjdk.jmh.annotations.OutputTimeUnit;
import org.openjdk.jmh.annotations.Scope;
import org.openjdk.jmh.annotations.State;
import org.openjdk.jmh.runner.Runner;
import org.openjdk.jmh.runner.options.Options;
import org.openjdk.jmh.runner.options.OptionsBuilder;

import cn.hutool.dfa.WordTree;

@BenchmarkMode(value = { Mode.All })
@OutputTimeUnit(TimeUnit.MILLISECONDS)
@State(Scope.Thread)
public class DFA關鍵字過濾Test {
	private static WordTree tree = new WordTree();
	static {
		tree.addWord("大");
		tree.addWord("大土豆");
		tree.addWord("土豆");
		tree.addWord("剛出鍋");
		tree.addWord("出鍋");
	}

	public static void main(String[] args) throws Exception {
		String name = DFA關鍵字過濾Test.class.getName();
		Options options = new OptionsBuilder().include(name).forks(1).measurementIterations(10).warmupIterations(3)
				.build();
		new Runner(options).run();
	}

 

	@Benchmark
	@Test
	public void DFA測試() {
		String text = "我有一顆大土豆,剛出鍋的";
		List<String> matchAll = tree.matchAll(text, -1, false, false);
		System.err.println(matchAll);

	}

}

JMH測試結果如下:

Result "com.autocoding.hutool.DFA關鍵字過濾Test.DFA測試":
  N = 10
  mean =      0.465 ±(99.9%) 0.255 ms/op

  Histogram, ms/op:
    [0.200, 0.250) = 0 
    [0.250, 0.300) = 1 
    [0.300, 0.350) = 2 
    [0.350, 0.400) = 1 
    [0.400, 0.450) = 1 
    [0.450, 0.500) = 2 
    [0.500, 0.550) = 0 
    [0.550, 0.600) = 0 
    [0.600, 0.650) = 1 
    [0.650, 0.700) = 0 
    [0.700, 0.750) = 2 
    [0.750, 0.800) = 0 

  Percentiles, ms/op:
      p(0.0000) =      0.275 ms/op
     p(50.0000) =      0.435 ms/op
     p(90.0000) =      0.731 ms/op
     p(95.0000) =      0.731 ms/op
     p(99.0000) =      0.731 ms/op
     p(99.9000) =      0.731 ms/op
     p(99.9900) =      0.731 ms/op
     p(99.9990) =      0.731 ms/op
     p(99.9999) =      0.731 ms/op
    p(100.0000) =      0.731 ms/op


# Run complete. Total time: 00:00:44

Benchmark                                Mode    Cnt   Score   Error   Units
DFA關鍵字過濾Test.DFA測試                 thrpt     10   6.372 ± 0.267  ops/ms
DFA關鍵字過濾Test.DFA測試                  avgt     10   0.156 ± 0.018   ms/op
DFA關鍵字過濾Test.DFA測試                sample  64804   0.154 ± 0.002   ms/op
DFA關鍵字過濾Test.DFA測試:DFA測試·p0.00    sample          0.088           ms/op
DFA關鍵字過濾Test.DFA測試:DFA測試·p0.50    sample          0.144           ms/op
DFA關鍵字過濾Test.DFA測試:DFA測試·p0.90    sample          0.180           ms/op
DFA關鍵字過濾Test.DFA測試:DFA測試·p0.95    sample          0.209           ms/op
DFA關鍵字過濾Test.DFA測試:DFA測試·p0.99    sample          0.324           ms/op
DFA關鍵字過濾Test.DFA測試:DFA測試·p0.999   sample          1.000           ms/op
DFA關鍵字過濾Test.DFA測試:DFA測試·p0.9999  sample          4.551           ms/op
DFA關鍵字過濾Test.DFA測試:DFA測試·p1.00    sample         30.933           ms/op
DFA關鍵字過濾Test.DFA測試                    ss     10   0.465 ± 0.255   ms/op