JAVA Hash效能優化
1 問題描述
在JAVA程式碼中有這樣一段:功能就是多個字串拼接後,作為map的key,put到map中。
public void hashCode(List<String> values) { long start2 = System.currentTimeMillis(); for (int i = 0; i + 1 < values.size(); i += 2) { StringBuilder builder = new StringBuilder(); builder.append(values.get(i)); builder.append(values.get(i + 1)); } Map<String, Object> map = new HashMap<>(); map.put(builder.toString(),new Object()); long end2 = System.currentTimeMillis(); System.out.println("string hash cost :" + (end2 - start2)); }
單個執行時,程式碼的效能無法體現出來,但是到了千萬級的呼叫時,將會耗費很多時間。 在我的筆記本上執行(i7 HQ,8G記憶體),需要2-3s的時間跑完一千萬次。從理論上來講,耗費時間的在於字串的拼接和hashcode的計算。為了確認問題,我們先從程式碼的角度找出可能出現的問題。
2 原始碼分析
2.1 StringBuilder構建字串原始碼分析。
首先是初始化StringBuilder物件。初始化時,StringBuilder先用預設的大小(16)構建一個char陣列。這裡只是分配一個初始化的記憶體,不應該佔用太多的時間。 在append的時候,如果發現申請的記憶體不夠,將會建立一個(原大小 + append字串長度)2大小的空間。StringBuilder會將所有的資料都拷貝到新的空間中,然後釋放舊空間。 假如每次append的資料都是剛好達到當前的邊界,那麼空間將按照[16,17
public final class StringBuilder extends AbstractStringBuilder implements java.io.Serializable, CharSequence { public StringBuilder() { super(16); } } abstract class AbstractStringBuilder implements Appendable, CharSequence { AbstractStringBuilder(int capacity) { value = new char[capacity]; } } public AbstractStringBuilder append(String str) { if (str == null) return appendNull(); int len = str.length(); ensureCapacityInternal(count + len); str.getChars(0, len, value, count); count += len; return this; } //Arrays public static char[] copyOf(char[] original, int newLength) { char[] copy = new char[newLength]; System.arraycopy(original, 0, copy, 0, Math.min(original.length, newLength)); return copy; }
除了記憶體的擴張,StringBuilder本身需要將append物件的記憶體拷貝到自身屬性中。
public void getChars(int srcBegin, int srcEnd, char dst[], int dstBegin) {
if (srcBegin < 0) {
throw new StringIndexOutOfBoundsException(srcBegin);
}
if (srcEnd > value.length) {
throw new StringIndexOutOfBoundsException(srcEnd);
}
if (srcBegin > srcEnd) {
throw new StringIndexOutOfBoundsException(srcEnd - srcBegin);
}
System.arraycopy(value, srcBegin, dst, dstBegin, srcEnd - srcBegin);
}
從載入資料的維度來看,可能需要關注的點:1 資料長度超出申請記憶體,需要記憶體擴充套件;2 每次append的資料,都需要拷貝;3 返回String物件,需要再次進行記憶體拷貝,資料輸出到String物件中。
2.2 hash
HashMap需要通過hashCode定位儲存位置。如果儲存位置已經有資料存在,則拉出一個list,順次排放多個位置衝突的資料。 位置發生了衝突分為多種情況:1 hashCode相同,值不同,位置相同;2 hashCode相同,值相同,位置相同;3 hashCode不同,值不同,位置相同 對於第一,三種情況,資料會依次放在list中。對於第二種情況,則會覆蓋之前的資料。 hashMap在put的時候,先行獲得key的hashCode。在hashCode相等的情況下,會通過地址相等以及equals方法進行比對。 hash的比對邏輯程式碼:
public V put(K key, V value) {
return putVal(hash(key), key, value, false, true);
}
final V putVal(int hash, K key, V value, boolean onlyIfAbsent,
boolean evict) {
Node<K,V>[] tab; Node<K,V> p; int n, i;
if ((tab = table) == null || (n = tab.length) == 0)
n = (tab = resize()).length;
if ((p = tab[i = (n - 1) & hash]) == null)
tab[i] = newNode(hash, key, value, null);
else {
Node<K,V> e; K k;
if (p.hash == hash &&
((k = p.key) == key || (key != null && key.equals(k))))
e = p;
.
.
.
}
從上面的程式碼可以看出,在進行put操作時,HashMap會立即計算key的hashCode,以hashCode作為定址的條件。如果定址發生衝突,則hashCode作為比對是否相等的首要條件。如果hashCode相等,則需要通過地址相等或者equals方法相等,來判斷是否相等。 所以總的來說,需要關注兩個函式:hashCode以及equals String的hashCode演算法如下。遍歷char陣列的每個元素,已有資料乘以31後和新的元素相加。網上說這個演算法產生衝突的概率較大,但是實際過程中不會有什麼差別。
public int hashCode() {
int h = hash;
if (h == 0 && value.length > 0) {
char val[] = value;
for (int i = 0; i < value.length; i++) {
h = 31 * h + val[i];
}
hash = h;
}
return h;
}
String equals演算法。遍歷當前char陣列和比對目標的陣列,挨個char進行比較。但是沒看懂的一點是:while迴圈採用變數n控制,但是陣列元素的獲取採用變數i控制。
public boolean equals(Object anObject) {
if (this == anObject) {
return true;
}
if (anObject instanceof String) {
String anotherString = (String)anObject;
int n = value.length;
if (n == anotherString.value.length) {
char v1[] = value;
char v2[] = anotherString.value;
int i = 0;
while (n-- != 0) {
if (v1[i] != v2[i])
return false;
i++;
}
return true;
}
}
return false;
}
3 hashkey實現
基於以上的分析,新作了一個物件,作為map的主鍵。 主要從記憶體拷貝的方面進行了優化,只進行一次copy。 hash演算法上採用FNVHash演算法,參考晚上的實現。
package org.yunzhong.test.stream;
import java.util.Arrays;
public class HashKey {
private static final int HASH_PARAM = 16777619;
private static int HASH_INIT = (int) 2166136261L;
private int hashCode;
private char[] values;
private int count;
public HashKey() {
values = new char[64];
count = 0;
}
public void append(String value) {
int minLength = 0;
if ((minLength = value.length() + count) > values.length) {
values = Arrays.copyOf(values, minLength * 2);
}
value.getChars(0, value.length(), values, count);
count += value.length();
}
public void hash1() {
for (int i = 0; i < count; ++i) {
hashCode = 31 * hashCode + values[i];
}
}
public void hash() {
hashCode = HASH_PARAM;
for (int i = 0; i < count; ++i) {
hashCode = (hashCode ^ values[i]) * HASH_PARAM;
}
hashCode += hashCode << 13;
hashCode ^= hashCode >> 7;
hashCode += hashCode << 3;
hashCode ^= hashCode >> 17;
hashCode += hashCode << 5;
}
@Override
public int hashCode() {
if(this.hashCode == 0) {
hash();
}
return hashCode;
}
public int getHashCode() {
return hashCode;
}
public void setHashCode(int hashCode) {
this.hashCode = hashCode;
}
public char[] getValues() {
return values;
}
public void setValues(char[] values) {
this.values = values;
}
public int getEnd() {
return count;
}
public void setEnd(int end) {
this.count = end;
}
@Override
public boolean equals(Object target) {
HashKey key = (HashKey) target;
int length = this.count;
if (length == key.count) {
int i = 0;
char[] v1 = this.values;
char[] v2 = key.values;
while (length-- != 0) {
if (v1[i] != v2[i]) {
return false;
}
i++;
}
return true;
}
return false;
}
@Override
public String toString() {
return String.copyValueOf(this.values, 0, count);
}
}
4 效能比對
400萬資料測試。我的筆記本引數:(i7 HQ,8G記憶體)。 總的來說,平均時間會減少,但是終究無法達到倍數的提升。才疏學淺,只能止步於此。 StringBuilder測試用例
@Test
public void testHashPut() {
String[] characters = new String[] { "a", "b", "c", "d", "e", "f", "j", "h", "i", "j", "k", "l", "m", "n", "o",
"p", "q", "r", "s", "t", "u", "v", "w", "x", "y", "z", "1", "2", "3", "4", "5", "6", "7", "8", "9" };
Random random = new Random();
List<String> values = Lists.newArrayList();
for (int i = 0; i < 4000000; i++) {
StringBuilder builder = new StringBuilder();
for (int j = 0; j < 10; j++) {
int nextInt = random.nextInt(34);
builder.append(characters[nextInt]);
}
values.add(builder.toString());
}
long start = System.currentTimeMillis();
Map<String, Object> map = new HashMap<String, Object>();
for (int i = 3; i < values.size(); i++) {
StringBuilder builder = new StringBuilder();
builder.append(values.get(i - 3));
builder.append(values.get(i - 2));
builder.append(values.get(i - 1));
builder.append(values.get(i));
map.put(builder.toString(), new Object());
}
System.out.println("hash init cost:" + (System.currentTimeMillis() - start));
}
HashKey測試用例
@Test
public void testHashPutOnceCopy() {
String[] characters = new String[] { "a", "b", "c", "d", "e", "f", "j", "h", "i", "j", "k", "l", "m", "n", "o",
"p", "q", "r", "s", "t", "u", "v", "w", "x", "y", "z", "1", "2", "3", "4", "5", "6", "7", "8", "9" };
Random random = new Random();
List<String> values = Lists.newArrayList();
for (int i = 0; i < 4000000; i++) {
StringBuilder builder = new StringBuilder();
for (int j = 0; j < 10; j++) {
int nextInt = random.nextInt(34);
builder.append(characters[nextInt]);
}
values.add(builder.toString());
}
long start = System.currentTimeMillis();
Map<HashKey, Object> map = new HashMap<HashKey, Object>(1000000);
for (int i = 3; i < values.size(); i++) {
HashKey key = new HashKey();
key.append(values.get(i - 3));
key.append(values.get(i - 2));
key.append(values.get(i - 1));
key.append(values.get(i));
map.put(key, new Object());
}
System.out.println("once hash init cost:" + (System.currentTimeMillis() - start));
}
HashKey 2個屬性
once hash init cost:7437
once hash init cost:3588
once hash init cost:3593
once hash init cost:1599
once hash init cost:4285
once hash init cost:1597
once hash init cost:1763
once hash init cost:1607
once hash init cost:1526
once hash init cost:1519
StringBuilder 2個屬性
hash init cost:4588
hash init cost:2890
hash init cost:3226
hash init cost:2963
hash init cost:1743
hash init cost:1695
hash init cost:1729
hash init cost:1748
hash init cost:1641
hash init cost:1859
HashKey 4個屬性
once hash init cost:7561
once hash init cost:4270
once hash init cost:3726
once hash init cost:4334
once hash init cost:4330
once hash init cost:1936
once hash init cost:1914
once hash init cost:2025
once hash init cost:1926
once hash init cost:2068
StringBuilder 4個屬性
hash init cost:6841
hash init cost:3479
hash init cost:3590
hash init cost:3897
hash init cost:3676
hash init cost:4806
hash init cost:3460
hash init cost:3661
hash init cost:3512
hash init cost:3466
5 多執行緒
其實不想採用多執行緒的方式進行。多執行緒意味著執行緒間的協調,CPU資源的競爭,在系統壓力大的情況下,並不能提升什麼效能。 另外,初始化map只是一個很小的功能點,開啟多執行緒有種殺雞用牛刀的感覺。 最後,上百萬的資料初始化,是很少的情況。這種情況通過1s執行,或者通過10s執行,對整體的效能來說無關緊要。 但是總的來說也是一種方案,本人也在本機進行了測試。在400萬、三個字串拼接的條件時,測試程式碼和資料如下:
private ExecutorService threadPool = Executors.newFixedThreadPool(8, new ThreadFactory() {
private int threadNum;
public Thread newThread(Runnable r) {
Thread th = new Thread(r);
th.setName("hashThread" + threadNum++);
return th;
}
});
@Test
public void testHashPutOnceCopyMultiTrhead() throws InterruptedException, ExecutionException {
String[] characters = new String[] { "a", "b", "c", "d", "e", "f", "j", "h", "i", "j", "k", "l", "m", "n", "o",
"p", "q", "r", "s", "t", "u", "v", "w", "x", "y", "z", "1", "2", "3", "4", "5", "6", "7", "8", "9" };
int batch = 100000;
Random random = new Random();
final List<String> values = Lists.newArrayList();
for (int i = 0; i < 4000000; i++) {
StringBuilder builder = new StringBuilder();
for (int j = 0; j < 10; j++) {
int nextInt = random.nextInt(34);
builder.append(characters[nextInt]);
}
values.add(builder.toString());
}
final Map<HashKey, Object> map = new ConcurrentHashMap<HashKey, Object>(1000000);
long start = System.currentTimeMillis();
List<Future<Object>> futures = Lists.newArrayList();
for (int j = 3; j < values.size(); j += batch) {
final int bottom = j;
final int top = values.size() > j + batch ? (j + batch) : values.size();
Future<Object> future = threadPool.submit(new Callable<Object>() {
public Object call() throws Exception {
for (int i = bottom; i < top; i++) {
HashKey key = new HashKey();
key.append(values.get(i - 3));
key.append(values.get(i - 2));
key.append(values.get(i - 1));
key.append(values.get(i));
map.put(key, new Object());
}
return null;
}
});
futures.add(future);
}
for (Future<Object> future : futures) {
future.get();
}
System.out.println("once hash init cost:" + (System.currentTimeMillis() - start));
}
測試資料
once hash init cost:7832
once hash init cost:3056
once hash init cost:2762
once hash init cost:3482
once hash init cost:3611
once hash init cost:3804
once hash init cost:1185
once hash init cost:1211
once hash init cost:1189
once hash init cost:1146