Linux環境下，多執行緒統計txt檔案中的單詞詞頻

阿新 • • 發佈：2021-11-09

#include <thread>
#include <cstdio>
#include <iostream>
#include <cstdlib>
#include <cctype>
#include <mutex>
#include <condition_variable>
#include <algorithm>
#include <unordered_set>
#include <unordered_map>
#include <vector>
#include <atomic>
#include <fstream>
using namespace std;

typedef pair<string, int> PSI;

class Source {// 用於給互斥變數加鎖
public:
	unordered_map<string, int> mapWords;
	mutex m;
	unordered_set<string> specialWords;// 儲存需要排除在外的單詞(例如代詞、介詞、冠詞等)
} source;

atomic_int wordsCount;// 原子整型變數，用於統計單詞總數，並且會自動實現互斥訪問
vector<PSI> vecWords;// 實現詞頻排序

bool compare(const PSI& a, const PSI& b) {// 自定義排序規則
	return a.second > b.second;
}

void countWords(char* fileName) {// 統計一個檔案中的單詞

	FILE *inputFile = fopen((const char *)fileName, "r");

	if (inputFile == NULL) {
		puts("No input file");
		exit(0);
	}

	while (true) {
		char ch;
		string str = "";
		while ((ch = fgetc(inputFile)) != EOF) {
			if (!isalpha(ch) && str.size() == 0) {// 讀到的字元不為字母，且現有字串長度為0，則直接跳過
				continue;
			} else if (!isalpha(ch) && str.size() > 0)// 同上，但是字串長度不為0，現有的字串是一個完整的單詞了
				break;
			str += tolower(ch);// 正常讀入了一個字元，就新增到臨時字串中
		}
		if (ch == EOF)
			break;
		else {
			unique_lock<mutex> lck(source.m);// 對儲存佇列加鎖
			wordsCount++;
			if (!source.specialWords.count(str)) {
				source.mapWords[str]++;// 將讀入的單詞加入到統計集
			}
			lck.unlock();// 解鎖
		}
	}
	fclose(inputFile);
}

void fixSpecialWords() {
	ifstream in("./wordSet.txt");// wordSet.txt就是儲存的需要排除在外的單詞
	string str;

	if (in)
		while (getline(in, str))
			source.specialWords.insert(str);
	else
		puts("read error");
}

void showResult() {// 輸出結果
	cout << "Number of input words: " << wordsCount << endl;

	for (auto it : source.mapWords)
		vecWords.push_back(it);
	sort(vecWords.begin(), vecWords.end(), compare);

	int cnt = 0;

	if (vecWords.size() >= 10) {
		cnt = vecWords[9].second;
		for (auto it : vecWords) {
			if (it.second < cnt)
				break;
			cout << it.first << " : " << it.second << endl;
		}
	} else {
		for (auto it : vecWords)
			cout << it.first << " : " << it.second << endl;
	}
}
int main(int argc, char *argv[]) {
	thread th[100];

	fixSpecialWords();

	for (int i = 1; i < argc; i++)// 將要進行統計的檔名作為引數今傳入程式
		th[i] = thread(countWords, argv[i]);

	for (int i = 1; i < argc; i++)
		th[i].join();

	showResult();
	return 0;
}

c++的main函式中argc 是argument count的縮寫表示傳入main函式中的引數個數，包括這個程式本身。argv 是 argument vector的縮寫表示傳入main函式中的引數列表，其中argv[0]表示這個程式的名字，argv[0]指向程式執行時的全路徑名；argv[1] 指向程式在命令列中執行程式名後的第一個字串；argv[2] 指向程式在命令列中執行程式名後的第二個字串；以此類推直到argv[argc]......

如圖tmux中，左邊是單執行緒統計的結果，右邊是多執行緒統計的結果；通過time函式我們可以看到，單執行緒是快於多執行緒的，並且多執行緒系統用時遠高於單執行緒，因為我們的設計思路是一個執行緒統計一個檔案中的單詞數，所以如果輸入檔案數越多，執行緒之間上下文切換的系統消耗就越多。所以可以設想一下，如果輸入檔案達到上百個時，多執行緒和單執行緒的結果是相同的，但是多執行緒無論是時間還是系統的消耗都是極大高於單執行緒的

Linux環境下，多執行緒統計txt檔案中的單詞詞頻

Linux環境下，多執行緒統計txt檔案中的單詞詞頻

多執行緒環境下，程式執行真是危機四伏

linux 下python多執行緒遞迴複製資料夾及資料夾中的檔案

cat 常用的日誌分析架構方案_在Linux環境下，對nginx日誌進行統計分析的幾個常用業務場景和常用命令...

Linux下C++多執行緒程式設計（入門例項）

實戰單執行緒爬取，單執行緒+協程爬取，多執行緒爬取

python的threading的使用（join方法，多執行緒，鎖threading.Lock和threading.Condition

多執行緒系列（1），多執行緒基礎

ASP下通過Adodb.Stream實現多執行緒下載大檔案

C++ 下的多執行緒

從0實現基於Linux socket聊天室-多執行緒伺服器模型-1

ConcurrentHashMap原始碼解析，多執行緒擴容

十年電商大廠面試官總結100道大廠高頻面試題：Dubbo，ElasticSearch，JVM，多執行緒/高併發，訊息中介軟體

2020年Java程式設計師學習方向微服務、高併發，多執行緒、Spring全家桶，及面試合集

13，多執行緒下載命令axel

python批量刪除檔案，多執行緒版【五】

關於非同步任務，多執行緒@EnableAsync@Async

非同步執行緒，多執行緒在專案啟動時候執行方法

【C語言進階】windows下的多執行緒該怎麼實現？附實戰：實現一邊倒計時一邊輸入單詞

Linux應用開發之多執行緒

Linux環境下，多執行緒統計txt檔案中的單詞詞頻

相關推薦