NetCore控制檯程式-使用HostService和HttpClient實現簡單的定時爬蟲
.NetCore承載系統
.NetCore的承載系統, 可以將長時間執行的服務承載於託管程序中, AspNetCore應用其實就是一個長時間執行的服務, 啟動AspNetCore應用後, 它就會監聽網路請求, 也就是開啟了一個監聽器, 監聽器會將網路請求傳遞給管道進行處理, 處理後得到Http響應返回
有很多場景都會有服務承載的需求, 比如這篇博文要做的, 定時抓取華為論壇的文章點贊數
爬取文章點贊數
分析
比如這個連結 https://developer.huawei.com/consumer/cn/forum/topicview?tid=0201308791792470245&fid=23 , 點進去不難發現這是用angular做的一個頁面, 既然是Angular, 那說明前後端分離了, 瀏覽器F12檢視網路請求
找到對應api請求方法:
POST https://developer.huawei.com/consumer/cn/forum/mid/partnerforumservice/v1/open/getTopicDetail? HTTP/1.1
Host: developer.huawei.com
Content-Type: application/json
Content-Length: 33
{"topicId":"0201302923811480141"}
這裡經過我的測試, Content-Type
和Content-Length
必須上面那樣的值, 還有body, 你多一個空格請求都會失敗
使用HttpClient請求資料
直接看程式碼吧, 這裡使用了依賴注入來注入HttpClientFactory
, 還可以使用強型別的HttpClient
, 具體可以看文件和dudu
部落格的這篇博文
工廠參觀記:.NET Core 中 HttpClientFactory 如何解決 HttpClient 臭名昭著的問題
private readonly IHttpClientFactory _httpClientFactory; public async Task<int> Crawl(string link) { using (var httpClient = _httpClientFactory.CreateClient()) { var uri = new Uri(link); uri.TryReadQueryAsJson(out var queryParams); var topicId = queryParams["tid"].ToString(); int likeCount = -1; if (!string.IsNullOrEmpty(topicId)) { var body = JsonConvert.SerializeObject( new { topicId }, Formatting.None); uri = new Uri(_baseUrl); var jsonContentType = "application/json"; var requestMessage = new HttpRequestMessage { RequestUri = uri, Headers = { { "Host", uri.Host } }, Method = HttpMethod.Post, Content = new StringContent(body) }; requestMessage.Content.Headers.ContentType = new MediaTypeWithQualityHeaderValue(jsonContentType); requestMessage.Content.Headers.ContentLength = body.Length; var response = await httpClient.SendAsync(requestMessage); if (response.StatusCode == HttpStatusCode.OK) { dynamic data = await response.Content.ReadAsAsync<dynamic>(); likeCount = data.result.likes; } } return likeCount; } }
這裡有更簡潔的的寫法, 使用_httpClient.PostAsJsonAsync()
, 但是考慮到可能需要自定義Content-Type這些請求頭, 所以先這樣寫;
配置承載系統
class Program
{
static void Main()
{
new HostBuilder()
.ConfigureServices(services =>
{
services.AddHttpClient();
services.AddHostedService<LikeCountCrawler>();
})
.Build()
.Run();
}
}
LikeCountCrawler
實現了IHostedService
介面
IHostedService
介面
public interface IHostedService
{
/// <summary>
/// Triggered when the application host is ready to start the service.
/// </summary>
/// <param name="cancellationToken">Indicates that the start process has been aborted.</param>
Task StartAsync(CancellationToken cancellationToken);
/// <summary>
/// Triggered when the application host is performing a graceful shutdown.
/// </summary>
/// <param name="cancellationToken">Indicates that the shutdown process should no longer be graceful.</param>
Task StopAsync(CancellationToken cancellationToken);
}
LikeCountCrawler
在StartAsync
方法中, 設定開啟了一個定時器, 定時器每次溢位, 都執行一次爬蟲邏輯
private readonly Timer _timer = new Timer();
private readonly IEnumerable<string> _links = new string[]
{
"https://developer.huawei.com/consumer/cn/forum/topicview?tid=0201308791792470245&fid=23",
"https://developer.huawei.com/consumer/cn/forum/topicview?tid=0201303654965850166&fid=18",
"https://developer.huawei.com/consumer/cn/forum/topicview?tid=0201294272503450453&fid=24",
"https://developer.huawei.com/consumer/cn/forum/topicview?tid=0201294189025490019&fid=17"
};
private readonly string _baseUrl = "https://developer.huawei.com/consumer/cn/forum/mid/partnerforumservice/v1/open/getTopicDetail";
...
public Task StartAsync(CancellationToken cancellationToken)
{
_timer.Interval = 5 * 60 * 1000;
_timer.Elapsed += OnTimer;
_timer.AutoReset = true;
_timer.Enabled = true;
_timer.Start();
OnTimer(null, null);
return Task.CompletedTask;
}
public async Task Crawl(IEnumerable<string> links)
{
await Task.Run(() =>
{
Parallel.ForEach(links, async link =>
{
Console.WriteLine($"Crawling link:{link}, ThreadId:{Thread.CurrentThread.ManagedThreadId}");
var likeCount = await Crawl(link);
Console.WriteLine($"Succeed crawling likecount - {likeCount}, ThreadId:{Thread.CurrentThread.ManagedThreadId}");
});
});
}
private void OnTimer(object sender, ElapsedEventArgs args)
{
_ = Crawl(_links);
}
...
執行效果: