【乾貨】找不到適合自己的程式設計書？我自己動手寫了一個熱門程式設計書搜尋網站（附PDF書單）

阿新 • • 發佈：2022-04-28

原作者 Vlad Wetzel

編譯 CDA 編譯團隊

本文為 CDA 資料分析師原創作品，轉載需授權

選擇適合自己的程式設計書絕非易事，美國的程式設計師小哥根據國外著名程式設計技術問答網站Stack Overflow 所推薦的所有程式設計書，自己動手寫了一個搜尋熱門程式設計書的網站。

選擇適合自己的程式設計書絕非易事。

作為一名開發者，你的時間是有限的，讀一本書需要很多時間。用這些時間你可以敲程式碼，你可以休息，可以做很多事。但相反，你用這些寶貴的時間來閱讀和提升自己的技能。

那麼應該讀什麼書呢？我和同事經常討論這個問題，但是我發現我們對某本書的看法差別很大。

所以我決定深入探究這個問題——怎樣選擇適合自己的程式設計書呢？

在這裡我決定把目光轉向 Stack Overflow （國外著名程式設計技術問答網站），當中不少大神都有推薦他們的書單。我打算通過分析 Stack Overflow 中關於程式設計書籍的相關資料，從而得出當中哪些書被推薦最多的。

幸運的是， Stack Exchange （ Stack Overflow 的母公司）最近剛剛釋出了他們的資料轉儲。以此為基礎，我構建了網站 dev-books.com ，通過對關鍵字的搜尋，你可以發現 Stack Overflow 最被推崇的程式設計相關書籍列表。現在網站有超過10萬的使用者。

總體來說，如果你求知慾很強，那麼推薦你閱讀《Working Effectively with Legacy Code》，同時《Design Pattern: Elements of Reusable Object-Oriented Software》也是不錯的選擇。雖然這些書名看上去十分枯燥，但是內容保證乾貨滿滿。你可以通過標籤（如 JavaScript ， C ，圖形等等）對書籍進行分類排序。這顯然不是所有的書推薦，如果你剛剛入門程式設計或者想擴充套件你的知識，這兩本書是很好的開始。

下面我來描述該網站是如何構建的。

獲取和匯入資料

我從 archive.org 獲取了 Stack Exchange 資料庫。

從一開始，我就意識到不可能使用如 myxml := pg_read_file(‘path/to/my_file.xml’) 這類常用工具將 48GB XML 檔案匯入新建立的資料庫（PostgreSQL），因為我伺服器沒有 48GB 的記憶體。所以，我決定使用SAX解析器。

所有的值儲存在 <row> 標籤之間，從而我打算使用一個 Python 指令碼來解析它：

def startElement(self, name, attributes):

 if name == ‘row’:
  self.cur.execute(“INSERT INTO posts (Id, Post_Type_Id, Parent_Id, Accepted_Answer_Id, Creation_Date, Score, View_Count, Body, Owner_User_Id, Last_Editor_User_Id, Last_Editor_Display_Name, Last_Edit_Date, Last_Activity_Date, Community_Owned_Date, Closed_Date, Title, Tags, Answer_Count, Comment_Count, Favorite_Count) VALUES (%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s)”,
  (
    (attributes[‘Id’] if ‘Id’ in attributes else None),
    (attributes[‘PostTypeId’] if ‘PostTypeId’ in attributes else None),
    (attributes[‘ParentID’] if ‘ParentID’ in attributes else None),
    (attributes[‘AcceptedAnswerId’] if ‘AcceptedAnswerId’ in attributes else None),
    (attributes[‘CreationDate’] if ‘CreationDate’ in attributes else None),
    (attributes[‘Score’] if ‘Score’ in attributes else None),
    (attributes[‘ViewCount’] if ‘ViewCount’ in attributes else None),
    (attributes[‘Body’] if ‘Body’ in attributes else None),
    (attributes[‘OwnerUserId’] if ‘OwnerUserId’ in attributes else None),
    (attributes[‘LastEditorUserId’] if ‘LastEditorUserId’ in attributes else None),
    (attributes[‘LastEditorDisplayName’] if ‘LastEditorDisplayName’ in attributes else None),
    (attributes[‘LastEditDate’] if ‘LastEditDate’ in attributes else None),
    (attributes[‘LastActivityDate’] if ‘LastActivityDate’ in attributes else None),
    (attributes[‘CommunityOwnedDate’] if ‘CommunityOwnedDate’ in attributes else None),
    (attributes[‘ClosedDate’] if ‘ClosedDate’ in attributes else None),
    (attributes[‘Title’] if ‘Title’ in attributes else None),
    (attributes[‘Tags’] if ‘Tags’ in attributes else None),
    (attributes[‘AnswerCount’] if ‘AnswerCount’ in attributes else None),
    (attributes[‘CommentCount’] if ‘CommentCount’ in attributes else None),
    (attributes[‘FavoriteCount’] if ‘FavoriteCount’ in attributes else None)
  )
);

經過近三天的匯入（幾乎一半的 XML 在此期間被匯入），我意識到我犯了一個錯誤： ParentID 欄位應該是 ParentId 。

但是，我並不想再浪費一個星期，於是我從 AMD E-350（2 x 1.35GHz）改為使用英特爾 G2020（2 x 2.90GHz）。但這仍然沒有加快程序。

下一個決定 - 批量插入：

class docHandler(xml.sax.ContentHandler):
  def __init__(self, cusor):
    self.cusor = cusor;
    self.queue = 0;
    self.output = StringIO();
  def startElement(self, name, attributes):
    if name == ‘row’:
      self.output.write(
          attributes[‘Id’] + 't` + 
          (attributes[‘PostTypeId’] if ‘PostTypeId’ in attributes else '\N') + 't' + 
          (attributes[‘ParentId’] if ‘ParentId’ in attributes else '\N') + 't' + 
          (attributes[‘AcceptedAnswerId’] if ‘AcceptedAnswerId’ in attributes else '\N') + 't' + 
          (attributes[‘CreationDate’] if ‘CreationDate’ in attributes else '\N') + 't' + 
          (attributes[‘Score’] if ‘Score’ in attributes else '\N') + 't' + 
          (attributes[‘ViewCount’] if ‘ViewCount’ in attributes else '\N') + 't' + 
          (attributes[‘Body’].replace('\', '\\').replace('n', '\n').replace('r', '\r').replace('t', '\t') if ‘Body’ in attributes else '\N') + 't' + 
          (attributes[‘OwnerUserId’] if ‘OwnerUserId’ in attributes else '\N') + 't' + 
          (attributes[‘LastEditorUserId’] if ‘LastEditorUserId’ in attributes else '\N') + 't' + 
          (attributes[‘LastEditorDisplayName’].replace('n', '\n') if ‘LastEditorDisplayName’ in attributes else '\N') + 't' + 
          (attributes[‘LastEditDate’] if ‘LastEditDate’ in attributes else '\N') + 't' + 
          (attributes[‘LastActivityDate’] if ‘LastActivityDate’ in attributes else '\N') + 't' + 
          (attributes[‘CommunityOwnedDate’] if ‘CommunityOwnedDate’ in attributes else '\N') + 't' + 
          (attributes[‘ClosedDate’] if ‘ClosedDate’ in attributes else '\N') + 't' + 
          (attributes[‘Title’].replace('\', '\\').replace('n', '\n').replace('r', '\r').replace('t', '\t') if ‘Title’ in attributes else '\N') + 't' + 
          (attributes[‘Tags’].replace('n', '\n') if ‘Tags’ in attributes else '\N') + 't' + 
          (attributes[‘AnswerCount’] if ‘AnswerCount’ in attributes else '\N') + 't' + 
          (attributes[‘CommentCount’] if ‘CommentCount’ in attributes else '\N') + 't' + 
          (attributes[‘FavoriteCount’] if ‘FavoriteCount’ in attributes else '\N') + 'n'
      );
      self.queue += 1;
    if (self.queue >= 100000):
      self.queue = 0;
      self.flush();
  def flush(self):
      self.output.seek(0);
      self.cusor.copy_from(self.output, ‘posts’)
      self.output.close();
      self.output = StringIO();

StringIO 允許使用像檔案的變數來處理使用 COPY 的函式 copy_from 。這樣，整個過程只花了一個晚上。

下面開始建立索引。理論上， GiST 所花的時間比 GIN 多，但佔用的空間更小。所以我決定使用 GiST 。一天後我得到了 70GB 的索引。

當我幾次嘗試查詢時，我發現處理時間特別長。其原因在於磁碟 IO 的等待時間。 SSD GOODRAM C40 120Gb 有很大的提升作用，即使它不是目前最快的 SSD 。

我建立了一個全新的 PostgreSQL 叢集：

initdb -D /media/ssd/postgresq/data

然後我更改了服務配置的路徑（我使用的是 Manjaro 作業系統）：

vim /usr/lib/systemd/system/postgresql.service

Environment=PGROOT=/media/ssd/postgres
PIDFile=/media/ssd/postgres/data/postmaster.pid

接著重新載入配置並啟動 postgreSQL ：

systemctl daemon-reload
postgresql systemctl start postgresql

這一次我使用 GIN ，匯入僅花了幾個小時。索引在 SSD 上佔 20GB 的空間，查詢僅需不到一分鐘。

從資料庫中提取書籍資訊

隨著資料的最終匯入，我開始搜尋提到推薦書籍的帖子，然後使用 SQL 將它們複製到單獨的表：

CREATE TABLE books_posts AS SELECT * FROM posts WHERE body LIKE ‘%book%’”;

下一步是找到當中所有的超連結：


CREATE TABLE http_books AS SELECT * posts WHERE body LIKE ‘%http%’”;

在這一點上，我發現 StackOverflow 代理所有的連結，如：

rads.stackowerflow.com/[$isbn]/

我建立了另一個表格，其中有所有包含連結的帖子：

CREATE TABLE rads_posts AS SELECT * FROM posts WHERE body LIKE ‘%http://rads.stackowerflow.com%'";

然後使用正則表示式提取所有 ISBN 。我通過 regexp_split_to_table 將 Stack Overflow 標籤提取到另一個表。

一旦對熱門標籤進行提取和計算，可以得出20本被推薦最多的書籍（文末附有書單）。

下一步：優化標籤。

這一步需要每個標籤中提取前 20 本書，並排除已處理的書籍。

因為它是“一次性”的工作，我決定使用 PostgreSQL 陣列。我寫了一個指令碼來實現查詢：

SELECT *
    , ARRAY(SELECT UNNEST(isbns) EXCEPT SELECT UNNEST(to_exclude ))
    , ARRAY_UPPER(ARRAY(SELECT UNNEST(isbns) EXCEPT SELECT UNNEST(to_exclude )), 1) 
FROM (
   SELECT *
      , ARRAY[‘isbn1’, ‘isbn2’, ‘isbn3’] AS to_exclude 
   FROM (
      SELECT 
           tag
         , ARRAY_AGG(DISTINCT isbn) AS isbns
         , COUNT(DISTINCT isbn) 
      FROM (
         SELECT * 
         FROM (
            SELECT 
                 it.*
               , t.popularity 
            FROM isbn_tags AS it 
            LEFT OUTER JOIN isbns AS i on i.isbn = it.isbn 
            LEFT OUTER JOIN tags AS t on t.tag = it.tag 
            WHERE it.tag in (
               SELECT tag 
               FROM tags 
               ORDER BY popularity DESC 
               LIMIT 1 OFFSET 0
            ) 
            ORDER BY post_count DESC LIMIT 20
      ) AS t1 
      UNION ALL
      SELECT * 
      FROM (
         SELECT 
              it.*
            , t.popularity 
         FROM isbn_tags AS it 
         LEFT OUTER JOIN isbns AS i on i.isbn = it.isbn 
         LEFT OUTER JOIN tags AS t on t.tag = it.tag 
         WHERE it.tag in (
            SELECT tag 
            FROM tags 
            ORDER BY popularity DESC 
            LIMIT 1 OFFSET 1
         ) 
         ORDER BY post_count 
         DESC LIMIT 20
       ) AS t2 
       UNION ALL
       SELECT * 
       FROM (
          SELECT 
               it.*
             , t.popularity 
          FROM isbn_tags AS it 
          LEFT OUTER JOIN isbns AS i on i.isbn = it.isbn 
          LEFT OUTER JOIN tags AS t on t.tag = it.tag 
          WHERE it.tag in (
             SELECT tag 
             FROM tags 
             ORDER BY popularity DESC 
             LIMIT 1 OFFSET 2
          ) 
          ORDER BY post_count DESC 
          LIMIT 20
      ) AS t3 
...
      UNION ALL
      SELECT * 
      FROM (
         SELECT 
              it.*
            , t.popularity 
         FROM isbn_tags AS it 
         LEFT OUTER JOIN isbns AS i on i.isbn = it.isbn 
         LEFT OUTER JOIN tags AS t on t.tag = it.tag 
         WHERE it.tag in (
            SELECT tag 
            FROM tags 
            ORDER BY popularity DESC 
            LIMIT 1 OFFSET 78
         ) 
         ORDER BY post_count DESC 
         LIMIT 20
     ) AS t79
   ) AS tt 
   GROUP BY tag 
   ORDER BY max(popularity) DESC 
  ) AS ttt
) AS tttt 
ORDER BY ARRAY_upper(ARRAY(SELECT UNNEST(arr) EXCEPT SELECT UNNEST(la)), 1) DESC;

有了這些資料，我開始建網站。

構建Web應用

由於我不是一個 Web 開發人員，也不是一個 Web 介面專家，我決定建立一個基於預設 Bootstrap 主題的非常簡易的單頁面應用程式。

我建立了一個“按標籤搜尋”選項，然後提取熱門標籤，每次搜尋時可點選對應標籤。

我使用條形圖顯示搜尋結果。我試過 Hightcharts 和 D3 ，但它們更適合做儀表盤。同時有一些有響應性的問題，並配置相當複雜。所以，我建立了基於 SVG 的響應圖表。為了使它能夠響應，必須在改變螢幕方向時重新整理：

var w = $('#plot').width();
var bars = "";var imgs = "";
var texts = "";
var rx = 10;
var tx = 25;
var max = Math.floor(w / 60);
var maxPop = 0;
for(var i =0; i < max; i ++){
  if(i > books.length - 1 ){
    break;
  }
  obj = books[i];
  if(maxPop < Number(obj.pop)) {
    maxPop = Number(obj.pop);
  }
}
for(var i =0; i < max; i ++){
  if(i > books.length - 1){
    break;
   }
   obj = books[i];
   h = Math.floor((180 / maxPop ) * obj.pop);
   dt = 0;
   if(('' + obj.pop + '').length == 1){
    dt = 5;
   }
   if(('' + obj.pop + '').length == 3){
    dt = -3;
   }
   var scrollTo = 'onclick="scrollTo(''+ obj.id +''); return false;" "';
   bars += '<rect id="rect'+ obj.id +'" x="'+ rx +'" y="' + (180 - h + 30) + '" width="50" height="' + h + '" ' + scrollTo + '>';
   bars += '<title>' + obj.name+ '</title>';
   bars += '</rect>';
   imgs += '<image height="70" x="'+ rx +'" y="220" href="img/ol/jpeg/' + obj.id + '.jpeg" onmouseout="unhoverbar('+ obj.id +');" onmouseover="hoverbar('+ obj.id +');" width="50" ' + scrollTo + '>';
   imgs += '<title>' + obj.name+ '</title>';
   imgs += '</image>';
   texts += '<text x="'+ (tx + dt) +'" y="'+ (180 - h + 20) +'"  class="bar-label"  style="font-size: 16px;" ' + scrollTo + '>' + obj.pop + '</text>';
   rx += 60;
   tx += 60;
}
$('#plot').html(
    ' <svg width="100%" height="300" aria-labelledby="title desc" role="img">'
  + '  <defs> '
  + '    <style type="text/css"><![CDATA['
  + '      .cla {'
  + '        fill: #337ab7;'
  + '      }'
  + '      .cla:hover {'
  + '        fill: #5bc0de;'
  + '      }'
  + '      ]]></style>'
  + '  </defs>'
  + '  <g>'
  + bars
  + '  </g>'
  + '  <g>'
  + imgs
  + '  </g>'
  + '  <g>'
  + texts
  + '  </g>'
  + '</svg>');

Web伺服器故障

釋出 dev-books.com 之後，馬上有許多使用者訪問我的網站。 Apache 不能同時為超過 500 個訪問者服務，所以我很快設定切換為 Nginx 。當實時訪問者高達 800人時我真的很驚訝。

書單下載：

Stack Overflow 推薦書單.pdf

【乾貨】找不到適合自己的程式設計書？我自己動手寫了一個熱門程式設計書搜尋網站（附PDF書單）

獲取和匯入資料

從資料庫中提取書籍資訊

構建Web應用

Web伺服器故障

書單下載：

【乾貨】找不到適合自己的程式設計書？我自己動手寫了一個熱門程式設計書搜尋網站（附PDF書單）

【乾貨】外貿開發信退信多、進垃圾箱、發不出郵件，原因二

【乾貨】外貿開發信退信多、進垃圾箱、發不出郵件，原因一

【乾貨】什麼？Python3.X不能輸出中文？原來是編輯器geany的鍋？！

【乾貨】郵件是否進垃圾箱大揭祕，別再被忽悠了！

【乾貨】Entity Embeddings : 利用深度學習訓練結構化資料的實體嵌入

【乾貨】資料分析師的真實寫照

【乾貨】掌握這5招，Linux排障不再怕

【Tensorflow】使用tf-keras在InceptionV3上finetune自己的資料集

list替換指定位置元素_【乾貨】Python基礎變數型別——List淺析

【乾貨】Redis進階合集，想學的進來看看！

python 兩個[]_【乾貨】每天更新兩個Python 小例子（十七）

python 兩個[]_【乾貨】每天更新兩個Python 小例子（九）

python 兩個[]_【乾貨】每天更新兩個Python 小例子（十八）

mysql語句大全_【乾貨】MySQL基礎優化教程

【乾貨】微信小程式如何讓view標籤中內容居中

將List集合內，具有相同屬性值的物件進行分類存放【乾貨】

【乾貨】前端開發VUE例項

【c#】JavaScriptSerializer 不序列化null值

【乾貨】WordPress系統級更新，程序升級

【乾貨】找不到適合自己的程式設計書？我自己動手寫了一個熱門程式設計書搜尋網站（附PDF書單）

獲取和匯入資料

從資料庫中提取書籍資訊

構建Web應用

Web伺服器故障

書單下載：

相關推薦