詞根統計系統實現背單詞計劃

阿新 • • 發佈：2019-01-03

人生啊，總是在不斷變化，往往會在出其不意的地方出現Bug，對此我們需要萬分小心

在詞根統計的功能上和爬蟲聯絡起來，從 https://www.etymonline.com/ 網站爬取相關的解釋。介面為：

https://www.etymonline.com/search?q=

利用爬蟲進行解析

public class SkillOfWords {


    private static Map<String,String> wordfanyicache = new HashMap<String, String>();

    private static 
 void getwordfanyicache(String name) throws IOException {
        if (wordfanyicache.size() == 0){
            name = name + "_fanyi.txt";
            File file = new File(name);
            if (file.exists()){
                InputStream inputStream = new FileInputStream(file);
                BufferedReader bufferedReader = 
 new BufferedReader(new InputStreamReader(inputStream));
                String line = null;
                int cnt = 0;
                while ((line = bufferedReader.readLine()) != null){
                    String[] tmp = line.split(" ");
                    int n = tmp[0].length();
                    if 
 (n>0){
                        String word = tmp[0].substring(0,n-1);
                        if (cnt == 0)System.out.println(word);
                        String value = "";
                        if (tmp.length==2){
                            value = tmp[1];
                        }
                        if (word == "" || value == "")continue;
                        try {
                            wordfanyicache.put(word, value);
                        }catch (NullPointerException ue){

                        }
                        cnt ++;
                    }
                }
                System.out.println("終於讀完了");
            }else {
                System.out.println("翻譯檔案不存在");
            }
        }
    }


    public static void getSkill(String name) throws IOException {
        if (wordfanyicache.size() == 0)getwordfanyicache(name);
        String nametmp = name;
        name = name + ".txt";
        File file = new File(name);
        if (file.exists()){
            InputStream inputStream = new FileInputStream(file);
            BufferedReader bufferedReader = new BufferedReader(new InputStreamReader(inputStream));
            String line = null;
            String word = "";
            String wordtmp = "";
            int cntline = 0;
            while ((line = bufferedReader.readLine()) != null){
                cntline ++;
                wordtmp = wordtmp + line;
                if (cntline == 10){
                    word = word + wordtmp;
                    wordtmp = "";
                    cntline = 0;
                }
            }
            word = word + wordtmp;
            String[] words = word.split(" ");
            int cntword = 0;
            String jihua = "";
            int cntjihua = getJihuaTian(nametmp);
            int totalwords = 0;
            for (String url : words){
                if (url.length()==0)continue;
                cntword ++;
                totalwords ++;
                if (totalwords < 10*cntjihua){
                    cntword = 0;
                    continue;
                }
                int n = cntword ;
                String ans = n + "、" + url;
                String w = wordfanyicache.get(url);
                url = Link.WORD_DETAIL_BASE.getLink() + url;
                ans = ans +":  "+url+"意思是： "+w+"\n";
                WebEntity webEntity = new WebEntity(url);
                Craw craw = Craw.getInstance();
                HtmlPage page = craw.parsePage(webEntity);
                YeMian yeMian = YeMian.WORD_DETAIL;
                if (page != null) {
                    // TODO: 2018/12/11 解析page並進行儲存，每10個單詞存一個檔案，生成每天的任務
                    String html = page.asXml();
                    ans = ans + LabelUtil.analyzeHTMLByString(html, yeMian);
                }
                jihua = jihua +"\n"+ ans;
                if (cntword == 10){
                    String newname = nametmp + "\\jihua_" + cntjihua + ".txt";
                    File file1 = new File(newname);
                    if (!file1.exists()){
                        file1.createNewFile();
                    }
                    FileOutputStream fileOutputStream = new FileOutputStream(file1);
                    BufferedOutputStream bufferedOutputStream = new BufferedOutputStream(fileOutputStream);
                    bufferedOutputStream.write(jihua.getBytes());
                    bufferedOutputStream.flush();
                    bufferedOutputStream.close();
                    fileOutputStream.close();
                    System.out.println("第"+cntjihua+"天，生成完成");
                    saveJihuaTian(cntjihua+1,nametmp);
                    jihua = "";
                    cntjihua ++;
                    cntword = 0;
                }

            }
        }else {
            System.out.println("檔案不存在");
        }
    }

    private static void saveJihuaTian(int jihua,String name) throws IOException {
        String newname = name + "_jihua_jilu.txt";
        File file = new File(newname);
        if (!file.exists()){
            file.createNewFile();
        }
        FileOutputStream fileOutputStream = new FileOutputStream(file);
        BufferedOutputStream bufferedOutputStream = new BufferedOutputStream(fileOutputStream);
        String ans = "";
        ans = ans + jihua;
        bufferedOutputStream.write(ans.getBytes());
        bufferedOutputStream.flush();
        bufferedOutputStream.close();
        fileOutputStream.close();
    }

    private static int getJihuaTian(String name) throws IOException {
        String newname = name + "_jihua_jilu.txt";
        File file = new File(newname);
        if (file.exists()){
            InputStream inputStream = new FileInputStream(file);
            BufferedReader bufferedReader = new BufferedReader(new InputStreamReader(inputStream));
            String line = bufferedReader.readLine();
            if (line != null){
               int ans = Integer.parseInt(line);
               return ans;
            }else {
                return 0;
            }
        }else {
            return 0;
        }
    }
}

解析html

public class LabelUtil {
    public static String analyzeHTMLByString(String html,YeMian yeMian){
        String ans = "";
        Document document = Jsoup.parse(html);
        if (yeMian == YeMian.WORD_DETAIL){
            try {
                Element element = document.select(".word--C9UPa").first().select("section").first();
                ans = handleHtmlLabel(element.toString());
            }catch (NullPointerException ue){
                System.out.println("不存在");
            }
        }
        return ans;
    }
    public static String handleHtmlLabel(String html){
        String noHTMLString = "";
        html = html.replaceAll("&amp;", "&");
        Matcher m = Pattern
                .compile("&#(\\d+);", Pattern.CASE_INSENSITIVE | Pattern.MULTILINE | Pattern.DOTALL | Pattern.CANON_EQ)
                .matcher(html);
        boolean b = false;
        int i = 0;
        while (m.find()) {
            if (i > 500) {
                System.out.println(i);
            }
            i++;
            html = html.replace("&#" + m.group(1) + ";", (char) Integer.parseInt(m.group(1)) + "");
            b = true;
        }
        if (!b) {
            m = Pattern
                    .compile("&#x([\\da-f]+);",
                            Pattern.CASE_INSENSITIVE | Pattern.MULTILINE | Pattern.DOTALL | Pattern.CANON_EQ)
                    .matcher(html);
            int j = 0;
            while (m.find()) {
                if (j > 500) {
                    System.out.println(j);
                }
                j++;
                html = html.replaceAll("&#[x|X]" + m.group(1) + ";", (char) Integer.parseInt(m.group(1), 16) + "");
            }
        }
        String scl = "<script";//8
        String scr = "</script>";//9
        int indexl = -1;
        indexl = html.indexOf(scl);
        long mm = html.length();
        while (indexl != -1){
            int indexr = -1;
            indexr = html.indexOf(scr);
            if (indexl != 0){
                String x = html.substring(0,indexl);
                int n = html.length();
                if (indexr != n-9 && indexr != -1) { ;
                    String y = html.substring(indexr+9,n-1);
                    html = x+y;

                }else if (indexr == n-9 || indexr == -1){
                    html= x;

                }
            }else {
                int n = html.length();
                if (indexr != n-9 && indexr != -1){
                    String y = html.substring(indexr+9,n-1);
                    html = y;

                }else if(indexr == n-9){
                    html = "";
                }else if(indexr == -1){
                    html = "";
                }
            }
            indexl = -1;
            indexl = html.indexOf(scl);
        }
        noHTMLString = html.replaceAll("<\\s*(?:br|Br|BR|bR|div|DIV|Div|p|P|td|TD|Td)\\s*(?:[^>])*\\s*>", "")
                .replaceAll("", "").replaceAll("&nbsp;", "").replaceAll("\\<.*?\\>", "")
                .replaceAll("&(?:g|l)t", "");
        String x = "";
        Pattern pattern = Pattern.compile("\\s*|\t|\r|\n");
        Matcher matcher = pattern.matcher(noHTMLString);
        x = matcher.replaceAll("");
        return noHTMLString.trim();
    }
}

在這裡插入圖片描述

功能還沒寫完，待更新

詞根統計系統實現背單詞計劃

人生啊，總是在不斷變化，往往會在出其不意的地方出現Bug，對此我們需要萬分小心在詞根統計的功能上和爬蟲聯絡起來，從 https://www.etymonline.com/ 網站爬取相關的解釋。介面為： https://www.etymonline.com/search?q=

[原始碼和文件分享]基於C語言實現的網咖管理系統-背單詞-自守數-進位制轉換

1 求解自守數 1.1 問題描述判斷任意輸入的某數，是否是自守數。如果一個自然數的平方數的尾部仍然為該自然數本身，則稱其為自守數。例如： 5x5=25 76x76=5776 625x625=390625 1.2 功能要求可任意輸入一個整數，輸出其是否是

圖書管理系統總結——統計圖實現

多少 election stat 定義變量解決格式 exception als fin JAVA的JFreeChar提供了繪制各種與統計有關的圖形，比如直方圖，折線圖，餅圖等，而且有各種樣式。這裏只是應用了最簡單的繪制，沒有什麽炫酷的修飾。一、餅狀圖：實現餅狀圖的

手動實現一個單詞統計MapReduce程序與過程原理分析

Hadoop MapReduce Java [toc] 手動實現一個單詞統計MapReduce程序與過程原理分析前言我們知道，在搭建好hadoop環境後，可以運行wordcount程序來體驗一下hadoop的功能，該程序在hadoop目錄下的share/hadoop/mapreduce目錄中

2018-08-05 期 MapReduce實現每個單詞在每個文件中坐標信息統計

line 字符 count throws ase protect clas 行處理 tostring package cn.sjq.bigdata.inverted.index;import java.io.IOException;import java.util.Iter

統計字符串單詞數的兩種方法(c語言實現)

字符串長度字符 include ++ hit you 問題 code bool 　問題描述：統計一個字符串，字符串由單詞，空格構成。　思路：　　一，遍歷字符串所有字符，設置一個布爾變量來判斷當前是空格還是字母　　　　 1 #include <stdio

[Trie樹] 統計英文文字中單詞出現的個數 - C語言實現 - 考慮數字、英文

【英文文字】 However, after reaching the shore there are plenty of challenges waiting for him."The biggest challenge now is learning to walk agai

【C】實現字元(單詞)個數的統計

文章目錄 C語言實現字元(單詞)個數的統計一、說明二、程式思路分析三、程式碼展示四、結果五、結果分析 C語言實現字元(單詞)個數的統計一、說明

單詞詞頻統計系統

輸入大量單詞，每個單詞，一行，不超過20字元，沒有空格。按出現次數從多到少輸出這些單詞及其出現次數。出現次數相同的，字典序靠前的在前面輸入樣例： this is ok this plus that is plus plus 輸出樣例： plus 3 is

java實現學生資訊統計系統

學生資訊查詢功能： 1.通過id修改學生資訊 2.獲取指定學生的成績 3.獲取指定學號的成績 4.根據姓名修改指定學科的成績 5.刪除指定學生及其成績 6.統計所有成績相等的人數 7.獲取到所有學生某一科的成績程式碼塊 pac

整合開源系統實現自動化構建、程式碼質量評估、專案資訊統計（1）——Jenkins安裝

Jenkins 是java 語言開發，需要JRE（JDK）的支援（注：我到現在還不知道JRE 和 JDK 的區別>_<），因需要支撐分散式部署支撐，因此採用直接部署jenkins.jar 包的方式進行安裝。 (一) 前期準備確認安裝jenkins 的機器已

【BZOJ4567】[Scoi2016]背單詞 Trie樹+貪心

字母如果 ems scanf 序號 data scan name rdquo 【BZOJ4567】[Scoi2016]背單詞 Description Lweb 面對如山的英語單詞，陷入了深深的沈思，“我怎麽樣才能快點學完，然後去玩三國殺呢？&rdquo

商城視頻直播導購系統實現邊看邊買

系統開發 “直播 ”的模式越來越熱火，隨著微信的成熟，現在商家開發視頻直播系統都不需要開發APP了。接下來，我們一起來了解下商城視頻直播導購系統吧! 一、商城視頻直播導購系統為什麽被商家推行? 1、導購操作簡便，和現場給用戶講解和回復咨詢幾乎一樣，沒有使用障礙和心理障

spark系統實現yarn資源的自動調度

yar 配置 nbsp block integer onf 個數 .mine erb 參考： http://blog.csdn.net/dandykang/article/details/48160953 對於Spark應用來說，資源是影響Spark應用執行效率

利用toggle實現背包

如圖所示技術分享 ges com 界面 nor graphic 背包圖片 1.先創建入如圖所示界面 2.在圖片下面創建一個選中狀態圖片 3.在normal圖片添加toggle組件，將子物體拖動到graphic裏利用toggle實現背包

操作系統概念文件系統實現

指針中斷命令減少目標連續分配外存提高 sector 磁盤提供大量的外存空間來維持文件系統。磁盤的下述兩個特點使得其成為存儲多個文件的方便介質。 ①可以原地重寫； ②可以直接訪問磁盤上的任意一塊信息。為了提供對磁盤的高效且便捷的訪問，操作系統通過文件系統來

洛谷 P2353 背單詞

期末考取出 alt 點擊 lin 是我 getchar() 文章 img 題目背景小明對英語一竅不通，令老師十分頭疼。於是期末考試前夕，小明被逼著開始背單詞…… 題目描述老師給了小明一篇長度為N的英語文章，然後讓小明背M個單詞。

LibreOJ #2012. 「SCOI2016」背單詞

pan ring har cst target cstring n) color 插入二次聯通門 : LibreOJ #2012. 「SCOI2016」背單詞 /* LibreOJ #2012. 「SCOI2016」背單詞 Trie

代碼發布系統實現

新的執行正常待審核 oct tps 後端部分問題文章目錄 [隱藏] 關於項目開源日常運維問題嘗試解決問題最終解決方案開源技術使用代碼發布流程最後想說的話關於項目開源由於挺多同學請求開源此項目，在這裏說明一下:其實本人

客戶關系管理系統與企業資源計劃

客戶關系管理系統管理系 post 跟蹤 pos 你們必須等等合同 CRM系統與ERP系統都屬於企業管理軟件，都可以幫助企業更系統的管理而產生的， CRM系統是側重於客戶關系與團隊管理的軟件；ERP系統是適合在一套系統裏管理公司全部業務的企業使用；其實ERP和CRM都

詞根統計系統 實現背單詞計劃

人生啊，總是在不斷變化，往往會在出其不意的地方出現Bug，對此我們需要萬分小心

相關推薦

詞根統計系統實現背單詞計劃