python爬蟲各種日期格式解析
阿新 • • 發佈:2019-02-05
爬蟲時遇到的日期格式太多,統計一下日期的格式:
2018-6-21、2018年6月21、2018/6/21、21st Jun,2018、Jun 21st,2018(如果是其他月份還有縮寫的形式)、Jun 21,2018、21-22 Jun 2018、Jun 21-22,2018、Thursday,21 Jun 2018(星期也可能會有縮寫),還有未收錄為了將日期格式統一,就需要解析。而網上的教程太少,就自己慢慢積累了。
import re class DateFormatHelper(object): regex1 = re.compile(r"[0-9]{1,2}-[0-9]{1,2} *[A-Za-z]+ *[0-9]{4}") regex2 = re.compile(r"[0-9]{1,2} *[A-Za-z]+ *[0-9]{4}") regex3 = re.compile(r"[0-9]{4}-[0-9]{1,2}-[0-9]{1,2}") regex4 = re.compile(r"[A-Za-z]+ *[0-9]{1,2}, *[0-9]{4}") regex5 = re.compile(r"[A-Za-z]+ *[0-9]{1,2}-[0-9]{1,2}, *[0-9]{4}") regex6 = re.compile(r"[0-9]{4}年[0-9]{1,2}月[0-9]{1,2}日") dateformatregexs = [regex1, regex2, regex3, regex4, regex5, regex6] monthMap = {"sep": "9", "oct": "10", "nov": "11", "dec": "12", "jan": "1", "feb": "2", "aug": "8", "jul": "7", "jun": "6", "may": "5", "apr": "4", "mar": "3"} monthMap2 = {"September": "9", "October": "10", "November": "11", "December": "12", "January": "1", "February": "2", "August": "8", "July": "7", "June": "6", "May": "5", "April": "4", "March": "3"} @classmethod def convertStandardDateFormat(cls, datestr: str) -> str: """ 轉換日期格式 :param datestr: :return: """ res = "" if datestr is None: return res for i in range(0, len(cls.dateformatregexs)): try: regex = cls.dateformatregexs[i] match = regex.match(datestr) if match is not None: itemstr = match.group() if i == 0: items = str(itemstr).split(" ") year = items[len(items) - 1] month = cls.monthMap.get(str(items[1]).lower()) if month is None: month = cls.monthMap2.get(str(items[1])) dayrange = str(items[0]) day = dayrange[0:dayrange.index("-")] day2 = dayrange[dayrange.index("-") + 1:] res = year + "-" + month + "-" + day # res2 = year + "-" + month + "-" + str(day2) elif i == 1: items = str(itemstr).split(" ") year = items[len(items) - 1] month = cls.monthMap.get(str(items[1]).lower()) if month is None: month = cls.monthMap2.get(str(items[1])) day = items[0] res = year + "-" + month + "-" + day elif i == 3: items = str(itemstr).split(" ") year = items[len(items) - 1] month = cls.monthMap.get(str(items[0]).lower()) if month is None: month = cls.monthMap2.get(str(items[0])) digit_pattern = re.compile(r'[0-9]+') digitlist = digit_pattern.findall(items[1]) day = digitlist[0] res = year + "-" + month + "-" + day elif i == 4: items = str(itemstr).split(" ") year = items[len(items) - 1] month = cls.monthMap.get(str(items[0]).lower()) if month is None: month = cls.monthMap2.get(str(items[0])) dayrange = str(items[1]) day = dayrange[0:dayrange.index("-")] # day2 = dayrange[dayrange.index("-")+1:dayrange.index(",")] res = year + "-" + month + "-" + day # res2 = year + "-" + month + "-" + day2 elif i == 5: for x in range(len(str(itemstr))): if ord(itemstr[x]) > 255: itemstr = itemstr.replace(itemstr[x], " ") items = str(itemstr).split(" ") year = items[0] month = items[1] day = items[2] res = year + "-" + month + "-" + day else: res = datestr print(res) break except Exception as e: print("convertStandardDateFormat方法出現異常{}".format(e)) return res
註釋沒有怎麼寫,不,沒寫註釋,這個習慣在改。就說說思路好了,將一個日期格式的字串傳遞給
convertStandardDateFormat()
這個函式,當然準確的說是需要時呼叫這個classmethod函式。然後,這個函式就會識別日期字串符合哪種規定日期格式。之後就好辦啦,知道哪個是年哪個是月哪個是日,然後就很簡單的替換賦值的就成了。嗯,這個程式目前才只能解析寥寥幾個格式,所以就放到網上,希望看到的人能繼續解析,如果可以就在哪裡回覆你們寫的解析程式的地址,我去轉載過來繼續完善。新手上路,有問題建議就留言什麼的,不要噴啊...