python過濾html標籤

阿新 • • 發佈：2019-01-31

def filter_tags(htmlstr):
    #先過濾CDATA
    re_cdata=re.compile('//<!\[CDATA\[[^>]*//\]\]>',re.I) #匹配CDATA
    re_script=re.compile('<\s*script[^>]*>[^<]*<\s*/\s*script\s*>',re.I)#Script
    re_style=re.compile('<\s*style[^>]*>[^<]*<\s*/\s*style\s*>',re.I)#style 

    re_br=re.compile('<br\s*?/?>')#處理換行
    re_h=re.compile('</?\w+[^>]*>')#HTML標籤
    re_comment=re.compile('<!--[^>]*-->')#HTML註釋
    s=re_cdata.sub('',htmlstr)#去掉CDATA
    s=re_script.sub('',s) #去掉SCRIPT
    s=re_style.sub('',s)#去掉style
    s=re_br.sub('\n',s)#將br轉換為換行
    s=re_h.sub('' 
,s) #去掉HTML 標籤
    s=re_comment.sub('',s)#去掉HTML註釋
    #去掉多餘的空行
    blank_line=re.compile('\n+')
    s=blank_line.sub('\n',s)
    s=replaceCharEntity(s)#替換實體
    return s

def replaceCharEntity(htmlstr):
    CHAR_ENTITIES={'nbsp':' ','160':' ',
                'lt':'<','60':'<',
                'gt' 
:'>','62':'>',
                'amp':'&','38':'&',
                'quot':'"','34':'"',}

    re_charEntity=re.compile(r'&#?(?P<name>\w+);')
    sz=re_charEntity.search(htmlstr)
    while sz:
        entity=sz.group()#entity全稱，如&gt;
        key=sz.group('name')#去除&;後entity,如&gt;為gt
        try:
            htmlstr=re_charEntity.sub(CHAR_ENTITIES[key],htmlstr,1)
            sz=re_charEntity.search(htmlstr)
        except KeyError:
            #以空串代替
            htmlstr=re_charEntity.sub('',htmlstr,1)
            sz=re_charEntity.search(htmlstr)
    return htmlstr

def repalce(s,re_exp,repl_string):
    return re_exp.sub(repl_string,s)

html = "<p>　　下面讓我們把掌聲和鮮花送給本屆<font color=\"#ff0000\">甲級</font><span style=\"line-height: 20.8px;\"><font color=\"#ff0000\">冠軍</font><font color=\"#ff00ff\">山西新區</font></span>、<font color=\"#ff0000\">乙級</font><span style=\"line-height: 20.8px;\"><font color=\"#ff0000\">冠軍</font><font color=\"#ff00ff\">天若有情</font></span>、<span style=\"color:#FF0000;\">"


print(filter_tags(html))

python過濾html標籤

def filter_tags(htmlstr): #先過濾CDATA re_cdata=re.compile('//<!\[CDATA\[[^>]*//\]\]>',re.I) #匹配CDATA re_script=r

php過濾html標籤

正常情況下： <?php $a='<p><span style="color:red;">我的外面包裹著html標籤<br/>我的上面的換行標籤</span></p>'; echo $a; ?> 使用了s

php正則過濾html標籤、空格、換行

$str=preg_replace("/\s+/", " ", $str); //過濾多餘回車 $str=preg_replace("/<[]+/si","<",$str);//過濾<__("<"號後面帶空格) $str=preg_replace("/<\!--.*?-->

過濾HTML標籤java工具類

廢話不說，直接上碼： package test; import java.util.regex.Matcher; import java.util.regex.Pattern; /** * &

通過原生js 簡單的實現過濾html標籤功能

使用場景：想輸出一個div裡的純文字內容，例如： <div id="divA"> This is <span>some</span> text <br&

mysql中利用函式與過程過濾html標籤

如果要過濾html標籤多半同學都使用php的函數了，但是大家不知道是可以直接在mysql中進行去除htm標籤吧，下面一起來看看吧。 mysql本身沒有去除html程式碼的內建函式，但是在一些情況下，不得不在資料庫層次提取一些去除了html程式碼的純文字。經過谷歌後，找

Objective-C裡字串NSString過濾HTML標籤的方法

- (NSString *)removeHTML:(NSString *)html { NSScanner *theScanner; NSString *text = nil; theScanner = [NSScanner scannerWithString:h

正則表示式之過濾html標籤

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> <html

Python 清理HTML標籤類似PHP的strip_tags函式功能（二）

沒有發現Python 有現成的類似功能模組，所以昨天寫了個簡單的 strip_tags 但還有些問題，今天應用到採集上時進行了部分功能的完善， 1. 對自閉和標籤處理 2. 以及對標籤引數的過濾

php過濾html標籤正則表示式

<?php $str=preg_replace("/\s+/", " ", $str); //過濾多餘回車 $str=preg_replace("/<[ ]+/si","<",$str); //過濾<__("<"號後面帶空格) $str=pre

通過正則過濾html標籤

publicstatic String delHtml(String inputString) { String htmlStr = inputString; // 含html標籤的字串 String textStr =""; java.util.regex.Patt

過濾HTML標籤的幾個函式

/**/ /// <summary>/// 去除HTML標記/// </summary>/// <param name="NoHTML">包括HTML的原始碼 </param>/// <returns>已經去除後的文字&l

msql 過濾HTML標籤函式

mysql本身沒有去除html程式碼的內建函式，但是在一些情況下，不得不在資料庫層次提取一些去除了html程式碼的純文字。經過谷歌後，找到了以下兩個函式，經測試，均可用。函式1：程式碼如下複製程式碼 SET GLOBAL log_bin_trust_function_creators

Python通過正則表示式獲取,去除(過濾)或者替換HTML標籤的幾種方法(本文由169it.com蒐集整理)

python正則表示式關鍵內容: python正則表示式轉義符: . 匹配除換行符以外的任意字元 \w 匹配字母或數字或下劃線或漢字 \s 匹配任意的空白符 \d 匹配數字 \b 匹配單詞的開始或結束 ^ 匹配字串的開始 $ 匹配字串的結束 \W 匹配任意不是字母，數字

Python正則表示式過濾或者替換HTML標籤的方法

python正則表示式關鍵內容: python正則表示式轉義符:. 匹配除換行符以外的任意字元 \w 匹配字母或數字或下劃線或漢字 \s 匹配任意的空白符 \d 匹配數字 \b 匹配單詞的開始或結束 ^ 匹配字串的開始 $ 匹配字串的結束 \W 匹配任意不是字母，數字，下劃

python 過濾文字中的HTML標籤

'''過濾HTML中的標籤 #將HTML中標籤等資訊去掉 #@param htmlstr HTML字串.''' def filter_tag(htmlstr): re_cdata = re.compile('<!DOCTYPE HTML PUBLIC[^>]*>', re.I)

Java對html標籤的過濾和清洗

OWASP HTML Sanitizer 是一個簡單快捷的java類庫，主要用於放置XSS 優點如下：　　1.使用簡單。不需要繁瑣的xml配置，只用在程式碼中少量的編碼　　2.由Mike Samuel（谷歌工程師）維護　　3.通過了AntiSamy超過95%的UT覆蓋　　4.高效能，低記憶體

在Python中使用正則表示式去掉字串裡的html標籤

有時候會獲得一些帶html標籤的字串，需要把html標籤去掉，獲得乾淨的字串，這時候可以使用正則表示式。程式碼如下： import re htmeString = ''' <ul id="TopNav"> &nbs

python使用bs4模組去除html標籤字串方法

使用bs4模組去除html標籤方法 from bs4 import BeautifulSoup s = ''' /usr/sbin/tgt-admin <span class="token comment">#配置工具</span> /usr/sbin/tgtadm <

【我要學python】爬蟲準備之瞭解基本的html標籤

HTML 標題 <h1>This is a heading</h1> HTML 段落 <p>This is a paragraph.</p> HTML 連結 <a href="http://www.cnblogs.com>This is a

python過濾html標籤

相關推薦