python3字串與文字處理

阿新 • • 發佈：2018-11-12

每個程式都回涉及到文字處理，如拆分字串、搜尋、替換、詞法分析等。許多工都可以通過內建的字串方法來輕鬆解決，但更復雜的操作就需要正則表示式來解決。

1、針對任意多的分隔符拆分字串

In [1]: line = 'asdf fjdk; afed, fjek,asdf,    foo'
#使用正則模組
In [2]: import re
#使用正則split方法可以匹配多分割符
In [3]: re.split(r'[;,\s]\s*',line)
Out[3]: ['asdf', 'fjdk', 'afed', 'fjek', 'asdf', 'foo']
#使用捕獲組分割會將匹配的文字也包含在最終結果中 

In [4]: re.split(r'(;|,|\s)\s*',line)
Out[4]: ['asdf', ' ', 'fjdk', ';', 'afed', ',', 'fjek', ',', 'asdf', ',', 'foo']
#如果不想在結果中看到分隔符，可以受用?:的形式使用非捕獲組
In [5]: re.split(r'(?:,|;|\s)\s*',line)
Out[5]: ['asdf', 'fjdk', 'afed', 'fjek', 'asdf', 'foo']

2、在字串的開頭或結尾處做文字匹配

In [6]: url = 'https://www.baidu.com 
'
#檢查字串的結尾，匹配則返回True
In [7]: url.endswith('.com')
Out[7]: True
In [8]: url.endswith('.cn')
Out[8]: False
#檢查字串的開頭，匹配則返回true
In [10]: url.startswith('https:')
Out[10]: True
In [11]: url.startswith('http:')
Out[11]: False

In [1]: import os
In [2]: filenames = os.listdir('.')
In [3]: filenames
Out[ 
3]: 
['.tcshrc',
 '.bash_logout',
 '.mysql_history',
 'Python-3.7.0.tgz',
 '.bash_history',
 '.cache',
 'anaconda-ks.cfg',
 '.ipython',
 '.cshrc',
 '.bashrc',
 '.viminfo',
 'Python-3.7.0',
 'mysql-boost-8.0.12.tar.gz',
 'heapq_queue.py',
 'mysql-8.0.12',
 '.bash_profile']
#應用匹配開頭字串過濾
In [4]: [i for i in filenames if i.startswith('.')]
Out[4]: 
['.tcshrc',
 '.bash_logout',
 '.mysql_history',
 '.bash_history',
 '.cache',
 '.ipython',
 '.cshrc',
 '.bashrc',
 '.viminfo',
 '.bash_profile']
#多個結果匹配時使用元組集合
In [5]: [i for i in filenames if i.endswith(('.py','.gz','.tgz'))]
Out[5]: ['Python-3.7.0.tgz', 'mysql-boost-8.0.12.tar.gz', 'heapq_queue.py']
#判斷目錄中是否有.py結尾的檔案
In [6]: any(i.endswith('.py') for i in filenames)
Out[6]: True
#應用列子解析網頁或文字內容
from urllib.request import urlopen

def read_date(name):
    if name.startswith(('http:','https:','ftp:')):
        return urlopen(name).read().decode()
    else:
        with open(name) as f:
            return f.read()

result = read_date('test.txt')
print(result)

#也可以使用正則匹配
In [9]: import re

In [10]: re.match('http:|https:|ftp:','https://www.baidu.com')
Out[10]: <re.Match object; span=(0, 6), match='https:'>

3、利用shell萬用字元做字串匹配

#利用fnmatch模組中的fnmatch和fnmatchcase函式匹配文字
In [12]: from fnmatch import fnmatch,fnmatchcase

In [13]: fnmatch('foo.txt','*.txt')
Out[13]: True

In [14]: fnmatch('foo.txt','?oo.txt')
Out[14]: True

In [15]: names = ['dat01.csv','dat99.csv','config.ini','foo.py']

In [18]: [i for i in names if fnmatch(i,'dat[0-9]*.csv')]
Out[18]: ['dat01.csv', 'dat99.csv']
#對於Windows作業系統時使用fnmatch函式時，匹配它不區分大小寫，這時我們可以使用fnmatchcase函式來代替，它完全按照提供的字串來匹配
In [2]: fnmatch('foo.txt','*.TXT')
Out[2]: True

In [3]: fnmatchcase('foo.txt','*.TXT')
Out[3]: False

#推倒式過濾檔名的字串
In [20]: addresses = [
    ...:     '5412 N CLARK ST',
    ...:     '1060 W ADDISON ST',
    ...:     '1039 W GRANVILLE AVE',
    ...:     '2122 N CLARK ST',
    ...:     '4802 N BROADWAY',
    ...: ]

In [21]: [i for i in addresses if fnmatch(i,'*ST')]
Out[21]: ['5412 N CLARK ST', '1060 W ADDISON ST', '2122 N CLARK ST']

In [22]: [i for i in addresses if fnmatch(i,'*CLARK*')]
Out[22]: ['5412 N CLARK ST', '2122 N CLARK ST']

4、文字模式的匹配和查詢

對於簡單的文字匹配我們只需要使用基本的字串方法str.find()、str.endswith()、str.startswith()，對於更復雜的匹配就需要使用正則表示式模組re來匹配了

In [23]: text1 = 'today is 10/19/2018.Pycon starts 3/13/2019.'

In [24]: import re

In [25]: re.match(r'\d+/\d+/\d+',text1)

In [28]: re.findall(r'\d+/\d+/\d+',text1)
Out[28]: ['10/19/2018', '3/13/2019']

In [29]: text2 = '11/20/2018'

In [30]: re.match(r'\d+/\d+/\d+',text2)
Out[30]: <re.Match object; span=(0, 10), match='11/20/2018'>

In [31]: result = re.match(r'(\d+)/(\d+)/(\d+)',text2)

In [32]: result.groups()
Out[32]: ('11', '20', '2018')

In [33]: result.group(0)
Out[33]: '11/20/2018'

In [34]: result.group(1)
Out[34]: '11'

In [35]: result.group(2)
Out[35]: '20'

In [36]: result.group(3)
Out[36]: '2018'
#分組取出所有的日期並按格式輸出
In [39]: text1 = 'today is 10/19/2018.Pycon starts 3/13/2019.'

In [40]: for month,day,year in re.findall(r'(\d+)/(\d+)/(\d+)',text1):
    ...:     print('{}-{}-{}'.format(year,month,day))
    ...:     
2018-10-19
2019-3-13

#如果文字資料比較大可以使用finditer()方法，以迭代器的方法匹配
In [43]: text1 = 'today is 10/19/2018.Pycon starts 3/13/2019.'

In [44]: for i in re.finditer(r'(\d+)/(\d+)/(\d+)',text1):
    ...:     print(i.groups())
    ...:     
('10', '19', '2018')
('3', '13', '2019')

5、查詢和替換文字

對於簡單的文字模式，可以使用str.replace()方法即可

In [45]: text = 'abcabcabcabc'

In [46]: text.replace('a','ee')
Out[46]: 'eebceebceebceebc'

針對更為複雜的匹配，可以使用re模組中的sub()方法

In [47]: text3 = 'today is 10/19/2018. pycon starts 3/13/2013.'

In [49]: re.sub(r'(\d+)/(\d+)/(\d+)',r'\3-\1-\2',text3)
Out[49]: 'today is 2018-10-19. pycon starts 2013-3-13.'

更為複雜的例子，如把日期換成字元格式

In [54]: text3 = 'today is 10/19/2018. pycon starts 3/13/2013.'

In [55]: from calendar import month_abbr

In [56]: def change_date(m):
    ...:     mon_name = month_abbr[int(m.group(1))]
    ...:     return '{} {} {}'.format(m.group(2),mon_name,m.group(3))
    ...: 
    ...: 

In [57]: re.sub(r'(\d+)/(\d+)/(\d+)',change_date,text3)
Out[57]: 'today is 19 Oct 2018. pycon starts 13 Mar 2013.'
#subn()可以返回完成了多少次替換
In [58]: re.subn(r'(\d+)/(\d+)/(\d+)',change_date,text3)
Out[58]: ('today is 19 Oct 2018. pycon starts 13 Mar 2013.', 2)

6、以不區分大小寫的方式對文字做查詢和替換

要進行不分割槽大小寫的文字操作時，可以使用re模組程序操作時都要加上re.IGNORECASE標記

In [60]: text = 'UPPER PYTHON,lower python, mixed Python'
In [61]: re.findall('python',text,flags=re.IGNORECASE)
Out[61]: ['PYTHON', 'python', 'Python']

import re
def matchcase(word):
    def replace(m):
        text = m.group()
        if text.isupper():
            return word.upper()
        elif text.islower():
            return word.lower()
        elif text[0].isupper():
            return word.capitalize()
        else:
            return word
    return replace
#保持原字元大小寫或首字母大寫替換例項
text = 'UPPER PYTHON,lower python,Mixed Python'
print(re.sub('python',matchcase('snake'),text,flags=re.IGNORECASE))

7、最短匹配的正則表示式

str_pat = re.compile(r'\"(.*)\"')
text1 = 'computer says "no."'
str_pat.findall(text1)
Out[18]: ['no.']
text2 = 'computer says "no." phone says "yes."'
str_pat.findall(text2) #在使用.*貪婪匹配時它將匹配儘可能多的匹配項
Out[20]: ['no." phone says "yes.']
str_pat = re.compile(r'\"(.*?)\"')  #只需要在多匹配後加上?號，就會以最少的匹配模式進行匹配
str_pat.findall(text2)
Out[22]: ['no.', 'yes.']

8、多行模式的正則表示式

comment = re.compile(r'python(.*?)end')
text1 = 'python is ver good \n so so end'
comment.findall(text1)  #.*匹配不到換行符
Out[27]: []
comment = re.compile(r'python(.*?)end',flags=re.DOTALL) #加上標記re.DOTALL將匹配所有的字元包括換行符
comment.findall(text1)
Out[29]: [' is ver good \n so so ']
comment = re.compile(r'python((?:.|\n)*?)end') #（?:.|\n)會指定一個非捕獲組，它只做匹配但不捕獲結果，也不分配組號
comment.findall(text1)
Out[31]: [' is ver good \n so so ']

9、將Unicode文字統一表示為規範形式是

s1 = 'spicy\u00f1o'  #它使用的是（U+00F1)全組成的（fully composed)
s2 = 'spicy\u0303o' #它使用的是（U+0303)拉丁字母組合而成
s1 == s2   #所以字元比較是不相等的
Out[35]: False
s1
Out[36]: 'spicyño'
s2
Out[37]: 'spicỹo'

10、從字串中去掉不需要的字元

#strip()方法用來從字串的開始和結尾處去掉字元，lstrip()和rstrip()分別從左或右開始執行去除字元操作，預設去除的是空格符，也可以指定
In [21]: s = ' hello world \n'                                               

In [22]: s.strip()                                                           
Out[22]: 'hello world'

In [23]: s.lstrip()                                                          
Out[23]: 'hello world \n'

In [24]: s.rstrip()                                                          
Out[24]: ' hello world'

In [25]: t = '-----hello====='                                               

In [26]: t.lstrip('-')             #指定去除字元                                          
Out[26]: 'hello====='

In [27]: t.strip('-=')       #可以指定多個字元                                                
Out[27]: 'hello'

#使用上面的方法不能去除中間的字元，要去除中間的字元可以使用replace()方法或正則表示式替換
In [28]: s.replace(' ','')                                                   
Out[28]: 'helloworld\n'

In [29]: re.sub('\s+', '',s)                                                 
Out[29]: 'helloworld'

11、對齊文字字串

#對應基本的字串對齊，可以使用字串方法ljust()、rjust()和center(),分別表示左對齊，右對齊和居中對齊，它還可以填充字元可選引數
In [31]: text = 'hello world'                                                

In [32]: text.ljust(30)                                                      
Out[32]: 'hello world                   '

In [33]: text.rjust(30)                                                      
Out[33]: '                   hello world'

In [34]: text.center(30)                                                     
Out[34]: '         hello world          '

In [35]: text.center(30,'=')                                                 
Out[35]: '=========hello world=========='

#format()函式也可以用來完成對齊任務，需要做的就是合理利用'<'、'>'和'^'字元分別表示左對齊、右對齊和居中對齊，並提供一個期望的寬度值，如果想指定填充字元，可以在對齊符前指定：
In [36]: format(text,'>20')                                                  
Out[36]: '         hello world'

In [37]: format(text,'<20')                                                  
Out[37]: 'hello world         '

In [38]: format(text,'^20')                                                  
Out[38]: '    hello world     '

In [39]: format(text,'=^20')                                                 
Out[39]: '====hello world====='

In [40]: format(text,'=^20s')                                                
Out[40]: '====hello world====='

In [41]: format(text,'*^20s')                                                
Out[41]: '****hello world*****'
#當格式化多個值時，也可以使用format()方法
In [42]: '{:>10s}{:<10s}'.format('hello','world')                            
Out[42]: '     helloworld     '

In [43]: '{:#>10s} {:&<10s}'.format('hello','world')                         
Out[43]: '#####hello world&&&&&'

12、字串連結及合併

#合併的字串在一個序列或可迭代物件中，最好的方法是使用join()方法
In [44]: data = ['I','like','is','python']                                   

In [45]: ' '.join(data)                                                      
Out[45]: 'I like is python'

In [46]: ','.join(data)                                                      
Out[46]: 'I,like,is,python'

#利用生成器表示式轉換後連結字串會更高效
In [47]: ','.join(str(d) for d in data)                                      
Out[47]: 'I,like,is,python'

13、給字串中的變數名做插值處理

#在字串中給變數賦值一般常見的處理方式是使用format()方法
In [5]: str_variable = "{name} today {num} old year"

In [6]: str_variable.format(name='zhang',num=20)
Out[6]: 'zhang today 20 old year'

#另一種方式是使用format_map()和vars()聯合匹配當前環境中的變數名
In [7]: name = 'python'

In [8]: num = 18

In [9]: str_variable.format_map(vars())
Out[9]: 'python today 18 old year'
#vars()還可用在類例項上
In [10]: class info:
    ...:     def __init__(self,name,num):
    ...:         self.name = name
    ...:         self.num = num
    ...:         

In [11]: a = info('shell',23)

In [12]: str_variable.format_map(vars(a))
Out[12]: 'shell today 23 old year'
#對於傳遞引數不夠時將會丟擲異常，可以定義一個帶有__missing__()方法的字典類來處理
In [13]: class safesub(dict):
    ...:     def __missing__(self,key):
    ...:         return '{' + key + '}'
    ...:     

In [14]: del num

In [15]: str_variable.format_map(safesub(vars()))
Out[15]: 'python today {num} old year'

14、以固定的列數重新格式化文字

#textwrap模組可以以多種方式重新格式化字串：
>>> import textwrap
>>> s = "look into eyes, look into my eyes, the eyes,the eyes, \
... the eyes, not around the eyes, don't look around the eyes, \
... look into my eyes, you're under."
>>> print(textwrap.fill(s,70)        
... )
look into eyes, look into my eyes, the eyes,the eyes, the eyes, not
around the eyes, don't look around the eyes, look into my eyes, you're
under.
>>> print(textwrap.fill(s,40))     
look into eyes, look into my eyes, the
eyes,the eyes, the eyes, not around the
eyes, don't look around the eyes, look
into my eyes, you're under.
>>> print(textwrap.fill(s,40,initial_indent=' '))
 look into eyes, look into my eyes, the
eyes,the eyes, the eyes, not around the
eyes, don't look around the eyes, look
into my eyes, you're under.
>>> print(textwrap.fill(s,40,subsequent_indent=' '))
look into eyes, look into my eyes, the
 eyes,the eyes, the eyes, not around the
 eyes, don't look around the eyes, look
 into my eyes, you're under.
#可以通過os.get_terminal_size()來獲取終端的尺寸大小
>>> import os                
>>> print(textwrap.fill(s,os.get_terminal_size().columns))
look into eyes, look into my eyes, the eyes,the eyes, the eyes, not around the eyes, don't look around
the eyes, look into my eyes, you're under.
>>> print(os.get_terminal_size())   
os.terminal_size(columns=105, lines=32)

15、在文字中處理HTML和XML實體

#使用html.escape()函式來替換HTML標籤為文字樣式
In [1]: s = 'Elements are written aa "<tag>text</tag>".'

In [2]: import html

In [3]: s
Out[3]: 'Elements are written aa "<tag>text</tag>".'

In [4]: html.escape(s)
Out[4]: 'Elements are written aa &quot;&lt;tag&gt;text&lt;/tag&gt;&quot;.'
#忽略quote標籤
In [5]: html.escape(s,quote=False)
Out[5]: 'Elements are written aa "&lt;tag&gt;text&lt;/tag&gt;".'
#處理ASCII文字
In [6]: s1 = 'Spicy &quot;Jalape&#241;o&quot.'
In [7]: from html.parser import HTMLParser
In [9]: p = HTMLParser()
In [11]: p.unescape(s1)
Out[11]: 'Spicy "Jalapeño".'
#生成ASCII文字
In [12]: s2 = p.unescape(s1)
In [13]: s2.encode('ascii',errors='xmlcharrefreplace')
Out[13]: b'Spicy "Jalape&#241;o".'
#處理XML實體
In [14]: s3 = 'the prompt is &gt;&gt;&gt;'
In [15]: from xml.sax.saxutils import unescape
In [16]: unescape(s3)
Out[16]: 'the prompt is >>>'

16、文字分詞

#從左到右將字串解析為標記流（stream of tokens）
In [17]: text = 'foo = 23 + 42 * 10'

In [18]: tokens= [('NAME','foo'),('EQ','='),('NUM','23'),('PLUS','+'),('NUM','42'),('TIMES','*'),('NUM','
    ...: 10')]

In [19]: import re
#使用正則表示式
InIn [20]: NAME = r'(?P<NAME>[a-zA_][a-zA-Z_0-9]*)'

In [21]: NUM = r'(?P<NUM>\d+)'

In [22]: PLUS = r'(?P<PLUS>\+)'

In [23]: TIMES = r'(?P<TIMES>\*)'

In [24]: EQ = r'(?P<EQ>=)'

In [25]: WS = r'(?P<WS>\s+)'

In [26]: master_pat = re.compile('|'.join([NAME,NUM,PLUS,TIMES,EQ,WS]))
#使用模式物件的scanner()方法來完成分詞操作
In [27]: scanner = master_pat.scanner('foo = 42')
#在給定的文字中重複呼叫match()方法，一次匹配一個模式，下面是匹配過程
In [28]: scanner.match()
Out[28]: <re.Match object; span=(0, 3), match='foo'>

In [29]: _.lastgroup,_.group()
Out[29]: ('NAME', 'foo')

In [30]: scanner.match()
Out[30]: <re.Match object; span=(3, 4), match=' '>

In [31]: _.lastgroup,_.group()
Out[31]: ('WS', ' ')

In [32]: scanner.match()
Out[32]: <re.Match object; span=(4, 5), match='='>

In [33]: _.lastgroup,_.group()
Out[33]: ('EQ', '=')

In [34]: scanner.match()
Out[34]: <re.Match object; span=(5, 6), match=' '>

In [35]: _.lastgroup,_.group()
Out[35]: ('WS', ' ')

In [36]: scanner.match()
Out[36]: <re.Match object; span=(6, 8), match='42'>

In [37]: _.lastgroup,_.group()
Out[37]: ('NUM', '42')
#通過生成器函式來轉化為程式碼的形式
In [40]: from collections import namedtuple

In [41]: token = namedtuple('token',['type','value'])

In [42]: def generate_tokens(pat,text):
    ...:     scanner = pat.scanner(text)
    ...:     for m in iter(scanner.match,None):
    ...:         yield token(m.lastgroup,m.group())
    ...:         

In [43]: for tok in generate_tokens(master_pat,'foo = 42'):
    ...:     print(tok)
    ...:     
token(type='NAME', value='foo')
token(type='WS', value=' ')
token(type='EQ', value='=')
token(type='WS', value=' ')
token(type='NUM', value='42')
#過濾空格標記
In [45]: tokens = (tok for tok in generate_tokens(master_pat,text) if tok.type != 'WS')

In [46]: for tok in tokens:print(tok)
token(type='NAME', value='foo')
token(type='EQ', value='=')
token(type='NUM', value='23')
token(type='PLUS', value='+')
token(type='NUM', value='42')
token(type='TIMES', value='*')
token(type='NUM', value='10')

17、編寫一個簡單的遞迴下降解析器

import re
import collections

#定義文字分詞變數
NUM = r'(?P<NUM>\d+)'
PLUS = r'(?P<PLUS>\+)'
MINUS = r'(?P<MINUS>-)'
TIMES = r'(?P<TIMES>\*)'
DIVIDE = r'(?P<DIVIDE>/)'
LPAREN = r'(?P<LPAREN>\()'
RPAREN = r'(?P<RPAREN>\))'
WS = r'(?P<WS>\s+)'

master_pat = re.compile('|'.join([NUM,PLUS,MINUS,TIMES,DIVIDE,LPAREN,RPAREN,WS]))
Token = collections.namedtuple('Token',['type','value'])

#過濾文字分詞
def generate_tokens(text):
    scanner = master_pat.scanner(text)
    for m in iter(scanner.match,None):
        tok = Token(m.lastgroup,m.group())
        if tok.type != 'WS':
            yield tok

class ExpressionEvaluator:
    def parse(self,text):
        self.tokens = generate_tokens(text)
        self.nexttok = None
        self.tok = None
        self._advance()
        return self.expr()

    def _advance(self):
        self.tok,self.nexttok = self.nexttok,next(self.tokens,None)
    def _accept(self,toktype):
        if self.nexttok and self.nexttok.type == toktype:
            self._advance()
            return True
        else:
            return False
    def _expect(self,toktype):
        if not self._accept(toktype):
            raise SyntaxError('Expected' + toktype)

    def expr(self):
        exprval = self.term()
        while self._accept('PLUS') or self._accept('MINUS'):
            op = self.tok.type
            right = self.term()
            if op == 'PLUS':
                exprval += right
            elif op == 'MINUS':
                exprval -= right
        return exprval

    def term(self):
        termval = self.factor()
        while self._accept('TIMES') or self._accept('DIVIDE'):
            op = self.tok.type
            right = self.factor()
            if op == 'TIMES':
                termval *= right
            elif op == 'DIVIDE':
                termval /= right
        return termval

    def factor(self):
        if self._accept('NUM'):
            return int(self.tok.value)
        elif self._accept('LPAREN'):
            exprval = self.expr()
            self._expect('RPAREN')
            return exprval
        else:
            raise SyntaxError('Expected NUMBER or LPAREN')


if __name__ == '__main__':
    e = ExpressionEvaluator()
    print(e.parse('2'))
    print(e.parse('2 + 3'))
    print(e.parse('2 + 3 * 4'))
    print(e.parse('2 + (3 + 4) * 5'))

18、在位元組串上進行文字操作

In [2]: data = b'hello world'

In [3]: data
Out[3]: b'hello world'
#切片
In [4]: data[0:5]
Out[4]: b'hello'
#分隔
In [6]: data.split()
Out[6]: [b'hello', b'world']
#替換
In [7]: data.replace(b'hello',b'python')
Out[7]: b'python world'
#在返回單個切片時將返回ASCII位元組表對應的位置
In [8]: data[0]
Out[8]: 104

python3字串與文字處理

每個程式都回涉及到文字處理，如拆分字串、搜尋、替換、詞法分析等。許多工都可以通過內建的字串方法來輕鬆解決，但更復雜的操作就需要正則表示式來解決。 1、針對任意多的分隔符拆分字串 In [1]: line = 'asdf fjdk; afed, fjek,asdf, foo' #使用正則模組

boost——字串與文字處理tokenizer

#include <iostream> #include <string> #include <vector> #include <set> #include <map> #include <al

Boost學習筆記 -- 字串與文字處理

lexical_cast 使用lexical_cast #include <boost/lexical_cast.hpp> using namespace boost; sample int x = lexical_cast&

數組與文字處理

ret 順序定義變量 n+1 ima 字符 delet sos 指定一、數組數據集合、元素、下表文字處理程序處理字符數據字符編碼 ASCII-英文字符類型字符數組 +數組定義數據類型數組變量名[表達式....]; +size sizeof(數據類型名)/

python字串和文字處理

2.1 使用多個界定符分割字串問題你需要將一個字串分割為多個欄位，但是分隔符(還有周圍的空格)並不是固定的。解決方案 string 物件的 split() 方法只適應於非常簡單的字串分割情形，它並不允許有多個分隔符或者是分隔符周圍不確定的空格。當你需要更加靈活的切割字串

Linux正則與文字處理工具(10)

正則表示式 (Regular Expression, RE, 或稱為常規表示式)是通過一些特殊字元的排列,用於『查詢/替換/刪除』一行或多行文字或字串,簡單的說,正則表示式就是用在字串的處理上面的一種『表示公式』,正則表示式並不是一個工具程式,而是一個對字串處理的標準依據,如果您想要以正則表示式的方式處理字串

boost 字串和文字處理庫概述

conversion/lexical_cast - lexical_cast 類模板，來自 Kevlin Henney. format - 型別安全的 '類printf' 的格式化操作，來自 Samuel Krempp. iostreams - 一個框架，用於定義流、流緩衝和i/o過濾器，來自 Jonatha

C++ boost 元件簡介：字串及文字處理

字串及文字處理 Boost.Regex 正則表示式是解決大量模式匹配問題的基礎。它們常用於處理大的字串，子串模糊查詢，按某種格式tokenize字串，或者是基於某種規則修改字串。由於C++沒有提供正則表示式支援，使得有些使用者被迫轉向其它支援正則表示式的語言，如Perl, awk, 和 sed。Regex

datagrid的文字換行與連續字串換行處理，字串三種擷取方式

1 文字自動換行：nowrap:false 2 當時字串，比如email這樣的欄位時，就需要用到字串的拼接，首先，先貼出我解決問題的方法，再介紹字串的三種拼接方式：我解決問題的方法：

python3-cookbook中一些關於字串和文字的處理方式

1.查詢最大或最小的 N 個元素 heapq 模組有兩個函式：nlargest() 和 nsmallest() 可以完美解決這個問題。 import heapq nums = [1, 8, 2, 23, 7, -4, 18, 23, 42, 37, 2] n

wikipedia 維基百科語料獲取與提取處理 by python3.5

維基 spa name open 命令 XML window 需要 rac 英文維基百科 https://dumps.wikimedia.org/enwiki/ 中文維基百科 https://dumps.wikimedia.org/zhwiki/ 全部語言的列表 https

笨辦法學Python3——習題6 字串和文字

【程式碼】 types_of_people = 10 x = f"There are {types_of_people} types of people." binary = "binary" do_not = "dont't" y = f"Those who know {binar

python3字串處理，高效切片

高階技巧：切片，迭代，列表，生成器切片 L = ['Hello', 'World', '!'] print("-------1.一個一個取-------") print(L[0]) print(L[1]) print(L[2]) print("-------2.開闢一個

rpm與yum命令、定時任務和sed文字處理

1、簡述rpm與yum命令的常見選項，並舉例rpm命令：語法：rpm [OPTIONS] [PACKAGE_FILE] 選項：安裝: -i, --install 升級: -U, --update,-F, --freshen 解除安裝: -e, --erase 查詢: -q

156-練習9和10 迴圈練習和字串與字元的處理

9,財務處的小云老師最近就在考慮一個問題：如果每個老師的工資額都知道，最少需要準備多少張人民幣，才能在給每位老師發工資的時候都不用老師找零呢？這裡假設老師的工資都是正整數，單位元，人民幣一共有100元、50元、10元、5元、2元和1元六種。 int num = Convert.

文字處理【1.1.1】-判斷字串型別並返回相應型別

def formats(st): if '.' in st or 'e' in st or 'E' in st: try: return float(st) except: return st

python3-列表與字串

del如下。pop彈出元素並返回。 print(x) [1, 2, 5, 6, 8, 4, 3, 5] del x[3] print(x) [1, 2, 5, 8, 4, 3, 5] y=x.pop() print(y) 5 x print(x) [1, 2, 5, 8, 4, 3] print(x

互動百科詞條快速抓取[適用於文字處理與挖掘]

1.前言　　因近期小組的一個專案有文字挖掘的需求，需要用到Word2Vec的文字特徵抽取，為了進行技術預演需要我們提前對模型進行訓練。而只要涉及資料探勘相關的模型，資料集是不必可少的。中文文字挖掘領域，百科詞條涵蓋面廣，而且內容比較豐富，於是便選擇百科的詞條作為資料集

python3 日期時間與文字之間轉換以及改變時區

第一種方法 strftime() 和 strptime() 的使用函式功能 strftime 日期時間轉文字 strptime 文字轉日

Go/文字處理/字串處理

# 字串常用函式 package main import ( "fmt" "strconv" "strings" ) func main() { //是否包含子串 fmt.Println(strings.Contains("hellogo","go")) /

python3字串與文字處理

1、針對任意多的分隔符拆分字串

2、在字串的開頭或結尾處做文字匹配

3、利用shell萬用字元做字串匹配

4、文字模式的匹配和查詢

5、查詢和替換文字

6、以不區分大小寫的方式對文字做查詢和替換

7、最短匹配的正則表示式

8、多行模式的正則表示式

9、將Unicode文字統一表示為規範形式是

10、從字串中去掉不需要的字元

11、對齊文字字串

12、字串連結及合併

13、給字串中的變數名做插值處理

14、以固定的列數重新格式化文字

15、在文字中處理HTML和XML實體

16、文字分詞

17、編寫一個簡單的遞迴下降解析器

18、在位元組串上進行文字操作

相關推薦