leetcode187 重複的DNA序列（雜湊表滑動視窗位運算字典樹）

阿新 • • 發佈：2021-10-08

連結：https://leetcode-cn.com/problems/repeated-dna-sequences/

題目

所有 DNA 都由一系列縮寫為 'A'，'C'，'G' 和 'T' 的核苷酸組成，例如："ACGAATTCCG"。在研究 DNA 時，識別 DNA 中的重複序列有時會對研究非常有幫助。

編寫一個函式來找出所有目標子串，目標子串的長度為 10，且在 DNA 字串 s 中出現次數超過一次。

示例

示例 1：

輸入：s = "AAAAACCCCCAAAAACCCCCCAAAAAGGGTTT"
輸出：["AAAAACCCCC","CCCCCAAAAA"]
示例 2：

輸入：s = "AAAAAAAAAAAAA"
輸出：["AAAAAAAAAA"]

提示：

0 <= s.length <= 105
s[i] 為 'A'、'C'、'G' 或 'T'

思路

方法1
看到找重複就想到可以遍歷儲存每一個字串，我一開始寫了個字典樹來儲存序列不過效果一般又慢記憶體又大

class Trie{
public:
    vector<Trie*>next;
    int isfind;
    Trie(){
        next=vector<Trie*>(4);
        isfind =0;
    }
};
class Solution {
public:
    vector<string> findRepeatedDnaSequences(string s) {
        int n=s.size();
        if(n<11)
            return {};
        vector<string>ans;
        Trie *zd=new Trie;
        unordered_map<char,int>mp;
        mp['A']=0;
        mp['C']=1;
        mp['G']=2;
        mp['T']=3;
        for(int i=0;i<=n-10;++i)
        {
            int ischange=0;
            Trie *ptr=zd;
            for(int j=i;j<i+10;++j)
            {
                int thischar=mp[s[j]];
                if(ptr->next[thischar]==nullptr)
                {
                    ptr->next[thischar]=new Trie;
                    ischange=1;
                }
                ptr=ptr->next[thischar];
            }
            if(ischange==0&&ptr->isfind==0)
            {   
                (ptr->isfind)++;
                ans.push_back(string(s.begin()+i,s.begin()+i+10));
            }
        }
        return ans;
    }
};

方法2
也可以直接使用hash表來進行儲存

class Solution {
    const int L = 10;
public:
    vector<string> findRepeatedDnaSequences(string s) {
        vector<string> ans;
        unordered_map<string, int> cnt;
        int n = s.length();
        for (int i = 0; i <= n - L; ++i) {
            string sub = s.substr(i, L);
            if (++cnt[sub] == 2) {
                ans.push_back(sub);
            }
        }
        return ans;
    }
};

方法3
看了官方題解非常精妙
因為只有4個字元可以將字元用二進位制數分別表示00 01 10 01
因為序列長度為10
因此32位int可以取前20位作為儲存
對字串進行後插操作為(x<<2) | bin[s[i+L-1]
對字串前刪x & ((1<<(L*2))-1)

class Solution {
    const int L=10;
    unordered_map<char,int>bin={{'A',0},{'C',1},{'G',2},{'T',3}};
public:
    vector<string> findRepeatedDnaSequences(string s) {
        vector<string>ans;
        int n =s.size();
        if(n<=L)
            return ans;
        int x=0;
        for(int i=0;i<L-1;++i)
            x=(x<<2) | bin[s[i]];//構建01序列，取32位前20位
        unordered_map<int ,int>cnt;
        for(int i=0;i<=n-L;++i)
        {
            x=((x<<2) | bin[s[i+L-1]]) &((1<<(L*2))-1);
            if(++cnt[x]==2){
                ans.push_back(s.substr(i,L));
            }
        }
        return ans;
    }
};

leetcode187 重複的DNA序列（雜湊表滑動視窗位運算字典樹）

連結：https://leetcode-cn.com/problems/repeated-dna-sequences/ 題目所有 DNA 都由一系列縮寫為 \'A\'，\'C\'，\'G\' 和 \'T\' 的核苷酸組成，例如：\"ACGAATTCCG\"。在研究 DNA 時，識別 DNA 中的重複序列有時