LeetCode--187. Repeated DNA Sequences

阿新 • • 發佈：2019-01-01

題目連結：https://leetcode.com/problems/repeated-dna-sequences/

要求尋找長度為10的DNA重複子字串

思路一：這裡可以考慮一個HashMap來儲存出現的子字串及其出現次數，出現第二次的則加入最終答案中，而首次出現的就加入Hashmap中，三次及三次以上出現的不加入只是更新出現次數。思路比較樸素，程式碼如下：

class Solution {
    public List<String> findRepeatedDnaSequences(String s) {
        LinkedList<String> ret=new LinkedList<String>();
        HashMap<String,Integer> hs=new HashMap<String,Integer>();
        for(int i=0;i<s.length()-9;i++)
        {
                int j=i+9;
                String str=s.substring(i,j+1);
                if(hs.containsKey(str))
                {
                    int frequency=hs.get(str);
                    if(frequency==1)
                        ret.add(str);
                    hs.put(str,frequency+1);
                }
                else
                {
                    hs.put(str,1);
                }
        }
        return ret;
    }
}

時間複雜度：O（10m）=O(m)

空間複雜度：O（m）

這個解法效率也很一般。

思路二：總體方法就是來判重。上述思路可以再進一步簡化，可以用一個HashSet來儲存所有已經出現的子字串，然後將重複出現的子字串存到另一個HashSet中，這樣就不用管它到底第幾次出現了，因為重複出現的子字串是無法插入集合中的，程式碼如下：

class Solution {
    public List<String> findRepeatedDnaSequences(String s) {
        HashSet<String> seen = new HashSet<>();
        HashSet<String> reap = new HashSet<>();
        for(int i=0; i<s.length()-9;i++) {
            String temp = s.substring(i,i+10);
            if(!seen.add(temp)) {
                reap.add(temp);
            }
        }
        
        return new ArrayList(reap);
    }
 }

思路三：上面的思路本質都是在雜湊函式，而雜湊函式可以利用rolling hash的方法來減少計算雜湊值的複雜度，這裡具體參考這篇演算法詳解https://blog.csdn.net/To_be_to_thought/article/details/85038546，這裡不展示具體程式碼了。

思路四：因為只有四個字母的情況，我們可以考慮將這四個字母來重現編碼，以實現只有四個字母組成的字母表。觀察發現：

ASCII碼是一個位元組表示的字元表，用0-255的十進位制數來表示字元
         A的二進位制編碼：0100 0001
         C的二進位制編碼：0100 0011
         G的二進位制編碼：0100 0111
         T的二進位制編碼：0101 0100
從編碼的角度看後三位就可以分別表示這四個字元了，也就是說連續10個字元的資訊儲存需要30個bit位，而int型是四個位元組32個bit，足以儲存連續10個字母的編碼資訊。編碼資訊的儲存是取整型數的二進位制表達的後30位置，需要一個掩模0x3fffffff，並且取每個字元的後三位需要一個掩模十進位制7(二進位制0111)

程式碼如下：

class Solution {
    
    public List<String> findRepeatedDnaSequences( String s)
    {
        LinkedList<String> ret=new LinkedList<>();
        if(s==null || s.length()<=9)
            return ret;
        int hash=0;

        HashMap<Integer,Integer> map=new HashMap<>();

        int mask=0x3FFFFFFF;
        for(int i=0;i<9;i++)
            hash=(hash<<3) | (s.charAt(i) & 7);
        map.put(hash,1);
        for(int i=9;i<s.length();i++)
        {
            hash= (hash<<3) & mask | ( s.charAt(i) & 7);
            if(map.containsKey(hash))
            {
                int p=map.get(hash);
                if(p==1)
                    ret.add(s.substring(i-9,i+1));
                map.put(hash,++p);
            }
            else
                map.put(hash,1);
        }
        return ret;
    }
}

我後來看了最高效率的解法也是基於思路三的再編碼思想來做的，只不過四個字母的字母表只需要4個整數（0-3）來編號(對映)，也就是先將這四個字母對映成0，1，2，3，這樣原來的30位編碼表達變成了20位編碼表達，掩模也換成了0xfffff（二進位制表達為00000000 00001111 11111111 11111111），程式碼如下：

class Solution {
    
    public List<String> findRepeatedDnaSequences( String s){
        
        List<String> result = new ArrayList();
        if (s == null || s.length() < 10)
            return result;
        int[] map = new int[26];
        map['A'-'A'] = 0;
        map['C'-'A'] = 1;
        map['G'-'A'] = 2;
        map['T'-'A'] = 3;
        int mask = 0xfffff;
        int hash = 1;
        for( int i= 0; i < 9; i ++ )
        {
            hash = (hash << 2 ) | map[s.charAt(i)-'A'];
        }
        byte[] set = new byte[1<<20];
        for( int i = 9; i < s.length(); i ++ )
        {
            hash = ((hash << 2) & mask ) | map[s.charAt(i)-'A'];
            if( set[hash] == 1 )
            {
                result.add( s.substring( i-9, i + 1));
            }
            if( set[hash] < 2)
            {
                set[hash] ++;
            }
        }
        return result;
    }
}

LeetCode--187. Repeated DNA Sequences

LeetCode 187. Repeated DNA Sequences 20170706 第三十次作業

[LeetCode] 187. Repeated DNA Sequences 求重復的DNA序列

LeetCode--187. Repeated DNA Sequences

leetcode 187. Repeated DNA Sequences 編碼計數統計重複字串 + 移動視窗

187. Repeated DNA Sequences

*187. Repeated DNA Sequences (hashmap, one for loop)(difference between subsequence & substring)

leetcode:(187) Repeated DNA Sequence(java)

187. Repeated DNA Sequences - Medium

[LeetCode] Repeated DNA Sequences 求重複的DNA序列

[Swift]LeetCode187. 重復的DNA序列 | Repeated DNA Sequences

[Swift]LeetCode187. 重複的DNA序列 | Repeated DNA Sequences

Leetcode: Repeated DNA Sequence

[LeetCode] 459. Repeated Substring Pattern 重復子字符串模式

LeetCode#686: Repeated String Match

leetcode 946 Validate Stack Sequences

leetcode （Repeated String Match）

leetcode （Repeated Substring Pattern）

LeetCode：187. 重複的DNA序列

【LeetCode】187. 重複的DNA序列結題報告 (C++)

[LeetCode] Repeated String Match 重復字符串匹配

LeetCode--187. Repeated DNA Sequences

相關推薦