Leetcode: Repeated DNA Sequence

阿新 • • 發佈：2017-12-02

and == 10個 nas rect 想是運算 tco contains

方法2：進一步的方法是用HashSet, 每次取長度為10的字符串，O(N)時間遍歷數組，重復就加入result，但這樣需要O(N)的space, 準確說來O(N*10bytes), java而言一個char是2 bytes，所以O(N*20bytes)。String一大就MLE

最優解：是在方法2基礎上用bit operation，大概思想是把字符串映射為整數，對整數進行移位以及位與操作，以獲取相應的子字符串。眾所周知，位操作耗時較少，所以這種方法能節省運算時間。

首先考慮將ACGT進行二進制編碼

A -> 00

C -> 01

G -> 10

T -> 11

在編碼的情況下，每10位字符串的組合即為一個數字，且10位的字符串有20位；一般來說int有4個字節，32位，即可以用於對應一個10位的字符串。例如

ACGTACGTAC -> 00011011000110110001

AAAAAAAAAA -> 00000000000000000000

每次向右移動1位字符，相當於字符串對應的int值左移2位，再將其最低2位置為新的字符的編碼值，最後將高2位置0。

Cost分析：

時間復雜度O（N）, 而且眾所周知，位操作耗時較少，所以這種方法能節省運算時間。

省空間，原來10個char要10 Byte，現在10個char總共20bit，總共O(N*20bits)

空間復雜度：20位的二進制數，至多有2^20種組合，因此HashSet的大小為2^20，即1024 * 1024，O(1)

public class Solution {
    public List<String> findRepeatedDnaSequences(String s) {
        ArrayList<String> res = new ArrayList<String>();
        if (s==null || s.length()<=10) return res;
        HashMap<Character, Integer> dict = new HashMap<Character, Integer>();
        dict.put(‘A‘, 0);
        dict.put(‘C‘, 1);
        dict.put(‘G‘, 2);
        dict.put(‘T‘, 3);
        HashSet<Integer> set = new HashSet<Integer>();
        HashSet<String> result = new HashSet<String>(); //directly use arraylist to store result may not avoid duplicates, so use hashset to preselect
        int hashcode = 0;
        for (int i=0; i<s.length(); i++) {
            if (i < 9) {
                hashcode = (hashcode<<2) + dict.get(s.charAt(i));
            }
            else {
                hashcode = (hashcode<<2) + dict.get(s.charAt(i));
                hashcode &= (1<<20) - 1;
                if (!set.contains(hashcode)) {
                    set.add(hashcode);
                }
                else {
                    //duplicate hashcode, decode the hashcode, and add the string to result
                    String temp = s.substring(i-9, i+1);
                    result.add(temp);
                }
            }
        }
        for (String item : result) {
            res.add(item);
        }
        return res;
    }
}

Leetcode: Repeated DNA Sequence

and == 10個 nas rect 想是運算 tco contains 方法2：進一步的方法是用HashSet, 每次取長度為10的字符串，O(N)時間遍歷數組，重復就加入result，但這樣需要O(N)的space, 準確說來O(N*10bytes), java而言

leetcode:(187) Repeated DNA Sequence(java)

/** * 題目： * All DNA is composed of a series of nucleotides abbreviated as A, C, G, and T, for example: "ACGAATTCCG". * When studying DNA

[LeetCode] Repeated DNA Sequences 求重複的DNA序列

All DNA is composed of a series of nucleotides abbreviated as A, C, G, and T, for example: "ACGAATTCCG". When studying DNA, it is sometimes useful to ide

LeetCode 187. Repeated DNA Sequences 20170706 第三十次作業

如果作業 log {} TTT enc series compose bst All DNA is composed of a series of nucleotides abbreviated as A, C, G, and T, for example: "ACGAA

[LeetCode] 187. Repeated DNA Sequences 求重復的DNA序列

item series style result table hashset nbsp identify substring All DNA is composed of a series of nucleotides abbreviated as A, C, G, and

LeetCode--187. Repeated DNA Sequences

題目連結：https://leetcode.com/problems/repeated-dna-sequences/ 要求尋找長度為10的DNA重複子字串思路一：這裡可以考慮一個HashMap來儲存出現的子字串及其出現次數，出現第二次的則加入最終答案中，而首次出現的就加入Hashmap中，

leetcode 187. Repeated DNA Sequences 編碼計數統計重複字串 + 移動視窗

All DNA is composed of a series of nucleotides abbreviated as A, C, G, and T, for example: “ACGAATTCCG”. When studying DNA, it is s

HDU 1560 DNA sequence（DNA序列）

memory nes rgb each 12px align div printf c++ p.MsoNormal { margin: 0pt; margin-bottom: .0001pt; text-align: justify; font-family: Calibr

187. Repeated DNA Sequences

topic some ive ack 所有 write 影響 useful content 題目： All DNA is composed of a series of nucleotides abbreviated as A, C, G, and T, for

POJ 2778 DNA Sequence（AC自動機+矩陣快速冪）

ace str etc cto .org empty pan dac http http://poj.org/problem?id=2778 題意：給出一些病毒字符串，只由A,T,C,G組成，現在要用著4個字符組成長度為n的字符串，且字符串中不可以包含任一病毒字符串，問共

[poj2778]DNA Sequence(AC自動機+矩陣快速冪)

build printf class queue cstring node mod names sequence 解題關鍵：卡時限過的，正在找原因中。 1 #include<cstdio> 2 #include<cstring>

[LeetCode] Repeated String Match 重復字符串匹配

ngs bsp cda mini subst use abcd bcd time Given two strings A and B, find the minimum number of times A has to be repeated such that B

POJ 2778 DNA Sequence (AC自動機+DP+矩陣)

ont val put 題意 mat stdin +++ iostream bit 題意：給定一些串，然後讓你構造出一個長度為 m 的串，並且不包含以上串，問你有多少個。析：很明顯，如果 m 小的話，直接可以用DP來解決，但是 m 太大了，我們可以認為是在AC自動機圖中

DNA sequence open reading frames (ORFs) | DNA序列的開放閱讀框ORF預測

ear xtend sta plus htm allow dev program HR 常見的ORF預測工具 Open Reading Frame Finder - NCBI ORF Finder - SMS OrfPredictor - YSU 基本概念開放

*187. Repeated DNA Sequences (hashmap, one for loop)(difference between subsequence & substring)

sequence value n-2 return hashset cga AS repeated des All DNA is composed of a series of nucleotides abbreviated as A, C, G, and T, for

DNA sequence（映射+BFS）

amp vector number else 而是 sent images modern problems Problem Description The twenty-first century is a biology-technology developing

poj 2778 DNA Sequence AC自動機+矩陣快速冪

DNA Sequence Time Limit: 1000MS Memory Limit: 65536K Total Submissions: 19923

poj2778 DNA Sequence【AC自動機】【矩陣快速冪】

segments nal several not unsigned .org 一個 == char DNA Sequence Time Limit: 1000MS Memory Limit: 65536K Total Submissions: 19991

DNA Sequence POJ - 2778 AC 自動機矩陣乘法

定義過載運算的時候一定要將矩陣初始化，因為這個調了一上午...... Code: #include<cstdio> #include<algorithm> #include<cstring> #include<queue> #include<st

【AC自動機+矩陣快速冪】POJ - 2778 - DNA Sequence & HDU - 2243 - 考研路茫茫——單詞情結

POJ - 2778 - DNA Sequence 題目連結<http://poj.org/problem?id=2778> 題意： DNA序列只包含ACTG四個字元，已知一些病毒的DNA序列，問你序列長度為n（1 <= n <=2000000000）且不

Leetcode: Repeated DNA Sequence

相關推薦