[LeetCode 187] Repeated DNA Sequences

Hash Table Bit Manipulation

All DNA is composed of a series of nucleotides abbreviated as A, C, G, and T,
for example: "ACGAATTCCG". When studying DNA, it is sometimes useful to identify repeated sequences within the DNA. Write a function to find all the 10-letter-long sequences (substrings) that occur more than once in a DNA molecule.

For example,

Given s = "AAAAACCCCCAAAAACCCCCCAAAAAGGGTTT",

Return:
["AAAAACCCCC", "CCCCCAAAAA"].

Diffculty
Medium

Similar Problems

Analysis

Naive 方法就是两层循环，外层for(int i=0; i<=s.length()-10; i++), 内层for(int j=i+1; j<=s.length()-10; j++), 比较两个字符串s.substring(i, i+10)和s.substring(j, j+10)是否equal, 是的话，加入到result里，这样两层循环再加equals()时间复杂度应该到O(N^3)了，TLE了

方法2：进一步的方法是用HashSet, 每次取长度为10的字符串，O(N)时间遍历数组，重复就加入result，但这样需要O(N)的space。String一大就MLE

方法3：在方法2基础上用bit operation，大概思想是把字符串映射为整数，对整数进行移位以及位与操作，以获取相应的子字符串。众所周知，位操作耗时较少，所以这种方法能节省运算时间。所以是最优解法。

首先考虑将ACGT进行二进制编码 A -> 00 C -> 01 G -> 10 T -> 11

在编码的情况下，每10位字符串的组合即为一个数字，且10位的字符串有20位；一般来说int有4个字节，32位，即可以用于对应一个10位的字符串。例如 ACGTACGTAC -> 00011011000110110001 AAAAAAAAAA -> 00000000000000000000

每次向右移动1位字符，相当于字符串对应的int值左移2位，再将其最低2位置为新的字符的编码值，最后将高2位置0。 Cost分析：

时间复杂度O（N）空间复杂度：20位的二进制数，至多有2^20种组合，因此HashSet的大小为2^20，即1024 * 1024，O(1) 常数空间

class Solution { public: vector findRepeatedDnaSequences(string s) { vector> res; if (s.empty() || s.length() <= 10) return res; std::unordered_map dict; dict['A'] = 0; dict['C'] = 1; dict['G'] = 2; dict['T'] = 3; std::unordered_set st; std::unordered_set result; //directly use arraylist to store result may not avoid duplicates, so use hashset to preselect int hashcode = 0; for (int i=0; i<s.length(); i++) { if (i < 9) { hashcode = (hashcode<<2) + dict.get(s[i]); } else { hashcode = (hashcode<<2) + dict.get(s[i]); hashcode &= (1<<20) - 1; if (!set.contains(hashcode)) { set.add(hashcode); } else { //duplicate hashcode, decode the hashcode, and add the string to result string temp = s.substr(i - 9, i + 1); result.add(temp); } } } for (auto item : result) { res.add(item); } return res; } };

// naive方法： public class Solution { public List findRepeatedDnaSequences(String s) { ArrayList res = new ArrayList(); if (s==null || s.length()<=10) return res; for (int i=0; i<=s.length()-10; i++) { String cur = s.substring(i, i+10); for (int j=i+1; j<=s.length()-10; j++) { String comp = s.substring(j, j+10); if (cur.equals(comp)) { res.add(cur); break; } } } return res; } }

4.5 (187) Repeated D N a Sequences

[LeetCode 187] Repeated DNA Sequences

Analysis

results matching ""

No results matching ""