LeetCode - 187. Repeated DNA Sequences

Category: Algorithm
Difficulty: Medium
Tags: Hash Table, Bit Manipulation
Discription: https://leetcode.com/problems/repeated-dna-sequences/#/description

這題給一組DNA字串，我們要檢查其中出現兩次以上的10字母長度之序列。
概念就是雜湊表，如何製作良好的雜湊表即為本題的演算關鍵。


class Solution {

public:

    vector<string> findRepeatedDnaSequences(string s) {

        vector<string> result = {};

        //for(vector<string>::iterator it=result.begin(); it!=result.end(); it++) cout<<*it<<endl;

        unordered_map<int, int> hash_table; // unordered_map is faster than map

      

        int checklen = 10;

        int slen = s.length();

        int index;

      

        for(int i=0; i<slen-10+1; i++){

            index=my_hash(s.substr(i, checklen));

            hash_table[index]++;

            if(hash_table[index]==2){

                result.push_back(s.substr(i, checklen));

            }

        }

        return result;

    }

  

    int my_hash(string ss){

        int result_value=0;

        result_value += (int(ss[0])&7);

        for(int i=1; i<ss.length(); i++){

            result_value = result_value << 3;

            result_value += (int(ss[i])&7);

        }

        return result_value;

    }

};

雜湊表的製作最重要的就是「不可碰撞」，其次才是算式的優化。
以下為優化過的的計算模式，依ASCII表得到各個字母的十進位值，再模除7得出：
A：65 mod 7 = 2
C：67 mod 7 = 4
G：71 mod 7 = 1
T：73 mod 7 = 3
如此一來，便可以最低30位元製作10字母之基因序列雜湊表，
例如：ACGTAACGCT之雜湊為010 100 001 011 010 010 100 001 100 011
為了最佳化演算程序，因此以位元的方式進行計算，
輸入三個位元，再左移三個位元(乘以2的3次方)，再輸入三個位元，再左移‧‧‧‧‧‧
如此反覆迴圈製作出雜湊表，並以此表記錄相同雜湊的次數，
最後在檢視雜湊表中重複出現的雜湊，推入堆疊輸出即可。

搜尋此網誌

TIME RECORD

How to Implement a Website with MVC Architecture using PHP

LeetCode - 187. Repeated DNA Sequences

留言

張貼留言