1. 程式人生 > >609. Find Duplicate File in System(python+cpp)

609. Find Duplicate File in System(python+cpp)

題目:

Given a list of directory info including directory path, and all the files with contents in this directory, you need to find out all the groups of duplicate files in the file system in terms of their paths.
A group of duplicate files consists of at least two files that have exactly the same content.
A single directory info string in the input list has the following format:
"root/d1/d2/.../dm f1.txt(f1_content) f2.txt(f2_content) ... fn.txt(fn_content)"


It means there are n files (f1.txt, f2.txtfn.txt with content f1_content, f2_contentfn_content, respectively) in directory root/d1/d2/.../dm. Note that n >= 1 and m >= 0. If m = 0, it means the directory is just the root directory.
The output is a list of group of duplicate file paths. For each group, it contains all the file paths of the files that have the same content. A file path is a string that has the following format:
"directory_path/file_name.txt"

Example 1:

Input: ["root/a 1.txt(abcd) 2.txt(efgh)", "root/c 3.txt(abcd)", "root/c/d 4.txt(efgh)", "root 4.txt(efgh)"] 
Output:   [["root/a/2.txt","root/c/d/4.txt","root/4.txt"],["root/a/1.txt","root/c/3.txt"]] 

Note:
 No order is required for the final output.
 You may assume the directory name, file name and file content only has letters and digits, and the length of file content is in the range of [1,50]

.
 The number of files given is in the range of [1,20000].
 You may assume no files or directories share the same name in the same directory.
 You may assume each given directory info represents a unique directory.
 Directory path and file info are separated by a single blank space.
Follow-up beyond contest:
 Imagine you are given a real file system, how will you search files? DFS or BFS?
 If the file content is verylarge (GB level), how will you modify your solution?
 If you can only read the file by 1kb each time, how will you modify your solution?
 What is the time complexity of your modified solution?
 What is the most time-consuming part and memory consuming part of it?
 How to optimize?
  How to make sure the duplicated files you find are not false positive?

解釋:
找到內容相同的檔案的路徑,並將路徑打成一個組。
用字典做,用collections.defaultdict(),使用dict時,如果引用的Key不存在,就會丟擲KeyError。如果希望key不存在時,返回一個預設值,就可以用defaultdict,這裡一定要用defaultdict,原因是dict可以用“=”賦值,但是鍵不存在的話不可以用append,但是defaultdict可以。_dict是一個(key,[])型別的字典,其實也可以用dict做,dupli_file[content] = dupli_file.get(content, []) + [di+"/"+filename],要用到dict的get()和list相加的知識
defaultdict['不存在的key'](返回預設值)相當於 dict.get(不存在的key,預設值)
partition()方法用來根據指定的分隔符將字串進行分割。
如果字串包含指定的分隔符,則返回一個3元的元組,第一個為分隔符左邊的子串,第二個為分隔符本身,第三個為分隔符右邊的子串。和split的區別是,split所有的都分開,partition值按照第一個分隔符分開。
上面的文字是當時第一次解題的時候寫的註釋,現在看emmmmm,果然多刷題可以提高程式碼能力。
python程式碼:

from collections import defaultdict
class Solution(object):
    def findDuplicate(self, paths):
        """
        :type paths: List[str]
        :rtype: List[List[str]]
        """
        _dict=defaultdict(list)
        for path in paths:
            path=path.split()
            root=path[0]
            for _file in path[1:]:
                root2,content=_file.split('(')
                _dict[content[:-1]].append(root+'/'+root2)
        return [value for value in _dict.values() if len(value)>1]

c++程式碼:

#include <map>
#include <sstream>
using namespace std;
class Solution {
public:
    vector<vector<string>> findDuplicate(vector<string>& paths) {
        map<string,vector<string>> _map;
        for (auto path:paths)
        {
            istringstream _path(path);
            string word="";
            int i=0;
            string root="";
            while(getline(_path,word,' '))
            {
                if (root=="")
                    root=word;
                else
                {
                    string root2="",content="";
                    string tmp="";
                    istringstream _word(word);
                    while(getline(_word,tmp,'('))
                    {
                        if (root2=="")
                            root2=tmp;
                        else
                            content=tmp;
                    }
                    _map[content.substr(0,content.size()-1)].push_back(root+"/"+root2);
                }      
            }   
        }
        vector<vector<string>> result;
        for(auto item:_map)
        {
            if (item.second.size()>1)
                result.push_back(item.second);
        }
        return result;    
    }
};

總結:
尋找c++中一種比較簡單的字串分割方法ing…