教你如何從Google Map爬資料

阿新 • • 發佈：2019-02-04

在這篇博文中，筆者從實驗的角度，從爬資料的困難出發，闡述如何從Google Map上爬地圖資料。本文的出發點為實驗，而非商用。Google Map對其自己的資料具有其權益，希望讀者以博文為學習實驗之用，不要將自己所爬到的資料用於商用。如果因為此類事件所引起的糾紛，筆者概不負責。筆者也希望，大家在看到此博文後，能夠進一步改進其資料的安全性。

筆者在實驗室某個GIS專案中必須需要一定資料級的地圖資料。在百般無奈下，筆者開始從Google Map爬資料。從Google Map上採集一定量的資料有作實驗。

從Google Map爬資料的原理

Google Map所採用的是Mercator座標系。何為Mercator座標系？讀者可以詳見{連結}。在Google Map也是以金字塔模型的方式來組織切圖檔案的。至於，它的後端處理或者儲存方式或者檔案命名方式是怎麼樣，筆者不得而知。筆者只能從URL等方面進行分析，大概確定其地圖檔案的組織方式。在金字塔模型中，地圖分成若干層，每一層資料的解析度為上層的4倍（橫向與縱向各2倍）。同時，每一層資料的分辨是極其巨大，而且成指數形式增加。如果一下子，將一層的資料作為一個檔案返回給使用者，無論從網路的傳輸能力、CPU處理能力還是記憶體的儲存能力而言都是無法做到的。而且使用者所觀看的只是地圖的某一層的某一塊區域。因而，一般都會將地圖資料進行切圖，即進行切分，將地圖資料切成解析度相等的若干塊。因而，我們可以得知，每一層資料集的檔案數為上層的4倍。

筆者使用GoogleChrome來檢視Google Map的Resources，圖如下:

我們可以清楚地看到，在Google Map的地圖檔案並不是一次載入一整張，而是分成若干塊，每一塊的分辨為256*256。同時，我們也得到了每一塊地圖的地址，例如http://mt0.google.com/[email protected]&hl=zh-CN&src=app&x=1&y=1&z=1&s=Ga.png。其中x、y是決定檔案左上角座標的引數，z為決定檔案層次的引數。通過向Google Map伺服器請求，我們可以得到第0層具有1塊。從而第level層，具有2^level*2^level塊，即x、y的取值範圍為[0,2^level-1]。第level層每一塊資料的橫向經度差為360/2^level，縱向緯度差為180/2^level。

x=0&y=0&z=0

x=0&y=0&z=1	x=1&y=0&z=1

x=0&y=1&z=1	x=1&y=1&z=1

我們可以得知，x=xx，y=yy，z=zz的這塊資料，所在的圖層為zz層，該圖層中每塊資料的經度差為360/2^zz，緯度差為180/2^zz，左上角的經緯度為(360/2^zz*xx-180, 180/2^zz*yy-90)。同樣，我們也可從一個數據塊的左上角經緯度反推出這個檔案在zz層的x與y。這也就是我們從Google Map爬資料的原理。

從Google Map爬資料有何難點？

1. 在國內由於政治等原因，連線Google伺服器會有所中斷。

2. Google的Web伺服器，或者Google防火牆，會對某一臺客戶端的請求進行統計。如果一段時間內，請求數超過一定的值，此後的請求會直接被忽略。據說，當一天中，來自某一個IP的請求數超過7000個時，此後的請求後直接被忽略。

3. 單執行緒操作的效率太低，多執行緒情況下，效率會有很大提升。

4. Google伺服器會對每個請求檢查，判斷是否來自瀏覽器還是來自爬蟲。

5. 對於已下載的檔案無須下載，即爬蟲必須擁有“斷點續傳”的功能。不能由於網路的中斷或者人為的中斷，而導致之前的進度丟失。

對於這些難點有何解決方案

1. 對於第1點難點，我們可以使用國外的伺服器作為我們的代理。這樣，我們通過國外的伺服器來請求Google Map。而對於大名鼎鼎的GFW而言，我們連線的並不是Google的伺服器，而是其它的伺服器。只要那臺伺服器沒有被牆，我們就可以一直下載。

2. 對於第2個難點，我們依然可以使用代理。一旦，下載失敗，這個代理ip可能已經被Google Map所阻攔，我們就需要更換代理。如果，代理的連線速度較慢，或者代理的下載檔案時，超時較多，可能我們目前所使用的代理與我們的機器之間的網路連線狀態不佳，或者代理服務負載較重。我們也需要更換代理。

3. 單執行緒操作的效率太低，我們需要使用多執行緒。但是，在使用多執行緒時，由於每一個檔案的大小都很小，因而我們設計多執行緒機制時，每一個執行緒可以負責下載若干個檔案。而不同的執行緒所下載的檔案之間，沒有交集。

4. 對於第4點，我們可以在建立http連線時，設定”User-Angent”，例如：

httpConnection.setRequestProperty("User-Agent", "Mozilla/4.0 (compatible; MSIE 5.0; Windows NT; DigExt)");

5. 對於第5點，我們可以在每下一個檔案之間，事先判斷檔案是否已經完成。這有很多種解決方法，筆者在這裡，採用file.exists()來進行判斷。因為，對於下載一個檔案而言，檢查檔案系統上某一個檔案的代價會小很多。

改進與具體實現

1. 代理的獲取

代理的獲取有很多種方式。但如果一開始就配置所有的代理，那麼，當這些代理都已經無法使用時，系統也將無法執行下去。當然，我們也不想那麼麻煩地不斷去更換代理。筆者是一個lazy man，所以還是由計算機自己來更換代理吧。筆者在此使用www.18daili.com。www.18daili.com會將其收集到代理已web的形式釋出出來。因而，我們可以下載這張網頁，對進行解析，便可以得最新可用的代理了。筆者在這裡使用Dom4J來進行網頁的解析。

2. 架構

其中，分成三個模組:Downloader, DownloadThread, ProxyConfig。Downloader負責初化化執行緒池以存放DownloaderThread。每一個DownloadThread都會負責相應的若干個切圖資料的下載。DownloadThread從ProxyConfig那裡去獲取代理，並從檔案系統中檢查某一個檔案是否已經下載完成，並將下載完成檔案按一定的規則儲存到檔案系統中去。ProxyConfig會從www.18daili.com更新現有的代理，在筆者的系統，每取1024次代理，ProxyCofig就會更新一次。

原碼

Downloader:

package ??;

import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;

public class Downloader {

	private static int minLevel = 0;
	private static int maxLevel = 10;
	private static String dir = "D:\\data\\google_v\\";
	private static int maxRunningCount = 16;
	private static int maxRequestLength = 100;

	public static void download() {
		ExecutorService pool = Executors.newFixedThreadPool(maxRunningCount);

		for (int z = minLevel; z <= maxLevel; z++) {
			int curDt = 0;
			int requests[][] = null;
			int maxD = (int) (Math.pow(2, z));
			for (int x = 0; x < maxD; x++) {
				for (int y = 0; y < maxD; y++) {
					if (curDt % maxRequestLength == 0) {
						String threadName = "dt_" + z + "_" + curDt;
						DownloadThread dt = new DownloadThread(threadName, dir, requests);
						pool.execute(dt);
						curDt = 0;
						requests = new int[maxRequestLength][3];
					}
					requests[curDt][0] = y;
					requests[curDt][1] = x;
					requests[curDt][2] = z;
					curDt++;
				}
			}
			DownloadThread dt = new DownloadThread("", dir, requests);
			pool.execute(dt);
		}

		pool.shutdown();
	}

	public static void main(String[] strs) {
		download();
	}
}

DownloadThread:

package ??;

import java.io.BufferedInputStream;
import java.io.File;
import java.io.FileInputStream;
import java.io.FileOutputStream;
import java.io.InputStream;
import java.net.HttpURLConnection;
import java.net.Proxy;
import java.net.URL;
import java.text.SimpleDateFormat;
import java.util.Date;


public class DownloadThread extends Thread {
	private static int BUFFER_SIZE = 1024 * 8;// 緩衝區大小
	private static int MAX_TRY_DOWNLOAD_TIME = 128;
	private static int CURRENT_PROXY = 0;
	private String threadName = "";
	private String dir;
	// private int level;
	private String tmpDir;
	private Proxy proxy;
	private int[][] requests;
	private String ext = ".png";
	private static SimpleDateFormat dateFormat = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss");

	public DownloadThread(String threadName, String dir, int[][] requests) {
		this.threadName = threadName;
		this.dir = dir;

		this.requests = requests;
	}

	@Override
	public void run() {
		Date now = new Date();
		System.out.println(dateFormat.format(now) + "\t" + threadName + ":\t開始執行");
		long t1 = System.currentTimeMillis();
		long totalLength = download();
		long t2 = System.currentTimeMillis();
		double speed = (double) totalLength / (t2 - t1);
		now = new Date();
		if (speed < 0.5) {
			CURRENT_PROXY++;
		}
		System.out.println(dateFormat.format(now) + "\t" + threadName + ":\t完成執行\t" + speed + "kB/s");
	}

	public long download() {
		long totalLength = 0;
		if (requests == null) {
			return 0;
		}
		//System.out.println(requests.length);
		for (int i = 0; i < requests.length; i++) {
			int yy = requests[i][0];

			int xx = requests[i][1];
			int zz = requests[i][2];
			int yyg = (int) (Math.pow(2, zz) - 1 - requests[i][0]);
			this.tmpDir = dir + "/tmp/" + zz + "/";
			File tmpDirFile = new File(tmpDir);
			if (tmpDirFile.exists() == false) {
				tmpDirFile.mkdirs();
			}
			String dirStr = dir + "/download/" + zz + "/" + yy + "/";
			File fileDir = new File(dirStr);
			if (fileDir.exists() == false) {
				fileDir.mkdirs();
			}
			String fileStr = dirStr + yy + "_" + xx + ext;
			File file = new File(fileStr);
			// double lat1 = (yy) * dDegree - 90;
			// double lat2 = (yy + 1) * dDegree - 90;
			String url = "http://mt0.google.com/vt/[email protected]&hl=zh-CN&src=app&x=" + xx + "&y=" + yyg + "&z=" + zz
					+ "&s=";
			// System.out.println(url);
			if (file.exists() == false) {
				String tmpFileStr = tmpDir + yy + "_" + xx + ext;
				boolean r = saveToFile(url, tmpFileStr);
				if (r == true) {
					totalLength += cut(tmpFileStr, fileStr);
					Date now = new Date();
					System.out.println(dateFormat.format(now) + "\t" + threadName + ":\t" + zz + "\\" + yy + "_" + xx + ext + "\t"+proxy+"\t完成！");
				} else {
					Date now = new Date();
					System.out.println(dateFormat.format(now) + "\t" + threadName + ":\t" + zz + "\\" + yy + "_" + xx + ext + "\t"+proxy+"\t失敗！");
				}
			} else {
				Date now = new Date();
				System.out.println(dateFormat.format(now) + "\t" + threadName + ":\t" + zz + "\\" + yy + "_" + xx + ext + "已經下載！");
			}
		}
		return totalLength;
	}

	public static long cut(String srcFileStr, String descFileStr) {

		try {
			// int bytesum = 0;
			int byteread = 0;
			File srcFile = new File(srcFileStr);
			File descFile = new File(descFileStr);
			if (srcFile.exists()) { // 檔案存在時
				InputStream is = new FileInputStream(srcFileStr); // 讀入原檔案
				FileOutputStream os = new FileOutputStream(descFileStr);
				byte[] buffer = new byte[1024 * 32];
				// int length;
				while ((byteread = is.read(buffer)) != -1) {
					// bytesum += byteread; //位元組數 檔案大小
					// System.out.println(bytesum);
					os.write(buffer, 0, byteread);
				}
				is.close();
				os.close();
			}
			srcFile.delete();
			return descFile.length();
		} catch (Exception e) {
			System.out.println("複製單個檔案操作出錯");
			e.printStackTrace();

		}
		return 0;

	}

	public boolean saveToFile(String destUrl, String fileName) {
		int currentTime = 0;
		while (currentTime < MAX_TRY_DOWNLOAD_TIME) {
			try {
				FileOutputStream fos = null;
				BufferedInputStream bis = null;
				HttpURLConnection httpConnection = null;
				URL url = null;
				byte[] buf = new byte[BUFFER_SIZE];
				int size = 0;

				// 建立連結
				url = new URL(destUrl);
				// url.openConnection(arg0)
				currentTime++;
				proxy = ProxyConfig.getProxy(CURRENT_PROXY);

				//if (proxy != null) {
				//	System.out.println(threadName + ":\t切換代理\t" + proxy.address().toString());
				//} else {
				//	System.out.println(threadName + ":\t使用本機IP");
				//}

				if (proxy == null) {
					httpConnection = (HttpURLConnection) url.openConnection();
				} else {
					httpConnection = (HttpURLConnection) url.openConnection(proxy);
				}
				
				httpConnection.setRequestProperty("User-Agent", "Mozilla/4.0 (compatible; MSIE 5.0; Windows NT; DigExt)"); 

				httpConnection.setConnectTimeout(60000);
				httpConnection.setReadTimeout(60000);

				// 連線指定的資源
				httpConnection.connect();
				// 獲取網路輸入流
				bis = new BufferedInputStream(httpConnection.getInputStream());
				// 建立檔案
				fos = new FileOutputStream(fileName);

				// System.out.println("正在獲取連結[" + destUrl + "]的內容;將其儲存為檔案[" +
				// fileName + "]");

				// 儲存檔案
				while ((size = bis.read(buf)) != -1){
			//	System.out.println(size);
					fos.write(buf, 0, size);
				}

				fos.close();
				bis.close();
				httpConnection.disconnect();
				// currentTime = MAX_TRY_DOWNLOAD_TIME;
				break;
			} catch (Exception e) {
				//e.printStackTrace();
				CURRENT_PROXY++;
			}
		}
		if (currentTime < MAX_TRY_DOWNLOAD_TIME) {
			return true;
		} else {
			return false;
		}
	}

}

ProxyConfig:

package org.gfg.downloader.google.vctor;

import java.net.InetSocketAddress;
import java.net.Proxy;
import java.net.Proxy.Type;
import java.net.URL;
import java.net.URLConnection;
import java.util.ArrayList;
import java.util.Iterator;
import java.util.List;

import org.dom4j.Document;
import org.dom4j.Element;
import org.dom4j.io.SAXReader;

public class ProxyConfig {

	private static List<Proxy> proxies;

	private static int getTime = 0;

	@SuppressWarnings("unchecked")
	public static void inital() {
		// if (proxies == null) {
		proxies = null;
		proxies = new ArrayList<Proxy>();
		// } else {
		// proxies.clear();
		// }
		try {

			URL url = new URL("http://www.18daili.com/");
			URLConnection urlConnection = url.openConnection();
			urlConnection.setConnectTimeout(30000);
			urlConnection.setReadTimeout(30000);
			SAXReader reader = new SAXReader();
			// System.out.println(url);
			reader.setFeature("http://apache.org/xml/features/nonvalidating/load-external-dtd", false);
			Document doc = reader.read(urlConnection.getInputStream());
			if (doc != null) {
				Element root = doc.getRootElement();

				Element proxyListTable = getElementById(root, "proxyListTable");
				// System.out.println(proxyListTable.asXML());
				Iterator<Element> trs = proxyListTable.elementIterator();
				trs.next();
				while (trs.hasNext()) {
					Element tr = trs.next();
					Iterator<Element> tds = tr.elementIterator();
					String ip = tds.next().getText();
					String port = tds.next().getText();
					// System.out.println(ip+":"+port);
					Proxy proxy = new Proxy(Type.HTTP, new InetSocketAddress(ip, Integer.valueOf(port)));
					proxies.add(proxy);
					System.out.println("新增代理\t" + proxy);
				}
			}
		} catch (Exception e) {
			// e.printStackTrace();
		}

	}

	private static Element getElementById(Element element, String id) {
		Element needElement = null;
		Iterator<Element> subElements = element.elementIterator();
		while (subElements.hasNext()) {
			Element subElement = subElements.next();
			String getId = subElement.attributeValue("id");
			if (getId != null && getId.equals(id)) {
				needElement = subElement;
				break;
			} else {
				needElement = getElementById(subElement, id);
				if (needElement != null) {
					break;
				}
			}
		}
		return needElement;
	}

	synchronized public static Proxy getProxy(int i) {
		getTime++;
		if (getTime % 1024 == 0 || proxies == null) {
			inital();
			getTime = 0;
			System.out.println("重新生成代理列表!");
			System.out.println("當前共有" + proxies.size() + "個代理!");
		}
		if (i % 8 == 0) {
			return null;
		}
		int index = i % proxies.size();
		index = Math.abs(index);
		return proxies.get(index);
	}

	public static void main(String... str) {
		inital();
	}

}

釋出與執行效果

本部落格中所有的博文都為筆者（Jairus Chan）原創。

如果您對本文有任何的意見與建議，請聯絡筆者（JairusChan）。

教你如何從Google Map爬資料

從Google Map爬資料的原理

從Google Map爬資料有何難點？

對於這些難點有何解決方案

改進與具體實現

原碼

釋出與執行效果

教你如何從Google Map爬資料(切片）

教你如何從Google Map爬資料

Swaggy教你用python實現NBA資料統計的爬取

一張思維導圖教你使用google一下

手把手教你從零開始做一個好看的 APP

偽標籤：教你玩轉無標籤資料的半監督學習方法

聰哥哥教你學Python之爬取金庸系列的小說

馳騁股市！手把手教你如何用Python和資料科學賺錢？python

教你建立一個私密資料夾

神奇回車鍵教你快速錄入Excel表格資料，開啟高效率工作模式！

零基礎學ui設計教學教你從0基礎建立設計規範

【手把手教你】Python獲取財經資料和視覺化分析

害怕別人偷看資料？教你一招將Excel資料變"*"號，從此不再擔心！

三個月教你從零入門人工智慧+深度學習精華實踐課程|深度學習視訊教程2018

手把手教你從零開始搭建SpringBoot後端專案框架

教你從0開始打造一場成功的微信抽獎活動方案！

神級python工程師教你從網站篩選工作需求資訊，助你就業

一步一步教你從零開始寫C語言連結串列---構建一個連結串列

教你從零開始寫一個雜湊表--導讀

教你從零開始寫一個雜湊表--雜湊表結構

教你如何從Google Map爬資料

從Google Map爬資料的原理

從Google Map爬資料有何難點？

對於這些難點有何解決方案

改進與具體實現

原碼

釋出與執行效果

相關推薦