如何利用matlab爬蟲抓資料
阿新 • • 發佈:2019-01-03
如何利用matlab爬蟲抓包
很多同學可能聽說用python進行網上爬蟲,今天給大家帶來的是利用matlab爬蟲。不需任何基礎,意在給大家一種自己動手抓包的體驗。
- 開啟你安裝好的matlab。
- 新建一個指令碼檔案,將以下程式碼複製黏貼進去。
clc;
clear;
warning off;
year=2015;
for season = 1:4
fprintf('抓取%d年%d季度的資料中...', year, season)
[sourcefile, status] = urlread(sprintf('http://vip.stock.finance.sina.com.cn/corp/go.php/vMS_MarketHistory/stockid/000001/type/S.phtml?year=%d&season=%d' , year));
if ~status%判斷資料是否全部讀取成功
error('出問題了哦,請檢查\n')
end
expr1 = '\s+(\d\d\d\d-\d\d-\d\d)\s*'; %要提取的模式,()中為要提取的內容
[datefile, date_tokens]= regexp(sourcefile, expr1, 'match', 'tokens'); %match返回整個匹配型別,token返回()標記的位置,都為元胞型別
date = cell(size(date_tokens));%建立一個等大的元胞陣列
for idx = 1:length(date_tokens)
date{idx} = date_tokens{idx}{1}; %將日期寫入
end
expr2 = '<div align="center">(\d*\.?\d*)</div>';
[datafile, data_tokens] = regexp(sourcefile, expr2, 'match', 'tokens'); %從原始檔中獲取目標資料
data = zeros(size(data_tokens));%產生和資料相同長度的0
for idx = 1:length(data_tokens)
data(idx) = str2double(data_tokens{idx} {1}); %轉變資料型別後存入data中
end
data = reshape(data, 6, length(data)/6 )'; %重排,根據原始碼的顯示,將不同定義的資料排成六列
items={'日期' '開盤價' '最高價' '收盤價' '最低價' '交易量' '交易金額'};
sheet = sprintf('第%d季度', season); %工作表名稱
xlswrite('D:/data', items, sheet)
xlswrite('D:/data', date' , sheet,'A2'); %在第一列寫入日期
range = sprintf('B2:%s%d',char(double('B')+size(data,2)-1), size(data,1)+1); %從原始檔中獲取的目標資料的放置範圍
xlswrite('D:/data', data, sheet, range);
fprintf('完成!\n')
end
fprintf('全部完成!資料儲存在D盤的data表格中,請注意檢視!\n')
3.靜靜地等待,提示完成後,開啟D盤的data表格,檢視你的成果吧。
有好奇心的同學,可能還想探索一下具體的工作機制,這可能就需要下苦工了,有以下幾點建議
- 學著看懂程式碼,不懂的地方上網搜尋。自己摸索才是最好的學習方式,小編就不詳述了。
- 看懂程式碼並學會正則表示式後,開啟網頁原始碼。
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=gb2312" />
<title>上證綜合指數(000001)_歷史交易_新浪網</title>
<meta name="Keywords" content="上證綜合指數,000001,行情" />
<meta name="Description" content="上證綜合指數的實時行情" />
<link media="all" rel="stylesheet" href="/corp/view/css/style.css" />
<link media="all" rel="stylesheet" href="/corp/view/css/newstyle.css" />
<link media="all" rel="stylesheet" href="/corp/view/css/tables.css" />
<link media="all" rel="stylesheet" href="/corp/view/css/style4.css" />
<style type="text/css">
body,ul,ol,li,p,h1,h2,h3,h4,h5,h6,form,fieldset,table,td,img,div{margin:0;padding:0;border:0;}
body,ul,ol,li,p,form,fieldset,table,td{font-family:"宋體";}
body{background:#fff;color:#000;}
td,p,li,select,input,textarea,div{font-size:12px;}
ul{list-style-type:none;}
select,input{vertical-align:middle; padding:0; margin:0;}
.f14 {font-size:14px;}
.lh20 {line-height:20px;}
.lh23{line-height:23px;}
.b1{border:1px #fcc solid;}
a{text-decoration: underline;color:#009}
a:visited{color:#333333;}
a:hover{color:#f00;}
.f14links{line-height:23px;}
.f14links,.f14links a{font-size:14px;color:#009;}
.f14links a:hover{color:#F00;}
.f14links li{padding-left:13px;background:url(http://image2.sina.com.cn/dy/legal/2006index/news_law_hz_012.gif) no-repeat 3px 45%;}
.clearit{clear:both;font-size:0;line-height:0;height:0;}
.STYLE2 {font-size: 14px; font-weight: bold; }
/*杜邦分析用到的css begin*/
.bottom_line {border-bottom:1px solid #999999}
.f14 {font-size:14px}
.f12 {font-size:12px}
.l15{line-height:150%}
.l13{line-height:130%}
.lh19{line-height:19px;}
/*杜邦分析用到的css end*/
</style>
<!--[if IE]>
<link media="all" rel="stylesheet" href="http://www.sinaimg.cn/cj/realstock/css/ie.css" />
<![endif]-->
<script language="javascript" type="text/javascript">
<!--//--><![CDATA[//><!--
var fullcode="sh000001";
var chart_img_alt = "上證綜合指數 000001 行情圖";
/* comment */
var cmnt_channel = "gg";
var cmnt_newsid = "sh-000001";
var cmnt_group = 1;
var detailcache = new Array();
//--><!]]>
</script>
<script type="text/javascript" src="/corp/view/js/all.js"></script>
<script type="text/javascript" src="/corp/view/js/tables.js"></script>
<script type="text/javascript" src="http://finance.sina.com.cn/iframe/hot_stock_list.js"></script>
<script type="text/javascript" src="http://hq.sinajs.cn/list=sh000001,s_sh000001,s_sh000300,s_sz399001,s_sz399106,s_sz395099"></script>
<script type="text/javascript" src="http://image2.sina.com.cn/home/sinaflash.js"></script>
<script type="text/javascript" src="/corp/view/js/corp_fenshi_zs.js"></script>
</head>
<body>
<div id="wrap">
<!-- 標準二級導航_財經 begin -->
<style type="text/css">
.secondaryHeader{height:33px;overflow:hidden;background:url(http://i2.sinaimg.cn/dy/images/header/2008/standardl2nav_bg.gif) repeat-x #fff;color:#000;font-size:12px;font-weight:100;}
.secondaryHeader a,.secondaryHeader a:visited{color:#000;text-decoration:none;}
.secondaryHeader a:hover,.secondaryHeader a:active{color:#c00;text-decoration:underline;}
.sHBorder{border:1px #e3e3e3 solid;padding:0 10px 0 12px;overflow:hidden;zoom:1;}
.sHLogo{float:left;height:31px;line-height:31px;overflow:hidden;}
.sHLogo span,.sHLogo span a,.sHLogo span a:link,.sHLogo span a:visited,.sHLogo span a:hover{display:block;*float:left;display:table-cell;vertical-align:middle;*display:block;*font-size:27px;*font-family:Arial;height:31px;}
.sHLogo span,.sHLogo span a img,.sHLogo span a:link img,.sHLogo span a:visited img,.sHLogo span a:hover img{vertical-align:middle;}
.sHLinks{float:right;line-height:31px;}
#level2headerborder{background:#fff; height:5px; overflow:hidden; clear:both; width:950px;}
</style>
<div id="level2headerborder"></div>
<div class="secondaryHeader">
<div class="sHBorder">
<div class="sHLogo"><span><a href="http://www.sina.com.cn/"><img src="http://i1.sinaimg.cn/dy/images/header/2009/standardl2nav_sina_new.gif" alt="新浪網" /></a><a href="http://finance.sina.com.cn/"><img src="http://i1.sinaimg.cn/dy/images/header/2009/standardl2nav_finance.gif" alt="新浪財經" /></a></span></div>
<div class="sHLinks"><a href="http://finance.sina.com.cn/">財經首頁</a> | <a href="http://www.sina.com.cn/">新浪首頁</a> | <a href="http://news.sina.com.cn/guide/">新浪導航</a></div>
</div>
</div>
<div id="level2headerborder"></div>
<!-- 標準二級導航_財經 end -->
<!-- banner begin -->
<div style="float:left; width:950px;">
<!-- 頂部廣告位 begin -->
<div style="float:left; width:750px; height:90px;">
<iframe marginheight="0" marginwidth="0" src="http://finance.sina.com.cn/iframe/ad/PDPS000000004094.html" frameborder="0" height="90" scrolling="no" width="750"></iframe><!--<script type="text/javascript" src="http://finance.sina.com.cn/pdps/js/PDPS000000004094.js"></script> -->
</div>
<!-- 頂部廣告位 end -->
<div style="float:right;width:188px; height:88px; border:1px solid #DEDEDE;">
<ul>
<li style="background:url(http://www.sinaimg.cn/bb/article/con_ws_001.gif);line-height:15px;text-align:center;color:#F00">熱點推薦</li>
<li style="line-height:20px; margin-top:5px;">·<a href="http://vip.stock.finance.sina.com.cn/portfolio/main.php" style="color:#F00">自選股-輕鬆管理您的千隻股票</a></li>
<li style="line-height:20px;">·<a href="http://finance.sina.com.cn/money/mall.shtml">金融e路通-理財投資更輕鬆</a></li>
<li style="line-height:20px;">·<a href="http://biz.finance.sina.com.cn/hq/">行情中心-通往財富之門</a></li>
</ul>
</div>
<div style="clear:both"></div>
</div>
<!-- banner end -->
<div class="HSpace-1-5"></div>
<!-- 導航 begin -->
<div class="nav">
<ul>
<li class="navRedLi"><a href="http://finance.sina.com.cn/" target="_blank">財經首頁</a></li>
<li id="nav01"><a href="http://finance.sina.com.cn/stock/index.shtml" target="_blank">股票</a></li>
<li id="nav02"><a href="http://finance.sina.com.cn/fund/index.shtml" target="_blank">基金</a></li>
<li id="nav03"><a href="http://finance.sina.com.cn/stock/roll.shtml" target="_blank">滾動</a></li>
<li id="nav04"><a href="http://vip.stock.finance.sina.com.cn/corp/view/vCB_BulletinGather.php" target="_blank">公告</a></li>
<li id="nav05"><a href="http://finance.sina.com.cn/column/jsy.html" target="_blank">大盤</a></li>
<li id="nav06"><a href="http://finance.sina.com.cn/column/ggdp.html" target="_blank">個股</a></li>
<li id="nav07"><a href="http://finance.sina.com.cn/stock/newstock/index.shtml" target="_blank">新股</a></li>
<li id="nav08"><a href="http://finance.sina.com.cn/stock/warrant/index.shtml" target="_blank">權證</a></li>
<li id="nav09"><a href="http://finance.sina.com.cn/stock/reaserchlist.shtml" target="_blank">報告</a></li>
<li id="nav10"><a href="http://finance.sina.com.cn/money/globalindex/index.shtml" target="_blank">環球市場</a></li>
<li id="nav11"><a href="http://blog.sina.com.cn/lm/finance/index.html" target="_blank">部落格</a></li>
<li id="nav12"><a href="http://finance.sina.com.cn/bar/" target="_blank">股票吧</a></li>
<li id="nav13"><a href="http://finance.sina.com.cn/stock/hkstock/index.shtml" target="_blank">港股</a></li>
<li id="nav14"><a href="http://finance.sina.com.cn/stock/usstock/index.shtml" target="_blank">美股</a></li>
<li id="nav15"><a href="http://biz.finance.sina.com.cn/hq/" target="_blank">行情中心</a></li>
<li id="nav16"><a href="http://vip.stock.finance.sina.com.cn/portfolio/main.php" target="_blank">自選股</a></li>
</ul>
</div>
<!-- 導航 end -->
<!-- 導航下 begin -->
<div class="navbtm">
<div class="navbtmblk1"><span id="idxsh000001"><a href="http://finance.sina.com.cn/realstock/company/sh000001/nc.shtml" target="_blank">上證指數</a>: 0000.00 0.00 00.00億元</span> | <span id="idxsz399001"><a href="http://finance.sina.com.cn/realstock/company/sz399001/nc.shtml" target="_blank">深圳成指</a>: 0000.00 0.00 00.00億元</span> | <span id="idxsh000300"><a href="http://finance.sina.com.cn/realstock/company/sh000300/nc.shtml" target="_blank">滬深300</a>: 0000.00 0.00 00.00億元</span></div>
<div class="navbtmmaquee">
<script type="text/javascript" src="http://finance.sina.com.cn/286/20061129/3.js"></script>
<script type="text/javascript" language="javascript">
<!--//--><![CDATA[//><!--
if(!document.layers) {
with (document.getElementsByTagName("marquee")[0]) {
scrollDelay = 50;
scrollAmount = 2;
onmouseout = function () {
this.scrollDelay = 50;
};
}
}
//--><!]]>
</script>
</div>
</div>
<!-- 導航下 end -->
<div class="HSpace-1-6"></div>
<div id="main">
<!-- 左側 begin -->
<div id="left">
<!-- 最近訪問股|我的自選股 begin -->
<div class="LBlk01">
<!-- 標籤 begin -->
<ul class="LTab01">
<li class="Menu01On" id="m01-0">最近訪問股</li>
<li class="Menu01Off" id="m01-1">我的自選股</li>
</ul>
<!-- 標籤 end -->
<!-- 內容 begin -->
<div id="con01-0"></div>
<div id="con01-1" style="display:none;"></div>
<!-- 內容 end -->
</div>
<!-- 最近訪問股|我的自選股 end -->
<div class="HSpace-1-10"></div>
<!-- 選單 begin -->
<div class="Menu-Ti" id="navlf00"><img src="http://www.sinaimg.cn/cj/realstock/image2/finance_in_ws_010.gif" alt="" id="tImg0"/><span class="capname">每日必讀</span></div>
<div class="Menu-Con" id="item0" style="display:block;">
<table cellspacing="0">
<tr>
<td>·<a href="http://stock.finance.sina.com.cn/" target="_self">股市必察</a></td>
<td>·<a href="http://biz.finance.sina.com.cn/stock/company/notice.php?kind=daily" target="_self" class="incolor">每日提示</a></td>
</tr>
<tr>
<td>·<a href="/corp/go.php/vRPD_QuickView/.phtml" target="_self">公司快報</a></td>
<td>·<a href="/corp/go.php/vRPD_NewStockIssue/page/1.phtml" target="_self">新股上市</a></td>
</tr>
<tr>
<td>·<a href="http://vip.stock.finance.sina.com.cn/q/go.php/vInvestConsult/kind/lhb/index.phtml" target="_self">龍虎榜</a></td>
<td>·<a href="http://vip.stock.finance.sina.com.cn/q/go.php/vIR_EndRise/index.phtml" target="_self" class="incolor">每日熱股</a></td>
</tr>
<tr>
<td colspan='2'>·<a href="http://finance.sina.com.cn/realstock/income_statement/2012-06-30/issued_pdate_de_1.html" target="_self" class="incolor">中報速遞</a></td>
</tr>
</table>
</div>
<!--<div class="HSpace-1-10"></div> -->
<div class="Menu-Ti" id="navlf01"><img src="http://www.sinaimg.cn/cj/realstock/image2/finance_in_ws_010.gif" alt="" id="tImg1"/><span class="capname">指數資料</span></div>
<div class="Menu-Con" id="item1" style="display:block;">
<table cellspacing="0">
<tr>
<td>·<a href="/corp/go.php/vII_BasicInfo/indexid/000001.phtml" target="_self">基本屬性</a></td>
<td>·<a href="/corp/go.php/vII_NewestComponent/indexid/000001.phtml" target="_self">最新成分</a></td>
</tr>
<tr>
<td>·<a href="/corp/go.php/vII_HistoryComponent/indexid/000001.phtml" target="_self">歷史成分</a></td>
<