1. 程式人生 > >gethostbyname(),以及相關的資料處理流程

gethostbyname(),以及相關的資料處理流程

gethostbyname() -- 用域名或主機名獲取IP地址

    #include <netdb.h>

    #include <sys/socket.h>

#include <unistd.h>  

#include <sys/types.h>

#include <netdb.h>

#include <netinet/in.h>  

#include <stdlib.h> 

#include <netinet/in.h>   

#include <arpa/inet.h> 

#include <stdio.h>

    struct hostent *gethostbyname(const char *name);

    這個函式的傳入值是域名或者主機名,例如"www.google.cn"等等。傳出值,是一個hostent的結構。如果函式呼叫失敗,將返回NULL。

    struct hostent

    {

        char    *h_name;                

        char    **h_aliases; 

        int     h_addrtype;

        int     h_length;

        char    **h_addr_list; 

        #define h_addr h_addr_list[0] 

    }; 

    hostent->h_name

    表示的是主機的規範名。例如www.google.com的規範名其實是www.l.google.com。

    hostent->h_aliases

    表示的是主機的別名.www.google.com就是google他自己的別名。有的時候,有的主機可能有好幾個別名,這些,其實都是為了易於使用者記憶而為自己的網站多取的名字。

    hostent->h_addrtype     

    表示的是主機ip地址的型別,到底是ipv4(AF_INET),還是pv6(AF_INET6)

    hostent->h_length       

    表示的是主機ip地址的長度

    hostent->h_addr_lisst 

    表示的是主機的ip地址,注意,這個是以網路位元組序儲存的。千萬不要直接用printf帶%s引數來打這個東西,會有問題的哇。所以到真正需要打印出這個IP的話,需要呼叫inet_ntop()。

    const char *inet_ntop(int af, const void *src, char *dst, socklen_t cnt) :

    這個函式,是將型別為af的網路地址結構src,轉換成主機序的字串形式,存放在長度為cnt的字串中。返回指向dst的一個指標。如果函式呼叫錯誤,返回值是NULL。

#include <netdb.h>

#include <sys/socket.h>

#include <stdio.h>

int main(int argc, char **argv)

{

    char   *ptr, **pptr;

    struct hostent *hptr;

    char   str[32];

    ptr = argv[1];

    if((hptr = gethostbyname(ptr)) == NULL)

    {

        printf(" gethostbyname error for host:%s\n", ptr);

        return 0; 

    }

    printf("official hostname:%s\n",hptr->h_name);

    for(pptr = hptr->h_aliases; *pptr != NULL; pptr++)

        printf(" alias:%s\n",*pptr);

    switch(hptr->h_addrtype)

    {

        case AF_INET:

        case AF_INET6:

            pptr=hptr->h_addr_list;

            for(; *pptr!=NULL; pptr++)

                printf(" address:%s\n", 

                       inet_ntop(hptr->h_addrtype, *pptr, str, sizeof(str)));

            printf(" first address: %s\n", 

                       inet_ntop(hptr->h_addrtype, hptr->h_addr, str, sizeof(str)));

        break;

        default:

            printf("unknown address type\n");

        break;

    }

    return 0;

}

編譯執行

-----------------------------

# gcc test.c

# ./a.out www.baidu.com

official hostname:www.a.shifen.com

alias:www.baidu.com

address:121.14.88.11

address:121.14.89.11

first address: 121.14.88.11

注意:

Unix/Linux下的gethostbyname函式常用來向DNS查詢一個域名的IP地址。 由於DNS的遞迴查詢,常常會發生gethostbyname函式在查詢一個域名時嚴重超時。而該函式又不能像connect和read等函式那樣通過setsockopt或者select函式那樣設定超時時間,因此常常成為程式的瓶頸。有人提出一種解決辦法是用alarm設定定時訊號,如果超時就用setjmp和longjmp跳過gethostbyname函式(這種方式我沒有試過,不知道具體效果如何)。    在多執行緒下面,gethostbyname會一個更嚴重的問題,就是如果有一個執行緒的gethostbyname發生阻塞,其它執行緒都會在gethostbyname處發生阻塞。我在編寫爬蟲時也遇到了這個讓我疑惑很久的問題,所有的爬蟲執行緒都阻塞在gethostbyname處,導致爬蟲速度非常慢。在網上google了很長時間這個問題,也沒有找到解答。今天湊巧在實驗室的googlegroup裡面發現了一本電子書"Mining the Web - Discovering Knowledge from Hypertext Data",其中在講解爬蟲時有下面幾段文字:    Many clients for DNS resolution are coded poorly.Most UNIX systems provide an implementation of gethostbyname (the DNS client API—application program interface), which cannot concurrently handle multiple outstanding requests. Therefore, the crawler cannot issue many resolution requests together and poll at a later time for completion of individual requests, which is critical for acceptable performance. Furthermore, if the system-provided client is used, there is no way to distribute load among a number of DNS servers. For all these reasons, many crawlers choose to include their own custom client for DNS name resolution. The Mercator crawler from Compaq System Research Center reduced the time spent in DNS from as high as 87% to a modest 25% by implementing a custom client. The ADNS asynchronous DNS client library is ideal for use in crawlers.    In spite of these optimizations, a large-scale crawler will spend a substantial fraction of its network time not waiting for Http data transfer, but for address resolution. For every hostname that has not been resolved before (which happens frequently with crawlers), the local DNS may have to go across many network hops to fill its cache for the first time. To overlap this unavoidable delay with useful work, prefetching can be used. When a page that has just been fetched is parsed, a stream of HREFs is extracted. Right at this time, that is, even before any of the corresponding URLs are fetched, hostnames are extracted from the HREF targets, and DNS resolution requests are made to the caching server. The prefetching client is usually implemented using UDP  instead of TCP, and it does not wait for resolution to be completed. The request serves only to fill the DNS cache so that resolution will be fast when the page is actually needed later on.    大意是說unix的gethostbyname無法處理在併發程式下使用,這是先天的缺陷是無法改變的。大型爬蟲往往不會使用gethostbyname,而是實現自己獨立定製的DNS客戶端。這樣可以實現DNS的負載平衡,而且通過非同步解析能夠大大提高DNS解析速度。DNS客戶端往往用UDP實現,可以在爬蟲爬取網頁前提前解析URL的IP。文章中還提到了一個開源的非同步DNS庫adns,主頁是http://www.chiark.greenend.org.uk/~ian/adns/    從以上可看出,gethostbyname並不適用於多執行緒環境以及其它對DNS解析速度要求較高的程式。