裝置介面層之net_device註冊與去註冊
網路裝置對應的結構體是struct net_device,而且驅動程式只有將該結構註冊給核心,核心才認為這是一個合法的網路裝置,這篇筆記就記錄了驅動程式如何將網路設備註冊給核心,以及對應反操作,如何去註冊。
1. net_device的建立
首先,驅動程式需要建立一個struct net_device結構,並且填充其中必要的成員,然後才能註冊給核心。
網路裝置種類有很多,但是從軟體層面來看,又有很多的共性,核心提供的struct net_device就是對這些共性的抽象。雖然該結構體定義的成員已經非常多了,但是不同的裝置很有可能還有很多額外的特性,所以網絡卡驅動程式在實際定義中完全可以在該結構的基礎上重新定義,通常格式是這樣的:
struct new_net_device
{
//struct net_device必須放在第一個位置
struct net_device dev;
...
}
這裡,我們不討論這種不同網絡卡所具備的新特性,這些特性太多了,我們不關注。struct net_device是核心抽象出來的結構,對於該結構的分配,核心提供了標準的介面可用,這才是我們應該關注的。
1.1 記憶體分配
驅動程式可以呼叫alloc_netdev()分配一個struct net_device結構。
#define alloc_netdev(sizeof_priv, name, setup) \
alloc_netdev_mq(sizeof_priv, name, setup, 1)
/**
* alloc_netdev_mq - allocate network device
* @sizeof_priv: 私有資料空間的記憶體佔用位元組數
* @name: 為介面指定一個名字validate_addr
* @setup: 呼叫者可以提供一個初始化函式用於初始化net_device,該回調會在分配記憶體後被呼叫
* @queue_count:子佇列數目,關於子佇列不在這裡討論,這裡預設為1
*
* Allocates a struct net_device with private data area for driver use
* and performs basic initialization. Also allocates subquue structs
* for each queue on the device at the end of the netdevice.
*/
struct net_device *alloc_netdev_mq(int sizeof_priv, const char *name,
void (*setup)(struct net_device *), unsigned int queue_count)
{
void *p;
struct net_device *dev;
int alloc_size;
//介面名字不能超過15個字元,dev->name[]陣列的長度就是15
BUG_ON(strlen(name) >= sizeof(dev->name));
/* ensure 32-byte alignment of both the device and private area */
//保證32位元組對齊的情況下,計算總共需要分配的記憶體大小
alloc_size = (sizeof(*dev) + NETDEV_ALIGN_CONST +
(sizeof(struct net_device_subqueue) * (queue_count - 1))) &
~NETDEV_ALIGN_CONST;
alloc_size += sizeof_priv + NETDEV_ALIGN_CONST;
//記憶體分配
p = kzalloc(alloc_size, GFP_KERNEL);
if (!p) {
printk(KERN_ERR "alloc_netdev: Unable to allocate device.\n");
return NULL;
}
//dev指向32位元組地址對齊處,這會導致實際分配的記憶體可能在開始會有一段padding
dev = (struct net_device *)
(((long)p + NETDEV_ALIGN_CONST) & ~NETDEV_ALIGN_CONST);
//計算首部padding大小儲存到padded中,記憶體釋放的釋放會用到
dev->padded = (char *)dev - (char *)p;
dev->nd_net = &init_net;
//dev->priv指向私有資料結構的起始處
if (sizeof_priv) {
dev->priv = ((char *)dev +
((sizeof(struct net_device) +
(sizeof(struct net_device_subqueue) *
(queue_count - 1)) + NETDEV_ALIGN_CONST)
& ~NETDEV_ALIGN_CONST));
}
dev->egress_subqueue_count = queue_count;
//預設的獲取網路裝置統計資訊的介面是internal_stats()
dev->get_stats = internal_stats;
//netpoll機制初始化,這裡不討論
netpoll_netdev_init(dev);
//回撥驅動程式提供的初始化函式進一步初始化,可以看出該函式是必須要提供的
setup(dev);
//儲存網路介面名字到dev->name
strcpy(dev->name, name);
return dev;
}
EXPORT_SYMBOL(alloc_netdev_mq);
從上面可以看到,在分配過程中,會對struct net_device中的一些欄位進行初始化;更重要的是,這期間會呼叫驅動程式提供的setup()回撥,這給了驅動程式對分配的net_device進行初始化的機會。當然了,因為alloc_netdev()執行後,分配好的net_device指標還是會返回給呼叫者(一般就是驅動程式了),所以呼叫者也可以在函式返回後執行本來應該在setup()中執行的邏輯。
2. net_device的註冊
分配好net_device並進行初始化後,驅動程式就可以通過register_netdev()向核心註冊該網路裝置了,程式碼如下:
/**
* register_netdev - register a network device
* @dev: device to register
*
* Take a completed network device structure and add it to the kernel
* interfaces. A %NETDEV_REGISTER message is sent to the netdev notifier
* chain. 0 is returned on success. A negative errno code is returned
* on a failure to set up the device, or if the name is a duplicate.
*
* This is a wrapper around register_netdevice that takes the rtnl semaphore
* and expands the device name if you passed a format string to
* alloc_netdev.
*/
int register_netdev(struct net_device *dev)
{
int err;
//持有RTNETLINK互斥鎖
rtnl_lock();
/*
* If the name is a format string the caller wants us to do a
* name allocation.
*/
//驅動程式分配網路裝置時,如果名字中有%d,則這裡為該網路裝置分配一個唯一的ID,
//以此組成最終的網路裝置名。關於網路裝置的名字確定,這裡不再深入分析了
if (strchr(dev->name, '%')) {
err = dev_alloc_name(dev, dev->name);
if (err < 0)
goto out;
}
//完成剩餘的註冊工作
err = register_netdevice(dev);
out:
//釋放鎖
rtnl_unlock();
return err;
}
互斥鎖RTNETLINK是將系統中所有的對net_device內容的讀寫操作序列化,程式碼如下:
static DEFINE_MUTEX(rtnl_mutex);
void rtnl_lock(void)
{
mutex_lock(&rtnl_mutex);
}
void __rtnl_unlock(void)
{
mutex_unlock(&rtnl_mutex);
}
2.1 register_netdevice()
實際的註冊工作由該函式完成,程式碼邏輯如下:
/**
* register_netdevice - register a network device
* @dev: device to register
*
* Take a completed network device structure and add it to the kernel
* interfaces. A %NETDEV_REGISTER message is sent to the netdev notifier
* chain. 0 is returned on success. A negative errno code is returned
* on a failure to set up the device, or if the name is a duplicate.
*
* Callers must hold the rtnl semaphore. You may want
* register_netdev() instead of this.
*
* BUGS:
* The locking appears insufficient to guarantee two parallel registers
* will not get the same name.
*/
int register_netdevice(struct net_device *dev)
{
struct hlist_head *head;
struct hlist_node *p;
int ret;
struct net *net;
//裝置介面層必須已經初始化完成,即net_dev_init()已經執行完畢
BUG_ON(dev_boot_phase);
ASSERT_RTNL();
might_sleep();
/* When net_device's are persistent, this will be fatal. */
//網路裝置的註冊狀態必須是未初始化狀態,剛分配的struct net_device就是這個狀態
BUG_ON(dev->reg_state != NETREG_UNINITIALIZED);
BUG_ON(!dev->nd_net);
net = dev->nd_net;
//對net_device結構中的一些鎖進行初始化,這些鎖的作用在資料傳輸部分再介紹
spin_lock_init(&dev->queue_lock);
spin_lock_init(&dev->_xmit_lock);
netdev_set_lockdep_class(&dev->_xmit_lock, dev->type);
dev->xmit_lock_owner = -1;
spin_lock_init(&dev->ingress_lock);
dev->iflink = -1;
/* Init, if this function is available */
//如果驅動程式提供了init()介面,則回撥該函式。如果該函式返回0,繼續後面的註冊,否則註冊失敗
if (dev->init) {
ret = dev->init(dev);
if (ret) {
if (ret > 0)
ret = -EIO;
goto out;
}
}
//檢查介面名字中有沒有特殊字元,長度有沒有越界
if (!dev_valid_name(dev->name)) {
ret = -EINVAL;
goto err_uninit;
}
//為該網路裝置分配一個全域性唯一的索引
dev->ifindex = dev_new_index(net);
if (dev->iflink == -1)
dev->iflink = dev->ifindex;
/* Check for existence of name */
//檢查系統中是否已經存在相同名字的網路介面,如果存在則註冊失敗,返回-EEXIST錯誤碼
//從這裡可以看出,同一網路名稱空間中的網絡卡裝置是不可以有重名的
head = dev_name_hash(net, dev->name);
hlist_for_each(p, head) {
struct net_device *d
= hlist_entry(p, struct net_device, name_hlist);
if (!strncmp(d->name, dev->name, IFNAMSIZ)) {
ret = -EEXIST;
goto err_uninit;
}
}
/* Fix illegal checksum combinations */
//下面的這幾組程式碼是校正驅動程式指定的Feature欄位。Featur代表的是裝置的能力,這裡先忽略
if ((dev->features & NETIF_F_HW_CSUM) &&
(dev->features & (NETIF_F_IP_CSUM|NETIF_F_IPV6_CSUM))) {
printk(KERN_NOTICE "%s: mixed HW and IP checksum settings.\n",
dev->name);
dev->features &= ~(NETIF_F_IP_CSUM|NETIF_F_IPV6_CSUM);
}
if ((dev->features & NETIF_F_NO_CSUM) &&
(dev->features & (NETIF_F_HW_CSUM|NETIF_F_IP_CSUM|NETIF_F_IPV6_CSUM))) {
printk(KERN_NOTICE "%s: mixed no checksumming and other settings.\n",
dev->name);
dev->features &= ~(NETIF_F_IP_CSUM|NETIF_F_IPV6_CSUM|NETIF_F_HW_CSUM);
}
/* Fix illegal SG+CSUM combinations. */
if ((dev->features & NETIF_F_SG) &&
!(dev->features & NETIF_F_ALL_CSUM)) {
printk(KERN_NOTICE "%s: Dropping NETIF_F_SG since no checksum feature.\n",
dev->name);
dev->features &= ~NETIF_F_SG;
}
/* TSO requires that SG is present as well. */
if ((dev->features & NETIF_F_TSO) &&
!(dev->features & NETIF_F_SG)) {
printk(KERN_NOTICE "%s: Dropping NETIF_F_TSO since no SG feature.\n",
dev->name);
dev->features &= ~NETIF_F_TSO;
}
if (dev->features & NETIF_F_UFO) {
if (!(dev->features & NETIF_F_HW_CSUM)) {
printk(KERN_ERR "%s: Dropping NETIF_F_UFO since no "
"NETIF_F_HW_CSUM feature.\n",
dev->name);
dev->features &= ~NETIF_F_UFO;
}
if (!(dev->features & NETIF_F_SG)) {
printk(KERN_ERR "%s: Dropping NETIF_F_UFO since no "
"NETIF_F_SG feature.\n",
dev->name);
dev->features &= ~NETIF_F_UFO;
}
}
//將網路設備註冊到裝置模型中
ret = netdev_register_kobject(dev);
if (ret)
goto err_uninit;
//設定網路裝置的註冊狀態為已註冊狀態
dev->reg_state = NETREG_REGISTERED;
/*
* Default initial state at registry is that the
* device is present.
*/
//設定裝置狀態為“存在”
set_bit(__LINK_STATE_PRESENT, &dev->state);
//初始化傳送排隊規則,見《流量控制》
dev_init_scheduler(dev);
//初始化基本完成,網路裝置的引用計數+1
dev_hold(dev);
//將網路裝置同時掛接到系統維護三個表中:名字表、索引表、裝置表
list_netdevice(dev);
/* Notify protocols, that a new device appeared. */
//向其他模組傳送"NETDEV_REGISTER"事件
ret = call_netdevice_notifiers(NETDEV_REGISTER, dev);
ret = notifier_to_errno(ret);
//如果通知失敗,這裡回滾前面所有的額註冊步驟,並設定註冊狀態為UNREGISTERED
if (ret) {
rollback_registered(dev);
dev->reg_state = NETREG_UNREGISTERED;
}
out:
return ret;
err_uninit:
if (dev->uninit)
dev->uninit(dev);
goto out;
}
3. 裝置的去註冊
裝置從系統中拔出或者由於其它某些原因,需要從核心中將網路裝置刪除時,驅動會呼叫unregister_netdev()來完成,程式碼如下:
/**
* unregister_netdev - remove device from the kernel
* @dev: device
*
* This function shuts down a device interface and removes it
* from the kernel tables.
*
* This is just a wrapper for unregister_netdevice that takes
* the rtnl semaphore. In general you want to use this and not
* unregister_netdevice.
*/
void unregister_netdev(struct net_device *dev)
{
//首先持有訊號量,之前的筆記有介紹過該訊號量用來保護全域性的net_device
//組織結構,去註冊當然需要修改該組織結構,所以需要持有訊號量
rtnl_lock();
unregister_netdevice(dev);
rtnl_unlock();
}
持鎖然後呼叫unregister_netdevices()。
/**
* unregister_netdevice - remove device from the kernel
* @dev: device
*
* This function shuts down a device interface and removes it
* from the kernel tables.
*
* Callers must hold the rtnl semaphore. You may want
* unregister_netdev() instead of this.
*/
void unregister_netdevice(struct net_device *dev)
{
ASSERT_RTNL();
//撤銷註冊時執行的操作
rollback_registered(dev);
/* Finish processing unregister after unlock */
//將裝置加入系統的todo_list中,在rtnl_unlock()時執行一些清理工作
net_set_todo(dev);
}
/* Delayed registration/unregisteration */
//全域性的net_todo_list專門用來延遲執行去註冊操作。從前面的註冊過程過程來看,
//並沒有使用net_todo_list
static LIST_HEAD(net_todo_list);
static void net_set_todo(struct net_device *dev)
{
list_add_tail(&dev->todo_list, &net_todo_list);
}
3.1 rollback_registered()
從上面的邏輯看來,實際的去註冊工作都是由rollback_registered()完成的.
static void rollback_registered(struct net_device *dev)
{
//預置條件判斷:1)裝置介面層已經初始化完畢;2)已經持有RTNETLINK訊號量
BUG_ON(dev_boot_phase);
ASSERT_RTNL();
//未初始化的裝置不能執行去註冊
if (dev->reg_state == NETREG_UNINITIALIZED) {
printk(KERN_DEBUG "unregister_netdevice: device %s/%p never "
"was registered\n", dev->name, dev);
WARN_ON(1);
return;
}
//裝置當前註冊狀態應該是REGISTERED,即已註冊狀態
BUG_ON(dev->reg_state != NETREG_REGISTERED);
//裝置可能還處於UP狀態,首先關閉裝置
/* If device is running, close it first. */
dev_close(dev);
//將裝置從名字表、索引表、裝置表中移除
unlist_netdevice(dev);
//設定設備註冊狀態為UNREGISTERING,即正在去註冊
dev->reg_state = NETREG_UNREGISTERING;
//同步其它CPU上面該裝置的狀態
synchronize_net();
/* Shutdown queueing discipline. */
//關閉裝置的傳送佇列
dev_shutdown(dev);
/* Notify protocols, that we are about to destroy
this device. They should clean all the things.
*/
//傳送UNREGISTER通知給其它對這一事件感興趣的模組
call_netdevice_notifiers(NETDEV_UNREGISTER, dev);
/*
* Flush the unicast and multicast chains
*/
//清除裝置的所有地址
dev_addr_discard(dev);
//回撥驅動程式uninit()介面
if (dev->uninit)
dev->uninit(dev);
/* Notifier chain MUST detach us from master device. */
BUG_TRAP(!dev->master);
//從統一裝置模型中移除
netdev_unregister_kobject(dev);
synchronize_net();
//遞減裝置引用計數,注意僅僅是遞減,引用計數為0並不會釋放記憶體
dev_put(dev);
}
3.2 netdev_run_todo()
從上面的程式碼中可以看到,在unregister_netdevice()末尾呼叫net_set_todo()將待去註冊裝置新增到了net_todo_list中,對該連結串列的處理是在rtnl_unlock()中呼叫netdev_run_todo()。
/* The sequence is:
*
* rtnl_lock();
* ...
* register_netdevice(x1);
* register_netdevice(x2);
* ...
* unregister_netdevice(y1);
* unregister_netdevice(y2);
* ...
* rtnl_unlock();
* free_netdev(y1);
* free_netdev(y2);
*
* We are invoked by rtnl_unlock().
* This allows us to deal with problems:
* 1) We can delete sysfs objects which invoke hotplug
* without deadlocking with linkwatch via keventd.
* 2) Since we run with the RTNL semaphore not held, we can sleep
* safely in order to wait for the netdev refcnt to drop to zero.
*
* We must not return until all unregister events added during
* the interval the lock was held have been completed.
*/
void netdev_run_todo(void)
{
struct list_head list;
//儘可能的縮短持有RTNETLINK訊號量的時間,將net_todo_list連結串列用list做個快照,然後釋放訊號量
/* Snapshot list, allow later requests */
list_replace_init(&net_todo_list, &list);
__rtnl_unlock();
//遍歷備份的net_todo_list,將其中的每一個裝置從系統中移除
while (!list_empty(&list)) {
struct net_device *dev
= list_entry(list.next, struct net_device, todo_list);
list_del(&dev->todo_list);
//檢查裝置的註冊狀態,如果已經是去註冊狀態了,那麼是異常
if (unlikely(dev->reg_state != NETREG_UNREGISTERING)) {
printk(KERN_ERR "network todo '%s' but state %d\n",
dev->name, dev->reg_state);
dump_stack();
continue;
}
//設定註冊狀態為去註冊狀態
dev->reg_state = NETREG_UNREGISTERED;
on_each_cpu(flush_backlog, dev, 1);
//等待對該裝置的所有引用計數都釋放
netdev_wait_allrefs(dev);
/* paranoia */
BUG_ON(atomic_read(&dev->refcnt));
WARN_ON(dev->ip_ptr);
WARN_ON(dev->ip6_ptr);
WARN_ON(dev->dn_ptr);
//呼叫驅動的釋放介面
if (dev->destructor)
dev->destructor(dev);
//釋放裝置模型中的相關結構
/* Free network device */
kobject_put(&dev->dev.kobj);
}
}
3.2.1 netdev_wait_allrefs()
在上面的去註冊過程中,我們並沒有看到釋放struct net_device的動作。實際上,是否釋放net_device是由其成員refcnt,即引用計數決定的,該引用計數的增減可以通過介面dev_hold()和dev_put()操作,在註冊過程中,會將該引用計數初始化為1,在去註冊過程中會將該引用計數減1,所以如果不執行去註冊過程,那麼引用計數是不可能為0的。
要釋放net_device,必須等引用計數變為0為止,但是在去註冊時,我們無法保證其它模組已經釋放了對該裝置的引用,所以必須有一種機制能夠可靠的釋放該裝置:
- 首先,在去註冊時會向外傳送NETDEV_UNREGISTER事件,持有裝置引用計數的模組應該關注該事件;
- 即使有通知,還是應該等待,因為無法保證其它子系統能夠及時的處理該事件。
綜上,核心使用netdev_wait_allrefs()來實現這一設計。
/*
* netdev_wait_allrefs - wait until all references are gone.
*
* This is called when unregistering network devices.
*
* Any protocol or device that holds a reference should register
* for netdevice notification, and cleanup and put back the
* reference if they receive an UNREGISTER event.
* We can get stuck here if buggy protocols don't correctly
* call dev_put.
*/
static void netdev_wait_allrefs(struct net_device *dev)
{
unsigned long rebroadcast_time, warning_time;
rebroadcast_time = warning_time = jiffies;
//迴圈等待,直到引用計數變為0
while (atomic_read(&dev->refcnt) != 0) {
//每隔1s向外傳送一次NETDEV_UNREGISTER事件通知
if (time_after(jiffies, rebroadcast_time + 1 * HZ)) {
rtnl_lock();
/* Rebroadcast unregister notification */
call_netdevice_notifiers(NETDEV_UNREGISTER, dev);
//對鏈路狀態的處理我們在單獨的筆記中介紹
if (test_bit(__LINK_STATE_LINKWATCH_PENDING &dev->state)) {
/* We must not have linkwatch events
* pending on unregister. If this
* happens, we simply run the queue
* unscheduled, resulting in a noop
* for this device.
*/
linkwatch_run_queue();
}
__rtnl_unlock();
rebroadcast_time = jiffies;
}
//休眠250ms
msleep(250);
//等待每超過10s,列印一條告警資訊
if (time_after(jiffies, warning_time + 10 * HZ)) {
printk(KERN_EMERG "unregister_netdevice: "
"waiting for %s to become free. Usage "
"count = %d\n",
dev->name, atomic_read(&dev->refcnt));
warning_time = jiffies;
}
}
}
由於netdev_wait_allrefs()會休眠等待,所以這裡需要注意的是,呼叫去註冊過程可能會阻塞一段時間,所以禁止在原子上下文執行該過程。
4. netdevice的銷燬
驅動程式在將裝置從核心中去註冊後,可以呼叫介面free_netdev()釋放裝置。
/**
* free_netdev - free network device
* @dev: device
*
* This func