linux kernel原始碼剖析共享記憶體部分 IPC 虛擬記憶體對映 VMA shmget shmat shmdt shmctl

阿新 • • 發佈：2018-12-26

實現程序間通訊的方法很多，比如：

管道（同一臺機器上兩個程序雙向通訊）

套接字（不同機器上的兩個程序間的通訊）

System IPC機制（同一機器上，許多程序相互通訊）IPC有以下三個機制：

1訊息佇列：

用於資訊傳遞頻繁且內容較少的程序間通訊

2 訊號量：

用於實現程序間同步通訊

3 共享記憶體：

用於資訊內容比較多的程序間通訊

核心中實現IPC子系統的原始碼組織結構：

其中msg.h msg.c組成第一種IPC機制訊息佇列，sem.h sem.c組成訊號量，shm.h,shm.c組成共享記憶體機制。util.h,util.c將三種獨立的機制結合起來，統一初始化，統一管理標識。

IPC是一組系統呼叫，且每個

IPC資源是一個32位的關鍵字（key），和一個32的唯一標識（identifer）。每個IPC資源都有一個ipc_perm的資料結構記錄了IPC關鍵字，建立者和所有者的UID，GID，讀寫許可權，seq為IPC識別符號的當前序號資訊。

struct ipc_perm
{
__kernel_key_tkey;
__kernel_uid_tuid;
__kernel_gid_tgid;
__kernel_uid_tcuid;
__kernel_gid_tcgid;
__kernel_mode_tmode; 
unsigned shortseq;
};

建立機制的函式分別為msget（）、semget

（）、shmget（），建立成功後都會返回一個唯一的IPC標識，使用完成後可根據這個標識釋放資源。

共享記憶體API：

shmget（）建立一塊共享記憶體

shmat（）將已經存在的共享記憶體對映到程序地址空間

shmdt（）取消該對映

shmctl（）根據操作管理共享記憶體

int shmget（key_t key,int size,int shmflg）

成功返回共享記憶體標示符，失敗返回-1；

key唯一標識鍵值，有兩種情況會建立新的共享記憶體;

a.key值設為IPC_PRIVATE。

b.key值為0，但是shmflag為CREATE

size建立共享記憶體的長度

記憶體的分配以頁為單位，

size為1~4096之間都會分配一頁4k，頁數的整數倍。

shmflag IPC_CREATE IPC_EXECL

每建立好一塊共享記憶體。都會用shmid_ds去維護：

struct shmid_ds {
struct ipc_permshm_perm; /* operation perms */
int shm_segsz;/* size of segment (bytes) */
__kernel_time_tshm_atime; /* last attach time */
__kernel_time_tshm_dtime; /* last detach time */
__kernel_time_tshm_ctime; /* last change time */
__kernel_ipc_pid_tshm_cpid; /* pid of creator */
__kernel_ipc_pid_tshm_lpid; /* pid of last operator */
unsigned shortshm_nattch; /* no. of current attaches */
unsigned short shm_unused; /* compatibility */
void *shm_unused2;/* ditto - used by DIPC */
void *shm_unused3;/* unused */
};

void *shmat（int shmid,const void *shmaddr,int shmflg）

成功返回連結的地址，失敗返回-1；呼叫完成後，shm_nattch += 1；

shmid

shmget返回的記憶體標識

shmaddr

把shmid標識的記憶體連線到程序資料段，根據shmaddr值可分為：

shmaddr = 0;核心從1~1.5G範圍從高位到低位自動選擇一塊空閒的未被對映的記憶體，並返回地址。

shmaddr != 0,shmflg未設定SHM_RND，以shmaddr為連線地址

shmaddr = 0，shmflag設定了SHM_RND，shmaddr自動規整為SHMLAB的整數倍。

shmflg

設定SHM_RND同時可設定SHM_RDONLY標誌位，改變讀寫許可權，以只讀方式繫結。

int shmdt（const void *shmaddr）

返回值跟shmat相似。呼叫完成後，shm_nattch -= 1；

int shmctl（int shmid, int cmd, struct shmid_ds *buf）

cmd 將要控制的操作

IPC_STAT 將shmid_ds複製到緩衝區buf。

IPC_RMID標記共享記憶體已銷燬，最後一個連線分離也即shm_nattch = 0；時銷燬。

IPC機制中共享記憶體的工作流程

IPC中系統建立共享記憶體的函式和函式之間的呼叫關係。

在util.c裡面有ipc_init函式：

static int __init ipc_init(void)
{
sem_init();
msg_init();
shm_init();
return 0;
}

初始化了三種IPC機制，其中shm_init是初始化共享記憶體的。以下進入shm_init函式詳細介紹。

void __init shm_init (void)
{
ipc_init_ids(&shm_ids, 1);
ipc_init_proc_interface("sysvipc/shm",
"       key      shmid perms       size  cpid  lpid nattch   uid   gid  cuid  cgid      atime      dtime      ctime\n",
&shm_ids,
sysvipc_shm_proc_show);
}

1、先看的shm_ids結構體

static struct ipc_ids shm_ids;
struct ipc_ids {
int in_use;
int max_id;
unsigned short seq;
unsigned short seq_max;
struct semaphore sem;
struct ipc_id_ary nullentry;
struct ipc_id_ary* entries;
};

ipc_ids裡面包括了指向儲存IPC資源的指標陣列p

struct ipc_id_ary {
int size;
struct kern_ipc_perm *p[0];
};

p是kern_ipc_perm*型別

/* used by in-kernel data structures */
struct kern_ipc_perm
{
spinlock_t lock;
int deleted;
key_t key;
uid_t uid;
gid_t gid;
uid_t cuid;
gid_t cgid;
mode_t mode; 
unsigned long seq;
void *security;
};

kern儲存資源的基本資訊，包括key值，建立者、擁有者的UID GID以及訪問許可權，當前唯一標識的順序編號seq。

其中spinlock_t是一個自旋鎖，型別如下：

typedef struct {
raw_spinlock_t raw_lock;
#if defined(CONFIG_PREEMPT) && defined(CONFIG_SMP)
unsigned int break_lock;
#endif
#ifdef CONFIG_DEBUG_SPINLOCK
unsigned int magic, owner_cpu;
void *owner;
#endif
} spinlock_t;

2、跟蹤到ipc_init_ids函式

設定ids結構體中的引數，開闢資源。

/**
 * ipc_init_ids- initialise IPC identifiers
 * @ids: Identifier set
 * @size: Number of identifiers
 *
 * Given a size for the ipc identifier range (limited below IPCMNI)
 * set up the sequence range to use then allocate and initialise the
 * array itself. 
 */
 
void __init ipc_init_ids(struct ipc_ids* ids, int size)
{
int i;
sema_init(&ids->sem,1);


if(size > IPCMNI)
size = IPCMNI;
ids->in_use = 0;
ids->max_id = -1;
ids->seq = 0;
{
int seq_limit = INT_MAX/SEQ_MULTIPLIER;
if(seq_limit > USHRT_MAX)
ids->seq_max = USHRT_MAX;
else
ids->seq_max = seq_limit;
}


ids->entries = ipc_rcu_alloc(sizeof(struct kern_ipc_perm *)*size +
    sizeof(struct ipc_id_ary));


if(ids->entries == NULL) {
printk(KERN_ERR "ipc_init_ids() failed, ipc service disabled.\n");
size = 0;
ids->entries = &ids->nullentry;
}
ids->entries->size = size;
for(i=0;i<size;i++)
ids->entries->p[i] = NULL;
}

為rcu頭結構和物件分配記憶體，返回物件的指標,分配失敗返回NULL。

/**
 * ipc_rcu_alloc- allocate ipc and rcu space 
 * @size: size desired
 *
 * Allocate memory for the rcu header structure +  the object.
 * Returns the pointer to the object.
 * NULL is returned if the allocation fails. 
 */
 
void* ipc_rcu_alloc(int size)
{
void* out;
/* 
* We prepend the allocation with the rcu struct, and
* workqueue if necessary (for vmalloc). 
*/
if (rcu_use_vmalloc(size)) {
out = vmalloc(HDRLEN_VMALLOC + size);
if (out) {
out += HDRLEN_VMALLOC;
container_of(out, struct ipc_rcu_hdr, data)->is_vmalloc = 1;
container_of(out, struct ipc_rcu_hdr, data)->refcount = 1;
}
} else {
out = kmalloc(HDRLEN_KMALLOC + size, GFP_KERNEL);
if (out) {
out += HDRLEN_KMALLOC;
container_of(out, struct ipc_rcu_hdr, data)->is_vmalloc = 0;
container_of(out, struct ipc_rcu_hdr, data)->refcount = 1;
}
}


return out;
}

3、跟蹤ipc_init_proc_interface函式

在進函式之前，先看引數seq_file結構體

struct seq_file {
char *buf;
size_t size;
size_t from;
size_t count;
loff_t index;
loff_t version;
struct semaphore sem;
struct seq_operations *op;
void *private;
};

設定引數，建立介面函式ipc_init_proc_interface

/**
 * ipc_init_proc_interface-  Create a proc interface for sysipc types
 *   using a seq_file interface.
 * @path: Path in procfs
 * @header: Banner to be printed at the beginning of the file.
 * @ids: ipc id table to iterate.
 * @show: show routine.
 */
void __init ipc_init_proc_interface(const char *path, const char *header,
   struct ipc_ids *ids,
   int (*show)(struct seq_file *, void *))
{
struct proc_dir_entry *pde;
struct ipc_proc_iface *iface;


iface = kmalloc(sizeof(*iface), GFP_KERNEL);
if (!iface)
return;
iface->path = path;
iface->header = header;
iface->ids = ids;
iface->show = show;


pde = create_proc_entry(path,
S_IRUGO,        /* world readable */
NULL            /* parent dir */);
if (pde) {
pde->data = iface;
pde->proc_fops = &sysvipc_proc_fops;
} else {
kfree(iface);
}
}

在此函式中呼叫了三個函式，kmalloc,create_proc_entry,kfree.

kmalloc是記憶體分配函式

static inline void *kmalloc(size_t size, gfp_t flags)
{
if (__builtin_constant_p(size)) {
int i = 0;
#define CACHE(x) \
if (size <= x) \
goto found; \
else \
i++;
#include "kmalloc_sizes.h"
#undef CACHE
{
extern void __you_cannot_kmalloc_that_much(void);
__you_cannot_kmalloc_that_much();
}
found:
return kmem_cache_alloc((flags & GFP_DMA) ?
malloc_sizes[i].cs_dmacachep :
malloc_sizes[i].cs_cachep, flags);
}
return __kmalloc(size, flags);
}

create_proc_entry是一個行內函數。

static inline struct proc_dir_entry *create_proc_entry(const char *name,
mode_t mode, struct proc_dir_entry *parent) { return NULL; }

第一個引數path，path是從ipc_init_proc_interface函式接收來的路徑，在該路徑下建立目錄。

第二個引數 mode，在呼叫的時候傳入的引數為S_IRUGO表示全域性可讀，也即許可權是400 040 004.

第三個引數 parent，標識是否有父目錄。

kfree釋放記憶體

/**
 * kfree - free previously allocated memory
 * @objp: pointer returned by kmalloc.
 *
 * If @objp is NULL, no operation is performed.
 *
 * Don't free memory not originally allocated by kmalloc()
 * or you will run into trouble.
 */
void kfree(const void *objp)
{
kmem_cache_t *c;
unsigned long flags;


if (unlikely(!objp))
return;
local_irq_save(flags);
kfree_debugcheck(objp);
c = page_get_cache(virt_to_page(objp));
__cache_free(c, (void*)objp);
local_irq_restore(flags);
}
EXPORT_SYMBOL(kfree);

上述ipc_init_proc_interface函式中呼叫的pro_dir_entry結構體

struct proc_dir_entry {
unsigned int low_ino;
unsigned short namelen;
const char *name;
mode_t mode;
nlink_t nlink;
uid_t uid;
gid_t gid;
unsigned long size;
struct inode_operations * proc_iops;
struct file_operations * proc_fops;
get_info_t *get_info;
struct module *owner;
struct proc_dir_entry *next, *parent, *subdir;
void *data;
read_proc_t *read_proc;
write_proc_t *write_proc;
atomic_t count;/* use count */
int deleted; /* delete flag */
void *set;
};

inode_opreations結構體

struct inode_operations {
int (*create) (struct inode *,struct dentry *,int, struct nameidata *);
struct dentry * (*lookup) (struct inode *,struct dentry *, struct nameidata *);
int (*link) (struct dentry *,struct inode *,struct dentry *);
int (*unlink) (struct inode *,struct dentry *);
int (*symlink) (struct inode *,struct dentry *,const char *);
int (*mkdir) (struct inode *,struct dentry *,int);
int (*rmdir) (struct inode *,struct dentry *);
int (*mknod) (struct inode *,struct dentry *,int,dev_t);
int (*rename) (struct inode *, struct dentry *,
struct inode *, struct dentry *);
int (*readlink) (struct dentry *, char __user *,int);
void * (*follow_link) (struct dentry *, struct nameidata *);
void (*put_link) (struct dentry *, struct nameidata *, void *);
void (*truncate) (struct inode *);
int (*permission) (struct inode *, int, struct nameidata *);
int (*setattr) (struct dentry *, struct iattr *);
int (*getattr) (struct vfsmount *mnt, struct dentry *, struct kstat *);
int (*setxattr) (struct dentry *, const char *,const void *,size_t,int);
ssize_t (*getxattr) (struct dentry *, const char *, void *, size_t);
ssize_t (*listxattr) (struct dentry *, char *, size_t);
int (*removexattr) (struct dentry *, const char *);
};

其中目錄資訊的結構體dentry

struct dentry {
atomic_t d_count;
unsigned int d_flags;/* protected by d_lock */
spinlock_t d_lock;/* per dentry lock */
struct inode *d_inode;/* Where the name belongs to - NULL is
* negative */
/*
* The next three fields are touched by __d_lookup.  Place them here
* so they all fit in a cache line.
*/
struct hlist_node d_hash;/* lookup hash list */
struct dentry *d_parent;/* parent directory */
struct qstr d_name;


struct list_head d_lru;/* LRU list */
struct list_head d_child;/* child of parent list */
struct list_head d_subdirs;/* our children */
struct list_head d_alias;/* inode alias list */
unsigned long d_time;/* used by d_revalidate */
struct dentry_operations *d_op;
struct super_block *d_sb;/* The root of the dentry tree */
void *d_fsdata;/* fs-specific data */
  struct rcu_head d_rcu;
struct dcookie_struct *d_cookie; /* cookie, if any */
int d_mounted;
unsigned char d_iname[DNAME_INLINE_LEN_MIN];/* small names */
};
dentry_opreations結構體
struct dentry_operations {
int (*d_revalidate)(struct dentry *, struct nameidata *);
int (*d_hash) (struct dentry *, struct qstr *);
int (*d_compare) (struct dentry *, struct qstr *, struct qstr *);
int (*d_delete)(struct dentry *);
void (*d_release)(struct dentry *);
void (*d_iput)(struct dentry *, struct inode *);
};


/* the dentry parameter passed to d_hash and d_compare is the parent
 * directory of the entries to be compared. It is used in case these
 * functions need any directory specific information for determining
 * equivalency classes.  Using the dentry itself might not work, as it
 * might be a negative dentry which has no information associated with
 * it */


/*
locking rules:
big lock dcache_lock d_lock   may block
d_revalidate: nono no       yes
d_hash no no no       yes
d_compare: no yes yes      no
d_delete: no yes no       no
d_release: no no no       yes
d_iput: no no no       yes
 */

結構體ipc_proc_iface

struct ipc_proc_iface*iface;

struct ipc_proc_iface {
const char *path;
const char *header;
struct ipc_ids *ids;
int (*show)(struct seq_file *, void *);
};

其中涉及的ipc_ids和seq_file在上面已提到。

跟蹤到sysvipc_shm_proc_show函式

static int sysvipc_shm_proc_show(struct seq_file *s, void *it)
{
struct shmid_kernel *shp = it;
char *format;


#define SMALL_STRING "%10d %10d  %4o %10u %5u %5u  %5d %5u %5u %5u %5u %10lu %10lu %10lu\n"
#define BIG_STRING   "%10d %10d  %4o %21u %5u %5u  %5d %5u %5u %5u %5u %10lu %10lu %10lu\n"


if (sizeof(size_t) <= sizeof(int))
format = SMALL_STRING;
else
format = BIG_STRING;
return seq_printf(s, format,
 shp->shm_perm.key,
 shp->id,
 shp->shm_flags,
 shp->shm_segsz,
 shp->shm_cprid,
 shp->shm_lprid,
 is_file_hugepages(shp->shm_file) ? (file_count(shp->shm_file) - 1) : shp->shm_nattch,
 shp->shm_perm.uid,
 shp->shm_perm.gid,
 shp->shm_perm.cuid,
 shp->shm_perm.cgid,
 shp->shm_atim,
 shp->shm_dtim,
 shp->shm_ctim);
}

傳入標示符seq結構體資訊，並顯示system V IPC資源資訊。並用seq_printf打印出來。

int seq_printf(struct seq_file *m, const char *f, ...)
{
va_list args;
int len;


if (m->count < m->size) {
va_start(args, f);
len = vsnprintf(m->buf + m->count, m->size - m->count, f, args);
va_end(args);
if (m->count + len < m->size) {
m->count += len;
return 0;
}
}
m->count = m->size;
return -1;
}
EXPORT_SYMBOL(seq_printf);

這裡seq_printf是一個可變參函式。傳入多個引數，並呼叫vsnprintf函式列印。

系統API：

shm API呼叫整體的流程圖如下：

四個函式都是被sys_ipc函式管理：

/*
 * sys_ipc() is the de-multiplexer for the SysV IPC calls..
 *
 * This is really horribly ugly.
 */
asmlinkage int sys_ipc(uint call, int first, int second, int third,
      void __user *ptr, long fifth)
{
int version, ret;


version = call >> 16; /* hack for backward compatibility */
call &= 0xffff;


switch (call) {
case SEMOP:
return sys_semtimedop (first, (struct sembuf __user *)ptr, second, NULL);
case SEMTIMEDOP:
return sys_semtimedop(first, (struct sembuf __user *)ptr, second,
(const struct timespec __user *)fifth);


case SEMGET:
return sys_semget (first, second, third);
case SEMCTL: {
union semun fourth;
if (!ptr)
return -EINVAL;
if (get_user(fourth.__pad, (void __user * __user *) ptr))
return -EFAULT;
return sys_semctl (first, second, third, fourth);
}


case MSGSND:
return sys_msgsnd(first, (struct msgbuf __user *) ptr, 
 second, third);
case MSGRCV:
switch (version) {
case 0: {
struct ipc_kludge tmp;
if (!ptr)
return -EINVAL;
if (copy_from_user(&tmp,(struct ipc_kludge __user *)ptr,
  sizeof (tmp)))
return -EFAULT;
return sys_msgrcv (first, tmp.msgp, second,
  tmp.msgtyp, third);
}
default:
return sys_msgrcv (first,
  (struct msgbuf __user *) ptr,
  second, fifth, third);
}
case MSGGET:
return sys_msgget ((key_t) first, second);
case MSGCTL:
return sys_msgctl(first, second, (struct msqid_ds __user *)ptr);


case SHMAT:
switch (version) {
default: {
ulong raddr;
ret = do_shmat(first, (char __user *)ptr, second, &raddr);
if (ret)
return ret;
return put_user(raddr, (ulong __user *)third);
}
case 1: /* Of course, we don't support iBCS2! */
return -EINVAL;
}
case SHMDT: 
return sys_shmdt ((char __user *)ptr);
case SHMGET:
return sys_shmget (first, second, third);
case SHMCTL:
return sys_shmctl (first, second,
  (struct shmid_ds __user *) ptr);
default:
return -ENOSYS;
}
}

一直不懂irix是什麼東東，維基百科的解釋：

IRIX是由Silicon Graphics（SGI）開發的在其MIPS工作站和伺服器上本機執行的停產作業系統。它基於具有BSD擴充套件的UNIX系統V。 IRIX是第一個包含XFS檔案系統的作業系統。

sys_shmget

asmlinkage long sys_shmget (key_t key, size_t size, int shmflg)
{
struct shmid_kernel *shp;
int err, id = 0;


down(&shm_ids.sem);
if (key == IPC_PRIVATE) {
err = newseg(key, shmflg, size);
} else if ((id = ipc_findkey(&shm_ids, key)) == -1) {
if (!(shmflg & IPC_CREAT))
err = -ENOENT;
else
err = newseg(key, shmflg, size);
} else if ((shmflg & IPC_CREAT) && (shmflg & IPC_EXCL)) {
err = -EEXIST;
} else {
shp = shm_lock(id);
if(shp==NULL)
BUG();
if (shp->shm_segsz < size)
err = -EINVAL;
else if (ipcperms(&shp->shm_perm, shmflg))
err = -EACCES;
else {
int shmid = shm_buildid(id, shp->shm_perm.seq);
err = security_shm_associate(shp, shmflg);
if (!err)
err = shmid;
}
shm_unlock(shp);
}
up(&shm_ids.sem);


return err;
}

newseg建立新的鍵值為key，大小為size的共享記憶體，並返回建立好的共享記憶體的標示符。

static int newseg (key_t key, int shmflg, size_t size)
{
int error;
struct shmid_kernel *shp;
int numpages = (size + PAGE_SIZE -1) >> PAGE_SHIFT;
struct file * file;
char name[13];
int id;


if (size < SHMMIN || size > shm_ctlmax)
return -EINVAL;


if (shm_tot + numpages >= shm_ctlall)
return -ENOSPC;


shp = ipc_rcu_alloc(sizeof(*shp));
if (!shp)
return -ENOMEM;


shp->shm_perm.key = key;
shp->shm_flags = (shmflg & S_IRWXUGO);
shp->mlock_user = NULL;


shp->shm_perm.security = NULL;
error = security_shm_alloc(shp);
if (error) {
ipc_rcu_putref(shp);
return error;
}


if (shmflg & SHM_HUGETLB) {
/* hugetlb_zero_setup takes care of mlock user accounting */
file = hugetlb_zero_setup(size);
shp->mlock_user = current->user;
} else {
int acctflag = VM_ACCOUNT;
/*
* Do not allow no accounting for OVERCOMMIT_NEVER, even
* if it's asked for.
*/
if  ((shmflg & SHM_NORESERVE) &&
sysctl_overcommit_memory != OVERCOMMIT_NEVER)
acctflag = 0;
sprintf (name, "SYSV%08x", key);
file = shmem_file_setup(name, size, acctflag);
}
error = PTR_ERR(file);
if (IS_ERR(file))
goto no_file;


error = -ENOSPC;
id = shm_addid(shp);
if(id == -1) 
goto no_id;


shp->shm_cprid = current->tgid;
shp->shm_lprid = 0;
shp->shm_atim = shp->shm_dtim = 0;
shp->shm_ctim = get_seconds();
shp->shm_segsz = size;
shp->shm_nattch = 0;
shp->id = shm_buildid(id,shp->shm_perm.seq);
shp->shm_file = file;
file->f_dentry->d_inode->i_ino = shp->id;


/* Hugetlb ops would have already been assigned. */
if (!(shmflg & SHM_HUGETLB))
file->f_op = &shm_file_operations;


shm_tot += numpages;
shm_unlock(shp);
return shp->id;


no_id:
fput(file);
no_file:
security_shm_free(shp);
ipc_rcu_putref(shp);
return error;
}

尋找一個ID，這個ID是陣列P的下標，這個陣列管理者所有的共享記憶體。

/**
 * ipc_findkey- find a key in an ipc identifier set
 * @ids: Identifier set
 * @key: The key to find
 * 
 * Requires ipc_ids.sem locked.
 * Returns the identifier if found or -1 if not.
 */
 
int ipc_findkey(struct ipc_ids* ids, key_t key)
{
int id;
struct kern_ipc_perm* p;
int max_id = ids->max_id;


/*
* rcu_dereference() is not needed here
* since ipc_ids.sem is held
*/
for (id = 0; id <= max_id; id++) {
p = ids->entries->p[id];
if(p==NULL)
continue;
if (key == p->key)
return id;
}
return -1;
}

shmat函式

asmlinkage long sys_shmat(int shmid, char __user *shmaddr, int shmflg)
{
unsigned long ret;
long err;


err = do_shmat(shmid, shmaddr, shmflg, &ret);
if (err)
return err;
force_successful_syscall_return();
return (long)ret;
}

主要是在do_shmat函式中實現掛載

/*
 * Fix shmaddr, allocate descriptor, map shm, add attach descriptor to lists.
 *
 * NOTE! Despite the name, this is NOT a direct system call entrypoint. The
 * "raddr" thing points to kernel space, and there has to be a wrapper around
 * this.
 */
long do_shmat(int shmid, char __user *shmaddr, int shmflg, ulong *raddr)
{
struct shmid_kernel *shp;
unsigned long addr;
unsigned long size;
struct file * file;
int    err;
unsigned long flags;
unsigned long prot;
unsigned long o_flags;
int acc_mode;
void *user_addr;


if (shmid < 0) {
err = -EINVAL;
goto out;
} else if ((addr = (ulong)shmaddr)) {
if (addr & (SHMLBA-1)) {
if (shmflg & SHM_RND)
addr &= ~(SHMLBA-1);  /* round down */
else
#ifndef __ARCH_FORCE_SHMLBA
if (addr & ~PAGE_MASK)
#endif
return -EINVAL;
}
flags = MAP_SHARED | MAP_FIXED;
} else {
if ((shmflg & SHM_REMAP))
return -EINVAL;


flags = MAP_SHARED;
}


if (shmflg & SHM_RDONLY) {
prot = PROT_READ;
o_flags = O_RDONLY;
acc_mode = S_IRUGO;
} else {
prot = PROT_READ | PROT_WRITE;
o_flags = O_RDWR;
acc_mode = S_IRUGO | S_IWUGO;
}
if (shmflg & SHM_EXEC) {
prot |= PROT_EXEC;
acc_mode |= S_IXUGO;
}


/*
* We cannot rely on the fs check since SYSV IPC does have an
* additional creator id...
*/
shp = shm_lock(shmid);
if(shp == NULL) {
err = -EINVAL;
goto out;
}
err = shm_checkid(shp,shmid);
if (err) {
shm_unlock(shp);
goto out;
}
if (ipcperms(&shp->shm_perm, acc_mode)) {
shm_unlock(shp);
err = -EACCES;
goto out;
}


err = security_shm_shmat(shp, shmaddr, shmflg);
if (err) {
shm_unlock(shp);
return err;
}

file = shp->shm_file;
size = i_size_read(file->f_dentry->d_inode);
shp->shm_nattch++;
shm_unlock(shp);


down_write(¤t->mm->mmap_sem);
if (addr && !(shmflg & SHM_REMAP)) {
user_addr = ERR_PTR(-EINVAL);
if (find_vma_intersection(current->mm, addr, addr + size))
goto invalid;
/*
* If shm segment goes below stack, make sure there is some
* space left for the stack to grow (at least 4 pages).
*/
if (addr < current->mm->start_stack &&
   addr > current->mm->start_stack - size - PAGE_SIZE * 5)
goto invalid;
}

user_addr = (void*) do_mmap (file, addr, size, prot, flags, 0);


invalid:
up_write(¤t->mm->mmap_sem);


down (&shm_ids.sem);
if(!(shp = shm_lock(shmid)))
BUG();
shp->shm_nattch--;
if(shp->shm_nattch == 0 &&
  shp->shm_flags & SHM_DEST)
shm_destroy (shp);
else
shm_unlock(shp);
up (&shm_ids.sem);


*raddr = (unsigned long) user_addr;
err = 0;
if (IS_ERR(user_addr))
err = PTR_ERR(user_addr);
out:
return err;
}

先檢查傳入的shmid共享記憶體是否存在，存在則去檢查它的訪問許可權。修改該共享記憶體shmid_kernel中的資料（比如連線數shm_nattch加1）

down_write開始對映，把共享記憶體連線到程序虛擬地址空間。

shmdt函式

asmlinkage long sys_shmdt(char __user *shmaddr)
{
struct mm_struct *mm = current->mm;
struct vm_area_struct *vma, *next;
unsigned long addr = (unsigned long)shmaddr;
loff_t size = 0;
int retval = -EINVAL;


down_write(&mm->mmap_sem);


/*
* This function tries to be smart and unmap shm segments that
* were modified by partial mlock or munmap calls:
* - It first determines the size of the shm segment that should be
*   unmapped: It searches for a vma that is backed by shm and that
*   started at address shmaddr. It records it's size and then unmaps
*   it.
* - Then it unmaps all shm vmas that started at shmaddr and that
*   are within the initially determined size.
* Errors from do_munmap are ignored: the function only fails if
* it's called with invalid parameters or if it's called to unmap
* a part of a vma. Both calls in this function are for full vmas,
* the parameters are directly copied from the vma itself and always
* valid - therefore do_munmap cannot fail. (famous last words?)
*/
/*
* If it had been mremap()'d, the starting address would not
* match the usual checks anyway. So assume all vma's are
* above the starting address given.
*/
vma = find_vma(mm, addr);


while (vma) {
next = vma->vm_next;


/*
* Check if the starting address would match, i.e. it's
* a fragment created by mprotect() and/or munmap(), or it
* otherwise it starts at this address with no hassles.
*/
if ((vma->vm_ops == &shm_vm_ops || is_vm_hugetlb_page(vma)) &&
(vma->vm_start - addr)/PAGE_SIZE == vma->vm_pgoff) {




size = vma->vm_file->f_dentry->d_inode->i_size;
do_munmap(mm, vma->vm_start, vma->vm_end - vma->vm_start);
/*
* We discovered the size of the shm segment, so
* break out of here and fall through to the next
* loop that uses the size information to stop
* searching for matching vma's.
*/
retval = 0;
vma = next;
break;
}
vma = next;
}


/*
* We need look no further than the maximum address a fragment
* could possibly have landed at. Also cast things to loff_t to
* prevent overflows and make comparisions vs. equal-width types.
*/
while (vma && (loff_t)(vma->vm_end - addr) <= size) {
next = vma->vm_next;


/* finding a matching vma now does not alter retval */
if ((vma->vm_ops == &shm_vm_ops || is_vm_hugetlb_page(vma)) &&
(vma->vm_start - addr)/PAGE_SIZE == vma->vm_pgoff)


do_munmap(mm, vma->vm_start, vma->vm_end - vma->vm_start);
vma = next;
}


up_write(&mm->mmap_sem);
return retval;
}

shmdt斷開程序與虛擬地址shmaddr的共享記憶體的連線。

首先找到程序虛擬空間中地址為shmaddr的VMA（linux管理程序虛擬地址空間的資料結構）。

/*
 * Look up the first VMA which satisfies  addr < vm_end,  NULL if none
 */
struct vm_area_struct *find_vma(struct mm_struct *mm, unsigned long addr)
{
struct vm_list_struct *vml;


for (vml = mm->context.vmlist; vml; vml = vml->next)
if (addr >= vml->vma->vm_start && addr < vml->vma->vm_end)
return vml->vma;


return NULL;
}


EXPORT_SYMBOL(find_vma);

這個函式的返回值是vm_area_struct型別，vm_area_struct結構體儲存程序虛擬地址資訊。

/*
 * This struct defines a memory VMM memory area. There is one of these
 * per VM-area/task.  A VM area is any part of the process virtual memory
 * space that has a special rule for the page-fault handlers (ie a shared
 * library, the executable area etc).
 */
struct vm_area_struct {
struct mm_struct * vm_mm;/* The address space we belong to. */
unsigned long vm_start;/* Our start address within vm_mm. */
unsigned long vm_end;/* The first byte after our end address
  within vm_mm. */


/* linked list of VM areas per task, sorted by address */
struct vm_area_struct *vm_next;


pgprot_t vm_page_prot;/* Access permissions of this VMA. */
unsigned long vm_flags;/* Flags, listed below. */


struct rb_node vm_rb;


/*
* For areas with an address space and backing store,
* linkage into the address_space->i_mmap prio tree, or
* linkage to the list of like vmas hanging off its node, or
* linkage of vma in the address_space->i_mmap_nonlinear list.
*/
union {
struct {
struct list_head list;
void *parent; /* aligns with prio_tree_node parent */
struct vm_area_struct *head;
} vm_set;


struct raw_prio_tree_node prio_tree_node;
} shared;


/*
* A file's MAP_PRIVATE vma can be in both i_mmap tree and anon_vma
* list, after a COW of one of the file pages.  A MAP_SHARED vma
* can only be in the i_mmap tree.  An anonymous MAP_PRIVATE, stack
* or brk vma (with NULL file) can only be in an anon_vma list.
*/
struct list_head anon_vma_node;/* Serialized by anon_vma->lock */
struct anon_vma *anon_vma;/* Serialized by page_table_lock */


/* Function pointers to deal with this struct. */
struct vm_operations_struct * vm_ops;


/* Information about our backing store: */
unsigned long vm_pgoff;/* Offset (within vm_file) in PAGE_SIZE
  units, *not* PAGE_CACHE_SIZE */
struct file * vm_file;/* File we map to (can be NULL). */
void * vm_private_data;/* was vm_pte (shared mem) */
unsigned long vm_truncate_count;/* truncate_count or restart_addr */


#ifndef CONFIG_MMU
atomic_t vm_usage;/* refcount (VMAs shared if !MMU) */
#endif
#ifdef CONFIG_NUMA
struct mempolicy *vm_policy;/* NUMA policy for the VMA */
#endif
};

接下來，呼叫do_mumap函式，斷開連線

int do_munmap(struct mm_struct *mm, unsigned long addr, size_t len)
{
struct vm_list_struct *vml, **parent;
unsigned long end = addr + len;


#ifdef DEBUG
printk("do_munmap:\n");
#endif


for (parent = &mm->context.vmlist; *parent; parent = &(*parent)->next)
if ((*parent)->vma->vm_start == addr &&
   ((len == 0) || ((*parent)->vma->vm_end == end)))
goto found;


printk("munmap of non-mmaped memory by process %d (%s): %p\n",
      current->pid, current->comm, (void *) addr);
return -EINVAL;


 found:
vml = *parent;


put_vma(vml->vma);


*parent = vml->next;
realalloc -= kobjsize(vml);
askedalloc -= sizeof(*vml);
kfree(vml);


update_hiwater_vm(mm);
mm->total_vm -= len >> PAGE_SHIFT;


#ifdef DEBUG
show_process_blocks();
#endif


return 0;
}

shmctl函式

asmlinkage long sys_shmctl (int shmid, int cmd, struct shmid_ds __user *buf)
{
struct shm_setbuf setbuf;
struct shmid_kernel *shp;
int err, version;


if (cmd < 0 || shmid < 0) {
err = -EINVAL;
goto out;
}


version = ipc_parse_version(&cmd);


switch (cmd) { /* replace with proc interface ? */
case IPC_INFO:
{
struct shminfo64 shminfo;


err = security_shm_shmctl(NULL, cmd);
if (err)
return err;


memset(&shminfo,0,sizeof(shminfo));
shminfo.shmmni = shminfo.shmseg = shm_ctlmni;
shminfo.shmmax = shm_ctlmax;
shminfo.shmall = shm_ctlall;


shminfo.shmmin = SHMMIN;
if(copy_shminfo_to_user (buf, &shminfo, version))
return -EFAULT;
/* reading a integer is always atomic */
err= shm_ids.max_id;
if(err<0)
err = 0;
goto out;
}
case SHM_INFO:
{
struct shm_info shm_info;


err = security_shm_shmctl(NULL, cmd);
if (err)
return err;


memset(&shm_info,0,sizeof(shm_info));
down(&shm_ids.sem);
shm_info.used_ids = shm_ids.in_use;
shm_get_stat (&shm_info.shm_rss, &shm_info.shm_swp);
shm_info.shm_tot = shm_tot;
shm_info.swap_attempts = 0;
shm_info.swap_successes = 0;
err = shm_ids.max_id;
up(&shm_ids.sem);
if(copy_to_user (buf, &shm_info, sizeof(shm_info))) {
err = -EFAULT;
goto out;
}


err = err < 0 ? 0 : err;
goto out;
}
case SHM_STAT:
case IPC_STAT:
{
struct shmid64_ds tbuf;
int result;
memset(&tbuf, 0, sizeof(tbuf));
shp = shm_lock(shmid);
if(shp==NULL) {
err = -EINVAL;
goto out;
} else if(cmd==SHM_STAT) {
err = -EINVAL;
if (shmid > shm_ids.max_id)
goto out_unlock;
result = shm_buildid(shmid, shp->shm_perm.seq);
} else {
err = shm_checkid(shp,shmid);
if(err)
goto out_unlock;
result = 0;
}
err=-EACCES;
if (ipcperms (&shp->shm_perm, S_IRUGO))
goto out_unlock;
err = security_shm_shmctl(shp, cmd);
if (err)
goto out_unlock;
kernel_to_ipc64_perm(&shp->shm_perm, &tbuf.shm_perm);
tbuf.shm_segsz= shp->shm_segsz;
tbuf.shm_atime= shp->shm_atim;
tbuf.shm_dtime= shp->shm_dtim;
tbuf.shm_ctime= shp->shm_ctim;
tbuf.shm_cpid = shp->shm_cprid;
tbuf.shm_lpid = shp->shm_lprid;
if (!is_file_hugepages(shp->shm_file))
tbuf.shm_nattch= shp->shm_nattch;
else
tbuf.shm_nattch = file_count(shp->shm_file) - 1;
shm_unlock(shp);
if(copy_shmid_to_user (buf, &tbuf, version))
err = -EFAULT;
else
err = result;
goto out;
}
case SHM_LOCK:
case SHM_UNLOCK:
{
shp = shm_lock(shmid);
if(shp==NULL) {
err = -EINVAL;
goto out;
}
err = shm_checkid(shp,shmid);
if(err)
goto out_unlock;


if (!capable(CAP_IPC_LOCK)) {
err = -EPERM;
if (current->euid != shp->shm_perm.uid &&
   current->euid != shp->shm_perm.cuid)
goto out_unlock;
if (cmd == SHM_LOCK &&
   !current->signal->rlim[RLIMIT_MEMLOCK].rlim_cur)
goto out_unlock;
}


err = security_shm_shmctl(shp, cmd);
if (err)
goto out_unlock;

if(cmd==SHM_LOCK) {
struct user_struct * user = current->user;
if (!is_file_hugepages(shp->shm_file)) {
err = shmem_lock(shp->shm_file, 1, user);
if (!err) {
shp->shm_flags |= SHM_LOCKED;
shp->mlock_user = user;
}
}
} else if (!is_file_hugepages(shp->shm_file)) {
shmem_lock(shp->shm_file, 0, shp->mlock_user);
shp->shm_flags &= ~SHM_LOCKED;
shp->mlock_user = NULL;
}
shm_unlock(shp);
goto out;
}
case IPC_RMID:
{
/*
* We cannot simply remove the file. The SVID states
* that the block remains until the last person
* detaches from it, then is deleted. A shmat() on
* an RMID segment is legal in older Linux and if 
* we change it apps break...
*
* Instead we set a destroyed flag, and then blow
* the name away when the usage hits zero.
*/
down(&shm_ids.sem);
shp = shm_lock(shmid);
err = -EINVAL;
if (shp == NULL) 
goto out_up;
err = shm_checkid(shp, shmid);
if(err)
goto out_unlock_up;


if (current->euid != shp->shm_perm.uid &&
   current->euid != shp->shm_perm.cuid && 
   !capable(CAP_SYS_ADMIN)) {
err=-EPERM;
goto out_unlock_up;
}


err = security_shm_shmctl(shp, cmd);
if (err)
goto out_unlock_up;


if (shp->shm_nattch){
shp->shm_flags |= SHM_DEST;
/* Do not find it any more */
shp->shm_perm.key = IPC_PRIVATE;
shm_unlock(shp);
} else
shm_destroy (shp);
up(&shm_ids.sem);
goto out;
}


case IPC_SET:
{
if (copy_shmid_from_user (&setbuf, buf, version)) {
err = -EFAULT;
goto out;
}
if ((err = audit_ipc_perms(0, setbuf.uid, setbuf.gid, setbuf.mode)))
return err;
down(&shm_ids.sem);
shp = shm_lock(shmid);
err=-EINVAL;
if(shp==NULL)
goto out_up;
err = shm_checkid(shp,shmid);
if(err)
goto out_unlock_up;
err=-EPERM;
if (current->euid != shp->shm_perm.uid &&
   current->euid != shp->shm_perm.cuid && 
   !capable(CAP_SYS_ADMIN)) {
goto out_unlock_up;
}


err = security_shm_shmctl(shp, cmd);
if (err)
goto out_unlock_up;

shp->shm_perm.uid = setbuf.uid;
shp->shm_perm.gid = setbuf.gid;
shp->shm_flags = (shp->shm_flags & ~S_IRWXUGO)
| (setbuf.mode & S_IRWXUGO);
shp->shm_ctim = get_seconds();
break;
}


default:
err = -EINVAL;
goto out;
}


err = 0;
out_unlock_up:
shm_unlock(shp);
out_up:
up(&shm_ids.sem);
goto out;
out_unlock:
shm_unlock(shp);
out:
return err;
}

shmctl函式根據傳入的cmd值的不同，進行不同的操作。例如SHM_STAT/IPC_STAT拷貝標示符為shmid的共享記憶體資訊或者IPC資訊到緩衝區buf。

先判斷標示符為shmid的標示符是否存在，然後根據cmd請求呼叫shm_buildid函式重新得到該共享記憶體標示符並返回。接著判斷讀取許可權，將共享記憶體資訊拷貝到臨時緩衝區tbuf，呼叫copy_shmid_to_user將tbuf拷貝到buf。

linux kernel原始碼剖析共享記憶體部分 IPC 虛擬記憶體對映 VMA shmget shmat shmdt shmctl

linux kernel原始碼剖析共享記憶體部分 IPC 虛擬記憶體對映 VMA shmget shmat shmdt shmctl

共享記憶體linux C/C++程式碼實戰------順便玩下ipcs, ipcrm, shmget, shmat, shmdt, shmctl

【原始碼剖析】MemoryPool —— 簡單高效的記憶體池 allocator 實現

CentOS 安裝 linux kernel 原始碼

linux kernel 原始碼安裝

為什麼linux下多執行緒程式如此消耗虛擬記憶體

什麼是記憶體(二)：虛擬記憶體

作業系統——記憶體管理和虛擬地址對映那些事

百度開源分散式id生成器uid-generator原始碼剖析偽共享（false sharing），併發程式設計無聲的效能殺手一個Java物件到底佔用多大記憶體？寫Java也得了解CPU--偽共享

Linux基於Live555從共享記憶體獲取rstp實時H264視訊流並轉發附原始碼

linux kernel(二)原始碼淺析

Redis原始碼剖析--記憶體管理zmalloc

hadoop2.6.0原始碼剖析-客戶端(第二部分--讀(open)HDFS檔案)

hadoop2.6.0原始碼剖析-客戶端(第二部分--DFSClient)下(HA代理)

hadoop2.6.0原始碼剖析-客戶端(第二部分--DFSClient)上(非HA代理)

hadoop2.6.0原始碼剖析-客戶端(第二部分--DistributedFileSystem)

hadoop2.6.0原始碼剖析-客戶端(第二部分--Configuration)

linux下poll和epoll核心原始碼剖析

hadoop2.6.0原始碼剖析-客戶端(第二部分--讀(read)HDFS檔案)

linux中幾種共享記憶體

linux kernel原始碼剖析 共享記憶體部分 IPC 虛擬記憶體對映 VMA shmget shmat shmdt shmctl

相關推薦

linux kernel原始碼剖析共享記憶體部分 IPC 虛擬記憶體對映 VMA shmget shmat shmdt shmctl