1. 程式人生 > 實用技巧 >深度學習 3d人臉 重建_深度學習實時3D人臉跟蹤

深度學習 3d人臉 重建_深度學習實時3D人臉跟蹤

深度學習 3d人臉 重建

Snapchat was made popular by putting funny dog ears on people’s head, swapping faces and other tricks, that beyond funny, look impossible, even magical. I am in the digital visual effects industry so I am familiar with that magic.. and the desire to understand how it works behind the scene.

通過將滑稽的狗耳朵放在人們的頭上,交換面Kong和其他技巧來使S napchat變得流行,這超出了滑稽,看起來不可能,甚至是魔術。 我屬於數字視覺效果行業,所以我熟悉這種魔術。並且渴望瞭解它在幕後的工作原理。

魔術的背後 (Behind the magic)

Modifying people faces is routine work in Hollywood visual effects, it’s a well understood craft nowadays, but it typically requires tens of digital artists to achieve a photorealistic face transformation. How can we automate that?

修改人臉是好萊塢視覺效果中的常規工作,如今已廣為人知,但是通常需要數十位數字藝術家才能實現逼真的人臉轉換。 我們該如何自動化呢?

Here’s a simplified breakdown of the steps these artists follow:

這是這些藝術家遵循的步驟的簡化分類:

  1. Tracking the position, shape and movement of the face relative to the camera in 3D

    以3D方式跟蹤面部相對於相機的位置,形狀和運動

  2. Animation of the 3D models to snap on the tracked face (e.g. a dog nose)

    3D模型的動畫可捕捉到跟蹤的面部(例如,鼻子)

  3. Lighting and rendering of the 3D models into 2D images

    將3D模型照明並渲染為2D影象

  4. Compositing of the rendered CGI images with the live action footage

    將渲染的CGI影象與實景鏡頭合成

Automation of steps 2 and 3 is not very different from what happens in video games, it’s relatively straightforward. Compositing can be simplified to 3D foreground over live background, easy. The challenge is the tracking, how can a program ‘see’ the complex motion of a human head?

第2步和第3步的自動化與視訊遊戲中的自動化並沒有太大不同,它相對簡單。 合成可以簡化為實時背景下的3D前景,非常簡單。 挑戰在於跟蹤,程式如何“看到”人頭的複雜運動?

用人工智慧追蹤人臉 (Tracking faces with Artificial Intelligence)

The Computer Science community has been trying to track faces automatically for a long time and it’s hard. In the recent years, Machine Learning came to the rescue and many Deep Learning papers are published every year on the topic. I’ve spent a while looking for the “state of the art” and realised doing this in real-time is VERY HARD! A good reason to try and tackle the challenge (and that would work nicely with the AR beauty mode I have implemented).

很長時間以來,電腦科學界一直在嘗試自動跟蹤人臉,這很難。 近年來,機器學習得到了廣泛的應用,每年都有很多關於該主題的深度學習論文發表。 我花了一段時間尋找“最先進的技術”,並意識到實時地做到這一點非常困難! 嘗試解決挑戰的一個很好的理由(這與我實現的AR美容模式很好搭配)。

“trying to track faces.. it’s hard.. doing this in real-time is VERY HARD!”

“試圖跟蹤人臉……很難。實時做到這一點非常困難!”

Here’s how I did it.

這是我的方法。

設計網路 (Designing the network)

Convolutional Neural Networks are popular for visual analysis of images and commonly used for applications such as object detection and image recognition.

卷積神經網路廣泛用於影象的視覺分析,並且通常用於諸如物件檢測和影象識別等應用。

Image for post
this publication⁹) 本出版物publication)

For a deep neural network to be evaluated in real-time (at least 30 times per second), a compact network is desired¹. With the popularity of Machine Learning and smart phones, new models are discovered every year that push the limit of efficiency — offering a trade-off between computational precision and overhead. Among such models, MobileNet, SqueezeNet and ShuffleNet are popular for applications on mobile devices, thanks to their compactness.

對於要實時(每秒至少30次)進行評估的深度神經網路,需要一個緊湊的網路¹。 隨著機器學習和智慧手機的普及,每年都會發現新的模型,這些模型推動了效率的極限—在計算精度開銷之間進行權衡。 在這些模型中,由於其緊湊性,MobileNet,SqueezeNet和ShuffleNet在移動裝置上很受歡迎。

Image for post
Architecture of ShuffleNet V2 for different levels of complexity (from the authors¹)
ShuffleNet V2的體系結構,可實現不同程度的複雜性(作者作者¹)

ShuttleNet V2¹ was recently introduced and offers state of the art performances, coming in various sizes to balance between speed and accuracy. It ships with PyTorch, one more reason to pick that model.

ShuttleNet V2¹是最近推出的,可提供最先進的效能,具有各種尺寸,可以在速度和精度之間取得平衡。 它與PyTorch一起提供,這是選擇該模型的另一個原因。

選擇要學習的功能 (Choosing the features to learn)

Image for post
this paper²) 本文²)

Now I need to find what features the CNN should learn. A common approach is defining a list of anchor points for different key parts of the face, also called ‘facial landmarks’.

現在,我需要找到CNN應該學習的功能。 一種常見的方法是為面部的不同關鍵部位定義錨定點列表,也稱為“面部標誌”。

The points are numbered and associated strategically around the eyes, eyebrows, nose, mouth and jawline. I want to train the network to identify the coordinate of each point, so I can later reconstruct masks or geometric meshes based on them.

這些點已編號,並且在眼睛,眉毛,鼻子,嘴巴和下巴周圍有策略地關聯。 我想訓練網路以識別每個點的座標,以便以後可以基於它們重建蒙版或幾何網格。

建立訓練資料集 (Building a training dataset)

Because I want to augment videos with 3D effects, I looked for a dataset with 3D landmark coordinates. 300W-LP is one of the few dataset that comes with 3D positions, it’s pretty large and as a bonus offers a good diversity of face angles. I want to benchmark my solution against the state of the art, recent publications test their models on AFLW2000–3D so I go for 300W-LP for training and test on AFLW2000–3D for comparison.

因為我想用3D效果來增強視訊,所以我尋找了具有3D地標座標的資料集。 300W-LP是少數具有3D位置的資料集之一,它非常大,並且額外提供了很好的面部角度多樣性。 我想以最新技術為基準對我的解決方案進行基準測試,最近的出版物在AFLW2000-3D上測試了他們的模型,因此我選擇300W-LP進行培訓,並在AFLW2000-3D上進行測試以進行比較。

Image for post
Image for post
Image for post
300W-LP³, profile views are generated mathematically 300W- LP³的影象,輪廓檢視是通過數學方式生成的

A note on these datasets, they are meant for the research community and generally not free for commercial use.

關於這些資料集的註釋,它們僅供研究人員使用,通常不免費用於商業用途。

擴充資料集 (Augmenting the dataset)

Dataset augmentation improves the accuracy of the training by adding even more variations to the set that it already has. I apply the following transformations to each image and landmark, to create new ones, by a random amount: rotation up to -/+ 40° around the centre, up to 10% translation and scale, and horizontal flip. I apply a different random transformation in memory on each image and for each learning pass (epoch) for additional augmentation.

資料集擴充通過向其已有的集合中新增更多變體來提高訓練的準確性。 我對每個影象和地標應用以下變換,以隨機的數量建立新的變換:圍繞中心旋轉最多-/ + 40°,最大平移和縮放10%,以及水平翻轉。 我對每個影象和每個學習通道(時期)在記憶體中應用了不同的隨機變換,以進行其他增強。

It’s also necessary to crop the input image close to the bounding box of the landmarks for the CNN to recognise the landmarks at their relative locations. That’s done as a preprocess to save on load time from disk during training.

還必須將輸入影象裁剪到地標邊界框附近,以使CNN能夠識別其相對位置處的地標。 這樣做是為了節省培訓期間磁碟載入時間的預處理。

設計損失函式 (Designing the loss function)

Image for post
the publication⁴) 出版物⁴中的影象)

Typically an L2 loss function is used to measure the prediction error for landmark positions. A recent publication⁴ describes a so-called Wing loss function, that performs better for this application, which I could verify. I parametrise it with w=10 and ε = 2 as suggested by the author and sum the result over all landmark coordinates.

通常,L2損失函式用於測量界標位置的預測誤差。 最近的出版物⁴描述了一種所謂的Wing損失函式,該函式對該應用程式的效能更好,我可以驗證一下。 根據作者的建議,我用w = 10和ε= 2對其引數化,並對所有界標座標上的結果求和。

訓練網路 (Training the network)

Training a deep neural network is a very expensive operation that requires powerful computers. Using my laptop would have taken weeks, literally, for one training phase and building a decent setup costs thousands of dollars. I decided to leverage the cloud so I can pay just for the compute power I need.

訓練深度神經網路是一項非常昂貴的操作,需要功能強大的計算機。 實際上,使用我的膝上型電腦要花幾個星期才能完成一個培訓階段,而建立一個像樣的安裝程式則要花費數千美元。 我決定利用雲,以便我可以僅為所需的計算能力付費。

I chose Genesis Cloud, that offers very competitive prices and $50 free credit to get started. I build a Linux VM with a GeForce GTX 1080 Ti, prepare an OS and storage image where I setup PyTorch and upload my code and the datasets, all through ssh. Once the system is setup, it can be started and shut down on demand, creating a snapshot allows to resume the work where I left it.

我選擇了Genesis Cloud ,它提供了極具競爭力的價格和$ 50的免費贈金,可以開始使用。 我使用GeForce GTX 1080 Ti構建了Linux VM,準備了作業系統和儲存映像,並在其中設定PyTorch並通過ssh上傳了我的程式碼和資料集。 設定好系統後,就可以按需啟動和關閉它了,建立快照可以在我離開系統的地方恢復工作。

Image for post
Plot of mean error for each epoch
每個時期的平均誤差圖

The inner training loop processes mini-batches of 32 images to maximise the parallel computation on GPU. A learning pass (epoch) process the entire set of about 60,000 images and takes about 4 minutes. The training converges around 70 epochs so I let it run overnight for 100 epochs to be safe.

內部訓練迴圈處理32幅影象的小批量,以最大程度地利用GPU進行平行計算。 學習通行證(時代)處理大約60,000張影象的整個過程,大約需要4分鐘。 培訓大約收斂了70個紀元,因此為了安全起見,我讓它連續執行100個紀元。

I use the popular Adam optimiser that automatically adapts the learning rate, starting with a rate of 0.001. I found that setting the initial learning rate right is critical, if it’s too small the training converges too early in a sub-optimal solution. If it’s too large it has difficulties converging at all. I found the value through trial and error, which is time consuming.. and actually costly when paying the cloud per use!

我使用流行的亞當優化器,該器會自動調整學習率,從0.001開始。 我發現正確設定初始學習率至關重要,如果它太小,則訓練在次優解決方案中收斂得太早。 如果太大,將很難收斂。 我通過反覆試驗發現了價值,這很費時..而且每次使用雲支付時實際上很昂貴!

評價 (Evaluation)

All these efforts paid off, with the bigger network ShuffleNet V2 2x, I obtain a Normalised Mean Error (NME) of 2.796 on AFLW2000–3D. That’s better than the state of the art model⁵ on that dataset and its NME of 3.07, by a good margin, despite that model being much heavier!