1. 程式人生 > 實用技巧 >維度詛咒_維度的詛咒減去行話的詛咒

維度詛咒_維度的詛咒減去行話的詛咒

維度詛咒

重點 (Top highlight)

The curse of dimensionality! What on earth is that? Besides being a prime example of shock-and-awe names in machine learning jargon (which often sound far fancier than they are), it’s a reference to the effect that adding more features has on your dataset. In a nutshell, the curse of dimensionality is all about loneliness

.

維度詛咒 ! 那到底是什麼? 除了是機器學習術語中震撼人心的名字的主要示例(聽起來通常比他們想象的要怪異得多)之外,它還引用了在資料集上新增更多特徵的效果。 簡而言之,維度的詛咒全都與孤獨有關。

In a nutshell, the curse of dimensionality is all about loneliness.

簡而言之,維度的詛咒全都與孤獨有關。

Before I explain myself, let’s get some basic jargon out of the way. What’s a feature? It’s the machine learning

word for what other disciplines might call a predictor / (independent) variable / attribute / signal. Information about each datapoint, in other words. Here’s a jargon intro if none of those words felt familiar.

在我自我解釋之前,讓我們先了解一些基本術語。 有什麼功能? 這是機器學習的詞彙,表示其他學科可能將其稱為預測變數/(獨立)變數/屬性/訊號。 換句話說,有關每個資料點的資訊。 如果這些詞都不熟悉,這是一個

術語介紹

Data social distancing is easy: just add a dimension. But for some algorithms, you may find that this is a curse…

資料社交區分開很容易:只需新增一個維度。 但是對於某些演算法,您可能會發現這是一個詛咒……

When a machine learning algorithm is sensitive to the curse of dimensionality, it means the algorithm works best when your datapoints are surrounded in space by their friends. The fewer friends they have around them in space, the worse things get. Let’s take a look.

機器學習演算法對維數的詛咒敏感時,這意味著當您的資料點被朋友包圍時,該演算法最有效。 他們在太空周圍擁有的朋友越少,情況就越糟。 讓我們來看看。

一維 (One dimension)

Imagine you’re sitting in a large classroom, surrounded by your buddies.

想象一下,您坐在一個大教室裡,周圍被好友們包圍著。

You’re a datapoint, naturally. Let’s put you in one dimension by making the room dark and shining a bright light from the back of the room at you. Your shadow is projected onto a line on the front wall. On that line, it’s not lonely at all. You and your crew are sardines in a can, all lumped together. It’s cozy in one dimension! Perhaps a little too cozy.

您自然是一個資料點 。 通過使房間變暗並從房間背面向您發出明亮的光線,讓您處於一個維度。 您的陰影投影到前牆上的一條線上。 在那條線上,這一點都不孤單。 您和您的船員都是沙丁魚罐頭,全都混在一起。 一維舒適! 也許有點舒服了。

Image for post

二維 (Two dimensions)

To give you room to breathe, let’s add a dimension. We’re in 2D and the plane is the floor of the room. In this space, you and your friends are more spread out. Personal space is a thing again.

為了給您呼吸的空間,讓我們新增一個尺寸。 我們處於2D模式,飛機是房間的地板。 在這個空間中,您和您的朋友更加分散。 個人空間又是一回事。

Image for post

Note: If you prefer to follow along in an imaginary spreadsheet, think of adding/removing a dimension as inserting/deleting a column of numbers.

注意: 如果您喜歡在虛構的電子表格中進行操作,請考慮將尺寸新增/刪除視為插入/刪除數字列。

三維 (Three dimensions)

Let’s add a third dimension by randomly sending each of you to one of the floors of the 5-floor building you were in.

讓我們通過將每個人隨機發送到您所在的5層建築物的一層來增加第三個維度。

Image for post

All of a sudden, you’re not so densely surrounded by friends anymore. It’s lonely around you. If you enjoyed the feeling of a student in nearly every seat, chances are you’re now mournfully staring at quite a few empty chairs. You’re beginning to get misty eyed, but at least one of your buddies is probably still near you…

突然之間,您不再被朋友所包圍。 你身邊很寂寞。 如果您喜歡幾乎每個座位上的學生感覺,那麼您現在很悲哀地凝視著很多空椅子。 您開始眼花mist亂,但是至少您的一個夥伴可能仍在您附近……

Image for post

四個維度 (Four dimensions)

Not for long! Let’s add another dimension. Time.

不是很長! 讓我們新增另一個維度。 時間。

Image for post

The students are spread among 60min sections of this class (on various floors) at various times — let’s limit ourselves to 9 sessions because lecturers need sleep and, um, lives. So, if you were lucky enough to still have a companion for emotional support before, I’m fairly confident you’re socially distanced now. If you can’t be effective when you’re lonely, boom! We have our problem. The curse of dimensionality has struck!

在不同的時間,這些學生分佈在該課程的60分鐘部分(位於不同樓層)中-我們將自己限制在9節課中,因為講師需要睡眠和一些生命。 因此,如果您有幸在此之前仍然有同伴提供情感支援,那麼我很自信您現在在社交上與外界保持距離。 如果您在孤獨時無法發揮作用,那就加油! 我們有問題。 維度的詛咒來了!

Image for post

MOAR尺寸 (MOAR dimensions)

As we add dimensions, you get lonely very, very quickly. If we want to make sure that every student is just as surrounded by friends as they were in 2D, we’re going to need students. Lots of them.

隨著我們新增維度,您會非常非常快速地孤獨。 如果我們要確保每個學生和2D一樣都被朋友包圍著,那麼我們將需要學生。 其中很多。

Image for post

The most important idea here is that we have to recruit more friends exponentially, not linearly, to keep your blues at bay.

這裡最重要的想法是,我們必須成倍地而不是線性地招募更多的朋友,以使您的藍調保持穩定。

If we add two dimensions, we can’t simply compensate with two more students… or even two more classrooms’ worth of students. If we started with 50 students in the room originally and we added 5 floors and 9 classes, we need 5x9=45 times more students to keep one another as much company as 50 could have done. So, we need 45x50=2,250 students to avoid loneliness. That’s a whole lot more than one extra student per dimension! Data requirements go up quickly.

如果我們增加兩個維度,就不能簡單地補償另外兩個學生…甚至兩個教室的學生價值。 如果我們最初從教室裡的50個學生開始,並且增加了5層樓和9個班級,那麼我們需要的學生人數是5x9 = 45倍,以保持50個學生可以做的儘可能多的陪伴。 因此,我們需要45x50 = 2,250名學生來避免孤獨感。 每個維度多了一個額外的學生! 資料需求Swift上升。

When you add dimensions, minimum data requirements can grow rapidly.

新增維度時,最低資料要求可能會Swift增長。

We need to recruit many, many more students (datapoints) every time we go up a dimension. If data are expensive for you, this curse is really no joke!

每次上維時,我們都需要招募更多很多學生(資料點)。 如果資料對您來說太昂貴了,那麼這個詛咒真的不是笑話!

維數 (Dimensional divas)

Not all machine learning algorithms get so emotional when confronted with a bit of me-time. Methods like k-NN are complete divas, of course. It’s hardly a surprise for a method whose name abbreviation stands for k-Nearest Neighbors — it’s about computing things about neighboring datapoints, so it’s rather important that the datapoints are neighborly.

並非所有的機器學習演算法在面對我的時候都會變得如此激動。 當然,像k-NN這樣的方法是完整的。 對於名稱縮寫代表k-Nearest Neighbors的方法來說,這並不令人驚訝-它是關於計算相鄰資料點的資訊,因此,資料點是相鄰的非常重要。

Other methods are a lot more robust when it comes to dimensions. If you’ve taken a class on linear regression, for example, you’ll know that once you have a respectable number of datapoints, gaining or dropping a dimension isn’t going to making anything implode catastrophically. There’s still a price — it’s just more affordable.*

在尺寸方面,其他方法要健壯得多。 例如,如果您上過線性迴歸課程,您就會知道,一旦擁有足夠數量的資料點,增加或減少維數就不會造成災難性的內爆。 仍有價格-更實惠。*

*Which doesn’t mean it is resilient to all abuse! If you’ve never known the chaos that including a single outlier or adding one near-duplicate feature can unleash on the least squares approach (the Napoleon of crime, Multicollinearity, strikes again!) then consider yourself warned. No method is perfect for every situation. And, yes, that includes neural networks.

*這並不意味著它可以抵抗所有虐待! 如果您從未意識到只有一個異常值或新增一個近乎重複的特徵會導致最小二乘方法的釋放(犯罪的拿破崙,多重共線性,再次打擊!),那麼請考慮一下自己。 沒有一種方法適合每種情況。 而且,是的,其中包括神經網路。

你應該怎麼做? (What should you do about it?)

What are you going to do about the curse of dimensionality in practice? If you’re a machine learning researcher, you’d better know if your algorithm has this problem… but I’m sure you already do. You’re probably not reading this article, so we’ll just talk about you behind your back, shall we? But yeah, you might like to think about whether it’s possible to design the algorithm you’re inventing to be less sensitive to dimension. Many of your customers like their matrices on the full-figured side**, especially if things are getting textual.

在實踐中,您將如何處理維數的詛咒? 如果您是機器學習研究人員,則最好知道您的演算法是否存在此問題……但我確定您已經做到了。 您可能沒有讀這篇文章,所以我們只是在背後談論您,對吧? 但是,是的,您可能想考慮是否有可能設計自己發明的對尺寸不太敏感的演算法。 您的許多客戶都喜歡他們在功能齊全的一面的矩陣**,尤其是當事情變得文字化時。

**Conventionally, we arrange data in a matrix so that the rows are examples and the columns are features. In that case, a tall and skinny matrix has lots of examples spread over few dimensions.

**按慣例,我們將資料排列在矩陣中,以使行為示例,而列為要素。 在那種情況下,一個又高又瘦的矩陣有很多例子,分佈在幾個維度上。

If you’re an applied data science enthusiast, you’ll do what you always do — get a benchmark of the algorithm’s performance using just one or a few promising features before attempting to throw the kitchen sink at it. (I’ll explain why you need that habit in another post, if you want a clue in the meantime, look up the term overfitting.)

如果您是應用資料科學的狂熱者,您將做自己經常做的事情-在嘗試將廚房水槽扔給它之前,僅使用一個或幾個有前途的功能就可以獲得演算法效能的基準。 (我將在另一篇文章中解釋為什麼您需要這種習慣,如果同時需要線索,請查詢 過度擬合 ”一詞 。)

Some methods only work well on tall, skinny datasets, so you might need to put your dataset on a diet if you’re feeling cursed.

某些方法僅適用於又高又瘦的資料集 ,因此,如果您感到被詛咒,可能需要節食飲食。

If your method works decently on a limited number of features and then blows a raspberry at you when you increase the dimensions, that’s your cue to either stick to a few features you handpick (or even stepwise-select if you’re getting crafty) or first make a few superfeatures out of your original kitchen sink by running some cute feature engineering techniques (you could try anything from old school things like principal component analysis (PCA) — still relevant today, eigenvectors never go out of fashion — to more modern things like autoencoders and other neural network funtimes). You don’t really need to know the term curse of dimensionality to get your work done because your process — start small and build up the complexity — should take care of it for you, but if it was bothering you… now you can shrug off the worry.

如果您的方法在有限數量的特徵上工作得很好,然後在增加尺寸時向您吹了覆盆子,那麼這可能是您堅持手工挑選了一些特徵(或者如果您正在精打細算,則是逐步選擇 )或首先通過執行一些可愛的功能工程技術在原始的廚房水槽中做一些超級功能 (您可以嘗試一些從老派的事情,例如主成分分析(PCA),到今天仍然有用,特徵向量永遠不會過時,再到更現代的事情)例如自動編碼器和其他神經網路的娛樂時間)。 您真的不需要知道維度詛咒一詞就可以完成工作,因為您的過程(從小開始並增加複雜性)應該為您解決,但是,如果它困擾您……現在您可以不用擔心了擔心。

Image for post

To summarize: As you add more and more features (columns), you need an exponentially-growing amount of examples (rows) to overcome how spread out your datapoints are in space. Some methods only work well on long skinny datasets, so you might need to put your dataset on a diet if you’re feeling cursed.

總結:隨著新增越來越多的功能 (列),您需要數量呈指數增長的示例 (行)來克服資料點在空間中的分佈。 有些方法僅適用於瘦長的資料集,因此,如果您感到被詛咒,可能需要節食飲食。

Image for post
spherical cow, er, I mean, meow-emitter is… and more a matter of how many packing peanuts it is covered in. Image: 頭球形奶牛有多大,呃,我的意思是喵喵發射器……而更多的是它所覆蓋的花生包裝數量的問題。圖片: SOURCE. SOURCE

謝謝閱讀! 喜歡作者嗎? (Thanks for reading! Liked the author?)

If you’re keen to read more of my writing, most of the links in this article take you to my other musings. Can’t choose? Try this one:

如果您希望閱讀更多我的作品,那麼本文中的大多數連結都將帶您進入我的其他想法。 無法選擇? 試試這個:

翻譯自: https://towardsdatascience.com/the-curse-of-dimensionality-minus-the-curse-of-jargon-520da109fc87

維度詛咒