1. 程式人生 > 實用技巧 >如何衡量二次曲線的變化趨勢_衡量變化

如何衡量二次曲線的變化趨勢_衡量變化

如何衡量二次曲線的變化趨勢

Let's do a think-experiment. Imagine that the world around you just froze. Nothing is changing. Everything is on a standstill. Is there anything, that is meaningful? Anything you can learn? Nothing. A stopped world is equivalent to a no world! The world exists because it “changes”. Rotations and revolutions of planets, the evolution of species, economic growth or decline, population mechanics, rise and fall of stock markets……..there is a notion of “change” at the heart of just every phenomenon.

L等我們做一個思考,實驗。 想象一下,您周圍的世界剛剛凍結。 一切都沒有改變。 一切都停滯不前。 有什麼有意義的嗎? 有什麼可以學的嗎? 沒有。 停止的世界等於沒有的世界! 世界之所以存在是因為它在“改變”。 行星的自轉和公轉,物種的演變,經濟增長或下降,人口機制,股票市場的興衰……..每個現象的核心都有“變化”的概念。

And if you take a big leap of generalization, you can see that all human knowledge is about the recognition, understanding, measurement, and manipulation of changes that occur in nature. Any knowledge about a phenomenon would roughly try to answer the following questions:

而且,如果您進行了概括性的大飛躍,您會發現所有人類知識都是關於對自然界中發生的變化的識別,理解,度量和操縱的。 有關現象的任何知識都將大致嘗試回答以下問題:

Is there a change? Which things are changing?

有變化嗎? 哪些事情正在改變?

Does the change in one thing affects others?

一件事的改變會影響其他嗎?

How can this change be measured? Is there a rule which can explain and express this change? Can it be used to predict the change in the future?

如何測量這種變化? 是否存在可以解釋和表達這種變化的規則? 可以用來預測未來的變化嗎?

When you are observing or studying a phenomenon, you focus on things that are involved. These things are called “variables” because they “vary” or they “change”. The kind of “changes” or “variations” they undergo depends on the kind of variables they are.

當您觀察或研究現象時,您會專注於所涉及的事物。 這些事物被稱為“ 變數 ”,因為它們“變化”或“變化”。 它們經歷的“變化”或“變化”的型別取決於它們所具有的變數的型別。

定量或定性 (Quantitative or Qualitative)

Image for post
Image by Author
圖片作者

Fundamentally, all variables (or all data) can only be of 2 types: Quantitative or Qualitative

從根本上講,所有變數(或所有資料)只能是兩種型別: 定量或定性

There are things for which you can measure the “size” or “value” like speed, distance, volume, weight, height, income, etc. These are numerical values and it makes sense to perform arithmetic operations such as addition or subtraction on such variables. Quantitative variables typically have measurement units, such as kilograms, dollars, years, and so on.

有些東西可以測量“大小”或“值”,例如速度,距離,體積,重量,高度,收入等。這些都是數值,因此對它們執行算術運算(如加法或減法)是有意義的變數。 定量變數通常具有度量單位,例如千克,美元,年等。

On the other hand, there are variables that describe something. e.g name of a person or a place, or gender or race or category of animals, and so on. The arithmetic operations on such variables do not make sense. You can not add or subtract genders or categories. These variables typically do not have units. Note that qualitative variables can also take numerical values such as PIN Codes or Social Security Numbers but you will not want to add PIN codes or SSNs. It will not make sense.

另一方面,有些變數可以描述某些事物。 例如,一個人或一個地方的名稱,或性別,種族或動物類別,等等。 對此類變數的算術運算沒有意義。 您不能新增或減去性別或類別。 這些變數通常沒有單位。 請注意,定性變數也可以採用數字值,例如PIN碼或社會安全號碼,但是您不希望新增PIN碼或SSN。 這沒有任何意義。

Now think of any data value about anything. This value will either be Quantitative or Qualitative. Quantitative variables are also called “Continuous” variables. Inherently, they represent a “quantity” that is being measured. e.g Weight. It's “continuous” because the value can be anything. e.g 50Kgs, 50.1 Kgs, 52 Kgs, 52.5 kgs….just about anything.

現在想想任何有關任何資料的值。 此值將是“定量”或“定性”。 定量變數也稱為“連續”變數。 從本質上講,它們代表著正在測量的“數量”。 例如重量。 這是“連續的”,因為值可以是任何值。 例如50公斤,50.1公斤,52公斤,52.5公斤……幾乎任何東西。

Qualitative variables are also known as “Categorical”. Inherently, they represent a “group” or a “category” and the value will belong to a fixed set of that group or category. e.g Gender — (Male or Female), Result of a Medical Test (Positive or Negative)

定性變數也稱為“類別”。 它們本質上代表一個“組”或“類別”,並且值將屬於該組或類別的固定集合。 例如,性別—(男性或女性)醫學檢查結果(陽性或陰性)

The type of variable (Quantitative or Qualitative) will determine the kind of analysis that can be done.

變數的型別(定量或定性)將確定可以進行的分析型別。

Let's look at some of the operations that can be done to describe or understand the changes in these variables.

我們來看一些可以描述或理解這些變數中的更改的操作。

集中趨勢測度 (Measures of Central Tendency)

Image for post
https://www.instagram.com/gauravkantgoel/) https://www.instagram.com/gauravkantgoel/ )

Whenever something is changing, what is our natural inclination to express or summarize these changes? On average, people will tend to find the “average”. (pun intended :)

每當發生變化時,表達或總結這些變化的自然傾向是什麼? 平均而言,人們會傾向於找到“ 平均值” 。 (雙關語意:)

Calculating “average” is the most intuitive thing which comes naturally when we are analyzing data. So let's say you want to know how good a batsman is? (In the game of cricket) You have been given the runs he scored in the last 10 matches.

當我們分析資料時,計算“平均值”是最直觀的事情。 假設您想知道一名擊球手有多出色? (在板球比賽中)您獲得了他在最近10場比賽中得分的得分。

Image for post

To measure his performance or to make a judgment of how good or how bad this player is, don't you have an instant urge to calculate “average” runs scored over the 10 matches? You can not make a decision or inference just by looking at the score of each match in isolation. You need 1 representative number to reflect it. And the natural tendency is to find out the “average”.

要衡量他的表現或判斷該球員的好壞,您是否立刻就想計算出10場比賽的平均得分? 您不能僅通過單獨檢視每個比賽的分數來做出決定或推斷。 您需要1個代表號碼才能反映出來。 自然的趨勢是找出“平均值”。

Many of us don’t realize but there are many ways to measure “average”. These measurements are known as “Measures of Central Tendency”.

我們許多人沒有意識到,但是有很多方法可以測量“平均值”。 這些度量被稱為“集中趨勢度量”。

意思: (Mean:)

Most of the times, when people are talking about average, they actually mean “Mean” :)

在大多數情況下,當人們談論平均水平時,實際上是指“均值” :)

Mean is simply the sum of all the given numbers divided by how many numbers are there.

平均值只是所有給定數字的總和除以其中有多少個數字。

In our example, the mean is (5+2+12+32+21+3+12+0+15+9)/10 which comes out to be 11.1

在我們的示例中,平均值為(5 + 2 + 12 + 32 + 21 + 3 + 12 + 0 + 15 + 9)/ 10,得出的平均值為11.1

You can say that “mean” is one type of “average”. Are there other types?

您可以說“平均值”是“平均值”的一種型別。 還有其他型別嗎?

中位數: (Median:)

Assume, that this player had another match in which he scored whopping 400 runs!

假設這位球員還有另一場比賽,他得分高達400杆!

The mean now comes out to be: 46.45

現在的平均值是:46.45

In this case, the mean can be misleading because, in all matches, the player has scored much less than 46.45. Because of only 1 match where he scored 400, the overall summary is kind of distorted. Such data points are called outliers. In such cases, a better measure of average is “Median”, which is the middle point of the data values. In our example, we will sort all the data values and choose the middle one.

在這種情況下,均值可能會產生誤導,因為在所有比賽中,玩家得分均遠低於46.45。 由於只有1場比賽他獲得了400分,因此總體總結有些失真。 這樣的資料點稱為離群值。 在這種情況下,更好的平均值度量是“中位數”,它是資料值的中間點。 在我們的示例中,我們將對所有資料值進行排序,然後選擇中間的一個。

Image for post

The median is 12, which is a more appropriate measure of “average” in this case.

中位數為12,在這種情況下,這是“平均值”的更合適度量。

模式: (Mode:)

A lesser-known measurement of “average” is Mode, which is the most frequently occurring data value in our data set. In our example, ‘12’ is repeating 2 times. So the mode is 12. Mean and Median are used for Quantitative variables while Mode comes in handy for Qualitative variables. There will no sense of calculating the mean or median for Qualitative variables like Gender, Colour, etc. Let's say, in our example, we have a column which tells if a particular match was won by the team or if they lost it.

對“平均值”的鮮為人知的度量是“模式”,它是資料集中最頻繁出現的資料值。 在我們的示例中,“ 12”重複2次。 因此,模式為12。均值和中位數用於定量變數,而模式在定性變數派上用場。 沒有必要計算諸如G​​ender,Colour等定性變數的平均值或中位數。在我們的示例中,我們有一個列,該列指示團隊是否贏得了特定的比賽或他們是否輸掉了比賽。

Image for post

The mode of variable “Match Result” in this case is “Won” since “Won” came 7 times while “Lost” came only 3.

在這種情況下,變數“匹配結果”的模式為“贏”,因為“贏”獲得了7次,而“迷失”僅獲得了3次。

We can say that on “average”, the team is winning.

可以說,在“平均水平”上,團隊是獲勝的。

“Mean”, “Median” and “Mode” are the three kinds of measurements of average which give an idea about the “centrality” of data.

“平均值”,“中位數”和“模式”是三種平均值度量,它們給出了資料的“中心性”概念。

But does the average provide enough information when analyzing or summarizing data? There is more which we can do.

但是,平均值在分析或彙總資料時是否提供了足夠的資訊? 我們還有更多可以做的事。

變異量度 (Measures of Variability)

Image for post
https://www.instagram.com/gauravkantgoel/) https://www.instagram.com/gauravkantgoel/ )

In addition to finding out the average of a data set, we can also look at how much the data is spread out. Or in other words, how much “variability” is there in the data set. Let us understand it by an example. Till now, we were looking at the scores of a single player. Let's say, we have been given the scores of 2 players of all 10 matches. We have to compare these scores and try to get an idea of how these players have performed in comparison to each other.

除了找出資料集的平均值之外,我們還可以檢視資料散佈了多少。 換句話說,資料集中有多少“可變性”。 讓我們通過一個例子來理解它。 到現在為止,我們正在研究單個玩家的得分。 假設我們在所有10場比賽中得到2位球員的得分。 我們必須比較這些得分,並試圖瞭解這些球員之間的比較情況。

Below is the data set:

以下是資料集:

Image for post

Both players have a mean(average) of 50. So how can we compare them? One subtle thing to check is “how” these players have averaged over the matches. Is one player more consistent in scoring than others? Is a player scoring huge runs in few matches while scoring very few runs in other matches? How are the scores “varying” across matches for these players?

兩位選手的均值(平均值)均為50。那麼我們如何比較它們呢? 要檢查的一件事是這些球員在比賽中的平均水平。 一個球員的得分是否比其他球員更一致? 玩家是否在很少的比賽中獲得高分,而在其他比賽中得分很少? 這些球員在比賽中的得分如何變化

範圍 (Range)

A very basic metric to check how the scores are varying is to calculate the range of scores for each player.

檢查分數如何變化的一個非常基本的指標是計算每個玩家的分數範圍。

Range is simply the difference between the largest and smallest observation in the data.

範圍只是資料中最大和最小觀測值之間的差。

Range of Player 1 = 200–0 (Highest Score — Lowest Score) = 200

玩家1的範圍= 200-0(最高分數—最低分數)= 200

Range of Player 2 = 70–30 = 40

玩家2的範圍= 70–30 = 40

Looks like Player 1 has more range of scores than Player 2. It means that the scores of Player 1 tend to have more variety than Player 2. If we look at the data of Player 1 carefully, there are outliers present. In some matches this player is scoring exceptionally high like 200 and in some matches, he is scoring nothing (0). Range can measure how far the values are spread out but it's sensitive to outliers.

看起來,玩家1的得分範圍比玩家2更大。這意味著玩家1的得分往往比玩家2更具多樣性。如果仔細觀察玩家1的資料,就會發現異常值。 在某些比賽中,該球員得分異常高,達到200,而在某些比賽中,他沒有得分(0)。 範圍可以測量值分佈的距離,但對異常值敏感。

四分位間距(IQR): (Inter-quartile Range (IQR):)

One way to overcome the shortfall of Range is to somehow exclude outliers and construct a ‘mini-range’. One such metric is “Inter-quartile range”. We sort the data in ascending order and divide the data values into 4 groups(quartiles). We then consider only the middle 50% values. This gives us Inter-quartile range.

克服範圍不足的一種方法是以某種方式排除異常值並構建“迷你範圍”。 一種這樣的度量標準是“四分位數間距”。 我們按升序對資料進行排序,然後將資料值分為4組(四分位數)。 然後,我們僅考慮中間的50%值。 這使我們達到四分位間距。

IQR for Player 1:

玩家1的IQR:

0 0 0 0 40 50 50 50 100 200

0 0 0 0 40 50 50 50 100 200

We have 10 scores. The position of lower Quartile is n/4 = 3 (after rounding)

我們有10分。 下四分位數的位置為n / 4 = 3(四捨五入後)

The position of Upper Quartile is 3n/4 = 8(after rounding)

高四分位的位置是3n / 4 = 8(四捨五入後)

Image for post

IQR is (Score at Position 8— Score at Position 3)

IQR為(位置8的得分-位置3的得分)

IQR = 50–0 = 50

IQR = 50-0 = 50

IQR for Player 2:

玩家2的IQR:

30 34 46 50 50 50 54 56 60 70

30 34 46 50 50 50 54 56 60 70

We have 10 scores. The position of lower Quartile is n/4 = 3 (after rounding)

我們有10分。 下四分位數的位置為n / 4 = 3(四捨五入後)

The position of Upper Quartile is 3n/4 = 8(after rounding)

高四分位的位置是3n / 4 = 8(四捨五入後)

Image for post

IQR is (Score at Position 8— Score at Position 3)

IQR為(位置8的得分-位置3的得分)

IQR = 56–46 = 10

IQR = 56–46 = 10

A special graph called “Box and Whisker” plot is used to visualize ranges.

稱為“箱形和晶須”圖的特殊圖形用於視覺化範圍。

Image for post

For a given data attribute it shows 5 metrics:

對於給定的資料屬性,它顯示5個指標:

  1. Minimum Value

    最低值
  2. Maximum Value

    最大值
  3. Median

    中位數
  4. Quartile 1

    四分位數1
  5. Quartile 3

    四分位數3

Let's compare the scores of the two players by drawing box and whiskers plots:

讓我們通過繪製方框圖和晶須圖來比較兩個玩家的得分:

Image for post

As evident from the above graph, Player 1 has more variations as compared to Player 2. The median or mean for both players is same but the scores for Player 1 differ from mean a lot more than Player 2. We can say that Player1 is less “consistent” than Player 2. Though the highest score for Player 1 i.e 200 is much more than the highest score of Player 2 (70), but he is not consistent in scoring runs.

從上圖可以明顯看出,與玩家2相比,玩家1的變化更大。兩個玩家的中位數或均值相同,但玩家1的得分與均值的差異比玩家2大得多。我們可以說玩家1的得分更低比玩家2“一致”。儘管玩家1的最高得分(即200)比玩家2的最高得分(70)高得多,但他的得分並不始終如一。

A more effective way to judge the performance of a player is to find out much a player deviates from its “mean” score. The more the deviation, the less consistent a player is. The less the deviation, the more consistent a player is.

判斷玩家表現的更有效方法是找出玩家偏離其“平均”得分的情況。 偏差越大,玩家的一致性就越差。 偏差越小,球員越穩定。

方差 (Variance)

Variance is simply a metric that expresses the “deviation” in a data attribute from its mean. You don’t have to memorize the below formula for Variance:

方差只是一個度量,表示資料屬性與平均值之間的“偏差”。 您不必記住以下方差公式:

Image for post

Remember, we are trying to find out how much variation is there in a data attribute. So, we can take the following simple steps:

請記住,我們正在嘗試找出資料屬性中有多少變化。 因此,我們可以採取以下簡單步驟:

  1. Calculate the mean of the data attribute

    計算資料屬性的平均值
  2. Find the difference between each data point and the mean. It actually gives us how much a data value varies from the mean or average. Now this difference can be either positive or negative (because the data value can be smaller or bigger than the mean). To avoid negative and positive numbers cancelling each other, we can take the square of it.

    找出每個資料點與平均值之間的差異。 它實際上為我們提供了一個數據值與平均值或平均值相差多少。 現在,此差異可以是正數或負數(因為資料值可以小於或大於平均值)。 為了避免負數和正數相互抵消,我們可以取其平方。
  3. Take the sum of all the differences

    取所有差異的總和
  4. Divide by the number of data values. This gives us the average variation.

    除以資料值的數量。 這給了我們平均變化。

標準偏差 (Standard Deviation)

Standard deviation is simply the square root of Variance

標準差只是方差的平方根

Image for post

The square root is taken to negate the effect of the square done in Step 2 while calculating Variance.

在計算方差時,取平方根以抵消在步驟2中完成的平方的影響。

Standard Deviation is the measure of spread of data about the mean. It measures roughly how far off the entries are from their average. It tells us how the data is spread out. The more the SD, the more spread out data is. Since its simply a measure, it can’t be negative.

標準差是關於均值的資料分佈的度量。 它大致衡量條目與平均值之間的距離。 它告訴我們資料如何散佈。 SD越大,資料散佈越多。 由於它只是一個量度,因此不能為負。

In our example,

在我們的示例中

The standard deviation of Player 1 = 59.32

玩家1的標準差= 59.32

The standard deviation of Player 2 = 11.06

玩家2的標準差= 11.06

Image for post
Python Code snippet
Python程式碼段

Player2 has much less standard deviation than Player 1. Hence we can say that this player is much more consistent. He has fewer variations in his performance. Though he is not scoring big runs, he is fairly consistent in all his matches.

Player2的標準偏差比Player 1小得多。因此,可以說此Player更加一致。 他的表現變化較少。 儘管他並沒有取得大的成績,但他在所有比賽中的表現都相當穩定。

Think of any variable in nature. If the standard deviation of this variable is large, you can safely say that this variable has a lot of different values. The behavior is not consistent. But if it has a low standard deviation, then you can assume that the behavior of this variable is fairly consistent.

想一下自然界中的任何變數。 如果此變數的標準偏差較大,則可以放心地說此變數具有很多不同的值。 行為不一致。 但是,如果它的標準偏差很低,則可以假定此變數的行為是相當一致的。

Have you got a friend who is mostly calm, at all times while there is another one who is very moody(sometimes angry, sometimes composed, sometimes indifferent)? You now know whose “standard deviation” is bigger.

您有沒有一個朋友在任何時候都比較平靜,而另一個朋友卻很喜怒無常(有時會生氣,時而沉穩,時而冷漠)? 您現在知道誰的“標準偏差”更大。

If a variable has a constant value, what would be its standard deviation? You don't need the formula. It will be 0.

如果變數具有恆定值,其標準偏差是多少? 您不需要公式。 它將是0。

Average” and “Deviations from Average” are 2 things that should be kept in mind while studying a data set.

在研究資料集時,應牢記“ 平均值 ”和“偏離平均值”兩件事。

翻譯自: https://towardsdatascience.com/measuring-change-54bb44e26a14

如何衡量二次曲線的變化趨勢