1. 程式人生 > >Assessing and Comparing Classifier Performance with ROC Curves

Assessing and Comparing Classifier Performance with ROC Curves

The most commonly reported measure of classifier performance is accuracy: the percent of correct classifications obtained.

This metric has the advantage of being easy to understand and makes comparison of the performance of different classifiers trivial, but it ignores many of the factors which should be taken into account when honestly assessing the performance of a classifier.

What Is Meant By Classifier Performance?

Classifier performance is more than just a count of correct classifications.

Consider, for interest, the problem of screening for a relatively rare condition such as cervical cancer, which has a prevalence of about 10% (actual stats). If a lazy Pap smear screener was to classify every slide they see as “normal”, they would have a 90% accuracy. Very impressive! But that figure completely ignores the fact that the 10% of women who do have the disease have not been diagnosed at all.

Some Performance Metrics

In a previous blog post we discussed some of the other performance metrics which can be applied to the assessment of a classifier. To review:

Most classifiers produce a score, which is then thresholded to decide the classification. If a classifier produces a score between 0.0 (definitely negative) and 1.0 (definitely positive), it is common to consider anything over 0.5 as positive.

However, any threshold applied to a dataset (in which PP is the positive population and NP is the negative population) is going to produce true positives (TP), false positives (FP), true negatives (TN) and false negatives (FN) (Figure 1). We need a method which will take into account all of these numbers.

ROC Curve Explaination

Figure 1. Overlapping datasets will always generate false positives and negatives as well as true positives and negatives

Once you have numbers for all of these measures, some useful metrics can be calculated.

  • Accuracy = (1 – Error) = (TP + TN)/(PP + NP) = Pr(C), the probability of a correct classification.
  • Sensitivity = TP/(TP + FN) = TP/PP = the ability of the test to detect disease in a population of diseased individuals.
  • Specificity = TN/(TN + FP) = TN / NP = the ability of the test to correctly rule out the disease in a disease-free population.

Let’s calculate these metrics for some reasonable real-world numbers. If we have 100,000 patients, of which 200 (20%) actually have cancer, we might see the following test results (Table 1):

Table of sample data

Table 1. Illustration of diagnostic test performance for “reasonable” values for Pap smear screening

For this data:

  • Sensitivity = TP/(TP + FN) = 160 / (160 + 40) = 80.0%
    Specificity = TN/(TN + FP) = 69,860 / (69,860 + 29,940) = 70.0%

In other words, our test will correctly identify 80% of people with the disease, but 30% of healthy people will incorrectly test positive. By only considering the sensitivity (or accuracy) of the test, potentially important information is lost.

By considering our wrong results as well as our correct ones we get much greater insight into the performance of the classifier.

One way to overcome the problem of having to choose a cutoff is to start with a threshold of 0.0, so that every case is considered as positive. We correctly classify all of the positive cases, and incorrectly classify all of the negative cases. We then move the threshold over every value between 0.0 and 1.0, progressively decreasing the number of false positives and increasing the number of true positives.

TP (sensitivity) can then be plotted against FP (1 – specificity) for each threshold used. The resulting graph is called a Receiver Operating Characteristic (ROC) curve (Figure 2). ROC curves were developed for use in signal detection in radar returns in the 1950’s, and have since been applied to a wide range of problems.

Example ROC Curves

Figure 2. Examples of ROC curves

For a perfect classifier the ROC curve will go straight up the Y axis and then along the X axis. A classifier with no power will sit on the diagonal, whilst most classifiers fall somewhere in between.

ROC analysis provides tools to select possibly optimal models and to discard suboptimal ones independently from (and prior to specifying) the cost context or the class distribution

Using ROC Curves

Threshold Selection

It is immediately apparent that a ROC curve can be used to select a threshold for a classifier which maximises the true positives, while minimising the false positives.

However, different types of problems have different optimal classifier thresholds. For a cancer screening test, for example, we may be prepared to put up with a relatively high false positive rate in order to get a high true positive,  it is most important to identify possible cancer sufferers.

For a follow-up test after treatment, however, a different threshold might be more desirable, since we want to minimise false negatives, we don’t want to tell a patient they’re clear if this is not actually the case.

Performance Assessment

ROC curves also give us the ability to assess the performance of the classifier over its entire operating range. The most widely-used measure is the area under the curve (AUC). As you can see from Figure 2, the AUC for a classifier with no power, essentially random guessing, is 0.5, because the curve follows the diagonal. The AUC for that mythical being, the perfect classifier, is 1.0. Most classifiers have AUCs that fall somewhere between these two values.

An AUC of less than 0.5 might indicate that something interesting is happening. A very low AUC might indicate that the problem has been set up wrongly, the classifier is finding a relationship in the data which is, essentially, the opposite of that expected. In such a case, inspection of the entire ROC curve might give some clues as to what is going on: have the positives and negatives been mislabelled?

Classifier Comparison

The AUC can be used to compare the performance of two or more classifiers. A single threshold can be selected and the classifiers’ performance at that point compared, or the overall performance can be compared by considering the AUC.

Most published reports compare AUCs in absolute terms: “Classifier 1 has an AUC of 0.85, and classifier 2 has an AUC of 0.79, so classifier 1 is clearly better“. It is, however, possible to calculate whether differences in AUC are statistically significant. For full details, see the Hanley & McNeil (1982) paper listed below.

ROC Curve Analysis Tutorials

When To Use ROC Curve Analysis

In this post I have used a biomedical example, and ROC curves are widely used in the biomedical sciences. The technique is, however, applicable to any classifier producing a score for each case, rather than a binary decision.

Neural networks and many statistical algorithms are examples of appropriate classifiers, while approaches such as decision trees are less suited. Algorithms which have only two possible outcomes (such as the cancer / no cancer example used here) are most suited to this approach.

Any sort of data which can be fed into appropriate classifiers can be subjected to ROC curve analysis.

Further Reading

相關推薦

Assessing and Comparing Classifier Performance with ROC Curves

Tweet Share Share Google Plus The most commonly reported measure of classifier performance is ac

Extracting and composing robust features with denosing autoencoders 論文

重要 style add 論文 是把 任務 生成 改進 編碼器 這是一篇發表於2008年初的論文。 文章主要講了利用 denosing autoencoder來學習 robust的中間特征。。進上步,說明,利用這個方法,可以初始化神經網絡的權值。。這就相當於一種非監督學習的

Renewed Red Hat and updated the system with yum up

ren style In cti lba round nor SM then Renewed Red Hat and updated the system with yum update, the following error:Not using downloaded r

How to use Kata Containers and CRI (containerd plugin) with Kubernetes

bsp use k8s doc ner blob ber uber net https://github.com/kata-containers/documentation/blob/master/how-to/how-to-use-k8s-with-cri-contain

Coupled Dictionary and Feature Space Learning with Applications to Cross-Domain Image Synthesis and

聯合字典學習的目標函式: 和是來自兩個不同領域的n個無標籤資料對,維度分別為。  表示字典學習的能量項,它典型地關於資料重構誤差。聯合能量項調整觀察到的字典和,或結果係數和之間的關係。注意分別是的字典原子的數目。 在我們的工作中,我們考慮的稀疏表示的公式,因為它已被證明在

READ–IT: Assessing Readability of Italian Texts with a View to Text Simplification

https://aclanthology.info/pdf/W/W11/W11-2308.pdf   2 background2000年以前 ----傳統可讀性準則侷限於表面的文字特徵,例如the Flesch-Kincaid measure(現在還在用的最普遍的)是每個單詞的平均音節數和每個句

Warning:Configuration 'compile' is obsolete and has been replaced with 'implementation'.

警告是這樣的: **Warning:Configuration ‘compile’ is obsolete and has been replaced with ‘implementation’.It will be removed at the end of 2018** 講的是 

使用Kong的oauth2.0,請求重定向url,返回“no route and no API found with those values”

官方提供的endpoints有兩個:/oauth2/authorize 以及 /oauth2/token。(詳情請看:https://docs.konghq.com/hub/kong-inc/oauth2/) 注意事項有以下3點: 1、如果api添加了“uris”,比如“/test",那麼訪問的

Configuration 'compile' is obsolete and has been replaced with 'implementation'. It will be removed

AndroidStudio升級過程中,真的是一腳一個坑啊,好不容易解決完前面的問題,新問題又來了。 Configuration 'compile' is obsolete and has been replaced with 'implementation'. It will be remo

There is no Action mapped for namespace [/] and action name... associated with解決方法

出現這個錯誤,說明action沒有找到,這時就需要考慮struts.xml配置檔案是否配置無誤,經過再三確認,檔案裡面沒出錯,是struts.xml名字錯了,寫成了structs.xml,多了個c,修改後,問題得以解決!

Cannot start internal HTTP server. Git integration, JavaScript debugger and LiveEdit may operate with errors. Please check your firewall settings and

 今天用IntelliJ IDEA跑程式碼。報了這個錯。   Cannot start internal HTTP server. Git integration, JavaScript debugger and LiveEdit may operate with errors.

Crafting a New Reality for Education and Career Decisions: Lead With UX

Written by: Daniel Salcius | Photo by: Ferdinand StohrCrafting a New Reality for Education and Career Decisions: Lead With UXThe take-away feeling an end u

Components testing in React: what and how to test with Jest and Enzyme.

Testing React components may be challenging for beginners as well as experienced developers who have already worked with tests. It may be interesting to co

Stream API performance with GraalVM

Better Java Streams performance with GraalVMThe functional Streams API introduced in Java 8 is a neat and efficient way to declaratively express programs t

Show HN: DataDrivenJS, track and read analytics data with JS

Hello!I've been running an analytics company for a couple of years now, and I realised that the data, generated by websites users, is locked in GA or Adobe

Machine Learning: Balancing model performance with business goals

This post is designed to give some guidance for evaluating the use of machine learning to solve your business problem. As a data scientist, I am highly mot

Hello World, book review: Algorithms, and how to live with them

The trajectory of books about new technologies follows a similar pattern: first, hype; then, backlash; then, finally, a more considered view of what it mig

Predicting School Performance with Census Income Data

Data VisualizationFirst, we can make a scatterplot with the location of every school using the library gmplot (pip install gmplot ).The plot shows that rat

Learn how to track a satellite and query its location with a chatbot

About this video The skies above us are filled with satellites, many of which enrich our science and understanding of the Earth with

Simple and scalable website interaction with puppet

Did you ever need to interact with a website, not only using plain http but a real browser? Have you ever felt the pain using selenium or struggled reading