1. 程式人生 > >人工智慧領域經典資料集

人工智慧領域經典資料集

MNIST : most commonly used sanity check. Dataset of 25x25, centered, B ? W handwritten digits. It is an easy task - just because something works on MNIST, does not mean it works.

CIFAR 10 ? CIFAR 100 : 32x32 color images. Not commonly used anymore, though once again, can be an interesting sanity check.

Image API
 : the de-facto image dataset for new algorithms. Many image API companies have labels from their REST interfaces that are suspiciously close to the 1000 category WordNet hierarchy from ImageNet.

LSUN : Scene understanding with many ancillary tasks (room layout estimation, saliency prediction, etc.) and an associated competition.

PASCAL VOC : Generic image Segmentation / classification - not terribly useful for building real-world image annotation, but great for baselines.

SVHN : House numbers from Google Street View. Think of this as recurrent MNIST in the wild.

MS COCO : Generic image understanding / captioning, with an associated competition.

Visual Genome : Very detailed visual knowledge base with deep captioning of ~ 100K images.

Labeled Faces in the Wild : Cropped faces (using Viola-Jones ) that have been labeled with a name identifier. A subset of the people present have two images in the dataset - it's quite common for people to train matching Systems here

Natural Language

Text Classification Datasets (Google Drive Link) from zhet al., 2015 : An extensive set of eight datasets for text classification. These are the most printed baselines for new text classification baselines. Sample size of 120K to 3.6M, ranging From binary to 14 class problems. Datasets from DBPedia, Amazon, Yelp, Yahoo !, Sogou, and AG.

WikiText: large language modeling corpus from quality Wikipedia articles, curated by Salesforce MetaMind .

Question Pairs: first dataset release from Quora containing duplicate / semantic similarity labels.

SQuAD : The Stanford Question Answering Dataset - broadly useful methods answering and reading comprehension dataset, where every answer to a question is posed as a span , or segment of text.

CMU Q / A Dataset: Manually-generated factoid question / answer pairs with difficulty ratings from Wikipedia articles.

Maluuba Datasets : Sophisticated, human-generated datasets for stateful natural language understanding research.

Billion Words: large, general purpose modeling modeling dataset. Often used to train distributed word representations such as word2vec or GloVe .

Common Crawl : Petabyte -scale crawl of the web - most frequently used for learning word embeddings. Available for free from Amazon S3 . Can also useful as a network dataset for it's crawl of the WWW.

bAbi : synthetic reading comprehension and question answering dataset from Facebook AI Research (FAIR) .

The Children's Book Test ( download link ): Baseline of (Question + context, Answer) pairs extracted from Children's books available through Project Gutenberg. Useful for question-answering, reading comprehension, and factoid look-up

Stanford Sentiment Treebank : standard sentiment dataset with fine-grained sentiment annotations at every node of each sentence's parse tree.

20 Newsgroups : one of the classic datasets for text classification, usually useful as a benchmark for either pure classification or as a validation of any IR / indexing algorithm.

Reuters : older, purely classification based dataset with text from the newswire. Commonly used in tutorials.

IMDB : an older, relatively small dataset for binary sentiment classification. Fallen out of favor for benchmarks in the literature in lieu of larger datasets.

UCI's Spambase : Older, classic spam email dataset from the famous UCI Machine Learning Repository . Due to details of how the dataset was curated, this can be an interesting baseline for learning personalized flight filtering.

Speech

Most speech recognition collectors are proprietary - the data holds a lot of value for the company that curates. Most datasets available in the field are quite old.

2000 HUB5 English: English-only speech data used most recently in the Deep Speech paper from Baidu.

LibriSpeech : Audio books data set of text and speech. Nearly 500 hours of clean speech of various audio books read by multiple speakers, organized by clubs of the book containing both the text and the speech

VoxForge : Clean speech dataset of accented english, useful for instances in which you expect to need robustness to different accents or intonations

TIMIT: English-only speech recognition dataset.

CHIME : Noisy speech recognition challenge dataset. Dataset contains real, simulated and clean voice recordings. Real being actual recordings of 4 speakers in nearly 9000 recordings over 4 noisy locations, simulated is generated by combining multiple environments over speech utterances and clean being non- Noisy recordings.

TED-LIUM : Audio transcription of TED talks. 1495 TED meetings audio recordings along with full text transcriptions of those recordings.

Recommendation and ranking systems

Netflix Challenge : first major Kaggle style data challenge. Only available unofficially, as privacy issues arose .

MovieLens : various sizes of movie review data - commonly used for collaborative filtering baselines.

Million Song Dataset : large, metadata-rich, open source dataset on Kaggle that can be good for people experimenting with hybrid recommendations systems.

Last.fm : music recommended dataset with access to underlying social network and other metadata that can be useful for hybrid systems.

Networks and Graphs

Amazon Co-PurchasingandAmazon Reviews : crawled data from the " users who bought this also bought ... " section of Amazon, as well as amazon review data for related products. Good for experimenting with recommended systems in networks

Friendster Social Network Dataset : Before their pivot as a gaming website, Friendster released anonymized data in the form of friends lists for 103,750,348 users.

Geospatial data

OpenStreetMap : Vector data for the entire planet under a free license . It includes (an older version of) the US Census Bureau's TIGER data.

Landsat8 : Satellite shots of the entire Earth surface, updated every several weeks.

NEXRAD : Doppler radar scans of atmospheric conditions in the US

相關推薦

人工智慧領域經典資料

MNIST : most commonly used sanity check. Dataset of 25x25, centered, B ? W handwritten digits. It is an easy task - just because something works on MNIST,

《TensorFlow:實戰Google深度學習框架》——6.1 影象識別中經典資料介紹

1、CIFAR資料集 CIFAR是一個影響力很大的影象分類資料集,CIFAR資料集中的圖片為32*32的彩色圖片,由Alex  Krizhevsky教授、Vinod Nair博士和Geoffrey Hinton教授整理的。 CIFAR是影象詞典專案(Visual Dictionar

領域公開資料下載 | 資源

本文整理了一些網上的免費資料集,分類下載地址如下,希望能節約大家找資料的時間。這篇文章涵蓋以下10個領域的資料集下載資源: 金融 交通 商業 推薦系統 醫療健康 影象資料 視訊資料 音訊資料 自然語言處理 社會資料 處理後的科研和競賽資料 1 金融 美國勞工部統計局官方

深度學習視覺領域常用資料彙總

[導讀] “大資料時代”,資料為王!無論是資料探勘還是目前大熱的深度學習領域都離不開“大資料”。大公司們一般會有自己的資料,但對於創業公司或是高校老師、學生來說,“Where can I get large datasets open to the public?”是不得不面對的一個問題。 本文結合筆者

深度學習與自動駕駛領域資料(KITTI,Oxford,Cityscape,Comma.ai,BDDV,TORCS,Udacity,GTA,CARLA,Carcraft)

資料集名稱 KITTI Oxford RobotCar 論文連結 http://robotcar-dataset.robots.ox.ac.uk/images/robotcar_ijrr.pdf Over the period of Ma

領域公開資料下載

使用的資料集 THCHS30是Dong Wang, Xuewei Zhang, Zhiyong Zhang這幾位大神釋出的開放語音資料集,可用於開發中文語音識別系統。 為了感謝這幾位大神,我是跪在電腦前寫的本帖程式碼。 下載中文語音資料集(5G+): 1 2 3 4 5 6

收集的資料——人工智慧

測試可用,測試時間:2018-10-06. 微博資料集 連結:http://pan.baidu.com/s/1pK9WRGJ 密碼:7quk 百度資料 連結:http://pan.baidu.com/s/1jIPqum6 密碼:4yoq

深度學習常用資料資源(計算機視覺領域

目錄   1、MNIST  2、ImageNet  4、COCO  5、PASCAL VOC 6、FDDB 1、MNIST  深度學習領域的入門資料集,當前主流的深度學習框架幾乎都將MNIST資料集的處理

機器視覺、影象處理、機器學習領域相關程式碼和工程專案和資料 集合

SIFT [1] [Demo program][SIFT Library] [VLFeat] PCA-SIFT [2] [Project] Affine-SIFT [3] [Project] SURF [4] [OpenSURF] [Matlab Wrapper] Af

大家來圍觀下:可口可樂在人工智慧和大資料領域的7項應用

儘管我們有著獨特的觀察身份來為大家提供投資建議,但我們從不告訴人們他們應該投資哪些股票。相反,我們談論的是我們做了什麼投資,以及我們為什麼要做這些投資。用真金白銀來驗證我們的投資理念。雖然我們主要討論的是顛覆性技術,但這種類別的純粹的股票很少。這就是為什麼我們的大部分資金都被投資於股息增長投資股票(

【深度學習】2個經典的練手CNN原始碼與MNIST資料測試結果

對剛入門深度學習的童鞋,這2個簡單的工程可快速入門。建議手敲一遍,可快速熟悉程式碼和CNN的實現流程。 #1、匯入相關庫 import numpy as np import tensorflow as tf import matplotlib.pyplot as plt import inp

資料探勘領域經典演算法——CART演算法

簡介 CART與C4.5類似,是決策樹演算法的一種。此外,常見的決策樹演算法還有ID3,這三者的不同之處在於特徵的劃分: ID3:特徵劃分基於資訊增益 C4.5:特徵劃分基於資訊增益比 CART:特徵劃分基於基尼指數 基本思想 CART假設決策樹是二叉樹,內部結點特徵的取值為“是”和“否”,左分支

領域機器學習資料彙總

大學公開資料集 (Stanford)69G大規模無人機(校園)影象資料集【Stanford】 http://cvgl.stanford.edu/projects/uav_data/ 人臉素描資料集【CUHK】 http://mmlab.ie.cuhk.edu.hk/archi

可口可樂在人工智慧和大資料領域的7項應用

儘管我們有著獨特的觀察身份來為大家提供投資建議,但我們從不告訴人們他們應該投資哪些股票。相反,我

2017年十本必讀的大資料&人工智慧領域書籍,你都讀過嗎?

【資料猿導讀】年關將至,回顧2017,小編記得自己曾在年初的時候給自己定下一個小目標——就是讀3

人工智慧深度學習TensorFlow通過感知器實現鳶尾花資料分類

一.iris資料集簡介 iris資料集的中文名是安德森鳶尾花卉資料集,英文全稱是Anderson’s Iris data set。iris包含150個樣本,對應資料集的每行資料。每行資料包含每個樣本的四個特徵和樣本的類別資訊,所以iris資料集是一個150行5列的二維表。 通俗地說,iris

針對於網路安全領域中基於PCAP流量的資料

網路安全領域中基於PCAP流量的資料集 MAWI Working Group Traffic Archive URL:http://mawi.wide.ad.jp/mawi/ CIC dataset Canadian Institute for Cybersecurity datasets are u

人工智慧考試——k近鄰演算法對鳶尾花(iris)資料進行分析

一、題目 通過修改提供的k_nn.c檔案,讀取鳶尾花的資料集,其中iris_training_set.txt和iris_test_set.txt分別為訓練集和測試集,兩個資料集中最後一列為類別標籤,其餘列為表示花瓣和萼片長度和寬度的輸入特徵。通過計算測試集中的每個輸入行和訓

分散式系統領域經典論文翻譯

一.google論文系列 二.分散式理論系列 00.    Appraising Two Decades of Distributed Co

[ MOOC課程學習 ] 人工智慧實踐:Tensorflow筆記_CH6_2 製作資料

製作資料集 tfrecords 檔案: (1) tfrecords: 是一種二進位制檔案,可先將圖片和標籤製作成該格式的檔案。使用 tfrecords 進行資料讀取,會提高記憶體利用率。 (2) tf.train.Example: 用來儲存訓練資料。訓練