With the rise of AI, the importance of data is being recognized again, and various data are openly released and can be used easily.
The existence of open data is essential for machine learning that learns from huge amounts of data, and it is necessary to select and utilize open data according to the AI you want to build.
On the other hand, open data alone cannot build AI with competitive advantage. It is important to build a unique AI by combining macro open data and independently collected micro data.
When building a service that utilizes open data, the key is to improve the UX (user experience) of the service and how unique data can be obtained, so don’t rely too much on open data.
This time, we classified 100 open data datasets into 6 categories. Look for data that can be used to build your own service or AI.
- Top 100 open data datasets
- Dataset Catalog/Dataset Summary
- image
- movie
- audio
- text
- economy and finance
- in conclusion
Top 100 open data datasets
Dataset Catalog/Dataset Summary
- DATA GO JP
A data catalog site for open data published by the Japanese government for the purpose of guiding and cross-searching public data that can be used for secondary purposes.
- National Informatics Research Data Repository
This is a joint use project of datasets operated by the Dataset Joint Research and Development Center (DSC) of the National Institute of Informatics (NII). We provide data of researchers at private companies and universities for researchers.
- Link Data
A site that supports conversion and publication of various table data.
- Kaggle
Kaggle is a platform for competing predictive models and analytics, with various datasets available for download.
- Harvard Dataverse
This dataset is published by Harvard University. Nearly 500 datasets that can be used for machine learning are released.
- UC Irvine Machine Learning Repository
A dataset published by the University of California, Irvine. About 400 datasets have been published.
- Data.gov (US)
Open data from various US government agencies. You can download various data from 14 topics.
- e-stat
This is a government statistics portal site where you can view Japanese statistics.
- r/datasets
A bulletin board site for sharing datasets managed by redit.com.
- Rakuten dataset
This dataset is published by Rakuten Institute of Technology. Rakuten product reviews and text images with annotations are available.
- Google Dataset Search
Google’s dataset search service. Officially released in 2020.
- facebook research Dataset
This dataset is published by Facebook Research.
- Registry of Open Data on AWS
Public datasets on Amazon AWS. We publish datasets that can be used for learning image classification and natural language processing. It is also possible to work with AWS.
- Microsoft Research Open Data
You can search and download Microsoft open data. It is possible to work with Azure.
- arXivTimes datasets
It summarizes the datasets that can be used when performing machine learning by category.
- awesome-public-datasets
This is also a summary of the datasets that can be used when performing machine learning.
- Core Data
Among the datasets, we have compiled high-quality and easy-to-use open data.
- ARIZONA STATE UNIVERSITY
A dataset compiled by the University of Arizona.
- Network Repository
This is a compilation of datasets that can be used for network analysis, etc.
- Yahoo Labs Webscope
This dataset is published by Yahoo! Labs.
image
- Label Me
Annotated dataset for semantic segmentation.
- MegaFace
Face recognition mixed with noise data and a large dataset used in a face recognition algorithm open competition at the University of Washington.
- MNIST
A dataset of handwritten digit images that is said to be the first machine learning beginner to use.
- CIFAR-10 / CIFAR-100
CIFAR-10 consists of 60000 32×32 color images in 10 classes, with 6000 images per class, 50000 training images and 10000 test images. CIFAR-100 has 100 classes of 600 images each, with 500 training and 100 test images per class.
- Pascal VOC Dataset
It provides standardized image datasets for object class recognition, a common set of tools for accessing datasets and annotations.
- Fashion-MNIST
This is a data set labeled with 10 categories such as T-shirts, pants, dresses, and sneakers as fashion images.
- Deep Fashion
This is a fashion image dataset consisting of over 800,000 and 50 categories.
- Food 101
A dataset containing 101,000 food images labeled with 101 categories.
- Google Open Image V4
A Google-published dataset of up to 9 million images annotated with image-level labels, object bounding boxes, object segmentation masks, and visual relationships.
- ImageNet
There is a data set of more than 14 million images, and if you search for a string, the class that matches the search word will appear, making it easy to acquire data.
- CoPhIR
More than 100 million images in the image dataset from the image site Flicker.
- Flickr Logos dataset
We provide annotated city image data and logo image data published by the National Technical University of Athens.
- Tiny Images Dataset
The TinyImages dataset consists of 79,302,017 images, each of which is a 32×32 color image.
- SUN dataset
Extensive Scene Recognition (SUN) database with 899 categories and 130,519 images for scene recognition and classification.
- COCO – Common Object in Context
COCO is a large-scale object detection, segmentation and captions dataset.
- Daimler Urban Segmentation Dataset
A dataset recorded in urban traffic. It consists of 5000 corrected stereo image pairs with a resolution of 1024 × 440. 500 frames (every 10 frames of the sequence) contain pixel-level semantic class annotations for 5 classes (ground, building, vehicle, pedestrian, sky).
- DAGM 2007
This is a data set for competition at a symposium in Germany. It is an image data set for detecting defects such as scratches on the surface of industrial products.
- Natural Adversarial Examples
A dataset intentionally prepared for machine learning models to make mistakes.
- rois-codh/kmnist
Unlike handwritten digits MNIST, this is a dataset of broken handwritten digits and kanji.
- The Oxford-IIIT Pet Dataset
A 37-category pet image dataset containing about 200 images per class.
- Stanford Drone Dataset
An image dataset taken from a Stanford University drone.
- CelebA Dataset
A large face attribute dataset containing over 200,000 celebrity images with 40 attribute annotations.
- Face Forensics
A dataset for detecting fake images of people generated by Deepfakes, Face2Face, etc.
- Indoor Scene Recognition
An indoor image dataset for indoor scene recognition models. It contains 67 indoor categories and a total of 15620 images.
movie
- YouTube-8M Dataset
A dataset of 8 million YouTube videos tagged with 4800 Knowledge Graph entities published by the Google research team. We collect human-validated labels on about 237K segments in 1000 classes.
- YouTube-Bounding Boxes Dataset
Here is a large dataset with bounding box labeled videos. The dataset consists of approximately 380,000 15-20 video segments extracted from 240,000 different publicly available YouTube videos, automatically selecting objects in natural settings without editing or post-processing can do.
- Kinetics
This is a video data set published by Deep Mind, in which approximately 650,000 videos are labeled with actions such as human-object interactions such as playing musical instruments, and handshakes.
- UCF101-Action Recognition Data Set
Provided by the University of Central Florida, UCF101 is a video recognition dataset with 101 action categories collected from YouTube. An extension of the UCF50 dataset with 50 action categories. The 101 action-category videos with behavioral classifications were grouped into 25 groups, and each group consisted of 4-7 action videos.
- 20BN-JESTER DATASET V1
This is a video dataset with hand gesture labels published by twentybn. About 150,000 videos are labeled with 27 hand gestures.
- Moments in Time Dataset
This is a project jointly researched by MIT and IBM, and is a video dataset in which action labels are attached to 3-second videos.
- EPIC KITCHENS
A video dataset in which kitchen tasks are labeled with actions. Published by research teams at the University of Bristol, the University of Toronto, and the University of Catania.
- STAIR Actions
A video captions dataset. There are 399,233 Japanese captions that describe the content of 798,220,000 videos.
- BDD100K: A Large-scale Diverse Driving Video Database
This is a driving video dataset published by the AI Lab (BAIR) at the University of California, Berkeley. A 10-second video is labeled with road object bounding boxes, drivable areas, and lane markings.
- Atomic Visual Actions (AVAs)
This is a dataset published by Google for recognizing human actions, and 80 types of labels such as walking and flying actions are attached to 57,000 videos.
audio
- The NES Music Database
A dataset for building an automatic music composition system, published by a postdoctoral fellow at Stanford University. A total of 5278 songs with 397 titles are included.
- The Largest MIDI Collection on the Internet
A large dataset of MIDI data published on reddit.
- NSynth Dataset
A data set containing about 300,000 single notes from 1,006 instruments, published by the open source research project Magenta.
- AudioSet
The 10-second sounds released by Google are labeled with human voices, animal sounds, and musical instruments.
- ToyADMOS-dataset
Machine operation sound dataset of over 12,000 samples of approximately 540 hours of normal machine operation sounds and abnormal sounds collected by four microphones at a sampling rate of 48 kHz.
- The Flickr Audio Caption Corpus
The Flickr 8k Audio Captions Corpus contains 40,000 audio captions on 8,000 natural images and is a dataset for unsupervised audio pattern discovery.
- The Spoken Wikipedia Corpora
A corpus of aligned English, German, and Dutch Wikipedia article audio files published by the University of Hamburg, Germany. Contains hundreds of hours of aligned audio. Annotations can be mapped to the original HTML.
- Speech Commands Dataset
A voice dataset for speech recognition for Tensorflow, published by Google. It contains 65,000 1 second long pronunciations of 30 short words.
- MUDB18
Dataset for speech separation. We provide music track data of 150 genres such as drums, bass, and vocals.
- Mozilla Common Voice
Approximately 1,400 hours of voice data from 42,000 contributors, 18 languages, and 18 languages have been released from the voice data set collection project “Common Voice” developed by Mozilla.
- Voice Actor Statistics Society of Japan
This is a speech corpus data set published by the Voice Actor Statistics Society of Japan, consisting of uniquely constructed phoneme-balanced sentences and their speech corpus read aloud by three professional female voice actors in three patterns.
- Speech Resource Consortium
This page summarizes various speech corpus lists.
text
- Resources for Natural Language Processing
The Kurohashi, Kawahara, and Murawaki Laboratory at Kyoto University publishes this information on tools and datasets for natural language processing.
- Aozora Bunko
We publish data of works whose copyright has expired and works licensed by the author.
- Aozora Bunko Morphological Analysis Data Collection
You can obtain CSV data that has undergone morphological analysis for the works of Aozora Bunko.
- Japanese translation data
Bilingual corpus, bilingual dictionaries, etc. that can be used to build a machine translation system are available.
- SNOW T15: Easy Japanese Corpus
This is a bilingual corpus that rewrites 50,000 sentences into easy Japanese (plain Japanese vocabulary). This corpus also has an English translation, making it a bilingual corpus that corresponds to each sentence in English, Japanese, and easy Japanese.
- Twitter Japanese Sentiment Analysis Dataset
We analyze tweet reputation information by crowdsourcing and publish the analysis results.
- SNOW D18 Japanese Emotion Dictionary
A dictionary of Japanese emotional expressions. About 2,000 expressions are given 48 categories of emotions, such as fun and familiarity, respect/preciousness, excitement, joy, and sadness.
- libedor news corpus
We publish a corpus of articles to which livedoor’s Creative Commons license is applied.
- Cookpad dataset
We publish data on 1.72 million recipes and menus posted on Cookpad.
- DBpedia English
A community project that extracts information from Wikipedia and publishes it as LOD (Linked Open Data).
- Web data: Amazon reviews
We publish about 35 million Amazon reviews.
- niconico dataset
Metadata of approximately 16.7 million videos posted from the start of Nico Nico Douga to November 8, 2018, along with approximately 3.8 billion comment data.
- Wikipedia Links data
The full text of Wikipedia is published as a dataset.
- Common Crawl
Crawl data for over 5 billion web pages.
- Web Data Commons
Data extracted from Common Crawl structured data.
- babi
A dataset of various tasks such as question answering, dialogue, and language models from Facebook AI Research.
- PAWS
This dataset is for overcoming paraphrases where the meaning changes if the word order and syntactic structure are slightly different.
- Tokyo Metropolitan University Natural Language Processing Laboratory (Komachi Lab)
Corpus, dictionary, and evaluation dataset provided by Tokyo Metropolitan University Natural Language Processing Laboratory (Komachi Lab).
economy and finance
- Quandl
A wide variety of financial and economic data sets can be retrieved. There are many articles on data acquisition in Python.
- Bitcoin Historical Data
Bitcoin data at 1-minute intervals from January 2012 to August 2019 published on Kaggle.
- World Bank Open Data
You can easily search and download World Bank data.
- US macroeconomic data
It publishes data on employment, economic output, and other macroeconomic variables.
- US Stock Data
US stock market data since 2009.
- finance-vix
CBOE Volatility Index (VIX) time series dataset containing daily opening, closing, high and low prices of the S&P 500.
- Dow Jones Index Data Set
Published by the University of California, Irvine, you can get weekly Dow Jones Industrial Average stock prices.
- Econ Data
Economic time series data published by the University of Maryland. It publishes thousands of economic time series produced by numerous US government agencies and distributed in a variety of formats and media.
- Ministry of Finance Government Bond Interest Rate Information
Japan’s interest rate information (since 1974) is open to the public.
- Asset Marco
It publishes over 20,000 macroeconomic indicators for over 120 countries.
- Eurostat Comext
We create materials and statistics related to the EU, and you can obtain trade data.
- Nikkei Average Profile
Published by Nihon Keizai Shimbun, Nikkei Stock Average, Nikkei Asian Index, JPX Nikkei Index 400, etc. can be acquired.
- IMF DATA
It publishes a series of time-series data on IMF lending, exchange rates, and other economic and financial indicators.
- Google Finance
Get securities information such as stock prices and exchange rates. . You can get it from Google Sheets function.
- Financial Data Finder
You can get various financial data such as stocks, exchange rates, bond assets, etc. Published by Ohio State University.
- EDINET
As an electronic information disclosure system for disclosure documents based on the Act, which links the host computer used by the Cabinet Office, the computer used by the submitting company, and the computer of the financial instruments exchange (and the Financial Instruments Firms Association), securities reports , securities registration statements, large-volume holding reports, and other disclosure documents.
In conclusion
New data will continue to be released, so I would like to acquire knowledge and technical skills for data utilization.
Especially for those who are learning machine learning technology, it would be a good idea to actually build a model using open data.