> For the complete documentation index, see [llms.txt](https://irosyadi.gitbook.io/irosyadi/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://irosyadi.gitbook.io/irosyadi/data-engineering/dataset.md).

# Awesome List of Dataset

## Dataset

* [Stanford AIMI Shared Datasets](https://stanfordaimi.azurewebsites.net/)
* [VisualData Discovery - Search Engine for Computer Vision Datasets](https://visualdata.io/discovery)

## Art Dataset

* [Art Data—Artnome](https://www.artnome.com/art-data)

## Dataset

* [The Open-Source Movement Comes to Medical Datasets](https://hai.stanford.edu/news/open-source-movement-comes-medical-datasets)
* [Mozilla Foundation - Mozilla Common Voice Adds 16 New Languages and 4,600 New Hours of Speech](https://foundation.mozilla.org/en/blog/mozilla-common-voice-adds-16-new-languages-and-4600-new-hours-of-speech/)

## Drug Dataset

* [DrugBank Online - Database for Drug and Drug Target Info](https://go.drugbank.com/)
  * [Muler](https://muler.pythonanywhere.com/), [PizzaMyHeart/muler: A search engine for drug information built with Flask.](https://github.com/PizzaMyHeart/muler)

## Dataset Zoo

* [Deeplite/deeplite-torch-zoo](https://github.com/Deeplite/deeplite-torch-zoo) Pytorch

## Dataset

* [Recommended Data Repositories - Scientific Data](https://www.nature.com/sdata/policies/repositories)

## Dataset

* [CatMeows: A Publicly-Available Dataset of Cat Vocalizations - Zenodo](https://zenodo.org/record/4008297)
* [Home - BBC Programme Index](https://genome.ch.bbc.co.uk/)

## Dataset

* [Dataset Search](https://datasetsearch.research.google.com/)

## Dataset

* [google-research-datasets/wit: WIT (Wikipedia-based Image Text) Dataset is a large multimodal multilingual dataset comprising 37M+ image-text sets with 11M+ unique images across 100+ languages.](https://github.com/google-research-datasets/wit)
* [PUBLIC DATA: 2021 AI Index Report - Google Drive](https://drive.google.com/drive/folders/1YY9rj8bGSJDLgIq09FwmF2y1k_FazJUm)

## Dataset Tools

* [Scale AI: The Data Platform for AI](https://scale.com/) : High quality training and validation data for AI applications
* [Aquarium - Data Management For ML](https://www.aquariumlearning.com/) : ML data management platform
* [Labelbox: The leading training data platform for data labeling](https://labelbox.com/) : Save time by creating and managing your training data, people, and processes in a single place

## Cell Tower Dataset

* [Cellular Tower and Signal Map](https://www.cellmapper.net/map)
* [OpenCelliD - Largest Open Database of Cell Towers & Geolocation - by Unwired Labs](https://www.opencellid.org/#zoom=11\&lat=-7.7798\&lon=109.1333)

## Twitter Dataset

* [Hedonometer](https://hedonometer.org/timeseries/en_all/?from=2019-11-02\&to=2021-05-01)

## Dataset

* [EleutherAI](https://www.eleuther.ai/) EleutherAI is a grassroots AI research group aimed at democratizing and open sourcing AI research.
* [The Pile](https://pile.eleuther.ai/) : The Pile is a 825 GiB diverse, open source language modelling data set that consists of 22 smaller, high-quality datasets combined together.

## Hugging Face Dataset API

[huggingface/datasets-server: Integrate into your apps over 10,000 datasets via simple HTTP requests, with pre-processed responses and scalability built-in.](https://github.com/huggingface/datasets-server)

## Dataset

* [Harvard Dataverse](https://dataverse.harvard.edu/)

## Open Food Data

* [Open Food Facts - World](https://world.openfoodfacts.org/)