# Awesome List of Dataset

## Dataset

* [Stanford AIMI Shared Datasets](https://stanfordaimi.azurewebsites.net/)
* [VisualData Discovery - Search Engine for Computer Vision Datasets](https://visualdata.io/discovery)

## Art Dataset

* [Art Data—Artnome](https://www.artnome.com/art-data)

## Dataset

* [The Open-Source Movement Comes to Medical Datasets](https://hai.stanford.edu/news/open-source-movement-comes-medical-datasets)
* [Mozilla Foundation - Mozilla Common Voice Adds 16 New Languages and 4,600 New Hours of Speech](https://foundation.mozilla.org/en/blog/mozilla-common-voice-adds-16-new-languages-and-4600-new-hours-of-speech/)

## Drug Dataset

* [DrugBank Online - Database for Drug and Drug Target Info](https://go.drugbank.com/)
  * [Muler](https://muler.pythonanywhere.com/), [PizzaMyHeart/muler: A search engine for drug information built with Flask.](https://github.com/PizzaMyHeart/muler)

## Dataset Zoo

* [Deeplite/deeplite-torch-zoo](https://github.com/Deeplite/deeplite-torch-zoo) Pytorch

## Dataset

* [Recommended Data Repositories - Scientific Data](https://www.nature.com/sdata/policies/repositories)

## Dataset

* [CatMeows: A Publicly-Available Dataset of Cat Vocalizations - Zenodo](https://zenodo.org/record/4008297)
* [Home - BBC Programme Index](https://genome.ch.bbc.co.uk/)

## Dataset

* [Dataset Search](https://datasetsearch.research.google.com/)

## Dataset

* [google-research-datasets/wit: WIT (Wikipedia-based Image Text) Dataset is a large multimodal multilingual dataset comprising 37M+ image-text sets with 11M+ unique images across 100+ languages.](https://github.com/google-research-datasets/wit)
* [PUBLIC DATA: 2021 AI Index Report - Google Drive](https://drive.google.com/drive/folders/1YY9rj8bGSJDLgIq09FwmF2y1k_FazJUm)

## Dataset Tools

* [Scale AI: The Data Platform for AI](https://scale.com/) : High quality training and validation data for AI applications
* [Aquarium - Data Management For ML](https://www.aquariumlearning.com/) : ML data management platform
* [Labelbox: The leading training data platform for data labeling](https://labelbox.com/) : Save time by creating and managing your training data, people, and processes in a single place

## Cell Tower Dataset

* [Cellular Tower and Signal Map](https://www.cellmapper.net/map)
* [OpenCelliD - Largest Open Database of Cell Towers & Geolocation - by Unwired Labs](https://www.opencellid.org/#zoom=11\&lat=-7.7798\&lon=109.1333)

## Twitter Dataset

* [Hedonometer](https://hedonometer.org/timeseries/en_all/?from=2019-11-02\&to=2021-05-01)

## Dataset

* [EleutherAI](https://www.eleuther.ai/) EleutherAI is a grassroots AI research group aimed at democratizing and open sourcing AI research.
* [The Pile](https://pile.eleuther.ai/) : The Pile is a 825 GiB diverse, open source language modelling data set that consists of 22 smaller, high-quality datasets combined together.

## Hugging Face Dataset API

[huggingface/datasets-server: Integrate into your apps over 10,000 datasets via simple HTTP requests, with pre-processed responses and scalability built-in.](https://github.com/huggingface/datasets-server)

## Dataset

* [Harvard Dataverse](https://dataverse.harvard.edu/)

## Open Food Data

* [Open Food Facts - World](https://world.openfoodfacts.org/)


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://irosyadi.gitbook.io/irosyadi/data-engineering/dataset.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
