Hawke Eye #5: Datasets are Taking the Center Stage of AI Wars
In the AI gold rush, don't sell picks and shovels. Sell the GOLD!
The one thing everyone wants to sell (or ought to be selling) in a gold rush is picks and shovels. AI goldrush is not a whole lot different. It’s just a flipped proposition.
Commodity datasets are not good anymore. They are OK at best.
If everyone uses the same data to train their models, the models will eventually converge to provide similar performance. Not a blue ocean strategy.
A list of the most popular datasets is listed at the bottom. Although very carefully curated, these datasets are lab-grade at best and only useful for smaller projects to test out the new algorithms.
Commodity vs. Proprietary Datasets
Both commodity datasets and proprietary datasets have their advantages and disadvantages, and their utility will largely depend on the specific application, objectives, and competitive landscape.
Commodity Datasets
Advantages:
Accessibility: Generally easy to access and often free.
Benchmarking: Industry standards for comparison of models.
Community Support: A lot of pre-trained models and open-source solutions are available.
Initial Prototyping: Good for initial experimentation and proof of concept.
Disadvantages:
Generic: May not capture the specifics of your application.
Overfitting Risk: There's a risk of overfitting your model to a well-known dataset, making it less effective on real-world, specific data.
Competitive Disadvantage: Since everyone has access to the same data, there’s less of a competitive edge.
Proprietary Datasets
Advantages:
Competitive Edge: Unique data can provide a significant competitive advantage.
Specificity: Data can be tailored to the exact needs of the business or problem.
Exclusivity: Being the sole possessor of a dataset can make your solution inimitable.
Disadvantages:
Cost: Gathering unique data can be expensive and time-consuming.
Data Quality: Requires a robust pipeline for data cleaning and preparation.
Legal and Ethical Concerns: Ownership and usage of data can entail complicated legal responsibilities, including around user privacy.
Most popular datasets
Currently, the most popular per training category are as follows.
Images:
ImageNet: This dataset has over a million labeled images and 1000 different categories. It's the benchmark for many algorithms in image classification. (Source: https://www.image-net.org/download.php)
COCO (Common Objects in Context): Used for object detection, segmentation, and captioning, it contains 330K images with over 200K labeled. (Source: https://cocodataset.org/#download)
Aerial Imagery: Datasets like
UC Merced Land Use DatasetandNWPU-RESISC45provide aerial imagery for land use classification. (Source: http://weegee.vision.ucmerced.edu/datasets/landuse.html)
Natural Language Processing:
Wikipedia dump: A large corpus used for various NLP tasks. (Source: https://en.wikipedia.org/wiki/Wikipedia:Database_download)
Common Crawl: A massive web crawl dataset that can be used for a variety of NLP applications. (Source: https://commoncrawl.org/)
Book Corpus: Contains books in different genres for unsupervised learning. (Source: https://paperswithcode.com/dataset/bookcorpus)
SQuAD (Stanford Question Answering Dataset): A reading comprehension dataset where models are tasked with answering questions based on given paragraphs. (Source: https://huggingface.co/datasets/squad)
Speech and Audio:
LibriSpeech: An extensive dataset derived from reading English speech from the LibriVox project. (Source: https://www.openslr.org/12)
VCTK: Contains English spoken by 109 native speakers with various accents. (Source: https://paperswithcode.com/dataset/vctk)
Medical Imaging:
DICOM (Digital Imaging and Communications in Medicine): Standard for storing and transmitting medical images. (Source: https://wiki.cancerimagingarchive.net/pages/viewpage.action?pageId=80969777)
MIMIC-III: A dataset containing de-identified health data associated with over 40,000 critical care patients. (Source: https://physionet.org/content/mimiciii/1.4/)
Video:
YouTube-8M: A large-scale labeled video dataset with millions of YouTube video IDs and associated labels. (Source: https://research.google.com/youtube8m/)
UCF101: Contains 13,320 videos from 101 action categories. (Source: https://www.crcv.ucf.edu/data/UCF101.php)
Autonomous Vehicles and Drones:
Waymo Open Dataset: Waymo’s self-driving dataset, offering high-resolution sensor data. (Source: https://waymo.com/open/)
KITTI: A benchmark suite for autonomous driving tasks. (Source: https://www.cvlibs.net/datasets/kitti/)
DJI Drone Detection Dataset: For detecting drones in various scenarios. (Source: https://www.kaggle.com/datasets/sshikamaru/drone-yolo-detection)
Recommender Systems:
MovieLens: Movie recommendation datasets of various sizes. (Source: https://grouplens.org/datasets/movielens/)
Amazon product review datasets: Useful for recommendation systems, sentiment analysis, etc. (Source: https://jmcauley.ucsd.edu/data/amazon/)
Tabular Data and Time Series:
UCI Machine Learning Repository: Hosts datasets for various domains including tabular data, image data, and more. (Source:
https://archive.ics.uci.edu/)
Kaggle: Offers a variety of datasets for different tasks, including time-series forecasting, tabular data competitions, etc. (Source: https://www.kaggle.com/datasets/)
Translation:
WMT (Workshop on Statistical Machine Translation): Offers datasets for machine translation tasks between various language pairs. (Source: https://paperswithcode.com/dataset/wmt-2020)
Specialized Tasks:
CelebA: Contains over 200,000 celebrity images, each with 40 attribute labels, commonly used for face-related tasks. (Source: https://www.kaggle.com/datasets/jessicali9530/celeba-dataset)
CASIA-WebFace: Used for face recognition tasks. (Source: https://www.kaggle.com/datasets/ntl0601/casia-webface)
MS MARCO: A dataset specifically for deep learning-based search engines. (Source: https://microsoft.github.io/msmarco/)
Conclusion
Curation and collection of new data and the ability to create new and clean datasets will be the only differentiating factors for the next-generation incubators and starters of the data and AI wars.


