Machine Learning Tools, Platforms, and Datasets: Getting Started
This post covers the essential Python libraries, web platforms, and public datasets for experimenting with Machine Learning.
Machine Learning Libraries and APIs: scikit-learn, TensorFlow, PyTorch, XGBoost
The core Python libraries for Machine Learning are scikit-learn (traditional ML), TensorFlow by Google, PyTorch by Facebook/Meta, and Keras (high-level API over TensorFlow or Aesara). These are the most mature tools providing the Machine Learning algorithms. Keras is an API presenting an easier usage of other libraries such as TensorFlow or Aesara (former Theano). PyTorch is also more user-friendly compared to TensorFlow. Nevertheless, TensorFlow is more mature and has a larger community and better support.
scikit-learn is a Python open-source library that provides regression, classification, clustering, feature selection, metrics, and preprocessing functionality. scikit-learn lacks native deep learning support; to use deep learning within a scikit-learn-style API, use TensorFlow with the Scikit Flow wrapper for creating neural networks.
XGBoost is a library for applications requiring multicore parallelism. XGBoost (Extreme Gradient Boosting) uses boosted decision trees to build regression, classification, ranking, and other predictive models — it is the standard choice for structured tabular data competitions on Kaggle.
Machine Learning Platforms: Kaggle and OpenML for Experimentation
Kaggle is the largest public Machine Learning competition and dataset platform, and is the recommended first step for hands-on ML experimentation.
Kaggle has posts, discussions, shared code, datasets, and competitions on topics from simple regression to computer vision and NLP. Kaggle also has courses in Python, Machine Learning, data manipulation and visualisation, SQL, and others.
OpenML is a Machine Learning platform for sharing experiments and datasets to facilitate reproducibility in research. OpenML contributors provide an API to access ML libraries such as scikit-learn [1]. OpenML is the standard reference platform for reproducible ML benchmarks in academic research.
Public Machine Learning Datasets: Sources and APIs
It is essential to mention that datasets on these websites are already preprocessed and ready to experiment. For real-life training purposes, It is good to try out your own data collection, for instance, web scrapping, or to use publicly available APIs. For example, Twitter streaming API is a good source of real-life data. I have shared the tweets collection code in my GitHub repository if you like to experiment with Twitter data.
Did you like this post? Please let me know if you have any comments or suggestions.
Posts about Machine Learning that might be interesting for youRelated tools you may want to try next.
Mixo.io generates websites instantly using AI. Builds stunning landing pages without any code or design. Includes a built-in email waiting list and all the tools you need to launch, grow, and test your ideas.
SEMrush SEMRUSH provides marketing platform for SEO insights and tools such as writing assistant.
References
[1] Matthias Feurer, Jan N. van Rijn, Arlind Kadra, Pieter Gijsbers, Neeratyoy Mallik, Sahithya Ravi, Andreas Mueller, Joaquin Vanschoren, Frank Hutter. OpenML-Python: an extensible Python API for OpenML. arXiv:1911.02490 [cs.LG], 2019
Related Reading
Enjoyed this? Get more like it.
Weekly notes on AI tools, Python, and what I'm actually building — plus a free copy of Fantastic AI: The 2026 Toolkit.