Note
Access to this page requires authorization. You can try signing in or changing directories.
Access to this page requires authorization. You can try changing directories.
Azure Open Datasets are curated public datasets that you can add to scenario-specific features to machine learning solutions, for more accurate models. Open Datasets are available in the cloud, on Microsoft Azure. They're integrated into Azure Machine Learning and readily available to Azure Databricks. You can also access the datasets through APIs and you can use them in other products, such as Power BI and Azure Data Factory.
Datasets include public-domain data for weather, census, holidays, public safety, and location that help you train machine learning models and enrich predictive solutions. You can also share your public datasets through Azure Open Datasets.
Curated, prepared datasets
Curated open public datasets in Azure Open Datasets are optimized for consumption in machine learning workflows.
For more information about the available datasets, visit the Azure Open Datasets Catalog resource.
Data scientists often spend most their time cleaning and preparing data for advanced analytics. To save you time, open Datasets are copied to the Azure cloud, and then preprocessed. At regular intervals, data is pulled from the sources - for example, by an FTP connection to the National Oceanic and Atmospheric Administration (NOAA). Next, the data is parsed into a structured format, and then enriched as needed, with features such as ZIP Code or the locations of the nearest weather stations.
Datasets are cohosted with cloud compute in Azure, to make access and manipulation easier.
Here are examples of available datasets:
Transportation
| Dataset | Description |
|---|---|
| NYC Taxi & Limousine Commission - yellow taxi trip records | The yellow taxi trip records include pick-up and drop-off dates/times, pick-up and drop-off locations, trip distances, itemized fares, rate types, payment types, and driver-reported passenger counts. |
| NYC Taxi & Limousine Commission - green taxi trip records | The green taxi trip records include pick-up and drop-off dates/times, pick-up and drop-off locations, trip distances, itemized fares, rate types, payment types, and driver-reported passenger counts. |
Labor and economics
| Dataset | Description |
|---|---|
| US Labor Force Statistics | US Labor Force Statistics provides Labor Force Statistics, labor force participation rates, and the civilian noninstitutional population by age, gender, race, and ethnic groups in the United States. |
| US National Employment Hours and Earnings | The Current Employment Statistics (CES) program produces detailed industry estimates of nonfarm employment, hours, and earnings of workers on payrolls in the United States. |
Access to datasets
With an Azure account, you can access open datasets through code or through the Azure service interface. The data is colocated with Azure cloud compute resources for use in your machine learning solutions.
Open Datasets are available through the Azure Machine Learning UI and SDK. Open Datasets also provide Azure Notebooks and Azure Databricks notebooks that can connect data to Azure Machine Learning and Azure Databricks. Datasets can also be accessed through a Python SDK.
However, you don't need an Azure account to access Open Datasets; you can access them from any Python environment with or without Spark.
Request or contribute datasets
If you can't find the data you want, email us to request a dataset or contribute a dataset.