The Complete List of Datasets for data-driven content

Category of Site/Data

I Want tools that are

Show only Nick’s favorite tools:

Federal Election Commission

Government Free

Created by congress in 1975 the FEC is is an independent regulatory authority whose purpose it to disclose campaign finance information. You can find data on US campaign finance sources. The FEC has downloadable data sets so you can slice and dice the data for your own analysis. Or find already made maps and charts that already break down the data for you.

Website:
http://www.fec.gov/pindex.shtml

GovTrack

Government Free

GovTrack.us is a completely independent entity which tracks the status of federal legislation, information about your representative and senators in Congress, as well as voting records and original research. GovTrack helps Americans understand what is going on in their national legislature.

Website
https://www.govtrack.us/

UCR Data Tool

Government Free

The FBI’s Uniform Crime Reporting (UCR) Program collects statistics on violent and property crime. The FBI, in cooperation with the Bureau of Justice Statistics, allows users to build their own customized data tables on this site. By using the table-building tool, users can choose these options: offenses, locality (city, county, state), and year(s).

Website
https://www.ucrdatatool.gov

Bureau of Labor Statistics

Free

The Bureau of Labor Statistics’ Public Data API (Version 1.0 and Version 2.0) give the public access to economic data from all of its programs.

Website
https://www.bls.gov/

Data.Gov

Government aggregator Free

This is the U.S. Government’s official data portal. It provides innumerous datasets on the nation’s demography, businesses and trade.

Website
https://www.data.gov/

American FactFinder (US Census)

Government aggregator Free

American FactFinder provides access to data collected from surveys and censuses regarding the United States, Puerto Rico and the Island Areas.

Website
https://factfinder.census.gov/

NICAR

aggregator Paid

The National Institute for Computer-Assisted Reporting (NICAR) is a program founded on the Missouri School of Journalism’s IRE. It is dedicated to excellence in journalism, particularly with regard to data journalism.

Website
https://www.ire.org/nicar/database-library/

Graphiq

Freemium

Graphiq Visualizations are used to enrich editorial content and increase support of third-party applications. There are more than 10 billion visualizations already listed in the Graphiq library and thousands are getting added daily.

Website
https://www.graphiq.com

Atlas

Free

Atlas is a platform with a goal to give everyone access to discovering and sharing great charts. Chart creators, especially researchers, analysts and journalists can use Atlas’ platform to create, share and embed their data visualizations.

Website
https://www.theatlas.com/

ProPublica

aggregator Free

ProPublica is a non-profit investigative news outlet which offers up hyperlocal data on an array of important issues, such as abuses of power and public trust issues found in government and businesses. The premium data products provide data, analysis, and practical documentation.

Website
https://www.propublica.org/data/

Google Public Data Explorer

aggregator Free

The Google Public Data Explorer makes it easy to review and use large datasets which are displayed as line graphs, bar graphs, cross sectional plots or on maps. The platform provides past and current public data and predictions from numerous international organizations, such as the World Bank, OECD, Eurostat and the University of Denver.

Website
https://www.google.com/publicdata/directory

DataPortals

aggregator Free

This is an extensive and comprehensive directory of open-data portals world-wide. It is managed by a group of leading open data experts, including representatives from local, regional and national governments, many NGOs, and international organisations.

Website
http://dataportals.org/

Net Data Directory

aggregator Free

The Net Data Directory collects and shares information on a wide range of Internet-related topics—freedom of expression, broadband, social media, cybersecurity and more. This database makes it easier to search, sort, and filter records that are important to their work, and many of the datasets are open and available to the public.

Website
https://netdatadirectory.org/

OpenSecrets.org

Freemium

OpenSecrets.org is the Nation’s premier website tracking the influence of money on U.S. politics. The site offers clear and unbiased information which details how money affects not only government policy, but the lives of US citizens and residents .

Website
http://www.opensecrets.org/

Censorship Explorer

Free

Censorship explorer has a proxy list that is regularly updated by scraping free online proxy lists. Each URL inputted will be requested through each selected proxy, so you can check whether a URL is censored in a particular country by using proxies located around the world.

Website
https://wiki.digitalmethods.net/Dmi/ToolCensorshipExplorer

CrocTail

aggregator Free

CrocTail provides an interface for browsing information about several hundred thousand U.S. publicly traded corporations and their foreign subsidiaries. Information from company filings with the U.S. Securities and Exchange Commission (SEC) has been parsed and annotated by CorpWatch to provide a way for Crocodyl.org users to research and add issues related to corporate subsidiaries. CrocTail also serves as a demonstration of the features and data available through the CorpWatch API.

Website
http://croctail.corpwatch.org/#cw_391,cw_391,2016

Crowd Voice

Social Media Freemium

The new Crowdvoice.by is an open-source project tool that can be used to collect, organize and share information about causes that are important to you. This tool, which is an interactive platform where users can invite and encourage change by raising awareness, can be easily customized and embedded into your page.

Website
http://crowdvoice.org/

Dat

Free

Dat is a secure open-source, decentralized data sharing tool for syncing changes to data. Dat is the package manager for data, and has a javascript library and a powerful command line tool which also allows for easy share and version control data.

Website
https://datproject.org/

Data Stringer

aggregator Free

Datastringer is a tool for hacker-journos. Datastringer can help you subscribe to data sources, and it will contact you when patterns arise or thresholds are broken. It can provide you with re-usable tools (written in Javascript and Node.js) that you will partly configure through a graphical interface.

Website
https://bbc-news-labs.github.io/datastringer/

CDC

Government Health Free

Centers for Disease Control and Prevention (CDC) offers data and statistics by topic and tools and resources for numerous diseases and health-related subjects.

Website
https://www.cdc.gov/DataStatistics/

Find the Data

Freemium

Deep insights from reference data. Knowledge delivered. Find the Data is a reference site that uses Graphiq’s semantic technology to deliver deep insights via data-driven articles, visualizations and research tools.

Website
http://www.findthedata.com/

Open Data by Arcgis

Freemium

Share Live ArcGIS Open Data in Minutes, as part of your ArcGIS Online subscription, you can use ArcGIS Open Data to share your live authoritative open data. Esri-hosted ArcGIS Open Data gives you a quick way to set up public-facing websites where people can easily find and download your open data in a variety of open formats.

Website
http://opendata.arcgis.com/

Hall of Justice

Legal Free

Criminal Justice data transparency

Website
http://hallofjustice.sunlightfoundation.com/

Follow the Money

aggregator Free

The Nation’s only free, nonpartisan, verifiable archive of contributions to political campaigns in all 50 states.

Website
http://followthemoney.org/

FOIA Machine

aggregator Paid

FOIA Machine is an open-source platform that empowers citizens and journalists to easily prepare, file and track multiple public records requests to various governmental and public agencies worldwide. This site helps users access government documents and data that are covered by Freedom of Information Act (FOIA) laws allowing citizens to obtain information vital to the workings of their government.

Website
https://www.foiamachine.org/

Kaggle Datasets

aggregator Free

The best place to discover and seamlessly analyze open data. Execute, share, and comment on code for any open dataset with our in-browser analytics tool, Kaggle Kernels. You can also download datasets in an easy-to-read format.

Website
https://www.kaggle.com/datasets

Map Light

Free

Tracking campaign contributions. MapLight is a nonpartisan research organization that reveals money’s influence on politics. We research and compile data about the sources of campaign contributions in U.S. presidential, congressional, state, and local ballot and candidate elections. We provide journalists and citizens with transparency tools that connect data on campaign contributions, politicians, legislative votes, industries, companies, and more to show patterns of influence never before possible to see. These tools allow users to gain unique insights into how campaign contributions affect policy so they can draw their own conclusions about how money influences our political system.

Website
http://maplight.org/

Re3data

Free

Re3data.org is a global registry of research data repositories that covers research data repositories from different academic disciplines. It presents repositories for the permanent storage and access of data sets to researchers, funding bodies, publishers and scholarly institutions.

Website
http://www.re3data.org/

Awesome Public Data Sets

aggregator Social Media Freemium

Public data sources are collected from blogs, answers, and user responses and turned into an organized and awesome list of ongoing, high-quality open datasets in public domains. Some are free, some are not.

Website
https://github.com/caesar0301/awesome-public-datasets

Google Trends

World Social Media Free

Website
https://trends.google.com/trends/

Index Mundi

World aggregator Free

Global data on population, demographics, trade and more. IndexMundi contains detailed country statistics, charts, and maps compiled from multiple sources. You can explore and analyze thousands of indicators organized by region, country, topic, industry sector, and type.

Website
http://www.indexmundi.com/

Google Correlate

Free

Find searches that correlate with real-world data

Website
https://www.google.com/trends/correlate

Google Public Data

Free

This dataset contains the World Development Indicators (WDI).

Website
https://www.google.com/publicdata/directory

Social Explorer

aggregator Social Media Freemium

Use our interactive tools to easily create and share maps, presentations and tables, or compare and analyse data and discover amazing facts.

Website
http://www.socialexplorer.com/

Data USA

aggregator Free

Data USA is the most comprehensive visualization of U.S. public data. Its provides an open, easy-to-use platform that turns data into knowledge for use by all sectors and occupations.

Website
https://datausa.io/

Google Data Studio

aggregator Free

Aggregates data from a variety of global public data sources for fast analysis, comparison and exploration.

Website
https://datastudio.google.com/

Data World

World Social Media Freemium

Data World is a social network platform for people who need to have access to a vast array of high-quality open data. It’s easy to share and connect with other problem-solvers, thereby accelerating and improving decision-making, knowledge transfer and more.

Website
https://data.world/

World Bank Open Data

aggregator Free

Free and open access to global development data

Website
http://data.worldbank.org/

AWS Public Datasets

Scientific Economic Free

Public Datasets on AWS provides a condensed storehouse of public datasets that can be smoothly integrated into AWS cloud-based applications. AWS is hosting the public datasets at no charge for the community, and users need only pay for the compute and storage they use for their own applications.

Website
https://aws.amazon.com/datasets/

Enigma

aggregator Freemium

Enigma’s Public Data Explorer contains massive troves of scraped data on almost any topic of public import. Allows for dynamic filtering, querying and search throughout every record and row of every dataset.

Website
http://enigma.io/

Data Bulletin

aggregator Free

The Data Bulletin is a central channel for the publication and analysis of data stories, and it is continuously updated with a stream of newly released government and private sector datasets that are available for download.

Website
http://databullet.in/

Uber Movement

Free

Uber Movement data has been used to [examine holiday traffic trends in Manila, measure road network performance in Australia, and understand the impact of Washington DC’s Metrorail shutdown.

Website
http://datadrivenjournalism.net/resources/uber_movement

NDC Explorer

Free

A one-stop-shop for exploring national climate action plans.

Website
http://klimalog.die-gdi.de/ndc/#NDCExplorer/worldMap

Mapzen Mobility Explorer

Free

“Mapzen Mobility Explorer to understand transportation networks around the world. Mapzen is an open, sustainable, and accessible mapping platform.
Our tools let you display, search, and navigate your world.

Website
https://mapzen.com/mobility/explorer/

IIAG Data Portal

World Social Media Free

Index of African Goverance – with a mandate to strengthen the availability and use of data in Africa, the new portal is freely available online and serves as an interactive platform for in-depth exploration of governance performance for each of the 54 countries.

Website
http://iiag.online/

Afrobarometer Online Data Analysis

Free

Afrobarometer is an online data analysis tool (ODA) that provides free and open data about Africans’ views on a number of issues including democracy and governance. The tool gives easy access to quality data on Africa.

Website
http://www.afrobarometer.org/online-data-analysis

Weather Data

Free

Is a collection of functions that will fetch weather (Temperature, Pressure, Humidity etc.) data from the Web for you as clean data frame. But Also has pre-loaded data sets that you can use.

Website
http://ram-n.github.io/weatherData/builtin.html

Lumen

Free

The Lumen database collects and analyzes legal complaints and requests for removal of online materials, helping Internet users to know their rights and understand the law. These data enable us to study the prevalence of legal threats and let Internet users see the source of content removals.

Webiste
https://www.lumendatabase.org/

SpikeCharts

Freemium

A Macroeconomic news analytics tool which provides historical Forex market data in the form of charts snapshots based around market moving economic news announcements.

Website
http://next.newsimpact.com/

Asian Data by Asian Development Bank

Economic Free

Asian Development Bank supports a free visualization tool for mobile devices that presents the latest macroeconomic and social indicators for Asia. The tool augments the stockpile of knowledge of developing member countries and the region and spreads it, so that Asia’s policies can be strengthened based on key data.

Website
https://www.adb.org/data/main

OpenAIRE

Freemium

OpenAIRE is an EC-funded initiative that supports the Open Access policy of the European Commission via a technical infrastructure.The project aims to promote open scholarship and substantially improve the discoverability and reusability of research publications and data. To this end, it offers a data repository platform that allows users to host and retrieve research data.

Website
https://www.openaire.eu/

UN-Habitat Urban Data Portal

Freemium

UN-Habitat has launched a new web portal featuring a wealth of city data based on its repository of research on urban trends.

Website
http://urbandata.unhabitat.org/

Open Spending

World Freemium

“By understanding how governments spend money in our name can we have a say in how that money will affect our own lives.
The journey starts here.”

Website
https://openspending.org/

Zarnan

Freemium

Zanran helps you to find ‘semi-structured’ data on the web. This is the numerical data that people have presented as graphs and tables and charts. For example, the data could be a graph in a PDF report, or a table in an Excel spreadsheet, or a barchart shown as an image in an HTML page. This huge amount of information can be difficult to find using conventional search engines, which are focused primarily on finding text rather than graphs, tables and bar charts.

Website
http://www.zanran.com

Statista

Freemium

Immediate access to over one million statistics and facts

Website
https://www.statista.com/

PewResearch Data

aggregator Free

Pew Research Center is a nonpartisan fact tank that informs the public about the issues, attitudes and trends shaping America and the world. It conducts public opinion polling, demographic research, media content analysis and other empirical social science research. Pew Research Center does not take policy positions. It is a subsidiary of The Pew Charitable Trusts.

Website
http://www.pewresearch.org/data/

U.S. Department of Agriculture’s Plants Database

aggregator Free

The PLANTS Database provides standardized information about the vascular plants, mosses, liverworts, hornworts, and lichens of the U.S. and its territories. It includes names, plant symbols, checklists, distributional data, species abstracts, characteristics, images, crop information, automated tools, onward Web links, and references. This information primarily promotes land conservation in the United States and its territories, but academic, educational, and general use is encouraged. PLANTS reduces government spending by minimizing duplication and making information exchange possible across agencies and disciplines.

Website
http://www.plants.usda.gov/dl_all.html

Biology

Free

An awesome list of high-quality open datasets in public domains (on-going). By everyone, for everyone!
This list of public data sources are collected and tidied from blogs, answers, and user responses. Most of the data sets listed below are free, however, some are not. Other amazingly awesome lists can be found in the awesome-awesomeness and sindresorhus’s awesome list.

Website
https://github.com/caesar0301/awesome-public-datasets#biology

1000 Genomes

Scientific Free

Data from the 1000 Genomes Project is available worldwide to the scientific community, and it is freely accessible through public databases. The developers are working on a new data portal which will facilitate finding and browsing data in IGSR.

Website
http://www.internationalgenome.org/data

American Gut (Microbiome Project)

Scientific Health Free

The American Gut project sheds light on the many connections between the human microbiome and health, and lifestyle factors. The repository is meant to be used as a project/repo, and all de-identified data is made freely available.

Website
https://github.com/biocore/American-Gut

Broad Cancer Cell Line Encyclopedia (CCLE)

Scientific Health Free

The Cancer Cell Line Encyclopedia (CCLE) project gives public access to genomic data, analysis and visualization to about 1000 cell lines. The project began in order to conduct a detailed genetic and pharmacologic characterization of a wide panel of human cancer models, as well as to develop integrated computational analyses that link distinct pharmacologic vulnerabilities to genomic patterns. In addition, the project is used to translate cell line integrative genomics into cancer patient stratification.

Website
https://portals.broadinstitute.org/ccle/home

Broad Bioimage Benchmark Collection (BBBC)

Scientific Health Free

The Broad Bioimage Benchmark Collection (BBBC) is a collection of annotated biological image sets for testing and validation. This collection of freely downloadable microscopy image sets includes images, a description of the biological application, and a type of expected results.

Website
https://data.broadinstitute.org/bbbc/

Cell Image Library

Scientific Health Free

The Cell Image Library™ is a freely accessible, easy-to-search, public repository of reviewed and annotated images, videos, and animations of cells from a variety of organisms. The images show cell architecture, intracellular functionalities, and both normal and abnormal processes. This database is meant to promote research, education, and training, with the goal of improving human health.

Website
http://www.cellimagelibrary.org/

Complete Genomics Public Data

Scientific Free

Complete Genomics Analysis Tools (CGA™ Tools) are a set of open source software tools for downstream analysis of sequencing data which focus on multi-genome comparisons and format conversion. These tools can be used to conduct various family-based or case-control analysis.

Website
http://www.completegenomics.com/public-data/69-genomes/

EBI ArrayExpress

Scientific Health Free

ArrayExpress Archive of Functional Genomics Data stores data from high-throughput functional genomics experiments. It provides these data for reuse to the research community, as ArrayExpress is one of the best known repositories capable of storing archive functional genomics data to support reproducible research.

Website
http://www.ebi.ac.uk/arrayexpress/

EBI Protein Data Bank in Europe

Scientific Free

The Electron Microscopy Data Bank (EMDB) covers a variety of techniques, including electron (2D) crystallography, electron tomography, and single-particle analysis. It is a public repository for electron microscopy density maps of subcellular structures and macromolecular complexes.

Website
http://www.ebi.ac.uk/pdbe/emdb/index.html/

Electron Microscopy Pilot Image Archive (EMPIAR)

Scientific Tech Free

The Electron Microscopy Public Image Archive (EMPIAR) is built on input from the EM community, specifically input from two key workshops organized by the Protein Data Bank in Europe. It is a public resource for raw, 2D electron microscopy images where you can browse, upload, download and reprocess the thousands of raw, 2D images used to build a 3D structure.

Website
http://www.ebi.ac.uk/pdbe/emdb/empiar/

ENCODE project

Free

The ENCODE (Encyclopedia of DNA Elements) Consortium is an international collaboration of research groups funded by the National Human Genome Research Institute (NHGRI). The goal of ENCODE is to build a comprehensive parts list of functional elements in the human genome, including elements that act at the protein and RNA levels, and regulatory elements that control cells and circumstances in which a gene is active.

Website
https://www.encodeproject.org/

Gene Expression Omnibus (GEO)

aggregator Free

GEO is a public functional genomics data repository supporting MIAME-compliant data submissions. Array- and sequence-based data are accepted. Tools are provided to help users query and download experiments and curated gene expression profiles.

Website
https://www.ncbi.nlm.nih.gov/geo/

Gene Ontology (GO)

Free

The Gene Ontology (GO) project is a collaborative effort to address the need for consistent descriptions of gene products across databases.
Annotation is the practice of capturing the activities and localization of a gene product with GO terms, providing references and indicating what kind of evidence is available to support the annotations. More information on how this is done can be found in the Guide to GO Annotation Policies. Members of the GO Consortium make their annotation data freely available to the public as part of the data accessed by AmiGO 2, the GO browser and search engine. Annotation data sets from individual databases can found on the GO annotations page.

Website
http://geneontology.org/page/download-annotations

Global Biotic Interactions (GloBI)

Free

GloBI contains code to normalize and integrate existing species-interaction datasets and export the resulting integrated interaction dataset. The mission of this project is to find efficient ways to normalize and integrate species-interaction data. By making this data readily available, GloBI will enable researchers and enthusiasts to answer questions about localized, one-to-one species interactions and big-picture changes in species interactions over time. For example, GloBI can answer which species an Angel Shark (Squatina squatina) eats in the Gulf of Mexico, or return the results of a query for the number of Angel Sharks feeding in the Gulf of Mexico between 2005 and 2010.

Website
https://github.com/jhpoelen/eol-globi-data/wiki#accessing-species-interaction-data

Harvard Medical School (HMS) LINCS Project

aggregator Freemium

The Harvard Medical School (HMS) LINCS Center is funded by NIH grant U54 HL127365 and is part of the NIH Library of Integrated Network-based Cellular Signatures (LINCS) Program. The overall goals of this program are to collect and disseminate data and analytical tools needed to understand how human cells respond to perturbation by drugs, the environment, and mutation. Further information about LINCS and other participating Centers is available at the program website.
HMS LINCS publications provide descriptions of key findings, links to relevant datasets in the HMS LINCS Database, and custom data visualization tools. These and other tools are available via our software page.

Website
http://lincs.hms.harvard.edu/

Human Genome Diversity Project

Scientific Freemium

A group of scientists at Stanford University have collaborated on a large study to understand genetic diversity in human populations. We analyzed genomic DNA from 1,043 individuals from around the world, determining their genotypes at more than 650,000 SNP loci, with the Illumina BeadStation technology. Genomic DNA samples from these fully-consenting individuals were collected by the Human Genome Diversity Project (HGDP), in a collaboration with the Centre Etude Polymorphism Humain (CEPH) in Paris. The collection we tested is referred to as the “HGDP-CEPH Human Genome Diversity Cell Line Panel”.

Website
http://www.hagsc.org/hgdp/files.html

Human Microbiome Project (HMP)

aggregator Free

“Welcome to the Data Analysis and Coordination Center (DACC) for the National Institutes of Health (NIH) Common Fund supported Human Microbiome Project (HMP). This site is the central repository for all HMP data. The aim of the HMP is to characterize microbial communities found at multiple human body sites and to look for correlations between changes in the microbiome and human health. More information can be found in the menus above and on the NIH Common Fund site. All software, online resources and standard operating protocols used in, or developed as part of the HMP, will be accessible here as they become available.

If you have a protocol or software package that you would like to post on this site, or would like more information on the currently available content, please contact us via the feedback form.”

Website
http://www.hmpdacc.org/reference_genomes/reference_genomes.php

100+ Interesting Data Sets for Statistics

aggregator Free

This site provides over 100 data sets on various interesting topics.

Website
http://rs.io/100-interesting-data-sets-for-statistics/

10k US Adult Faces Database

Scientific Health Free

This database contains more than 10,000 natural face photographs and measures for over 2000 of the faces, predicting the memorability of faces using computer vision features.

Website
http://wilmabainbridge.com/facememorability2.html

3.5B Web Pages from CommonCrawl 2012

Scientific aggregator Free

This page provides a large collection of webpages and hyperlinks for public download,similar to such data like Google, Yahoo, and Microsoft. The graph has been extracted from the Common Crawl 2012 web corpus, has 3.5 billion web pages and 128 billion hyperlinks.

Website
http://www.bigdatanews.com/profiles/blogs/big-data-set-3-5-billion-web-pages-made-available-for-all-of-us

53.5B Web clicks of 100K users in Indiana Univ.

Scientific aggregator Free

This database is to encourage and help the study of the structure and dynamics of Web traffic networks. It provides a large dataset of about 53.5 billion HTTP requests from the users of Indiana University.

Website
http://cnets.indiana.edu/groups/nan/webtraffic/click-dataset/

A list of cities and countries contributed by community

Government Free

Website
https://github.com/caesar0301/awesome-public-datasets/blob/master/Government.rst

Academic Torrents of data sharing from UMB

aggregator Free

This service is designed to facilitate storage all the data used in research, including datasets as well as publications. The journal focuses on its core mission of providing world class research, and this technology allows a group of editors to seed their own peer reviewed published articles with just a torrent client.

Website
http://academictorrents.com/

ACLED (Armed Conflict Location & Event Data Project)

aggregator Free

This is a project that collates data on political violence in developing states, in countries such as Africa and Asia. ACLED (Armed Conflict Location and Event Data Project) aims to supplement the study of civil war with models and periods of instability, public protest and regime breakdown.

Website
http://www.acleddata.com/

Actuaries Climate Index

Scientific Free

The Actuaries Climate Index (ACI) is an educational and useful weather and climate monitoring tool designed to help inform actuaries, public policymakers of the impact of a changing climate on the United States and Canada. This website is available for the USA and Canada and more than 10 of their subregions.

Website
http://actuariesclimateindex.org/data/

Affective Image Classification

Free

In order to facilitate the study of age and gender recognition, we provide a data set and benchmark of face photos. The data included in this collection is intended to be as true as possible to the challenges of real-world imaging conditions. In particular, it attempts to capture all the variations in appearance, noise, pose, lighting and more, that can be expected of images taken without careful preparation or posing.

Website
http://www.openu.ac.il/home/hassner/Adience/data.html

Airlines OD Data 1987-2008

aggregator Free

Airlines OD Data is a large dataset that consists of more than 100 million records of flight arrival and departure details for all commercial flights within the USA from October 1987- April 2008. Brief introductions to useful tools: linux command line tools and sqlite, a simple sql database are provided.

Website
http://stat-computing.org/dataexpo/2009/the-data.html

Allen Institute Datasets

Scientific Free

The Allen Institute provides answers to important questions in neuroscience. With public releases of new data, knowledge and tools it increases research worldwide.

Website
http://www.brain-map.org/

AWS Amazon

Mathematics

Public Datasets on AWS provides a central location of public datasets that can be quickly and easily processed with elastic computing resources.

Website
https://aws.amazon.com/datasets/

American Economic Association (AEA)

Economic Free

American Economic Association (AEA) society’s mission is the dissemination of economics data, and it is available online to professionals, teachers, students and the general public without any subscription.

Website
https://www.aeaweb.org/resources/data

AMiner Citation Network Dataset

Free

The AMiner Citation Network Dataset’s information is taken from DBLP, ACM, and other sources, and is meant for research purposes only. The first version contains over 600,000 papers which include their title, abstract, authors, year, venue, etc., and it also has more than 600,000 citations.

Website
http://aminer.org/citation

AMPds

aggregator Free

The AMPds dataset is designed to help eco-feedback researchers and load disaggregation/NILM researchers to test their prototypes, algorithms, systems and models.

Website
http://ampds.org/

OpenDataMonitor

aggregator Free

OpenDataMonitor is an overview of the many European open datasets available today. People can use this platform and its new technologies to make better use of the existing data catalogues.

Website
http://opendatamonitor.eu/frontend/web/index.php?r=dashboard%2Findex

Ancestry.com Forum Dataset

Free

The Ancestry.com Forum Dataset uses data accumulated on online forum boards.ancestry.com, from July 2010. This message board has had active participation for over ten years, and holds more than 22 million messages by over 3.5 million authors. The dataset was created to support research on information retrieval, language technologies, and social network analysis.

Website
http://www.cs.cmu.edu/~jelsas/data/ancestry.com/

Animals with Attributes

Free

This dataset consists of over 30,000 images of 50 animals classes, and uses six pre-extracted feature representations for each image. The platform includes benchmark transfer-learning algorithms, in particular attribute base classification.

Website
http://attributes.kyb.tuebingen.mpg.de/

AQUASTAT – Global water resources and uses

Health Social Media Free

By 2050, the world’s highest rates of population growth are expected to occur in areas that have deficiencies in the agriculture sector. The AQUASTAT Main Database is provided free of charge to all users, and allows for researchers to benefit from information gathered worldwide by the Food and Agriculture Organization of the United Nations (FAO).

Website
http://www.fao.org/nr/water/aquastat/data/query/index.html?lang=en

ArcGIS Open Data

World aggregator Free

ArcGIS Open Data uses the ArcGIS Online groups you already have, in order to integrate with other open data platforms, to identify open data sources, and to allow you to quickly publish or remove your open data. Your open datasets automatically sync with the latest version of your sources.

Website
http://opendata.arcgis.com/

Archive-it (from Internet Archive)

World aggregator Free

In 1996, Internet Archive was created as a non-profit digital library and is the world’s largest public web archive. It has focused on ensuring all collections are freely and publicly accessible at www.archive.org. By creating this digital library to permanently store digital content from all over the world, the data within it is available to everyone who wishes to view it.

Website
https://www.archive-it.org/explore?show=Collections

Climate Data Online – Australian Weather

Government Scientific Free

Climate Data Online is a platform on The Bureau of Meteorology’s agency website for tracking Australia’s national weather, climate and water. It allows for the use of the Text or Map search to view daily and monthly statistics, historical weather observations, rainfall, temperature and solar tables, graphs and data. In addition, the Daily Weather Observations tool is a part of the Climate Data Online platform.

Website
http://www.bom.gov.au/climate/dwo/

Aviation Weather Center

Government Free

The Aviation Weather Center delivers consistent, timely and accurate weather information for the world airspace system. We are a team of highly skilled people dedicated to working with customers and partners to enhance safe and efficient flight.

Website
https://aviationweather.gov/adds/dataserver

Basketball (NBA/NCAA/Euro) Player Database and Statistics

Entertainment Free

DraftExpress LLC is a professional scouting, statistics and analytics service that has been featured on several US sports and media outlets. The goal is to expand their reach worldwide, so that the Draft Express tools can provide comprehensive and trustworthy data to scouting professionals, fans and media.

Webiste
http://www.draftexpress.com/stats.php

Bay Area Bike Share Data

Entertainment Free

The Bay Area Bike Share’s trip data is based on the use of the company’s bike sharing system. The data combines the travel data of 700 bikes and 70 stations across the area, including San Francisco and San Jose. This data set is great for anyone interested in these stats, and designers and developers, too.

Website
http://www.bayareabikeshare.com/open-data

Betfair Historical Exchange Data

Economic aggregator Free

Users can replay Betfair markets in real-time after the market has been settled, because fully time-stamped historical Betfair price data is now available to Betfair users. The data is collected using the existing live Betfair API and it is a proper representation of what Betfair users already experience on the website while viewing the Betfair market.

Website
http://data.betfair.com/

British Oceanographic Data Center (BODC) – Marine data of ~22K vars

Scientific aggregator Free

Publicly accessible marine data is collected by using a variety of instruments and samplers, and the data is collated from many resources. The British Oceanographic Data Center (BODC) maintains databanks of almost 22,000 different oceanographic variables, including physical, chemical, biological and geophysical data. BODC makes data available under a licence agreement.

Website
https://www.bodc.ac.uk/data/

Brain Catalogue

Scientific Free

The Brain Catalogue is a data set for gathering and disseminating information regarding the diversity of the vertebrate brain. It’s goal is making high quality data, open and freely available to everyone.

Website
https://braincatalogue.org/

Brainomics

Health Free

Project Brainomics combines questionnaire data, genetics and imaging, and the Brainomics/Localizer online database serves a subset of the Functional Localizer dataset.

Website
http://brainomics.cea.fr/localizer

Brazilian Weather – Historical data (In Portuguese)

World Free

The SINDA is the Mission Center which is responsible for processing data collected remotely by Data Collection Platforms (PCDs) in Brazil. A network of PCDs and Receiving Stations are installed in Brazil and form the Brazilian Data Collection System, which is a wide array of satellites that carry the DCS (data collection transponder) system on board. SINDA manages the function, storage and dissemination of data to users.

Website
http://sinda.crn2.inpe.br/PCD/SITE/novo/site/

Adience Unfiltered faces for gender and age classification

aggregator Freemium

Adience Unfiltered provides a data set and benchmark of face photos. In this collection the data included is intended to be as true as possible to the challenges of real-world imaging conditions, especially images taken without careful preparation or posing.

Website
http://www.openu.ac.il/home/hassner/Adience/data.html

Center for Applied Internet Data Analysis (CAIDA) Internet Datasets

Scientific aggregator Freemium

CAIDA aggregates multiple types of data at geographically and topologically diverse locations, and makes this data available to the research community while keeping the anonymity of the donors and companies in tact. This is an overview of both public and private datasets that available.

Website
http://www.caida.org/data/overview/

Cambridge, MA, US, GIS data on GitHub

Government Free

Cambridge GIS has posted much of the data sets on this official City of Cambridge site, as the city is dedicated to providing developers and the public access to its building-data repositories.

Website
http://cambridgegis.github.io/gisdata.html

Canadian Legal Information Institute (CanLII)

Legal Free

CanLII provides free access to legal information collected from all Canadian jurisdictions. It gives access to court regulations, judgments, statutes, and tribunal decisions. In addition, CanLII Connects is a database of daily case commentary and case summaries presented by lawyers and other legal analysis professionals.

Website
https://www.canlii.org/en/index.php

Canadian Meteorological Centre

Scientific Free

This GRIB2 format database has free data, made available by the Meteorological Service of Canada. The database contributes information that is used by academics, private sector meteorologists, and the general public. It contains data from analysis systems and the Canadian Meteorological Centre’s Numerical Weather Prediction (NWP) models.

Website
http://weather.gc.ca/grib/index_e.html

CBOE Futures Exchange (CFE)

Economic Free

The CBOE Futures Exchange (CFE) is an all-electronic, open access market model. It has dedicated market makers and market participants providing liquidity, and the Data Service is a high-availability, low latency streaming data feed. CFE’s CSV files are typically updated daily on the evening of the same trading day, or the following business morning.

Website
http://cfe.cboe.com/Data/

Center for Systemic Peace Datasets – Conflict Trends, Polities, State Fragility, etc

World Free

The focus of CSP research is that of working toward finding true possibilities for a global systemic peace. This dataset tracks conditions and trends in societal-system performance at the global, regional and state levels, and includes data on sustainable human/physical development, governance, and social conflict.

Website
http://www.systemicpeace.org/

CERN Open Data Portal

Scientific Free

The CERN Open Data portal allows access to research activities performed at CERN, and it includes the necessary software and documentation required in order to understand and analyse the shared data. The products are shared under open licenses and they are citable.

Wesite
http://opendata.cern.ch/

Challenges in Machine Learning

Scientific Free

Machine Learning is the science of building hardware or software that can achieve tasks by learning from examples. Numerous challenges are listed along with website information and end results from the challenges.

Website
http://www.chalearn.org/

Chars74K dataset, Character Recognition in Natural Images

Free

This is a character recognition dataset which is a classic pattern recognition for Latin script. Character recognition using images containing common character fonts and uniform background is simple, but images taken with cameras and other devices are considerably more difficult, as seen in this dataset.

Wesite
http://www.ee.surrey.ac.uk/CVSSP/demos/chars74k/

Cheng-Caverlee-Lee September 2009 – January 2010 Twitter Scrape

Social Media Free

This dataset is a collection of messages gathered from September 2009 to January 2010. The data is made up of scraped public twitter updates which was used along with an academic project, in an effort to study geolocation data in relationship to Twitter usage.

Website
https://archive.org/details/twitter_cikm_2010

Climate Data from University of East Anglia (UEA)

Government Free

HadCRUT4 is a global temperature dataset developed by Climatic Research Unit, (University of East Anglia). It provides gridded temperature anomalies across the world, including separate averages for the hemispheres and the globe. CRUTEM4 is the land dataset, and HadSST3 is the ocean component of this overall dataset, and they are expected to be updated monthly.

Website
https://crudata.uea.ac.uk/cru/data/temperature/#datterandftp://ftp.cmdl.noaa.gov/

CLiPS Stylometry Investigation Corpus

aggregator Free

The CSI corpus is a corpus of student texts in two genres subsisting of reviews and essays. While other applications are possible, it is meant mainly for stylometric research. The meta-data includes various details about the authors and their documents.

Website
http://www.clips.uantwerpen.be/datasets/csi-corpus

ClueWeb09 – 1B web pages

Scientific Free

The ClueWeb09 dataset consists of about 1 billion web pages in ten different languages. It uses data from January and February 2009, and was created to support research on related human language technologies and retrieval information.

Website
http://lemurproject.org/clueweb09/

ClueWeb09 FACC

aggregator Free

Freebase Annotations of the ClueWeb Corpora, v1. Researchers at Google automatically, and therefore imperfectly, annotated English-language Web pages from the ClueWeb09 and ClueWeb12 corpora. Still, the annotations are of reasonably high quality, and for each entity they recognized with high confidence, they provide two confidence levels, its Freebase identifier (mid), and the beginning/end byte offsets.

Website
http://lemurproject.org/clueweb09/FACC1/

ClueWeb12 – 733M web pages

aggregator Free

The ClueWeb12 dataset was created to support research on information retrieval and related human language technologies. The dataset consists of 733,019,372 English web pages, collected between February 10, 2012 and May 10, 2012. ClueWeb12 is a companion or successor to the ClueWeb09 web dataset. Distribution of ClueWeb12 began in January 2013.

Website
http://lemurproject.org/clueweb12/

Collaborative Research in Computational Neuroscience (CRCNS)

Scientific Health Free

The Collaborative Research in Computational Neuroscience (CRCNS) supports the integration of experimental and theoretical neuroscience research projects. These projects are collaborative and normally involve up to five senior investigators.

Website
http://crcns.org/data-sets

ClueWeb12 FACC1

aggregator Free

Freebase Annotations of the ClueWeb Corpora, v1 (FACC1). The ClueWeb12 dataset has over 733,000,000 English web pages, and it was developed to support research on information retrieval and related human language technologies. The information was collected between February 10, 2012 and May 10, 2012.

Website
http://lemurproject.org/clueweb12/FACC1/

CMU Enron Email of 150 users

aggregator Free

The CALO Project (A Cognitive Assistant that Learns and Organizes) prepared this dataset which contains a total of about 0.5M messages and data from about 150 users – mainly senior management of Enron. This information was posted on the internet during an investigation by the Federal Energy Regulatory Commission.

Website
http://www.cs.cmu.edu/~enron/

CMU JASA data archive

Mathematics Free

The Journal of the American Statistical Association maintains the JASA data archive which contains contributed datasets from its published articles.

Website
http://lib.stat.cmu.edu/jasadata/

College Scorecard Data

Government Free

The College Scorecard’s function is to increase transparency regarding college qualities, so that students can see how well the different schools can serve them, and so that others can see where the colleges need improvements.

Website
https://collegescorecard.ed.gov/data/

COMBED

aggregator Free

The COMBED data set comes with a loader that easily plugs into nilmtkis, and it is the first energy-related data set where the data is sampled more than once every minute from a commercial building.

Website
http://combed.github.io/

CommonCrawl Web Data over 7 years

aggregator Free

The Common Crawl Foundation is a non-profit which strives to establish an open repository of web crawl data that is considered accessible and analyzable by all. Open access to web data that is cheap and easy will provide information that allows for greater innovation in many sectors.

Website
http://commoncrawl.org/the-data/get-started/

Complementary Collections

Scientific aggregator Free

This Complimentary Collection includes: Data Packaged Core Datasets; Database of Scientific Code Contributions; DataWrangling; Inside-r; OpenDataMonitor; Quora; RS.io; and StaTrek, among many others.

Website
https://github.com/caesar0301/awesome-public-datasets#complementary-collections; https://github.com/caesar0301/awesome-public-datasets#id32

Complex Networks

aggregator Free

This Complex Networks includes: AMiner Citation Network Dataset; CrossRef DOI URLs; DBLP Citation dataset; NBER Patent Citations; Network Repository with Interactive Exploratory Analysis Tools, among others.

Website
https://github.com/caesar0301/awesome-public-datasets#complex-networks , https://github.com/caesar0301/awesome-public-datasets#id5

Computer Networks

aggregator Free

Computer Networks includes: 3.5B Web Pages from CommonCrawl 2012; 53.5B Web clicks of 100K users in Indiana Univ; CAIDA Internet Datasets; ClueWeb09 – 1B web pages; ClueWeb12 – 733M web pages, and more.

Website
https://github.com/caesar0301/awesome-public-datasets#computer-networks , https://github.com/caesar0301/awesome-public-datasets#id6

Correlates of War Project

World Free

Key principles of Correlates of War (COW) include the free and timely public release of reliable data sets to the research community. COW seeks to collect and use and distribute accurate data about international relations.

Website
http://www.correlatesofwar.org/

CRAWDAD Wireless datasets from Dartmouth Univ.

aggregator Free

Community Resource for Archiving Wireless Data At Dartmouth (CRAWDAD) is a wireless network data resource for the research community. This archive stores wireless trace data from many locations, and their staff designs and improves tools for the collecting, analyzing and anonymizing of the data.

Website
https://crawdad.cs.dartmouth.edu/

Cricsheet Matches (cricket)

Entertainment Free

Cricsheet is a dataset for Cricket. It has ball-by-ball data for all Indian Premier League seasons, Men’s and Women’s Test Matches, One-day internationals, Twenty20 Internationals, some other international T20s.

Website
http://cricsheet.org/

Criteo click-through data

aggregator Free

Criteo compiles hundreds of billions of dollars of actual sales data, along with an incomparable network of global publishers, so that they can understand digital user behavior, and therefore deliver pertinent, personalized ads that propels incremental sales.

Website
http://labs.criteo.com/2015/03/criteo-releases-its-new-dataset/

CrossRef DOI URLs

aggregator Free

This dataset includes the URLs of almost 50 million journal articles which originate from CrossRef’s OAI-PMH server.

Website
https://archive.org/details/doi-urls

CrowdANALYTIX dataX

Scientific Tech Freemium

This platform, CrowdANALYTIX, is designed for crowdsourcing and deploying AI, NLP & Machine Learning solutions. Optimized algorithms, which are built by a crowdsourcing community of over 15,000 data scientists, are utilized and sustained on dataX.ai.

Website
http://data.crowdanalytix.com/

Cryptome Conspiracy Theory Items

World aggregator Free

The Cryptome Archive keeps 102,600 files dating between June 1996 and January 8, 2017. There is growing censor-tamper-implant-bowdlerize-redact-tag-track of archives, torrents, drops, shares, wikis, disclosure sites, and Cryptome welcomes documents for publication that are otherwise forbidden by all governments. More specifically, material on freedom of expression, privacy, cryptology, dual-use technologies, national security, intelligence, and secret governance — open, secret and classified documents is encouraged– but the list is not limited to those.

Website
http://cryptome.org/

Crystallography Open Database

Scientific aggregator Freemium

The Open Data principles have great supporters in crystallography which present full crystal data access for free on the internet, however other essential crystallography databases are only available with a paid subscription.

Website
http://www.crystallography.net/

D4D Challenge of Orange

aggregator Free

Data for Development (D4D) Senegal is an innovation challenge open on ICT Big Data. It was designed in 2013 for the purposes of societal development, as well as data on the hours of sunshine. Anonymous data is extracted from the mobile network in Senegal, and the Orange Group and Sonatel are making the data available to international research laboratories.

Website
http://www.d4d.orange.com/en/home

Data Challenges

aggregator Freemium

This Data Challenges dataset includes: Challenges in Machine Learning; CrowdANALYTIX dataX; D4D Challenge of Orange; DrivenData Competitions for Social Good; and ICWSM Data Challenge (since 2009), etc.

Website
https://github.com/caesar0301/awesome-public-datasets#data-challenges , https://github.com/caesar0301/awesome-public-datasets#id8

Data Packaged Core Datasets

aggregator Free

These Data Packaged Core Datasets are commonly-used, but important datasets. They are available in open form, and they are easy-to-use, high quality data packages.

Website
https://github.com/datasets/

Data360

Entertainment Free

Data360’s goal is to tell compelling and data-driven stories about important events and subjects. Data360 does reserve the right to adjust editorial permissions as it sees fit, in support of their purpose and principles.

Website
http://www.data360.org/index.aspx

Databanks International Cross National Time Series Data Archive

Scientific Economic Free

The Cross-National Time-Series Data Archive is a data set for over 200 countries. It contains annual data from the year 1815 and onwards. Its 196 variables are used by media, academia, finance and government agencies.

Website
http://www.cntsdata.com/

Database of Scientific Code Contributions

Scientific Free

This dataset is a collection of open source, web-based tools designed to help you do better science.

Website
https://mozillascience.org/collaborate

Datacards

World Free

DataCards is a structured collection tool that tracks irregular warfare and socio-cultural topics to support assessment, analysis, modeling, and other applications. The tool indexes data sources that relate to DataCards with a summary description and evaluation of the content, and are divided into portals according to the Area of Operations (AO) of every geographic COCOM.

Website
http://datacards.org/

Datahub.io

aggregator Free

The Datahub provides free access to many of CKAN’s (an open-source DMS) central features. You can create and manage groups of datasets, search for data, and get updates from datasets and groups. It’s accessible by the web interface or the CKAN API.

Website
https://datahub.io/dataset

Dataport

aggregator Freemium

Dataport offers a mix of free and subscription tools. These tools are great for utility analysts, university researchers and research institutions. Dataport’s research tools allow you to analyze, visualize and create custom reports from a vast database of original and curated data.

Webiste
https://dataport.pecanstreet.org/

DBLP Citation dataset

aggregator Free

The Proximity DBLP database presents information on computer science publications listed in the DBLP Computer Science Bibliography. The data in this dataset were derived from a snapshot of the bibliography as of April 12, 2006. The Proximity DBLP dataset maps each entry in the original DBLP data to one of six types of objects representing different types of publications. It includes links from publications to their authors and editors and from papers to the journal, proceedings, or book in which they appear.

Website
https://kdl.cs.umass.edu/display/public/DBLP

DBpedia – 4.58M things with 583M facts

World Free

DBpedia is a crowd-sourced community effort to extract structured information from Wikipedia and make this information available on the Web. DBpedia allows you to ask sophisticated queries against Wikipedia, and to link the different data sets on the Web to Wikipedia data. We hope that this work will make it easier for the huge amount of information in Wikipedia to be used in some new interesting ways. Furthermore, it might inspire new mechanisms for navigating, linking, and improving the encyclopedia itself.

Website
http://wiki.dbpedia.org/Datasets

Delve Datasets for classification and regression (Univ. of Toronto)

Tech Free

Each of the Delve datasets and families has a brief overview page, and many of them have detailed documentation. The datasets are categorized: primarily assessment, development or historical. Each category also distinguishes the datasets as regression or classification, depending on how their prototasks have been created.

Website
http://www.cs.toronto.edu/~delve/data/datasets.html

DIMACS Road Networks Collection

Scientific Mathematics Free

Algorithms for “shortest path” problems have been studied since the 1950’s and they still remain an active area of research, because these problems are ones of the most fundamental combinatorial optimization problems with many applications. One DIMACS goal is to make it possible for current researchers to compare their codes with each other.

Website
http://www.dis.uniroma1.it/challenge9/download.shtml

DRED

Scientific Mathematics Free

The DRED dataset is made available to the research community. It is meant to encourage the testing of the performance of energy disaggregation algorithms, derive appliance usage, behavior, and analyze demand response algorithms.

Website
http://www.st.ewi.tudelft.nl/~akshay/dred/

Climate/Weather Datasets

aggregator Free

This set of datasets on Github for Climate/Weather sources includes: Actuaries Climate Index, Australian Weather, Canadian Meteorological Centre, Climate Data from UEA, European Climate Assessment & Dataset, and many more.

Website
https://github.com/caesar0301/awesome-public-datasets#climateweather

DrivenData Competitions for Social Good

World Scientific Freemium

DrivenData provides data science to organizations that are using it to solve challenges, for positive social impact. DrivenData then runs online modeling competitions for data scientists, to develop the best models to solve them.

Website
http://www.drivendata.org/

Earth Models

Scientific Free

This dataset includes observational and virtual data, as well as processing and simulation software. This data comes mainly from geodesy, tectonics, geodynamics and seismology.

Website
http://www.earthmodels.org/

Earth Science

Scientific Free

This Earth Science dataset includes AQUASTAT – Global water resources, and it uses BODC – marine data of ~22K vars; Earth Models; EOSDIS – NASA’s earth observing system data; and Marinexplore – Open Oceanographic Data.

Website
https://github.com/caesar0301/awesome-public-datasets#earth-science , https://github.com/caesar0301/awesome-public-datasets#id9

eBay Online Auctions (2012)

aggregator Free

Website
http://www.modelingonlineauctions.com/datasets

ECO

Scientific Free

The ECO data set is a comprehensive data set for non-intrusive load monitoring and occupancy detection research which was collected over a period of 8 months from 6 Swiss households.

Website
http://www.vs.inf.ethz.ch/res/show.html?what=eco-data

EconData from UMD

Economic Free

Economic data has been made publicly available through the EconData site, and it has been put into a standard, easy-to-use, standard form for personal computers. These dataset series include current business indicators, product accounts (NIPA), national income and labor statistics, price indices, and industrial production.

Website
http://inforumweb.umd.edu/econdata/econdata.html

Economic Freedom of the World Data

World Economic Free

Fraser Institute’s Economic Freedom of North America index (EFNA) has illustrated that economic freedom is one of the main drivers of prosperity. Use their dataset and filters to research worldwide economic stats and other details.

Website
http://www.freetheworld.com/datasets_efw.html

Economics

Economic Free

This Economics includes: American Economic Association (AEA); EconData from UMD; Economic Freedom of the World Data; and Historical MacroEconomc Statistics; International Trade Statistics.

Website
https://github.com/caesar0301/awesome-public-datasets#economics , https://github.com/caesar0301/awesome-public-datasets#id10

EDRM Enron EMail of 151 users, hosted on S3

aggregator Legal Paid

The Enron email data was publicly released as part of FERC’s Western Energy Markets investigation. It was converted to industry standard formats by EDRM. The data set consists of 1,227,255 emails with 493,384 attachments covering 151 custodians. The emails are provided in several formats: Microsoft PST, IETF MIME, and EDRM XML.

Website
https://aws.amazon.com/datasets/enron-email-data/

Education

aggregator Free

This Education dataset includes the College Scorecard Data and the Student Data from Free Code Camp.

Website
https://github.com/caesar0301/awesome-public-datasets#education , https://github.com/caesar0301/awesome-public-datasets#id11

EIA

Government Free

Data for utility plants are available from 1970, and data for non-utility plants from 1999.The EIA-906, EIA-920, EIA-923 and predecessor forms provide monthly and annual data, specifically on generation and fuel consumption at the power plant and prime mover level. In addition, a subset of plants, such as 10 MW and above steam-electric plants, also provides data for the boiler level and generator level.

Website
http://www.eia.gov/electricity/data/eia923/

Energy

Economic Free

This Energy includes: AMPds; BLUEd; COMBED; Dataport; DRED.

Website
https://github.com/caesar0301/awesome-public-datasets#energy; https://github.com/caesar0301/awesome-public-datasets#id12

EOSDIS – NASA’s earth observing system data

Scientific Free

The Earth Observing System Data and Information System (EOSDIS) is a key core capability in NASA’s Earth Science Data Systems Program. It provides end-to-end capabilities for managing NASA’s Earth science data from various sources – satellites, aircraft, field measurements, and various other programs.

Website
http://sedac.ciesin.columbia.edu/data/sets/browse

Ergast Formula 1, from 1950 up to date (API)

aggregator Free

The Ergast Developer API is an experimental web service which provides a historical … The API provides data for the Formula One series, from the beginning of the world championships in 1950. … The number of results that are returned can be controlled using a limit query parameter, up to a maximum value of 1000.

Website
http://ergast.com/mrd/db

European Climate Assessment & Dataset

World aggregator Free

The European Climate Assessment and Dataset (ECA&D) is a database of daily meteorological station observations across Europe and is gradually being extended to countries in the Middle East and North Africa. ECA&D has attained the status of Regional Climate Centre for high-resolution observation data in World Meteorological Organization Region VI (Europe and the Middle East).

Website
http://eca.knmi.nl/

European Social Survey

Free

The European Social Survey runs a programme of research to support and enhance the methodology that underpins the high standards it pursues in every aspect of survey design, data collection and archiving.

Website
http://www.europeansocialsurvey.org/data/

Face Recognition Benchmark

Tech Free

A face recognition system is a computer application capable of identifying or verifying a person from a digital image or a video frame from a video source. One of the ways to do this is by comparing selected facial features from the image and a face database.

Website
http://www.face-rec.org/databases/

Factual Global Location Data

aggregator Free

Data is increasingly critical to driving innovation and no one should be at a data disadvantage. We at Factual believe that data should be accessible to every developer, entrepreneur, business, or organization – anyone who needs it to build a better app, provide a better search result, make smarter software – anyone who needs data to make a better decision or help others make better decisions.

Website
https://www.factual.com/

FBI Hate Crime 2013 – aggregated data

aggregator Free

A hate crime (also known as a bias-motivated crime) is a prejudice-motivated crime, which occurs when a perpetrator targets a victim because of his or her membership (or perceived membership) in a certain social group.

Website
https://github.com/emorisse/FBI-Hate-Crime-Statistics/tree/master/2013

Finance

Free

This Finance includes: CBOE Futures Exchange; Google Finance; Google Trends; NASDAQ; OANDA

Website
https://github.com/caesar0301/awesome-public-datasets#finance; https://github.com/caesar0301/awesome-public-datasets#id13

Flickr Personal Taxonomies

Social Media Free

In addition to allowing users to organize content by tagging it with descriptive labels, several social media sites also allow users to organize content hierarchically within personal taxonomies. Delicious, for example, lets users group related tags into bundles. Flickr lets users group related photos into sets and related sets within collections.

Website
http://www.isi.edu/~lerman/downloads/flickr/flickr_taxonomies.html

Football/Soccer resources (data and APIs)

Entertainment Paid

There are three main ways to get data. You can parse/scrape it from a hobbyist project/website, you can pay for it or you can try to collect it yourself.

Website
http://www.jokecamp.com/blog/guide-to-football-and-soccer-data-and-apis/

Foursquare from UMN/Sarwat (2013)

aggregator Free

This data set contains 2153471 users, 1143092 venues, 1021970 check-ins, 27098490 social connections, and 2809581 ratings that users assigned to venues; all extracted from the Foursquare application through the public API. All users information have been anonymized, i.e., users geolocations are also anonymized. Each user is represented by an id, and GeoSpatial location. The same for venues. The data are contained in five files, users.dat, venues.dat, checkins.dat, socialgraph.dat, and ratings.dat.

Website
https://archive.org/details/201309_foursquare_dataset_umn

Fragile States Index

World Free

We are pleased to present the twelfth annual Fragile States Index. The FSI focuses on the indicators of risk and is based on thousands of articles and reports that are processed by our CAST Software from electronically available sources. We encourage others to utilize the Fragile States Index to develop ideas for promoting greater stability worldwide. We hope the Index will spur conversations, encourage debate, and most of all help guide strategies for sustainable security.

Website
http://fsi.fundforpeace.org/data

Freebase.com of people, places, and things

aggregator Free

Freebase is an open database of the world?s information. It is built by the community and for the community?free for anyone to query, contribute to, built applications on top of, or integrate into their websites.

Website
http://www.freebase.com/

Gapminder World demographic databases

World Free

Gapminder is an independent Swedish foundation with no political, religious or economic affiliations. Gapminder is a fact tank, not a think tank. Gapminder fights devastating misconceptions about global development. Gapminder produces free teaching resources making the world understandable based on reliable statistics. Gapminder promotes a fact-based worldview everyone can understand. Gapminder collaborates with universities, UN, public agencies and non-governmental organizations. All Gapminder activities are governed by the board. We do not award grants. Gapminder Foundation is registered at Stockholm County Administration Board.

Website
http://www.gapminder.org/data/

GDELT Global Events Database

World Free

GDELT is the largest, most comprehensive, and highest resolution open database of human society ever created. Creating a platform that monitors the world’s news media from nearly every corner of every country in print, broadcast, and web formats, in over 100 languages, every moment of every day and that stretches back to January 1, 1979 through present day, with daily updates, required an unprecedented array of technical and methodological innovations, partnerships, and whole new mindsets to bring this all together and make it a reality.

Website
http://gdeltproject.org/data.html

General Social Survey (GSS) since 1972

World Scientific Free

The GSS gathers data on contemporary American society in order to monitor and explain trends and constants in attitudes, behaviors, and attributes. The GSS contains a standard core of demographic, behavioral, and attitudinal questions, plus topics of special interest. Among the topics covered are civil liberties, crime and violence, intergroup tolerance, morality, national spending priorities, psychological well-being, social mobility, and stress and traumatic events.

Website
http://gss.norc.org/

Geo Spatial Data from ASU

Scientific Free

Geospatial analysis, or just spatial analysis is an approach to applying statistical analysis and other analytic techniques to data which has a geographical or spatial aspect. Such analysis would typically employ software capable of rendering maps processing spatial data and applying analytical methods to terrestrial or geographic datasets, including the use of geographic information systems and geomatics.

Website
http://geodacenter.asu.edu/datalist/

Geo Wiki Project – Citizen-driven Environmental Monitoring

Scientific Free

The Geo-Wiki Project is a global network of volunteers who wish to help improve the quality of global land-cover maps. Because large differences occur between existing global land-cover maps, current ecosystem and land-use science lacks crucial accurate data (for example, to determine the potential of additional agricultural land available to grow crops in Africa).

Website
http://geo-wiki.org/

GeoFabrik – OSM data extracted to a variety of formats and areas

aggregator Free

The OpenStreetMap (OSM) project was founded in the United Kingdom in 2004 and is aimed at creating a free, world-wide geographic data set. OpenStreetMap wants to be for geodata what Wikipedia is for encyclopedic knowledge. The focus is mainly on transport infrastructure (streets, paths, railways, rivers), but OpenStreetMap also collects a multitude of points of interest, buildings, natural features and landuse information, as well as coastlines and administrative boundaries.

Website
http://download.geofabrik.de/

GeoLife GPS Trajectory from Microsoft Research

aggregator Free

This GPS trajectory dataset was collected in (Microsoft Research Asia) Geolife project by 182 users in a period of over three years (from April 2007 to August 2012). A GPS trajectory of this dataset is represented by a sequence of time-stamped points, each of which contains the information of latitude, longitude and altitude. This dataset contains 17,621 trajectories with a total distance of about 1.2 million kilometers and a total duration of 48,000+ hours. These trajectories were recorded by different GPS loggers and GPS-phones, and have a variety of sampling rates. 91 percent of the trajectories are logged in a dense representation, e.g. every 1~5 seconds or every 5~10 meters per point.

Website
http://research.microsoft.com/en-us/downloads/b16d359d-d164-469e-9fd4-daa38f2b2e13/

GeoNames Worldwide

aggregator Free

The GeoNames database contains over 10,000,000 geographical names corresponding to over 7,500,000 unique features.[1] All features are categorized into one of nine feature classes and further subcategorized into one of 645 feature codes. Beyond names of places in various languages, data stored include latitude, longitude, elevation, population, administrative subdivision and postal codes.

Website
http://www.geonames.org/

German Social Survey

aggregator Free

The German General Social Survey is a national data generation program in Germany, which is similar to the American General Social Survey (GSS). Its mission is to collect and disseminate high quality statistical surveys on attitudes, behavior, and social structure in Germany.

Website
http://www.gesis.org/en/home/

GIS

aggregator Free

This GIS includes: ArcGIS Open Data portal; Cambridge, MA, US, GIS data on GitHub; Factual Global Location Data; Geo Spatial Data from ASU; Geo Wiki Project – Citizen-driven Environmental Monitoring.

Website
https://github.com/caesar0301/awesome-public-datasets#gis; https://github.com/caesar0301/awesome-public-datasets#id14

GitHub Collaboration Archive

World aggregator Free

GitHub Archive is a project to record the public GitHub timeline, archive it, and make it easily accessible for further analysis.

Website
https://www.githubarchive.org/

Global Administrative Areas Database (GADM)

World aggregator Free

GADM is a spatial database of the location of the world’s administrative areas (or adminstrative boundaries) for use in GIS and similar software. Administrative areas in this database are countries and lower level subdivisions such as provinces, departments, bibhag, bundeslander, daerah istimewa, fivondronana, krong, landsvæðun, opština, sous-préfectures, counties, and thana.

Website
http://www.gadm.org/

Global Climate Data Since 1929

World Free

Climate Information for every country in the world with historical data in some cases date back to 1929. Here you can check the status of an earlier time in one of the more than 9,000 stations that have information. You can meet the annual averages, monthly averages and extended information for a day.

Website
http://en.tutiempo.net/climate

Global Religious Futures Project

World Scientific Free

Pew Research Center is a nonpartisan fact tank that informs the public about the issues, attitudes and trends shaping America and the world. We conduct public opinion polling, demographic research, content analysis and other data-driven social science research. We do not take policy positions.

Website
http://www.globalreligiousfutures.org/

Google Books Ngrams (2.2TB)

aggregator Paid

N-grams are fixed size tuples of items. In this case the items are words extracted from the Google Books corpus. The n specifies the number of elements in the tuple, so a 5-gram contains five words or characters. The n grams in this dataset were produced by passing a sliding window of the text of books and outputting a record for each new token.

Website
https://aws.amazon.com/datasets/google-books-ngrams/

Google Finance

Economic

Get real-time stock quotes & charts, financial news, currency conversions, or track your portfolio with Google Finance.

Website
https://www.google.com/finance

Google Trends

World

Google Trends is a public web facility of Google Inc., based on Google Search, that shows how often a particular search-term is entered relative to the total search-volume across various regions of the world, and in various languages. The horizontal axis of the main graph represents time (starting from 2004), and the vertical is how often a term is searched for relative to the total number of searches, globally. Below the main graph, popularity is broken down by countries, regions, cities and language. Note that what Google calls “language”, however, does not display the relative results of searches in different languages for the same term(s).

Website
http://www.google.com/trends?q=google&ctab=0&geo=all&date=all&sort=0

Google Web 5gram (1TB, 2006)

aggregator Free

Web 1T 5-gram Version 1, contributed by Google Inc., contains English word n-grams and their observed frequency counts. The length of the n-grams ranges from unigrams (single words) to five-grams. This data is expected to be useful for statistical language modeling, e.g., for machine translation or speech recognition, as well as for other uses.

Website
https://catalog.ldc.upenn.edu/LDC2006T13

Government

Government aggregator Free

This Government includes: OpenDataSoft’s list of 1,600 open data; Open Data for Africa; A list of cities and countries contributed by community.

Website
https://github.com/caesar0301/awesome-public-datasets#government; https://github.com/caesar0301/awesome-public-datasets#id15

Gutenberg eBooks List

Tech Free

An electronic book (or e-book) is a book publication made available in digital form, consisting of text, images, or both, readable on the flat-panel display of computers or other electronic devices.[1] Although sometimes defined as “an electronic version of a printed book”,[2] some e-books exist without a printed equivalent. Commercially produced and sold e-books are usually intended to be read on dedicated e-reader devices. However, almost any sophisticated computer device that features a controllable viewing screen can also be used to read e-books, including desktop computers, laptops, tablets and smartphones.

Website
http://www.gutenberg.org/wiki/Gutenberg:Offline_Catalogs

Hansards text chunks of Canadian Parliament

Government aggregator Free

This release contains 1.3 million pairs of aligned text chunks (sentences or smaller … from the official records (Hansards) of the 36th Canadian Parliament.

Webiste
http://www.isi.edu/natural-language/download/hansard/

Hard Drive Failure Rates

Scientific Free

The 4TB Seagate drives are our workhorse drives today and their 2.8% annualized failure rate is more than acceptable for us. Their low failure rate roughly translates to an average of one drive failure per Storage Pod per year.

Website
https://www.backblaze.com/hard-drive-test-data.html

Harvard Dataverse Network of scientific data

Scientific aggregator Free

Dataverse is an open source web application to share, preserve, cite, explore, and analyze research data. It facilitates making data available to others, and allows you to replicate others’ work more easily. Researchers, data authors, publishers, data distributors, and affiliated institutions all receive academic credit and web visibility. Dataverse provides a robust infrastructure for data stewards to host and Share this dataverse on your favorite social media networks.

Website
https://dataverse.harvard.edu/

Healthcare

Health aggregator Free

This Healthcare includes: EHDP Large Health Data Sets; Gapminder World demographic databases; Medicare Coverage Database (MCD), U.S.; Medicare Data Engine of medicare.gov Data; Medicare Data File.

Website
https://github.com/caesar0301/awesome-public-datasets#healthcare; https://github.com/caesar0301/awesome-public-datasets#id16

Heart Rate Time Series from MIT

Scientific Health Free

Heart rate is the speed of the heartbeat measured by the number of contractions of the heart per minute (bpm). The heart rate can vary according to the body’s physical needs, including the need to absorb oxygen and excrete carbon dioxide. It is usually equal or close to the pulse measured at any peripheral point. Activities that can provoke change include physical exercise, sleep, anxiety, stress, illness, and ingestion of drugs.

Website
http://ecg.mit.edu/time-series/

HES

Government Scientific Free

The Household Electricity Use Study monitored domestic electrical appliances in a total of 251 owner-occupier households across England over the period of April 2010 to April 2011. Twenty six of these households were monitored for a full year; whilst the remaining 225 were monitored for the duration of one month on a rolling basis throughout the trial period.

Website
http://randd.defra.gov.uk/Default.aspx?Menu=Menu&Module=More&Location=None&ProjectID=17359&FromSearch=Y&Publisher=1&SearchText=EV0702&SortString=ProjectCode&SortOrder=Asc&Paging=10#Description

HFED

Scientific aggregator Free

HFED is a high frequency EMI dataset having traces taken from signal analyser and USRP. Our data processing and visualization script is open source and is accessible on Github.

Website
http://hfed.github.io/

High-Resolution Contact Networks from Wearable Sensors

aggregator Free

This data set contains the temporal network of contacts between individuals measured in an office building in France, from June 24 to July 3, 2013. This page provides a collection of datasets obtained through the SocioPatterns sensing platform.

Website
http://www.sociopatterns.org/datasets/

Historical MacroEconomc Statistics

aggregator Mathematics Free

Historical Statistics argues for trans-historical reformulations of the basic economic concepts of production, work and consumption. Important issues concern how to deal with violence, double counting of transaction costs, human capital formation, non-market activities and causation of final consumption. Production, work and consumption are defined as relations between events, the subject matter and the agent. Eight different definitions of GDP are presented.

Website
http://www.historicalstatistics.org/

Homeland Infrastructure Foundation-Level Data

Legal Free

HIFLD (Homeland Infrastructure Foundation-Level Data) provides National foundation-level geospatial data within the open public domain that can be useful to support community preparedness, resiliency, research, and more. The data is available for download as CSV, KML, Shapefile, and accessible via web services to support application development and data visualization.

Website
https://hifld-dhs-gii.opendata.arcgis.com/

Hubway Million Rides in MA

aggregator Freemium

Data geeks of all stripes! Students, professors, designers, artists, data nerds by profession and those who just do it for fun.
Visualizations, animations, maps, info graphics that tell us something new or illustrate the awesomeness of more than half a million bike trips in one year. Winning entries were both smart and beautiful, and included interactive data analysis tools, animations, artistic representations, and even a video game.

Website
http://hubwaydatachallenge.org/trip-history-data/

Human Connectome Project

Scientific aggregator Free

The HCP (Human Connectome Project) is mapping the human connectome as accurately as possible in a large number of normal adults and is making this data freely available to the scientific community using a powerful, user-friendly informatics platform.

Website
http://www.humanconnectome.org/data/

Humanitarian Data Exchange

aggregator Social Media Free

The Humanitarian Data Exchange. Find, share and use humanitarian data all in one place.

Website
https://data.hdx.rwlabs.org/

iAWE

Tech Free

Indian Dataset for Ambient Water and Energy.

Website
http://iawe.github.io/

ICOS PSP Benchmark

aggregator Free

The ICOS PSP benchmarks repository contains an adjustable real-world family of benchmarks suitable for testing the scalability of classification/regression methods. When we test a machine learning method we usually choose a test suite containing datasets with a broad set of characteristics, as we are interested in knowing how the learning method reacts to a veriety of scenarios. The PSP field provides us with a whole family of real-world classification/regression problems that can be adjusted almost arbitrarily in terms of number of variables, number of classes, class balance, etc. Thus, these datasets are an ideal benchmark suite for data mining methods.

Website
http://ico2s.org/datasets/psp_benchmark.html

ICPSR (UMICH)

Scientific aggregator Social Media Free

ICPSR advances and expands social and behavioral research, acting as a global leader in data stewardship and providing rich data resources and responsive educational opportunities for present and future generations.
ICPSR provides leadership and training in data access, curation, and methods of analysis for the social science research community.

Website
http://www.icpsr.umich.edu/icpsrweb/ICPSR/index.jsp

Image Processing

aggregator Free

This list of public data sources are collected and tidied from blogs, answers, and user responses. Most of the data sets listed below are free, however, some are not. Other amazingly awesome lists can be found in the awesome-awesomeness and sindresorhus’s awesome list.

Website
http://icwsm.cs.umbc.edu/

ImageNet (in WordNet hierarchy)

aggregator Freemium

ImageNet is an image database organized according to the WordNet hierarchy (currently only the nouns), in which each node of the hierarchy is depicted by hundreds and thousands of images. Currently we have an average of over five hundred images per node. We hope ImageNet will become a useful resource for researchers, educators, students and all of you who share our passion for pictures.
On this page, you will find some useful information about the database, the ImageNet community, and the background of this project.

Website
http://www.image-net.org/

IMDb Database

Entertainment Social Media Freemium

This page describes various alternate ways to access IMDb locally by holding copies of the data directly on your system.

Website
http://www.imdb.com/interfaces

Indoor Scene Recognition

aggregator Free

In this database contains 67 Indoor categories, and more than 15000 images. The number of images varies across categories, but there are at least 100 images per category. All images are in jpg format. The images provided here are for research purposes only.

Website
http://web.mit.edu/torralba/www/indoor.html

Infochimps

aggregator Freemium

Infochimps Cloud is a suite of robust, scalable cloud services that make it faster and far less complex to develop and deploy enterprise Big Data applications. Whether you need real–time analytics on multi–source streaming data, a scalable NoSQL database or an elastic, cloud-based Hadoop cluster — Infochimps Cloud is your easiest step to Big Data.

Website
http://www.infochimps.com/

INFORM Index for Risk Management

Economic Free

INFORM is a global, open-source risk assessment for humanitarian crises and disasters. It can support decisions about prevention, preparedness and response. INFORM’s user-friendly interface allows policymakers to prioritize countries by multiple dimensions of risk and visualize disaster risk. The results of INFORM are also available for the past 5 years, so trends can by analyzed as well. It is a powerful tool that actors involved in disaster prevention, preparedness and response can use to collaborate, plan efficiently, and save lives.

Website
http://www.inform-index.org/Results/Global

Institute for Demographic Studies

Scientific aggregator Free

The Institute for Demographic Studies or INED, is a public research institute specialized in population studies that works in partnership with the academic and research communities at national and international levels.

Wesite
http://www.ined.fr/en/

Institute of Education Sciences

Government Free

The Institute of Education Sciences (IES) is the independent, non-partisan statistics, research, and evaluation arm of the U.S. Department of Education. IES’ stated mission is to provide scientific evidence on which to ground education practice and policy and to share this information in formats that are useful and accessible to educators, parents, policymakers, researchers, and the public.

Website
http://eric.ed.gov/

Integrated Marine Observing System (IMOS) – roughly 30TB of ocean measurements; on S3

Scientific aggregator Free

IMOS has been routinely operating a wide range of observing equipment throughout Australia’s coastal and open oceans, making all of its data accessible to the marine and climate science community, other stakeholders and users, and international collaborators. IMOS is designed to be a fully-integrated, national system, observing at ocean-basin and regional scales, and covering physical, chemical and biological variables.

Website
https://imos.aodn.org.au/ , http://imos-data.s3-website-ap-southeast-2.amazonaws.com/

International Affective Picture System, UFL

Scientific Free

The International Affective Picture System (IAPS) is being developed to provide a set of normative emotional stimuli for experimental investigations of emotion and attention. The goal is to develop a large set of standardized, emotionally-evocative, internationally-accessible, color photographs that includes contents across a wide range of semantic categories.

Website
http://csea.phhp.ufl.edu/media/iapsmessage.html

International Economics Database; various data tools

Economic aggregator Free

The purpose of the Widukind project is to provide a unique website accessible for all users, allowing them to free download public economic data as released by national producers (national institutes of statistics, central banks) as well as international ones (IMF, World Bank, OECD, Eurostat, ECB).

Website
http://widukind.cepremap.org/ ; https://github.com/Widukind

International HapMap Project

Government Free

The International HapMap Project was an organization that aimed to develop a haplotype map (HapMap) of the human genome, to describe the common patterns of human genetic variation. HapMap is used to find genetic variants affecting health, disease and responses to drugs and environmental factors.

Website
http://hapmap.ncbi.nlm.nih.gov/downloads/index.html.en

International Networks Archive

aggregator Free

The International Archive is to assemble data sets relevant to empirical research on mapping the global web in a central location and to standardize them so the various indicators can be combined. Given the immense amount of work that defining a global web involves we argue for disseminating the raw data as widely as possible so as to recruit the largest possible number of collaborators.

Website
http://www.princeton.edu/~ina/

International Social Survey Program ISSP

Scientific aggregator Free

The ISSP is a continuing annual programme of cross-national collaboration on surveys covering topics important for social science research. It brings together pre-existing social science projects and coordinates research goals, thereby adding a cross-national, cross-cultural perspective to the individual national studies. The ISSP researchers develop questions which are meaningful and relevant to all countries which can be expressed in an equal manner in different languages. The results of the surveys provide a cross-national and cross-cultural perspective to individual national studies.

Website
http://www.issp.org/

International Studies Compendium Project

World aggregator Free

The International Studies Compendium Project, published in association with the International Studies Association (ISA), is available as an online reference or as a 12-volume set in print. This resource is the most comprehensive reference work of its kind for the fields of international studies and international relations. Comprising a series of literature review essays, referred and rigorous, comprehensive, and neutral in tone, each details fruitful lines of research up to the current “state of the art”. As such, the essays provide an invaluable resource for students and scholars new to a particular area of research who need an overview that maps the existing scholarship in a useful way.

Website
http://www.isacompendium.com/public/

International Trade Statistics

aggregator Mathematics Free

Firms scanning the world market for opportunities to diversify products, markets and suppliers, and trade support institutions (TSIs) setting priorities in terms of trade promotion, sectoral performance, partner countries and trade development strategies must have detailed statistical information on international trade flows in order to utilize resources effectively.

Website
http://www.econostatistics.co.za/

Internet Product Code Database

aggregator Free

First of all the term UPC has been deprecated the new term is UCC-12. But the world has moved beyond that. As of January 2005, retailers in the U.S. are supposed to be able to support the EAN/UCC-13 code (the rest of the world has done this for years), which uses similar symbology, and one additional digit.

Website
http://www.upcdatabase.com/

James McGuire Cross National Data

Economic Health Free

The Several contents are: Health and Health Care Data; Infant, Child, and Maternal Mortality; Economic Affluence; Democracy, Civil and Political Rights, Women in Parliament; Water and Sanitation.

Website
http://jmcguire.faculty.wesleyan.edu/welcome/cross-national-data/

Joint External Debt Data Hub

World aggregator Free

The Joint External Debt Hub (JEDH)—jointly developed by the Bank for International Settlements (BIS), the International Monetary Fund (IMF), the Organization for Economic Cooperation and Development (OECD) and the World Bank (WB)—brings together external debt data and selected foreign assets from international creditor/market and national debtor sources.

Website
http://www.jedh.org/

Journal of Cell Biology DataViewer

Tech aggregator Free

The JCB DataViewer is a web-based, multi-dimensional image data-viewing application. It is a tool for visualization and simple analysis of original image data files associated with JCB articles. Image data are archived by the Journal and may be freely accessed by readers using the JCB DataViewer. Download of author-provided image data and associated metadata in OME-TIFF format is also possible with author permission, allowing for independent analysis of image data irrespective of acquisition or viewing software. Although the JCB DataViewer is designed to host and facilitate sharing and analysis of original microscopy image data, authors may also upload other types of original image data as supplements to their manuscripts, including histology and electron micrographs and digital scans of gels or blots.

Website
http://jcb-dataviewer.rupress.org/

Kaggle Competition Data

aggregator Free

Kaggle is a platform for data science competitions. We help you solve difficult problems, recruit strong teams, and amplify the power of your data science talent.

Website
https://www.kaggle.com/

KDD Cup by Tencent 2012

aggregator Free

The dataset represents a sampled snapshot of Tencent Weibo users’ preferences for various items –– the recommendation of items to users and the history of users’ ‘following’ history. It is of a larger scale compared to other publicly available datasets ever released. Also it provides richer information in multiple domains such as user profiles, social graph, item category, which may hopefully evoke deeply thoughtful ideas and methodology.

Website
http://www.kddcup2012.org/

KDNuggets Data Collections

Scientific aggregator Free

KDnuggets is a leading site on Business Analytics, Big Data, Data Mining, and Data Science.

Website
http://www.kdnuggets.com/datasets/index.html

Keel Repository for classification, regression and time series

aggregator Free

KEEl at providing to the machine learning researchers a set of benchmarks to analyze the behavior of the learning methods. Concretely, it is possible to find benchmarks already formatted in KEEL format for classification (such as standard, multi instance or imbalanced data), semi-supervised classification, regression, time series and unsupervised learning. In several domains as statistics, signal processing or econometrics, a time series is a sequence of data points, measured typically at successive times spaced at uniform time intervals. Time series data have a natural temporal ordering. This makes time series analysis distinct from other common data analysis problems, in which there is no natural ordering of the observations.

Website
http://sci2s.ugr.es/keel/datasets.php

Labeled Faces in the Wild (LFW)

Tech Free

A Database for Studying Face Recognition in Unconstrained Environments Most face databases have been created under controlled conditions to facilitate the study of specific parameters on the face recognition problem. These parameters include such variables as position, pose, lighting, background, camera quality, and gender. While there are many applications for face recognition technology in which one can control the parameters of image acquisition, there are also many applications in which the practitioner has little or no control over such parameters. This database, Labeled Faces in the Wild, is provided as an aid in studying the latter, unconstrained, recognition problem. The database contains labeled face photographs spanning the range of conditions typically encountered in everyday life.

Website
http://vis-www.cs.umass.edu/lfw/

Lahman’s Baseball Database

Entertainment Free

Sean Lahman is an award-winning database journalist and author. He develops interactive databases and data driven stories for the Rochester Democrat and Chronicle and other Gannett newspapers and websites. He also writes a weekly column on emerging technology and innovation.

Website
http://www.seanlahman.com/baseball-archive/statistics/

Landsat 8 on AWS

Tech aggregator Paid

Landsat 8 data is available for anyone to use via Amazon S3. All Landsat 8 scenes from 2015 are available along with a selection of cloud-free scenes from 2013 and 2014. All new Landsat 8 scenes are made available each day, often within hours of production.

Website
https://aws.amazon.com/public-data-sets/landsat/

Lending Club Loan Data

Economic Free

Lending Club is the world’s largest online marketplace connecting borrowers and investors. Lending Club’s platform has the potential to profoundly transform traditional banking over the next decade. Lending Club is helping reinvent the consumer lending industry. All loans facilitated by Lending Club are issued by a bank and subject to the same consumer protection, fair lending, and disclosure requirements as any other bank loan.

Website
https://www.lendingclub.com/info/download-data.action

Leveraging open data to understand urban lives

Scientific aggregator Free

Data mining one of the hottest topics on the media in past years, exhibits a new way to help companies, organizations and even ordinary people to make plans and decisions in near future. We are convinced by the knowledge derived from data, mostly because the data recording historical events is more solid and reliable than people’s experience that is influenced by so many random factors in reality.

Website
http://xiaming.me/posts/2014/10/23/leveraging-open-data-to-understand-urban-lives/

List of all countries in all languages

World Free

Umpirsky Country List: List of all languages with names and ISO 639-1 codes in all languages and all data formats.

Website
https://github.com/umpirsky/country-list

Localytics Data Visualization Challenge

Scientific aggregator Free

Data visualization is viewed by many disciplines as a modern equivalent of visual communication. It involves the creation and study of the visual representation of data, meaning information that has been abstracted in some schematic form, including attributes or variables for the units of information.

Website
https://github.com/localytics/data-viz-challenge

Machine Comprehension Test (MCTest) of text from Microsoft Research

Scientific Tech Free

Understanding unstructured text is a major goal within natural language processing. Comprehension tests pose questions based on short text passages to evaluate such understanding. In this work, we investigate machine comprehension on the challenging {\it MCTest} benchmark. Partly because of its limited size, prior work on {\it MCTest} has focused mainly on engineering better features.

Website
http://research.microsoft.com/en-us/um/redmond/projects/mctest/index.html

Machine Learning

aggregator Free

This Machine Learning includes: Delve Datasets for classification and regression (Univ. of Toronto); Discogs Monthly Data; eBay Online Auctions (2012); IMDb Database; Keel Repository for classification, regression and time series.

Website
https://github.com/caesar0301/awesome-public-datasets#machine-learning; https://github.com/caesar0301/awesome-public-datasets#id18

Machine Learning Data Set Repository

aggregator Free

This repository manages the following types of objects. Data Sets Raw data as a collection of similarily structured objects. Material and Methods Descriptions of the computational pipeline. Learning Tasks Learning tasks defined on raw data.

Website
http://mldata.org/

Machine Translation of European languages

World Free

We provide training data for four European language pairs, and a common framework (including a baseline system). The task is to improve methods current methods. This can be done in many ways. For instance participants could try to improve word alignment quality, phrase extraction, phrase scoring add new components to the open source software of the baseline system.

Website
http://statmt.org/wmt11/translation-task.html#download

MacroData Guide by Norsk samfunnsvitenskapelig datatjeneste

Scientific Free

NSD is the Data Protecion Official for Research for all the Norwegian universities, university colleges and several hospitals and research institutes. The Data Protecion Official scheme implies that the requirement for obtaining licenses from the Data Inspectorate for a greater part of research projects are replaced by a notification requirement where NSD is the last instance for reviewing applications for licenses. This means that the Data Inspectorate has delegated part of its responsibility to NSD in relation to the Personal Data Act and Health Register Act.

Website
http://nsd.uib.no/

Marine Traffic – ship tracks, port calls and more

World Free

MarineTraffic maintains a database of real-time and historical ship positions sourced from the largest station network and Satellite constellation.

Website
http://www.marinetraffic.com/de/ais-api-services

Medicare Coverage Database (MCD), U.S

Government Free

The Medicare Coverage Database (MCD) contains all National Coverage Determinations (NCDs) and Local Coverage Determinations (LCDs), local articles, and proposed NCD decisions. The database also includes several other types of National Coverage policy related documents, including National Coverage Analyses (NCAs), Coding Analyses for Labs (CALs), Medicare Evidence Development & Coverage Advisory Committee (MEDCAC) proceedings, and Medicare coverage guidance documents.

Website
https://www.cms.gov/medicare-coverage-database/

Medicare Data Engine of medicare.gov Data

Government Free

These data allow you to compare the quality of care at every Medicare and Medicaid-certified nursing home in the country, including over 15,000 nationwide.

Website
https://data.medicare.gov/

Medicare Data File

Government Free

The Centers for Medicare & Medicaid Services (CMS) makes identifiable data files (IDFs) available to certain stakeholders as allowed by federal laws and regulations as well as CMS policy. IDFs contain protected health information (PHI) and/or personally identifiable information (PII) and CMS is committed to ensuring this information is protected.

Website
http://go.cms.gov/19xxPN4

MeSH, the vocabulary thesaurus used for indexing articles for PubMed

Government Free

MeSH is the National Library of Medicine’s controlled vocabulary thesaurus. It consists of sets of terms naming descriptors in a hierarchical structure that permits searching at various levels of specificity.

Website
https://www.nlm.nih.gov/mesh/filelist.html

Microsoft Data Science for Research

Tech Free

Microsoft Research provides a continuously refreshed collection of free datasets, tools and resources designed to advance the state of the art of academic research in many areas of computer science, such as natural language processing and computer vision. In addition, you can browse datasets and apply for cloud-based compute cycles available under the Azure for Research program.

Website
http://aka.ms/Data-Science

Microsoft MAchine Reading COmprehension Dataset (or MS MARCO)

Tech Free

Microsoft Machine Reading Comprehension (MS MARCO) is a new large scale dataset for reading comprehension and question answering. In MS MARCO, all questions are sampled from real anonymized user queries. The context passages, from which answers in the dataset are derived, are extracted from real web documents using the most advanced version of the Bing search engine. The answers to the queries are human generated if they could summarize the answer.

Website
http://www.msmarco.org/dataset.aspx

Million Song Dataset

Entertainment Free

The Million Song Dataset is a freely-available collection of audio features and metadata for a million contemporary popular music tracks.

Website
http://labrosa.ee.columbia.edu/millionsong/

Minneapolis Institute of Arts metadata

Free

A collection of metadata associated with the collection of the Minneapolis Institute of Art.

Website
https://github.com/artsmia/collection

Minnesota Population Center

World Free

The Minnesota Population Center (MPC) is a University-wide interdisciplinary cooperative for demographic research. The MPC serves more than 80 faculty members and research scientists from eight colleges and institutes at the University of Minnesota. As a leading developer and disseminator of demographic data, we also serve a broader audience of some 60,000 demographic researchers worldwide.

Website
https://www.ipums.org/

MIT Cancer Genomics Data

Health Freemium

Estimating Dataset Size Requirements for Classifying DNA Microarray Data.

Website
http://portals.broadinstitute.org/cgi-bin/cancer/datasets.cgi

MIT Reality Mining Dataset

Scientific Tech aggregator Social Media Freemium

This experiment was to explore the capabilities of the smart phones that enabled social scientists to investigate human interactions beyond the traditional survey based methodology or the traditional simulation base methodology. These data sets were collected with tools developed in the MIT Human Dynamics Lab.

Website
http://realitycommons.media.mit.edu/realitymining.html

MNIST database of handwritten digits, near 1 million examples

aggregator Free

It is a good database for people who want to try learning techniques and pattern recognition methods on real-world data while spending minimal efforts on preprocessing and formatting.
The MNIST database of handwritten digits, available from this page, has a training set of 60,000 examples, and a test set of 10,000 examples. It is a subset of a larger set available from NIST.

Website
http://yann.lecun.com/exdb/mnist/

Mobile Social Networks from UMASS

aggregator Free

The Proximity Mobile Social Networks database is based on data collected by the Privacy, Internetworking, Security, and Mobile Systems. The data provide a record of successful mote-to-mote connections over the course of each trial.

Website
https://kdl.cs.umass.edu/display/public/Mobile+Social+Networks

More Song Datasets

Entertainment aggregator Free

The goal is to be able to train on the whole dataset, and then easily compare the results with previous publications. All files have been uploaded to the Echo Nest API. The Million Song Dataset is a freely-available collection of audio features and metadata for a million contemporary popular music tracks.

Website
http://labrosa.ee.columbia.edu/millionsong/pages/additional-datasets

MovieLens Data Sets

aggregator Free

The data sets were collected over various periods of time, depending on the size of the set. Before using these data sets, please review their README files for the usage licenses and other details.

Website
http://grouplens.org/datasets/movielens/

Multi-Domain Sentiment Dataset (version 2.0)

aggregator Freemium

The Multi-Domain Sentiment Dataset contains product reviews taken from Amazon.com from many product types (domains). Some domains (books and dvds) have hundreds of thousands of reviews. Others (musical instruments) have only a few hundred. Reviews contain star ratings (1 to 5 stars) that can be converted into binary labels if needed.

Website
http://www.cs.jhu.edu/~mdredze/datasets/sentiment/

Museums

aggregator Free

An awesome list of high-quality open datasets in public domains (on-going). By everyone, for everyone!

Website
https://github.com/caesar0301/awesome-public-datasets#museums; https://github.com/caesar0301/awesome-public-datasets#id19

NASA Exoplanet Archive

Scientific aggregator Freemium

The first space mission to search for Earth-sized and smaller planets in the habitable zone of other stars in our neighborhood of the galaxy.

Website
http://exoplanetarchive.ipac.caltech.edu/

NASA Global Imagery Browse Services

Scientific aggregator Freemium

The Global Imagery Browse Services (GIBS) system is a core EOSDIS component which provides a scalable, responsive, highly available, and community standards based set of imagery services. These services are designed with the goal of advancing user interactions with EOSDIS’ inter-disciplinary data through enhanced visual representation and discovery.

Website
https://wiki.earthdata.nasa.gov/display/GIBS

NASDAQ

aggregator Paid

The Nasdaq Stock Market is an American stock exchange. It is the second-largest exchange in the world by market capitalization.

Website
https://data.nasdaq.com/default.aspx

National Weather Service GIS Data Portal

Government aggregator Free

This page contains links to data that are distributed via web server technology in the Open Geospatial Consortium (OGC). In addition, some of the NWS data is available as geo-referenced image files such as geo-gifs. NWS provides access to watches, warnings, advisories, and other similar products in the Common Alerting Protocol (CAP) and Atom Syndication Format (ATOM)..

Website
http://www.nws.noaa.gov/gis/

Natural Earth – Vectors and Rasters of the World

aggregator Free

Natural Earth is a public domain map dataset available at 1:10m, 1:50m, and 1:110 million scales. Featuring tightly integrated vector and raster data, with Natural Earth you can make a variety of visually pleasing, well-crafted maps with cartography or GIS software.

Website
http://www.naturalearthdata.com/

Natural History Museum (London) Data Portal

aggregator Free

The Museum is committed to open access and open science, and has launched the Data Portal to make its research and collections datasets available online. It allows anyone to explore, download and reuse the data for their own research.

Website
http://data.nhm.ac.uk/dataset

NBER Patent Citations

Economic aggregator Freemium

These data comprise detail information on almost 3 million U.S. patents granted between January 1963 and December 1999, all citations made to these patents between 1975 and 1999 (over 16 million), and a reasonably broad match of patents to Compustat (the data set of all firms traded in the U.S. stock market).

Website
http://nber.org/patents/

NCBI Proteins

Government aggregator Free

A collection of genomics, functional genomics, and genetics studies and links to their resulting datasets. This resource describes project scope, material, and objectives and provides a mechanism to retrieve datasets that are often difficult to find due to inconsistent annotation, multiple independent submissions, and the varied nature of diverse data types which are often stored in different databases. A database that includes protein sequence records from a variety of sources, including GenPept, RefSeq, Swiss-Prot, PIR, PRF, and PDB.

Website
https://www.ncbi.nlm.nih.gov/

NCBI Taxonomy

Government Free

The Taxonomy Database is a curated classification and nomenclature for all of the organisms in the public sequence databases. This currently represents about 10% of the described species of life on the planet.

Website
https://www.ncbi.nlm.nih.gov/taxonomy

NCI Genomic Data Commons

Scientific Health Free

The NCI Genomic Data Commons (GDC) is a unified knowledge base that promotes sharing of genomic and clinical data between researchers and facilitates precision medicine in oncology.

Website
https://gdc-portal.nci.nih.gov/

NDAR

aggregator Free

The National Database for Autism Research (NDAR) is an NIH-funded research data repository that aims to accelerate progress in autism spectrum disorders (ASD) research through data sharing, data harmonization, and the reporting of research results. NDAR also serves as a scientific community platform and portal to multiple other research repositories, allowing for aggregation and secondary analysis of data.

Website
https://ndar.nih.gov/

Netflix Prize

Entertainment Free

The Netflix Prize was an open competition for the best collaborative filtering algorithm to predict user ratings for films, based on previous ratings without any other information about the users or films, i.e. without the users or the films being identified except by numbers assigned for the contest.

Website
http://netflixprize.com/leaderboard.html

Network Repository with Interactive Exploratory Analysis Tools

aggregator Free

The first interactive data and network repository with real-time analytics. Network repository is not only the first interactive repository, but also the largest network and graph data repository with over 500+ donations. This large comprehensive collection of network graph data is useful for making significant research findings as well as benchmark data sets for a wide variety of applications and domains (e.g., network science, bioinformatics, machine learning, data mining, physics, and social science) and includes relational, attributed, heterogeneous, streaming, spatial, and time series data as well as non-relational machine learning data. All data sets are easily downloaded into a standard consistent format.

Website
http://networkrepository.com/

Network Twitter Data

aggregator Free

Website
http://snap.stanford.edu/data/higgs-twitter.html

NeuroData

Scientific aggregator Free

Our goal is to work together with neuroexperimentalists to discover fundamental principles governing the relationship between mind and brain, via building and deploying open source data-driven tools that run at scale on open access data. This includes analytics, databases, cloud computing, and Web-services applied to both big neuroimages and big neurographs.

Website
http://neurodata.io/

Neuroelectro

aggregator Free

The goal of the NeuroElectro Project is to extract information about the electrophysiological properties (e.g. resting membrane potentials and membrane time constants) of diverse neuron types from the existing literature and place it into a centralized database.

Website
http://neuroelectro.org/

NIMH Data Archive

aggregator Free

The National Institute of Mental Health Data Archive (NDA) makes available human subjects data collected from hundreds of research projects across many scientific domains. The NDA provides infrastructure for sharing research data, tools, methods, and analyses enabling collaborative science and discovery. De-identified human subjects data, harmonized to a common standard, are available to qualified researchers. Summary data is available to all.

Website
https://data-archive.nimh.nih.gov/

NIST complex networks data collection

aggregator Free

In analyzing large-scale complex networks, it is important to establish a standard dataset from which algorithms and claims be compared and verified. Currently, it is often difficult to track down the original data used for computational experiments. Much of it is floating around in various formats throughout the net, imbedded in papers, and often difficult to get from the authors. Moreover, the datasets are often modified (filtered) by research groups interested in different attributes, so that even when the name and descriptions match a citation in a paper, there is no guarantee that the data is identical.

Website
http://math.nist.gov/~RPozo/complex_datasets.html

NOAA Bering Sea Climate

aggregator Free

There is an explosion of interest in Northern Hemisphere climate, and new science programs are highlighting the importance of recent changes in the Arctic on mid-latitude climate impacts. The Bering Sea is one of the world’s major fisheries, and fisheries from Alaskan waters represents half of the landed U.S. catch of fish and shellfish. Because of the changes going on in the Arctic, future evolution of the Bering Sea climate/ecosystem is more uncertain. This is a symmetric problem: climate change impacts ecosystems, and ecosystems serve as indicators for climate change.

Website
http://www.beringclimate.noaa.gov/

NOAA Climate Datasets

World Government Free

NCEI is the world’s largest provider of weather and climate data. Land-based, marine, model, radar, weather balloon, satellite, and paleoclimatic are just a few of the types of datasets available.

Website
http://www.ncdc.noaa.gov/data-access/quick-links

NOAA Realtime Weather Models

World Free

Numerical Weather Prediction (NWP) data are the form of weather model data we are most familiar with on a day-to-day basis. NWP focuses on taking current observations of weather and processing these data with computer models to forecast the future state of weather. Knowing the current state of the weather is just as important as the numerical computer models processing the data. Current weather observations serve as input to the numerical computer models through a process known as data assimilation to produce outputs of temperature, precipitation, and hundreds of other meteorological elements from the oceans to the top of the atmosphere.

Website
http://www.ncdc.noaa.gov/data-access/model-data/model-datasets/numerical-weather-prediction

NOAA SURFRAD Meteorology and Radiation Datasets

World Free

NOAA/ESRL’s Global Monitoring Division (formerly CMDL) of the National Oceanic and Atmospheric Administration, conducts sustained observations and research related to source and sink strengths, trends and global distributions of atmospheric constituents that are capable of forcing change in the climate of Earth through modification of the atmospheric radiative environment, those that may cause depletion of the global ozone layer, and those that affect baseline air quality.

Website
https://www.esrl.noaa.gov/gmd/grad/stardata.html

Notre Dame Global Adaptation Index (NG-DAIN)

World Free

The Notre Dame Global Adaptation Initiative (ND-GAIN) is part of the Climate Change Adaptation Program of the University of Notre Dame’s Environmental Change initiative (ND-ECI). The ND-GAIN Country Index follows a data-driven approach to show which countries are best prepared to deal with global changes brought about by overcrowding, resource-constraints and climate disruption. The Index aims to unlock global adaptation solutions in the corporate and development communities to save lives and improve livelihoods while strengthening market positions.

Website
http://index.gain.org/about/download

NSSDC (NASA) data of 550 space spacecraft

World Government Free

The NASA Space Science Data Coordinated Archive serves as the permanent archive for NASA space science mission data. “Space science” means astronomy and astrophysics, solar and space plasma physics, and planetary and lunar science. As permanent archive, NSSDCA teams with NASA’s discipline-specific space science “active archives” which provide access to data to researchers and, in some cases, to the general public.

Website
http://nssdc.gsfc.nasa.gov/nssdc/obtaining_data.html

NYC Taxi Trip Data 2009-

Paid

The yellow and green taxi trip records include fields capturing pick-up and drop-off dates/times, pick-up and drop-off locations, trip distances, itemized fares, rate types, payment types, and driver-reported passenger counts. The data used in the attached datasets were collected and provided to the NYC Taxi and Limousine Commission (TLC) by technology providers authorized under the Taxicab & Livery Passenger Enhancement Programs (TPEP/LPEP). The trip data was not created by the TLC, and TLC makes no representations as to the accuracy of these data.

Website
http://www.nyc.gov/html/tlc/html/about/trip_record_data.shtml

NYC Taxi Trip Data 2013 (FOIA/FOILed)

Free

Website
https://archive.org/details/nycTaxiTripData2013

NYC Uber trip data April 2014 to September 2014

Paid

This directory contains data on over 4.5 million Uber pickups in New York City from April to September 2014, and 14.3 million more Uber pickups from January to June 2015. Trip-level data on 10 other for-hire vehicle (FHV) companies, as well as aggregated data for 329 FHV companies, is also included. All the files are as they were received on August 3, Sept. 15 and Sept. 22, 2015.

Website
https://github.com/fivethirtyeight/uber-tlc-foil-response

OANDA

World Free

OANDA Corporation provides Internet-based foreign exchange (forex) trading and currency information services to individuals, corporations, portfolio managers, and financial institutions worldwide. The company provides fxTrade, a forex trading platform that enables users to access charting features, multiple sub-accounts to test various trading strategies and financial news, and market analysis; fxTrade Mobile, a forex trading platform for iPhone, iPad, and Android devices that provides charting features, financial news and market analysis, and more; MetaTrader 4 (MT4), a Windows-based electronic trading platform with automated trading capabilities; MT4 Hedging Compatibility,

Website
http://www.oanda.com/

OASIS

aggregator Free

The Open Access Series of Imaging Studies (OASIS) is a project aimed at making MRI data sets of the brain freely available to the scientific community. By compiling and freely distributing MRI data sets, we hope to facilitate future discoveries in basic and clinical neuroscience. OASIS is made available by the Washington University Alzheimer’s Disease Research Center, Dr. Randy Buckner at the Howard Hughes Medical Institute (HHMI) at Harvard University, the Neuroinformatics Research Group (NRG) at Washington University School of Medicine, and the Biomedical Informatics Research Network (BIRN).

Website
http://www.oasis-brains.org/

OONI: Open Observatory of Network Interference – Internet censorship data

aggregator Free

The Open Observatory of Network Interference (OONI) is a free software project under the Tor Project which aims to detect internet censorship, traffic manipulation and signs of surveillance around the world through the collection and processing of network measurements. Since late 2012, OONI has collected millions of network measurements across more than 90 countries around the world, shedding light on multiple cases of network interference.

Website
https://ooni.torproject.org/data/

Open Crime and Policing Data in England, Wales and Northern Ireland

Legal Free

Individual crime and anti-social behaviour (ASB) incidents, including street-level location information and subsequent police and court outcomes associated with the crime.

Website
https://data.police.uk/data/

Open Data Certificates (beta)

aggregator Free

Open Data Certificate is a free online tool developed and maintained by the Open Data Institute, to assess and recognise the sustainable publication of quality open data. It assess the legal, practical, technical and social aspects of publishing open data using best practice guidance.

Website
https://certificates.theodi.org/en/datasets

Open Data for Africa

aggregator Free

The AfDB Statistical Data Portal has been developed in response to the increasing demand for statistical data and indicators relating to African Countries. The Portal provides multiple customized tools to gather indicators, analyze them, and export them into multiple formats. With the Data Portal, you can visualize Socio-Economic indicators over a period of time, gain access to presentation-ready graphics and perform comprehensive analysis on a Country and Regional level.

Website
http://opendataforafrica.org/

Open Library Data Dumps

Free

Open Library provides dumps of all the data in various formats. Currently these dumps are generated every month.

Website
https://openlibrary.org/developers/dumps

Open Mobile Data by MobiPerf

aggregator Free

MobiPerf is an open source application for measuring network performance on mobile platforms. You can measure your network’s throughput and latency, as well as other useful network metrics. MobiPerf also performs measurements at regular intervals in the background. The data is collected either anonymously or from your selected account, which allows you to see your own data. The user credentials collected are not shared outside of this site, and any data used in research projects in universities are anonymized before use.

Website
https://console.cloud.google.com/storage/browser/openmobiledata_public/?pli=1

Open Multilingual Wordnet

World Free

The individual wordnets have been made by many different projects and vary greatly in size and accuracy. We have (i) extracted and normalized the data, (ii) linked it to Princeton WordNet 3.0 and (iii) put it in one place. The Open Multilingual Wordnet and its components are open: they can be freely used, modified, and shared by anyone for any purpose. There is a fuller list of wordnets at the Global Wordnet Association’s Wordnets in the World page.

Website
http://compling.hss.ntu.edu.sg/omw/

Open Traffic collection

Government Free

This Contemporary includes:

Open Traffic Data project
OpenTraffic.io
MDM-Portal
Datex2 Portal

Website
https://github.com/graphhopper/open-traffic-collection

Open-ODS (structure of the UK NHS)

Tech Free

The Organisation Data Service (ODS) is responsible for publishing organisation and practitioner codes, along with related national policies and standards. We’re also responsible for the ongoing maintenance of the organisation and person nodes of the Spine Directory Service, the central data repository used within various NHS systems and services.

Website
https://digital.nhs.uk/home

OpenAddresses

World Free

OpenAddresses.io is a global collection of address data sources, open and free to use” which was created by OSMers User:ToeBee, User:Ingalls and User:Lxbarth, among others. In fact it started out as a spreadsheet of government address datasets maintained by ToeBee, but now has an aggregated download, an API, and a website.

Website
http://openaddresses.io/

OpenCorporates Database of Companies in the World

aggregator

OpenCorporates is the largest open database of companies and company data in the world, with in excess of 100 million companies in a similarly large number of jurisdictions. Our primary goal is to make information on companies more usable and more widely available for the public benefit, particularly to tackle the use of companies for criminal or anti-social purposes, for example corruption, money laundering and organised crime.

Website
https://opencorporates.com/

OpenDataNetwork

aggregator Freemium

A search engine of all Socrata powered data portals. Publish data and share. Find data and build. Answer questions.

Website
https://www.opendatanetwork.com/

OpenDataSoft’s list of 1,600 open data

aggregator Free

We rolled up our sleeves and started aggregating all of the Open Data portals we could get our hands on. We are thrilled to present you the first version of our comprehensive list of 2600+ Open Data portals around the world.

Website
https://www.opendatasoft.com/a-comprehensive-list-of-all-open-data-portals-around-the-world/

OpenFlights – airport, airline and route data

aggregator Free

OpenFlights is a tool that lets you map your flights around the world, search and filter them in all sorts of interesting ways, calculate statistics automatically, and share your flights and trips with friends and the entire world (if you wish). It’s also the name of the open-source project to build the tool

Website
http://openflights.org/data.html

OpenfMRI

Scientific Health aggregator Freemium

The OpenfMRI project has provided a resource for researchers to make their MRI data openly available to the research community.

Website
https://openfmri.org/

OpenPaymentsData, Healthcare financial relationship data

Health aggregator Free Freemium

Open Payments is the federally run transparency program that collects information about these financial relationships and makes it available to you. These relationships can involve money for research activities, gifts, speaking fees, meals, or travel. One of the ways we provide this data to the public is through this search tool, which allows you to search for a doctor, teaching hospital, or company that has made payments. Exploring this information, and discussing the results you find with your healthcare provider, can help you make more informed healthcare decisions.
Use the Data Explorer tool to view the full data sets and create visualizations such as charts and graphs to get an in-depth look at the data submitted by applicable manufacturers and GPO’s.

Website
https://openpaymentsdata.cms.gov/

OpenSNP genotypes data

aggregator Freemium

With openSNP you can share stories about your genetic variations and phenotypes, and discover the stories of other users. openSNP gets the latest open access journal articles on genetic variations from the Public Library of Science. Phenotypes are the observable characteristics of your body, such as height, eye color or preference for coffee. Share your phenotype with other openSNP users, and find others with similar characteristics and traits.
Your data may help scientists discover new genetic associations!

Website
https://opensnp.org/

OpenStreetMap (OSM)

aggregator Free

This project that creates and distributes free geographic data for the world. We started it because most maps you think of as free actually have legal or technical restrictions on their use, holding back people from using them in creative, productive, or unexpected ways. OpenStreetMap is a federative project. That means that a lot a essential resources are provided by third party providers

Website
http://wiki.openstreetmap.org/wiki/Downloading_data

OSU Financial data

aggregator Freemium

We provide a vibrant research and teaching atmosphere, characterized by extensive collaboration with a shared goal of conducting leading-edge research and providing students with the skills they need to succeed in the field of finance.

Website
https://fisher.osu.edu/academic-departments/department-finance

Our World in Data

aggregator Free

Our World in Data (OWID) is an online publication that shows how living conditions are changing. The aim is to give a global overview and to show changes over the very long run, so that we can see where we are coming from and where we are today. We have a list of all current and future data-entries that shows which topics we will cover in this publication. There will be 275 entries. Offline we are constantly collecting material for the future entries; this catalogue includes much more than ten thousand references to visualisations, data sources, and research papers.

Website
https://ourworldindata.org/

Pathguid – Protein-Protein Interactions Catalog

aggregator Free

Pathguide contains information about more than 500 biological pathway related resources and molecular interaction related resources.

Website
http://www.pathguide.org/

Personae Corpus

aggregator Freemium

The Personae corpus was collected for experiments in Authorship Attribution and Personality Prediction. It consists of 145 Dutch-language essays, written by 145 different students (BA in Linguistics and Literature at the University of Antwerp, Belgium). Each student also took an online MBTI personality test, allowing personality prediction experiments. The corpus was controlled for topic, register, genre, age, and education level.

We make available the original texts, a syntactically annotated version of the texts, and the metadata.

Website
http://www.clips.uantwerpen.be/datasets/personae-corpus

PewResearch Society Data Collection

Government aggregator Freemium

Pew Research Center makes its data available to the public for secondary analysis after a period of time.

Website
http://www.pewresearch.org/data/download-datasets/

Physics

Freemium

An awesome list of high-quality open datasets in public domains (on-going). By everyone, for everyone!

Website
https://github.com/caesar0301/awesome-public-datasets#physics; https://github.com/caesar0301/awesome-public-datasets#id22

Pinhooker: Thoroughbred Bloodstock Sale Data

aggregator Freemium

An R Package to compile data sets of historic results from thoroughbred sales

Website
https://github.com/phillc73/pinhooker

PLAID

Tech aggregator Freemium

A Public Dataset of High Resolution for Load Identification Research

Website
http://plaidplug.com/

Pleiades – Gazetteer and graph of ancient places

aggregator Free

Pleiades is a community-built gazetteer and graph of ancient places. It publishes authoritative information about ancient places and spaces, providing unique services for finding, displaying, and reusing that information under open license. It publishes not just for individual human users, but also for search engines and for the widening array of computational research and visualization tools that support humanities teaching and research.

Website
https://pleiades.stoa.org/

Protein Data Bank

Health aggregator Freemium

The RCSB PDB builds upon the data by creating tools and resources for research and education in molecular biology, structural biology, computational biology, and beyond.

Website
http://www.rcsb.org/pdb/home/home.do

Protein-protein interaction network

aggregator Free

Background Interaction detection methods have led to the discovery of thousands of interactions between proteins, and discerning relevance within large-scale data sets is important to present-day biology. The dataset consists of protein-protein interaction network described and analyzed in (1) and available as an example in the software package – PIN (2).

Website
http://vlado.fmf.uni-lj.si/pub/networks/data/bio/Yeast/Yeast.htm

Psychiatric Genomics Consortium

aggregator Paid

The purpose of the Psychiatric Genomics Consortium (PGC) is to unite investigators around the world to conduct meta- and mega-analyses of genome-wide genomic data for psychiatric disorders. This website provides information about the organization, implementation, and results of the PGC.

Website
https://www.med.unc.edu/pgc/acl_users/credentials_cookie_auth/require_login?came_from=https%3A//www.med.unc.edu/pgc/old-pages/downloads

Psychology/Cognition

aggregator Free

An awesome list of high-quality open datasets in public domains (on-going). By everyone, for everyone!

Website
https://github.com/caesar0301/awesome-public-datasets#psychology-cognition

PubChem Project

aggregator Free

PubChem is a database of chemical molecules and their activities against biological assays.

Website
https://pubchem.ncbi.nlm.nih.gov/#

PubGene (now Coremine Medical)

Health aggregator Freemium

Explore connections – Build your biomedical mindmap

Website
http://www.pubgene.org/

Public Domains

aggregator Free

An awesome list of high-quality open datasets in public domains (on-going). By everyone, for everyone!

Website
https://github.com/caesar0301/awesome-public-datasets#public-domains

PyPI and Maven Dependency Network

aggregator Free

As time is always running out, i don’t think i’ll have the time in a while to work again on the data I collected for the last three articles, Going offline with Maven, State of the Maven/Java dependency graph and State of the PyPi/Python dependency graph.

Website
https://ogirardot.wordpress.com/2013/01/31/sharing-pypimaven-dependency-data/

Quandl

aggregator Freemium

The world’s most powerful data lives on Quandl. Designed for professionals, Quandl delivers financial, economic and alternative data to over 150,000 people worldwide. Our customers include the world’s top hedge funds, asset managers and investment banks.

Website
https://www.quandl.com/

Rapid7 Sonar Internet Scans

aggregator Freemium

Project Sonar is a community effort to improve security through the active analysis of public networks. This includes running scans across public internet-facing systems, organizing the results, and sharing the data with the information security community. The three components to this project are tools, datasets, and research.

Website
https://sonar.labs.rapid7.com/

RDataMining –

aggregator Free

RDataMining.com is a leading website on R and data mining, providing examples, documents, tutorials, resources and training on data mining and analytics with R.

Website
http://www.rdatamining.com/data

REDD

aggregator Free

REDD, a data set for energy disaggregation. The data contains power consumption from real homes, for the whole house as well as for each individual circuit in the house (labeled by the main type of appliance on that circuit). The data is intended for use in developing disaggregation methods, which can predict, from only the whole-home signal, which devices are being used (though any other uses are of course encouraged as well).

Website
http://redd.csail.mit.edu/

Reddit Comments

aggregator Free

Website
https://www.reddit.com/r/datasets/comments/3bxlg7/i_have_every_publicly_available_reddit_comment/

Restaurants Health Score Data in San Francisco

Health Free

Website
http://missionlocal.org/san-francisco-restaurant-health-inspections/

Retrosheet Baseball Statistics

aggregator Free

Retrosheet is a non-profit organization whose website features major league baseball box scores and play-by-play narratives for almost every contest from 1871–1872, 1874, 1911 National League, and since 1913. It also includes scores from every Major League Baseball game played since the 1871 season (what is officially the inception of Major League Baseball history), as well as all All-Star, League Championship Series and World Series games.

Website
http://www.retrosheet.org/game.htm

Revolution Analytics Collection

aggregator Free

The Revolution Analytics collection contains some of the data sets we use at Revolution to show off the Parallel External Memory Algorithms in our RevoScaleR package. The collection includes easily accessible “tarred-up” versions of the Airlines Data Set, Census5PCT2000 data set and an artificial set of mortgage default data.

Website
http://packages.revolutionanalytics.com/datasets/

Rijksmuseum Historical Art Collection

aggregator Free

The Rijksmuseum is a Dutch national museum dedicated to arts and history in Amsterdam. The museum has on display 8,000 objects of art and history, from their total collection of 1 million objects from the years 1200–2000, among which are some masterpieces by Rembrandt, Frans Hals, and Johannes Vermeer. The museum also has a small Asian collection, which is on display in the Asian pavilion.

Website
https://www.rijksmuseum.nl/en/api

RITA Airline On-Time Performance data

Government Free

Website
http://www.transtats.bts.gov/Tables.asp?DB_ID=120

RITA/BTS transport data collection (TranStat)

Government Free

Website
https://www.transtats.bts.gov/DataIndex.asp

Sample R data sets

aggregator Free

This package contains a variety of datasets.

Website
http://stat.ethz.ch/R-manual/R-patched/library/datasets/html/00Index.html

Sanger Catalogue of Somatic Mutations in Cancer (COSMIC)

Health Free

COSMIC, the Catalogue Of Somatic Mutations In Cancer, is the world’s largest and most comprehensive resource for exploring the impact of somatic mutations in human cancer.

Website
http://cancer.sanger.ac.uk/cosmic

Sanger Genomics of Drug Sensitivity in Cancer Project (GDSC)

Health Free

The Genomics of Drug Sensitivity in Cancer Project is part of a Wellcome Trust funded collaboration between The Cancer Genome Project at the Wellcome Trust Sanger Institute (UK) and the Center for Molecular Therapeutics, Massachusetts General Hospital Cancer Center (USA). This collaboration integrates the expertise at both sites toward the goal of identifying cancer biomarkers that can be used to identify genetically defined subsets of patients most likely to respond to cancer therapies.

Website
http://www.cancerrxgene.org/

SaudiNewsNet Collection of Saudi Newspaper Articles (Arabic, 30K articles)

aggregator Free

The dataset currently contains 31,030 Arabic articles (with a total number of 8,758,976 words). The articles were extracted from the following Saudi newspapers (sorted by number of articles): Al-Riyadh; Al-Jazirah; Al-Yaum; Al-Eqtisadiya.

Website
https://github.com/ParallelMazen/SaudiNewsNet

SciencesPo World Trade Gravity Datasets

aggregator Free

Website
http://econ.sciences-po.fr/thierry-mayer/data

Scopus Citation Database

aggregator Free

Scopus is the largest abstract and citation database of peer-reviewed literature: scientific journals, books and conference proceedings.

Website
https://www.elsevier.com/solutions/scopus

Search Engines

Free

This Search Engines Includes: Agriculture; Biology; Climate/Weather; Complex Networks.

Website
https://github.com/caesar0301/awesome-public-datasets#search-engines

Sequence Read Archive(SRA)

aggregator Free

The Sequence Read Archive (SRA) stores raw sequence data from “next-generation” sequencing technologies including Illumina, 454, IonTorrent, Complete Genomics, PacBio and OxfordNanopores. In addition to raw sequence data, SRA now stores alignment information in the form of read placements on a reference sequence.

Website
http://www.ncbi.nlm.nih.gov/Traces/sra/

Skytrax’ Air Travel Reviews Dataset

aggregator Free

This Skytrax User Reviews Dataset includes: 41396 Airline Reviews; 17721 Airport Reviews; 1258 Seat Reviews; 2264 Lounge Reviews.

Website
https://github.com/quankiquanki/skytrax-reviews-dataset

Sloan Digital Sky Survey (SDSS) – Mapping the Universe

Scientific Free

The Sloan Digital Sky Survey has created the most detailed three-dimensional maps of the Universe ever made, with deep multi-color images of one third of the sky, and spectra for more than three million astronomical objects. Learn and explore all phases and surveys—past, present, and future—of the SDSS.

Website
http://www.sdss.org/

Small Network Data

aggregator Free

Website
http://www-personal.umich.edu/~mejn/netdata/

Smithsonian Institution Global Volcano and Eruption Database

World aggregator Free

The mission of GVP is to document, understand, and disseminate information about global volcanic activity. We do this through four core functions: reporting, archiving, research, and outreach. The data systems that lie at our core have been in development since 1968 when GVP began documenting the eruptive histories of volcanoes.

Website
http://volcano.si.edu

SMS Spam Collection in English

aggregator Free

The SMS Spam Collection v.1 is a public set of SMS labeled messages that have been collected for mobile phone spam research. It has one collection composed by 5,574 English, real and non-enconded messages, tagged according being legitimate (ham) or spam.

Website
http://www.dt.fee.unicamp.br/~tiago/smsspamcollection/

Social Networks

Social Media Free

This Social Networks includes: 72 hours #gamergate Twitter Scrape; Ancestry.com Forum Dataset over 10 years; Cheng-Caverlee-Lee September 2009 – January 2010 Twitter Scrape; CMU Enron Email of 150 users.

Website
https://github.com/caesar0301/awesome-public-datasets#social-networks

Social Sciences

Scientific Free

The main social sciences include economics, political science, human geography, demography, psychology, and sociology. In a wider sense, social science also includes some fields in the humanities such as anthropology, archaeology, jurisprudence, history, and linguistics.

Website
https://github.com/caesar0301/awesome-public-datasets#social-networks; https://github.com/caesar0301/awesome-public-datasets#id26