Category of Site/Data
I Want tools that are
Show only Nick’s favorite tools:
Federal Election Commission
Created by congress in 1975 the FEC is is an independent regulatory authority whose purpose it to disclose campaign finance information. You can find data on US campaign finance sources. The FEC has downloadable data sets so you can slice and dice the data for your own analysis. Or find already made maps and charts that already break down the data for you.

Website:
http://www.fec.gov/pindex.shtml
GovTrack
GovTrack.us is a completely independent entity which tracks the status of federal legislation, information about your representative and senators in Congress, as well as voting records and original research. GovTrack helps Americans understand what is going on in their national legislature.

Website
https://www.govtrack.us/
UCR Data Tool
The FBI’s Uniform Crime Reporting (UCR) Program collects statistics on violent and property crime. The FBI, in cooperation with the Bureau of Justice Statistics, allows users to build their own customized data tables on this site. By using the table-building tool, users can choose these options: offenses, locality (city, county, state), and year(s).

Website
https://www.ucrdatatool.gov
Bureau of Labor Statistics
The Bureau of Labor Statistics’ Public Data API (Version 1.0 and Version 2.0) give the public access to economic data from all of its programs.

Website
https://www.bls.gov/
Data.Gov
This is the U.S. Government’s official data portal. It provides innumerous datasets on the nation’s demography, businesses and trade.

Website
https://www.data.gov/
American FactFinder (US Census)
American FactFinder provides access to data collected from surveys and censuses regarding the United States, Puerto Rico and the Island Areas.

NICAR
The National Institute for Computer-Assisted Reporting (NICAR) is a program founded on the Missouri School of Journalism’s IRE. It is dedicated to excellence in journalism, particularly with regard to data journalism.

Graphiq
Graphiq Visualizations are used to enrich editorial content and increase support of third-party applications. There are more than 10 billion visualizations already listed in the Graphiq library and thousands are getting added daily.

Website
https://www.graphiq.com
Atlas
Atlas is a platform with a goal to give everyone access to discovering and sharing great charts. Chart creators, especially researchers, analysts and journalists can use Atlas’ platform to create, share and embed their data visualizations.

Website
https://www.theatlas.com/
ProPublica
ProPublica is a non-profit investigative news outlet which offers up hyperlocal data on an array of important issues, such as abuses of power and public trust issues found in government and businesses. The premium data products provide data, analysis, and practical documentation.

Google Public Data Explorer
The Google Public Data Explorer makes it easy to review and use large datasets which are displayed as line graphs, bar graphs, cross sectional plots or on maps. The platform provides past and current public data and predictions from numerous international organizations, such as the World Bank, OECD, Eurostat and the University of Denver.

DataPortals
This is an extensive and comprehensive directory of open-data portals world-wide. It is managed by a group of leading open data experts, including representatives from local, regional and national governments, many NGOs, and international organisations.

Website
http://dataportals.org/
Net Data Directory
The Net Data Directory collects and shares information on a wide range of Internet-related topics—freedom of expression, broadband, social media, cybersecurity and more. This database makes it easier to search, sort, and filter records that are important to their work, and many of the datasets are open and available to the public.

OpenSecrets.org
OpenSecrets.org is the Nation’s premier website tracking the influence of money on U.S. politics. The site offers clear and unbiased information which details how money affects not only government policy, but the lives of US citizens and residents .

Website
http://www.opensecrets.org/
Censorship Explorer
Censorship explorer has a proxy list that is regularly updated by scraping free online proxy lists. Each URL inputted will be requested through each selected proxy, so you can check whether a URL is censored in a particular country by using proxies located around the world.

Website
https://wiki.digitalmethods.net/Dmi/ToolCensorshipExplorer
CrocTail
CrocTail provides an interface for browsing information about several hundred thousand U.S. publicly traded corporations and their foreign subsidiaries. Information from company filings with the U.S. Securities and Exchange Commission (SEC) has been parsed and annotated by CorpWatch to provide a way for Crocodyl.org users to research and add issues related to corporate subsidiaries. CrocTail also serves as a demonstration of the features and data available through the CorpWatch API.

Crowd Voice
The new Crowdvoice.by is an open-source project tool that can be used to collect, organize and share information about causes that are important to you. This tool, which is an interactive platform where users can invite and encourage change by raising awareness, can be easily customized and embedded into your page.

Website
http://crowdvoice.org/
Dat
Dat is a secure open-source, decentralized data sharing tool for syncing changes to data. Dat is the package manager for data, and has a javascript library and a powerful command line tool which also allows for easy share and version control data.

Website
https://datproject.org/
Data Stringer
Datastringer is a tool for hacker-journos. Datastringer can help you subscribe to data sources, and it will contact you when patterns arise or thresholds are broken. It can provide you with re-usable tools (written in Javascript and Node.js) that you will partly configure through a graphical interface.

CDC
Centers for Disease Control and Prevention (CDC) offers data and statistics by topic and tools and resources for numerous diseases and health-related subjects.

Find the Data
Deep insights from reference data. Knowledge delivered. Find the Data is a reference site that uses Graphiq’s semantic technology to deliver deep insights via data-driven articles, visualizations and research tools.

Website
http://www.findthedata.com/
Open Data by Arcgis
Share Live ArcGIS Open Data in Minutes, as part of your ArcGIS Online subscription, you can use ArcGIS Open Data to share your live authoritative open data. Esri-hosted ArcGIS Open Data gives you a quick way to set up public-facing websites where people can easily find and download your open data in a variety of open formats.

Website
http://opendata.arcgis.com/
Hall of Justice
Follow the Money
The Nation’s only free, nonpartisan, verifiable archive of contributions to political campaigns in all 50 states.

Website
http://followthemoney.org/
FOIA Machine
FOIA Machine is an open-source platform that empowers citizens and journalists to easily prepare, file and track multiple public records requests to various governmental and public agencies worldwide. This site helps users access government documents and data that are covered by Freedom of Information Act (FOIA) laws allowing citizens to obtain information vital to the workings of their government.

Website
https://www.foiamachine.org/
Kaggle Datasets
The best place to discover and seamlessly analyze open data. Execute, share, and comment on code for any open dataset with our in-browser analytics tool, Kaggle Kernels. You can also download datasets in an easy-to-read format.

Map Light
Tracking campaign contributions. MapLight is a nonpartisan research organization that reveals money’s influence on politics. We research and compile data about the sources of campaign contributions in U.S. presidential, congressional, state, and local ballot and candidate elections. We provide journalists and citizens with transparency tools that connect data on campaign contributions, politicians, legislative votes, industries, companies, and more to show patterns of influence never before possible to see. These tools allow users to gain unique insights into how campaign contributions affect policy so they can draw their own conclusions about how money influences our political system.

Website
http://maplight.org/
Re3data
Re3data.org is a global registry of research data repositories that covers research data repositories from different academic disciplines. It presents repositories for the permanent storage and access of data sets to researchers, funding bodies, publishers and scholarly institutions.

Website
http://www.re3data.org/
Awesome Public Data Sets
Public data sources are collected from blogs, answers, and user responses and turned into an organized and awesome list of ongoing, high-quality open datasets in public domains. Some are free, some are not.

Website
https://github.com/caesar0301/awesome-public-datasets
Index Mundi
Global data on population, demographics, trade and more. IndexMundi contains detailed country statistics, charts, and maps compiled from multiple sources. You can explore and analyze thousands of indicators organized by region, country, topic, industry sector, and type.

Website
http://www.indexmundi.com/
Google Correlate
Google Public Data
This dataset contains the World Development Indicators (WDI).

Social Explorer
Use our interactive tools to easily create and share maps, presentations and tables, or compare and analyse data and discover amazing facts.

Data USA
Data USA is the most comprehensive visualization of U.S. public data. Its provides an open, easy-to-use platform that turns data into knowledge for use by all sectors and occupations.

Website
https://datausa.io/
Google Data Studio
Aggregates data from a variety of global public data sources for fast analysis, comparison and exploration.

Data World
Data World is a social network platform for people who need to have access to a vast array of high-quality open data. It’s easy to share and connect with other problem-solvers, thereby accelerating and improving decision-making, knowledge transfer and more.

Website
https://data.world/
World Bank Open Data
AWS Public Datasets
Public Datasets on AWS provides a condensed storehouse of public datasets that can be smoothly integrated into AWS cloud-based applications. AWS is hosting the public datasets at no charge for the community, and users need only pay for the compute and storage they use for their own applications.

Enigma
Enigma’s Public Data Explorer contains massive troves of scraped data on almost any topic of public import. Allows for dynamic filtering, querying and search throughout every record and row of every dataset.

Website
http://enigma.io/
Data Bulletin
The Data Bulletin is a central channel for the publication and analysis of data stories, and it is continuously updated with a stream of newly released government and private sector datasets that are available for download.

Website
http://databullet.in/
Uber Movement
Uber Movement data has been used to [examine holiday traffic trends in Manila, measure road network performance in Australia, and understand the impact of Washington DC’s Metrorail shutdown.

Website
http://datadrivenjournalism.net/resources/uber_movement
NDC Explorer
A one-stop-shop for exploring national climate action plans.

Mapzen Mobility Explorer
“Mapzen Mobility Explorer to understand transportation networks around the world. Mapzen is an open, sustainable, and accessible mapping platform.
Our tools let you display, search, and navigate your world.

IIAG Data Portal
Index of African Goverance – with a mandate to strengthen the availability and use of data in Africa, the new portal is freely available online and serves as an interactive platform for in-depth exploration of governance performance for each of the 54 countries.

Website
http://iiag.online/
Afrobarometer Online Data Analysis
Afrobarometer is an online data analysis tool (ODA) that provides free and open data about Africans’ views on a number of issues including democracy and governance. The tool gives easy access to quality data on Africa.

Weather Data
Is a collection of functions that will fetch weather (Temperature, Pressure, Humidity etc.) data from the Web for you as clean data frame. But Also has pre-loaded data sets that you can use.

Lumen
The Lumen database collects and analyzes legal complaints and requests for removal of online materials, helping Internet users to know their rights and understand the law. These data enable us to study the prevalence of legal threats and let Internet users see the source of content removals.

SpikeCharts
A Macroeconomic news analytics tool which provides historical Forex market data in the form of charts snapshots based around market moving economic news announcements.

Website
http://next.newsimpact.com/
Asian Data by Asian Development Bank
Asian Development Bank supports a free visualization tool for mobile devices that presents the latest macroeconomic and social indicators for Asia. The tool augments the stockpile of knowledge of developing member countries and the region and spreads it, so that Asia’s policies can be strengthened based on key data.

OpenAIRE
OpenAIRE is an EC-funded initiative that supports the Open Access policy of the European Commission via a technical infrastructure.The project aims to promote open scholarship and substantially improve the discoverability and reusability of research publications and data. To this end, it offers a data repository platform that allows users to host and retrieve research data.

Website
https://www.openaire.eu/
UN-Habitat Urban Data Portal
UN-Habitat has launched a new web portal featuring a wealth of city data based on its repository of research on urban trends.

Open Spending
“By understanding how governments spend money in our name can we have a say in how that money will affect our own lives.
The journey starts here.”

Website
https://openspending.org/
Zarnan
Zanran helps you to find ‘semi-structured’ data on the web. This is the numerical data that people have presented as graphs and tables and charts. For example, the data could be a graph in a PDF report, or a table in an Excel spreadsheet, or a barchart shown as an image in an HTML page. This huge amount of information can be difficult to find using conventional search engines, which are focused primarily on finding text rather than graphs, tables and bar charts.

Website
http://www.zanran.com
Statista
PewResearch Data
Pew Research Center is a nonpartisan fact tank that informs the public about the issues, attitudes and trends shaping America and the world. It conducts public opinion polling, demographic research, media content analysis and other empirical social science research. Pew Research Center does not take policy positions. It is a subsidiary of The Pew Charitable Trusts.

U.S. Department of Agriculture’s Plants Database
The PLANTS Database provides standardized information about the vascular plants, mosses, liverworts, hornworts, and lichens of the U.S. and its territories. It includes names, plant symbols, checklists, distributional data, species abstracts, characteristics, images, crop information, automated tools, onward Web links, and references. This information primarily promotes land conservation in the United States and its territories, but academic, educational, and general use is encouraged. PLANTS reduces government spending by minimizing duplication and making information exchange possible across agencies and disciplines.

Biology
An awesome list of high-quality open datasets in public domains (on-going). By everyone, for everyone!
This list of public data sources are collected and tidied from blogs, answers, and user responses. Most of the data sets listed below are free, however, some are not. Other amazingly awesome lists can be found in the awesome-awesomeness and sindresorhus’s awesome list.
Website
https://github.com/caesar0301/awesome-public-datasets#biology
1000 Genomes
Data from the 1000 Genomes Project is available worldwide to the scientific community, and it is freely accessible through public databases. The developers are working on a new data portal which will facilitate finding and browsing data in IGSR.

American Gut (Microbiome Project)
The American Gut project sheds light on the many connections between the human microbiome and health, and lifestyle factors. The repository is meant to be used as a project/repo, and all de-identified data is made freely available.

Broad Cancer Cell Line Encyclopedia (CCLE)
The Cancer Cell Line Encyclopedia (CCLE) project gives public access to genomic data, analysis and visualization to about 1000 cell lines. The project began in order to conduct a detailed genetic and pharmacologic characterization of a wide panel of human cancer models, as well as to develop integrated computational analyses that link distinct pharmacologic vulnerabilities to genomic patterns. In addition, the project is used to translate cell line integrative genomics into cancer patient stratification.

Broad Bioimage Benchmark Collection (BBBC)
The Broad Bioimage Benchmark Collection (BBBC) is a collection of annotated biological image sets for testing and validation. This collection of freely downloadable microscopy image sets includes images, a description of the biological application, and a type of expected results.

Cell Image Library
The Cell Image Library™ is a freely accessible, easy-to-search, public repository of reviewed and annotated images, videos, and animations of cells from a variety of organisms. The images show cell architecture, intracellular functionalities, and both normal and abnormal processes. This database is meant to promote research, education, and training, with the goal of improving human health.

Complete Genomics Public Data
Complete Genomics Analysis Tools (CGA™ Tools) are a set of open source software tools for downstream analysis of sequencing data which focus on multi-genome comparisons and format conversion. These tools can be used to conduct various family-based or case-control analysis.

Website
http://www.completegenomics.com/public-data/69-genomes/
EBI ArrayExpress
ArrayExpress Archive of Functional Genomics Data stores data from high-throughput functional genomics experiments. It provides these data for reuse to the research community, as ArrayExpress is one of the best known repositories capable of storing archive functional genomics data to support reproducible research.

EBI Protein Data Bank in Europe
The Electron Microscopy Data Bank (EMDB) covers a variety of techniques, including electron (2D) crystallography, electron tomography, and single-particle analysis. It is a public repository for electron microscopy density maps of subcellular structures and macromolecular complexes.

Electron Microscopy Pilot Image Archive (EMPIAR)
The Electron Microscopy Public Image Archive (EMPIAR) is built on input from the EM community, specifically input from two key workshops organized by the Protein Data Bank in Europe. It is a public resource for raw, 2D electron microscopy images where you can browse, upload, download and reprocess the thousands of raw, 2D images used to build a 3D structure.

ENCODE project
The ENCODE (Encyclopedia of DNA Elements) Consortium is an international collaboration of research groups funded by the National Human Genome Research Institute (NHGRI). The goal of ENCODE is to build a comprehensive parts list of functional elements in the human genome, including elements that act at the protein and RNA levels, and regulatory elements that control cells and circumstances in which a gene is active.

Gene Expression Omnibus (GEO)
GEO is a public functional genomics data repository supporting MIAME-compliant data submissions. Array- and sequence-based data are accepted. Tools are provided to help users query and download experiments and curated gene expression profiles.

Gene Ontology (GO)
The Gene Ontology (GO) project is a collaborative effort to address the need for consistent descriptions of gene products across databases.
Annotation is the practice of capturing the activities and localization of a gene product with GO terms, providing references and indicating what kind of evidence is available to support the annotations. More information on how this is done can be found in the Guide to GO Annotation Policies. Members of the GO Consortium make their annotation data freely available to the public as part of the data accessed by AmiGO 2, the GO browser and search engine. Annotation data sets from individual databases can found on the GO annotations page.

Global Biotic Interactions (GloBI)
GloBI contains code to normalize and integrate existing species-interaction datasets and export the resulting integrated interaction dataset. The mission of this project is to find efficient ways to normalize and integrate species-interaction data. By making this data readily available, GloBI will enable researchers and enthusiasts to answer questions about localized, one-to-one species interactions and big-picture changes in species interactions over time. For example, GloBI can answer which species an Angel Shark (Squatina squatina) eats in the Gulf of Mexico, or return the results of a query for the number of Angel Sharks feeding in the Gulf of Mexico between 2005 and 2010.

Website
https://github.com/jhpoelen/eol-globi-data/wiki#accessing-species-interaction-data
Harvard Medical School (HMS) LINCS Project
The Harvard Medical School (HMS) LINCS Center is funded by NIH grant U54 HL127365 and is part of the NIH Library of Integrated Network-based Cellular Signatures (LINCS) Program. The overall goals of this program are to collect and disseminate data and analytical tools needed to understand how human cells respond to perturbation by drugs, the environment, and mutation. Further information about LINCS and other participating Centers is available at the program website.
HMS LINCS publications provide descriptions of key findings, links to relevant datasets in the HMS LINCS Database, and custom data visualization tools. These and other tools are available via our software page.

Human Genome Diversity Project
A group of scientists at Stanford University have collaborated on a large study to understand genetic diversity in human populations. We analyzed genomic DNA from 1,043 individuals from around the world, determining their genotypes at more than 650,000 SNP loci, with the Illumina BeadStation technology. Genomic DNA samples from these fully-consenting individuals were collected by the Human Genome Diversity Project (HGDP), in a collaboration with the Centre Etude Polymorphism Humain (CEPH) in Paris. The collection we tested is referred to as the “HGDP-CEPH Human Genome Diversity Cell Line Panel”.

Human Microbiome Project (HMP)
“Welcome to the Data Analysis and Coordination Center (DACC) for the National Institutes of Health (NIH) Common Fund supported Human Microbiome Project (HMP). This site is the central repository for all HMP data. The aim of the HMP is to characterize microbial communities found at multiple human body sites and to look for correlations between changes in the microbiome and human health. More information can be found in the menus above and on the NIH Common Fund site. All software, online resources and standard operating protocols used in, or developed as part of the HMP, will be accessible here as they become available.
If you have a protocol or software package that you would like to post on this site, or would like more information on the currently available content, please contact us via the feedback form.”

Website
http://www.hmpdacc.org/reference_genomes/reference_genomes.php
100+ Interesting Data Sets for Statistics
This site provides over 100 data sets on various interesting topics.

Website
http://rs.io/100-interesting-data-sets-for-statistics/
10k US Adult Faces Database
This database contains more than 10,000 natural face photographs and measures for over 2000 of the faces, predicting the memorability of faces using computer vision features.

3.5B Web Pages from CommonCrawl 2012
This page provides a large collection of webpages and hyperlinks for public download,similar to such data like Google, Yahoo, and Microsoft. The graph has been extracted from the Common Crawl 2012 web corpus, has 3.5 billion web pages and 128 billion hyperlinks.

53.5B Web clicks of 100K users in Indiana Univ.
This database is to encourage and help the study of the structure and dynamics of Web traffic networks. It provides a large dataset of about 53.5 billion HTTP requests from the users of Indiana University.

Website
http://cnets.indiana.edu/groups/nan/webtraffic/click-dataset/
A list of cities and countries contributed by community
Academic Torrents of data sharing from UMB
This service is designed to facilitate storage all the data used in research, including datasets as well as publications. The journal focuses on its core mission of providing world class research, and this technology allows a group of editors to seed their own peer reviewed published articles with just a torrent client.

Website
http://academictorrents.com/
ACLED (Armed Conflict Location & Event Data Project)
This is a project that collates data on political violence in developing states, in countries such as Africa and Asia. ACLED (Armed Conflict Location and Event Data Project) aims to supplement the study of civil war with models and periods of instability, public protest and regime breakdown.

Website
http://www.acleddata.com/
Actuaries Climate Index
The Actuaries Climate Index (ACI) is an educational and useful weather and climate monitoring tool designed to help inform actuaries, public policymakers of the impact of a changing climate on the United States and Canada. This website is available for the USA and Canada and more than 10 of their subregions.

Affective Image Classification
In order to facilitate the study of age and gender recognition, we provide a data set and benchmark of face photos. The data included in this collection is intended to be as true as possible to the challenges of real-world imaging conditions. In particular, it attempts to capture all the variations in appearance, noise, pose, lighting and more, that can be expected of images taken without careful preparation or posing.

Website
http://www.openu.ac.il/home/hassner/Adience/data.html
Airlines OD Data 1987-2008
Airlines OD Data is a large dataset that consists of more than 100 million records of flight arrival and departure details for all commercial flights within the USA from October 1987- April 2008. Brief introductions to useful tools: linux command line tools and sqlite, a simple sql database are provided.

Website
http://stat-computing.org/dataexpo/2009/the-data.html
Allen Institute Datasets
The Allen Institute provides answers to important questions in neuroscience. With public releases of new data, knowledge and tools it increases research worldwide.

Website
http://www.brain-map.org/
AWS Amazon
Public Datasets on AWS provides a central location of public datasets that can be quickly and easily processed with elastic computing resources.

American Economic Association (AEA)
American Economic Association (AEA) society’s mission is the dissemination of economics data, and it is available online to professionals, teachers, students and the general public without any subscription.

AMiner Citation Network Dataset
The AMiner Citation Network Dataset’s information is taken from DBLP, ACM, and other sources, and is meant for research purposes only. The first version contains over 600,000 papers which include their title, abstract, authors, year, venue, etc., and it also has more than 600,000 citations.

Website
http://aminer.org/citation
AMPds
The AMPds dataset is designed to help eco-feedback researchers and load disaggregation/NILM researchers to test their prototypes, algorithms, systems and models.

Website
http://ampds.org/
OpenDataMonitor
OpenDataMonitor is an overview of the many European open datasets available today. People can use this platform and its new technologies to make better use of the existing data catalogues.

Website
http://opendatamonitor.eu/frontend/web/index.php?r=dashboard%2Findex
Ancestry.com Forum Dataset
The Ancestry.com Forum Dataset uses data accumulated on online forum boards.ancestry.com, from July 2010. This message board has had active participation for over ten years, and holds more than 22 million messages by over 3.5 million authors. The dataset was created to support research on information retrieval, language technologies, and social network analysis.

Animals with Attributes
This dataset consists of over 30,000 images of 50 animals classes, and uses six pre-extracted feature representations for each image. The platform includes benchmark transfer-learning algorithms, in particular attribute base classification.

AQUASTAT – Global water resources and uses
By 2050, the world’s highest rates of population growth are expected to occur in areas that have deficiencies in the agriculture sector. The AQUASTAT Main Database is provided free of charge to all users, and allows for researchers to benefit from information gathered worldwide by the Food and Agriculture Organization of the United Nations (FAO).

Website
http://www.fao.org/nr/water/aquastat/data/query/index.html?lang=en
ArcGIS Open Data
ArcGIS Open Data uses the ArcGIS Online groups you already have, in order to integrate with other open data platforms, to identify open data sources, and to allow you to quickly publish or remove your open data. Your open datasets automatically sync with the latest version of your sources.

Website
http://opendata.arcgis.com/
Archive-it (from Internet Archive)
In 1996, Internet Archive was created as a non-profit digital library and is the world’s largest public web archive. It has focused on ensuring all collections are freely and publicly accessible at www.archive.org. By creating this digital library to permanently store digital content from all over the world, the data within it is available to everyone who wishes to view it.

Climate Data Online – Australian Weather
Climate Data Online is a platform on The Bureau of Meteorology’s agency website for tracking Australia’s national weather, climate and water. It allows for the use of the Text or Map search to view daily and monthly statistics, historical weather observations, rainfall, temperature and solar tables, graphs and data. In addition, the Daily Weather Observations tool is a part of the Climate Data Online platform.

Aviation Weather Center
The Aviation Weather Center delivers consistent, timely and accurate weather information for the world airspace system. We are a team of highly skilled people dedicated to working with customers and partners to enhance safe and efficient flight.

Basketball (NBA/NCAA/Euro) Player Database and Statistics
DraftExpress LLC is a professional scouting, statistics and analytics service that has been featured on several US sports and media outlets. The goal is to expand their reach worldwide, so that the Draft Express tools can provide comprehensive and trustworthy data to scouting professionals, fans and media.

Bay Area Bike Share Data
The Bay Area Bike Share’s trip data is based on the use of the company’s bike sharing system. The data combines the travel data of 700 bikes and 70 stations across the area, including San Francisco and San Jose. This data set is great for anyone interested in these stats, and designers and developers, too.

Betfair Historical Exchange Data
Users can replay Betfair markets in real-time after the market has been settled, because fully time-stamped historical Betfair price data is now available to Betfair users. The data is collected using the existing live Betfair API and it is a proper representation of what Betfair users already experience on the website while viewing the Betfair market.

Website
http://data.betfair.com/
British Oceanographic Data Center (BODC) – Marine data of ~22K vars
Publicly accessible marine data is collected by using a variety of instruments and samplers, and the data is collated from many resources. The British Oceanographic Data Center (BODC) maintains databanks of almost 22,000 different oceanographic variables, including physical, chemical, biological and geophysical data. BODC makes data available under a licence agreement.

Website
https://www.bodc.ac.uk/data/
Brain Catalogue
The Brain Catalogue is a data set for gathering and disseminating information regarding the diversity of the vertebrate brain. It’s goal is making high quality data, open and freely available to everyone.

Website
https://braincatalogue.org/
Brainomics
Project Brainomics combines questionnaire data, genetics and imaging, and the Brainomics/Localizer online database serves a subset of the Functional Localizer dataset.

Brazilian Weather – Historical data (In Portuguese)
The SINDA is the Mission Center which is responsible for processing data collected remotely by Data Collection Platforms (PCDs) in Brazil. A network of PCDs and Receiving Stations are installed in Brazil and form the Brazilian Data Collection System, which is a wide array of satellites that carry the DCS (data collection transponder) system on board. SINDA manages the function, storage and dissemination of data to users.

Adience Unfiltered faces for gender and age classification
Adience Unfiltered provides a data set and benchmark of face photos. In this collection the data included is intended to be as true as possible to the challenges of real-world imaging conditions, especially images taken without careful preparation or posing.

Website
http://www.openu.ac.il/home/hassner/Adience/data.html
Center for Applied Internet Data Analysis (CAIDA) Internet Datasets
CAIDA aggregates multiple types of data at geographically and topologically diverse locations, and makes this data available to the research community while keeping the anonymity of the donors and companies in tact. This is an overview of both public and private datasets that available.

Cambridge, MA, US, GIS data on GitHub
Cambridge GIS has posted much of the data sets on this official City of Cambridge site, as the city is dedicated to providing developers and the public access to its building-data repositories.

Canadian Legal Information Institute (CanLII)
CanLII provides free access to legal information collected from all Canadian jurisdictions. It gives access to court regulations, judgments, statutes, and tribunal decisions. In addition, CanLII Connects is a database of daily case commentary and case summaries presented by lawyers and other legal analysis professionals.

Canadian Meteorological Centre
This GRIB2 format database has free data, made available by the Meteorological Service of Canada. The database contributes information that is used by academics, private sector meteorologists, and the general public. It contains data from analysis systems and the Canadian Meteorological Centre’s Numerical Weather Prediction (NWP) models.

CBOE Futures Exchange (CFE)
The CBOE Futures Exchange (CFE) is an all-electronic, open access market model. It has dedicated market makers and market participants providing liquidity, and the Data Service is a high-availability, low latency streaming data feed. CFE’s CSV files are typically updated daily on the evening of the same trading day, or the following business morning.

Website
http://cfe.cboe.com/Data/
Center for Systemic Peace Datasets – Conflict Trends, Polities, State Fragility, etc
The focus of CSP research is that of working toward finding true possibilities for a global systemic peace. This dataset tracks conditions and trends in societal-system performance at the global, regional and state levels, and includes data on sustainable human/physical development, governance, and social conflict.

CERN Open Data Portal
The CERN Open Data portal allows access to research activities performed at CERN, and it includes the necessary software and documentation required in order to understand and analyse the shared data. The products are shared under open licenses and they are citable.

Wesite
http://opendata.cern.ch/
Challenges in Machine Learning
Machine Learning is the science of building hardware or software that can achieve tasks by learning from examples. Numerous challenges are listed along with website information and end results from the challenges.

Website
http://www.chalearn.org/
Chars74K dataset, Character Recognition in Natural Images
This is a character recognition dataset which is a classic pattern recognition for Latin script. Character recognition using images containing common character fonts and uniform background is simple, but images taken with cameras and other devices are considerably more difficult, as seen in this dataset.

Cheng-Caverlee-Lee September 2009 – January 2010 Twitter Scrape
This dataset is a collection of messages gathered from September 2009 to January 2010. The data is made up of scraped public twitter updates which was used along with an academic project, in an effort to study geolocation data in relationship to Twitter usage.

Climate Data from University of East Anglia (UEA)
HadCRUT4 is a global temperature dataset developed by Climatic Research Unit, (University of East Anglia). It provides gridded temperature anomalies across the world, including separate averages for the hemispheres and the globe. CRUTEM4 is the land dataset, and HadSST3 is the ocean component of this overall dataset, and they are expected to be updated monthly.

Website
https://crudata.uea.ac.uk/cru/data/temperature/#datterandftp://ftp.cmdl.noaa.gov/
CLiPS Stylometry Investigation Corpus
The CSI corpus is a corpus of student texts in two genres subsisting of reviews and essays. While other applications are possible, it is meant mainly for stylometric research. The meta-data includes various details about the authors and their documents.

ClueWeb09 – 1B web pages
The ClueWeb09 dataset consists of about 1 billion web pages in ten different languages. It uses data from January and February 2009, and was created to support research on related human language technologies and retrieval information.

ClueWeb09 FACC
Freebase Annotations of the ClueWeb Corpora, v1. Researchers at Google automatically, and therefore imperfectly, annotated English-language Web pages from the ClueWeb09 and ClueWeb12 corpora. Still, the annotations are of reasonably high quality, and for each entity they recognized with high confidence, they provide two confidence levels, its Freebase identifier (mid), and the beginning/end byte offsets.

ClueWeb12 – 733M web pages
The ClueWeb12 dataset was created to support research on information retrieval and related human language technologies. The dataset consists of 733,019,372 English web pages, collected between February 10, 2012 and May 10, 2012. ClueWeb12 is a companion or successor to the ClueWeb09 web dataset. Distribution of ClueWeb12 began in January 2013.

Collaborative Research in Computational Neuroscience (CRCNS)
The Collaborative Research in Computational Neuroscience (CRCNS) supports the integration of experimental and theoretical neuroscience research projects. These projects are collaborative and normally involve up to five senior investigators.

Website
http://crcns.org/data-sets
ClueWeb12 FACC1
Freebase Annotations of the ClueWeb Corpora, v1 (FACC1). The ClueWeb12 dataset has over 733,000,000 English web pages, and it was developed to support research on information retrieval and related human language technologies. The information was collected between February 10, 2012 and May 10, 2012.

CMU Enron Email of 150 users
The CALO Project (A Cognitive Assistant that Learns and Organizes) prepared this dataset which contains a total of about 0.5M messages and data from about 150 users – mainly senior management of Enron. This information was posted on the internet during an investigation by the Federal Energy Regulatory Commission.

CMU JASA data archive
The Journal of the American Statistical Association maintains the JASA data archive which contains contributed datasets from its published articles.

College Scorecard Data
The College Scorecard’s function is to increase transparency regarding college qualities, so that students can see how well the different schools can serve them, and so that others can see where the colleges need improvements.

COMBED
The COMBED data set comes with a loader that easily plugs into nilmtkis, and it is the first energy-related data set where the data is sampled more than once every minute from a commercial building.

Website
http://combed.github.io/
CommonCrawl Web Data over 7 years
The Common Crawl Foundation is a non-profit which strives to establish an open repository of web crawl data that is considered accessible and analyzable by all. Open access to web data that is cheap and easy will provide information that allows for greater innovation in many sectors.

Complementary Collections
This Complimentary Collection includes: Data Packaged Core Datasets; Database of Scientific Code Contributions; DataWrangling; Inside-r; OpenDataMonitor; Quora; RS.io; and StaTrek, among many others.

Complex Networks
This Complex Networks includes: AMiner Citation Network Dataset; CrossRef DOI URLs; DBLP Citation dataset; NBER Patent Citations; Network Repository with Interactive Exploratory Analysis Tools, among others.

Computer Networks
Computer Networks includes: 3.5B Web Pages from CommonCrawl 2012; 53.5B Web clicks of 100K users in Indiana Univ; CAIDA Internet Datasets; ClueWeb09 – 1B web pages; ClueWeb12 – 733M web pages, and more.

Correlates of War Project
Key principles of Correlates of War (COW) include the free and timely public release of reliable data sets to the research community. COW seeks to collect and use and distribute accurate data about international relations.

CRAWDAD Wireless datasets from Dartmouth Univ.
Community Resource for Archiving Wireless Data At Dartmouth (CRAWDAD) is a wireless network data resource for the research community. This archive stores wireless trace data from many locations, and their staff designs and improves tools for the collecting, analyzing and anonymizing of the data.

Cricsheet Matches (cricket)
Cricsheet is a dataset for Cricket. It has ball-by-ball data for all Indian Premier League seasons, Men’s and Women’s Test Matches, One-day internationals, Twenty20 Internationals, some other international T20s.

Website
http://cricsheet.org/
Criteo click-through data
Criteo compiles hundreds of billions of dollars of actual sales data, along with an incomparable network of global publishers, so that they can understand digital user behavior, and therefore deliver pertinent, personalized ads that propels incremental sales.

Website
http://labs.criteo.com/2015/03/criteo-releases-its-new-dataset/
CrossRef DOI URLs
This dataset includes the URLs of almost 50 million journal articles which originate from CrossRef’s OAI-PMH server.

CrowdANALYTIX dataX
This platform, CrowdANALYTIX, is designed for crowdsourcing and deploying AI, NLP & Machine Learning solutions. Optimized algorithms, which are built by a crowdsourcing community of over 15,000 data scientists, are utilized and sustained on dataX.ai.

Cryptome Conspiracy Theory Items
The Cryptome Archive keeps 102,600 files dating between June 1996 and January 8, 2017. There is growing censor-tamper-implant-bowdlerize-redact-tag-track of archives, torrents, drops, shares, wikis, disclosure sites, and Cryptome welcomes documents for publication that are otherwise forbidden by all governments. More specifically, material on freedom of expression, privacy, cryptology, dual-use technologies, national security, intelligence, and secret governance — open, secret and classified documents is encouraged– but the list is not limited to those.

Website
http://cryptome.org/
Crystallography Open Database
The Open Data principles have great supporters in crystallography which present full crystal data access for free on the internet, however other essential crystallography databases are only available with a paid subscription.

D4D Challenge of Orange
Data for Development (D4D) Senegal is an innovation challenge open on ICT Big Data. It was designed in 2013 for the purposes of societal development, as well as data on the hours of sunshine. Anonymous data is extracted from the mobile network in Senegal, and the Orange Group and Sonatel are making the data available to international research laboratories.

Data Challenges
This Data Challenges dataset includes: Challenges in Machine Learning; CrowdANALYTIX dataX; D4D Challenge of Orange; DrivenData Competitions for Social Good; and ICWSM Data Challenge (since 2009), etc.

Data Packaged Core Datasets
These Data Packaged Core Datasets are commonly-used, but important datasets. They are available in open form, and they are easy-to-use, high quality data packages.

Website
https://github.com/datasets/
Data360
Data360’s goal is to tell compelling and data-driven stories about important events and subjects. Data360 does reserve the right to adjust editorial permissions as it sees fit, in support of their purpose and principles.

Databanks International Cross National Time Series Data Archive
The Cross-National Time-Series Data Archive is a data set for over 200 countries. It contains annual data from the year 1815 and onwards. Its 196 variables are used by media, academia, finance and government agencies.

Website
http://www.cntsdata.com/
Database of Scientific Code Contributions
This dataset is a collection of open source, web-based tools designed to help you do better science.

Datacards
DataCards is a structured collection tool that tracks irregular warfare and socio-cultural topics to support assessment, analysis, modeling, and other applications. The tool indexes data sources that relate to DataCards with a summary description and evaluation of the content, and are divided into portals according to the Area of Operations (AO) of every geographic COCOM.

Website
http://datacards.org/
Datahub.io
The Datahub provides free access to many of CKAN’s (an open-source DMS) central features. You can create and manage groups of datasets, search for data, and get updates from datasets and groups. It’s accessible by the web interface or the CKAN API.

Website
https://datahub.io/dataset
Dataport
Dataport offers a mix of free and subscription tools. These tools are great for utility analysts, university researchers and research institutions. Dataport’s research tools allow you to analyze, visualize and create custom reports from a vast database of original and curated data.

DBLP Citation dataset
The Proximity DBLP database presents information on computer science publications listed in the DBLP Computer Science Bibliography. The data in this dataset were derived from a snapshot of the bibliography as of April 12, 2006. The Proximity DBLP dataset maps each entry in the original DBLP data to one of six types of objects representing different types of publications. It includes links from publications to their authors and editors and from papers to the journal, proceedings, or book in which they appear.

DBpedia – 4.58M things with 583M facts
DBpedia is a crowd-sourced community effort to extract structured information from Wikipedia and make this information available on the Web. DBpedia allows you to ask sophisticated queries against Wikipedia, and to link the different data sets on the Web to Wikipedia data. We hope that this work will make it easier for the huge amount of information in Wikipedia to be used in some new interesting ways. Furthermore, it might inspire new mechanisms for navigating, linking, and improving the encyclopedia itself.

Delve Datasets for classification and regression (Univ. of Toronto)
Each of the Delve datasets and families has a brief overview page, and many of them have detailed documentation. The datasets are categorized: primarily assessment, development or historical. Each category also distinguishes the datasets as regression or classification, depending on how their prototasks have been created.

DIMACS Road Networks Collection
Algorithms for “shortest path” problems have been studied since the 1950’s and they still remain an active area of research, because these problems are ones of the most fundamental combinatorial optimization problems with many applications. One DIMACS goal is to make it possible for current researchers to compare their codes with each other.

DRED
The DRED dataset is made available to the research community. It is meant to encourage the testing of the performance of energy disaggregation algorithms, derive appliance usage, behavior, and analyze demand response algorithms.

Climate/Weather Datasets
This set of datasets on Github for Climate/Weather sources includes: Actuaries Climate Index, Australian Weather, Canadian Meteorological Centre, Climate Data from UEA, European Climate Assessment & Dataset, and many more.

Website
https://github.com/caesar0301/awesome-public-datasets#climateweather
DrivenData Competitions for Social Good
DrivenData provides data science to organizations that are using it to solve challenges, for positive social impact. DrivenData then runs online modeling competitions for data scientists, to develop the best models to solve them.

Website
http://www.drivendata.org/
Earth Models
This dataset includes observational and virtual data, as well as processing and simulation software. This data comes mainly from geodesy, tectonics, geodynamics and seismology.

Website
http://www.earthmodels.org/
Earth Science
This Earth Science dataset includes AQUASTAT – Global water resources, and it uses BODC – marine data of ~22K vars; Earth Models; EOSDIS – NASA’s earth observing system data; and Marinexplore – Open Oceanographic Data.

ECO
The ECO data set is a comprehensive data set for non-intrusive load monitoring and occupancy detection research which was collected over a period of 8 months from 6 Swiss households.

Website
http://www.vs.inf.ethz.ch/res/show.html?what=eco-data
EconData from UMD
Economic data has been made publicly available through the EconData site, and it has been put into a standard, easy-to-use, standard form for personal computers. These dataset series include current business indicators, product accounts (NIPA), national income and labor statistics, price indices, and industrial production.

Economic Freedom of the World Data
Fraser Institute’s Economic Freedom of North America index (EFNA) has illustrated that economic freedom is one of the main drivers of prosperity. Use their dataset and filters to research worldwide economic stats and other details.

Economics
This Economics includes: American Economic Association (AEA); EconData from UMD; Economic Freedom of the World Data; and Historical MacroEconomc Statistics; International Trade Statistics.

EDRM Enron EMail of 151 users, hosted on S3
The Enron email data was publicly released as part of FERC’s Western Energy Markets investigation. It was converted to industry standard formats by EDRM. The data set consists of 1,227,255 emails with 493,384 attachments covering 151 custodians. The emails are provided in several formats: Microsoft PST, IETF MIME, and EDRM XML.

Education
This Education dataset includes the College Scorecard Data and the Student Data from Free Code Camp.

EIA
Data for utility plants are available from 1970, and data for non-utility plants from 1999.The EIA-906, EIA-920, EIA-923 and predecessor forms provide monthly and annual data, specifically on generation and fuel consumption at the power plant and prime mover level. In addition, a subset of plants, such as 10 MW and above steam-electric plants, also provides data for the boiler level and generator level.

Energy
This Energy includes: AMPds; BLUEd; COMBED; Dataport; DRED.

EOSDIS – NASA’s earth observing system data
The Earth Observing System Data and Information System (EOSDIS) is a key core capability in NASA’s Earth Science Data Systems Program. It provides end-to-end capabilities for managing NASA’s Earth science data from various sources – satellites, aircraft, field measurements, and various other programs.

Ergast Formula 1, from 1950 up to date (API)
The Ergast Developer API is an experimental web service which provides a historical … The API provides data for the Formula One series, from the beginning of the world championships in 1950. … The number of results that are returned can be controlled using a limit query parameter, up to a maximum value of 1000.

Website
http://ergast.com/mrd/db
European Climate Assessment & Dataset
The European Climate Assessment and Dataset (ECA&D) is a database of daily meteorological station observations across Europe and is gradually being extended to countries in the Middle East and North Africa. ECA&D has attained the status of Regional Climate Centre for high-resolution observation data in World Meteorological Organization Region VI (Europe and the Middle East).

Website
http://eca.knmi.nl/
European Social Survey
The European Social Survey runs a programme of research to support and enhance the methodology that underpins the high standards it pursues in every aspect of survey design, data collection and archiving.

Face Recognition Benchmark
A face recognition system is a computer application capable of identifying or verifying a person from a digital image or a video frame from a video source. One of the ways to do this is by comparing selected facial features from the image and a face database.

Factual Global Location Data
Data is increasingly critical to driving innovation and no one should be at a data disadvantage. We at Factual believe that data should be accessible to every developer, entrepreneur, business, or organization – anyone who needs it to build a better app, provide a better search result, make smarter software – anyone who needs data to make a better decision or help others make better decisions.

Website
https://www.factual.com/
FBI Hate Crime 2013 – aggregated data
A hate crime (also known as a bias-motivated crime) is a prejudice-motivated crime, which occurs when a perpetrator targets a victim because of his or her membership (or perceived membership) in a certain social group.

Website
https://github.com/emorisse/FBI-Hate-Crime-Statistics/tree/master/2013
Finance
This Finance includes: CBOE Futures Exchange; Google Finance; Google Trends; NASDAQ; OANDA

Flickr Personal Taxonomies
In addition to allowing users to organize content by tagging it with descriptive labels, several social media sites also allow users to organize content hierarchically within personal taxonomies. Delicious, for example, lets users group related tags into bundles. Flickr lets users group related photos into sets and related sets within collections.

Website
http://www.isi.edu/~lerman/downloads/flickr/flickr_taxonomies.html
Football/Soccer resources (data and APIs)
There are three main ways to get data. You can parse/scrape it from a hobbyist project/website, you can pay for it or you can try to collect it yourself.

Website
http://www.jokecamp.com/blog/guide-to-football-and-soccer-data-and-apis/
Foursquare from UMN/Sarwat (2013)
This data set contains 2153471 users, 1143092 venues, 1021970 check-ins, 27098490 social connections, and 2809581 ratings that users assigned to venues; all extracted from the Foursquare application through the public API. All users information have been anonymized, i.e., users geolocations are also anonymized. Each user is represented by an id, and GeoSpatial location. The same for venues. The data are contained in five files, users.dat, venues.dat, checkins.dat, socialgraph.dat, and ratings.dat.

Website
https://archive.org/details/201309_foursquare_dataset_umn
Fragile States Index
We are pleased to present the twelfth annual Fragile States Index. The FSI focuses on the indicators of risk and is based on thousands of articles and reports that are processed by our CAST Software from electronically available sources. We encourage others to utilize the Fragile States Index to develop ideas for promoting greater stability worldwide. We hope the Index will spur conversations, encourage debate, and most of all help guide strategies for sustainable security.

Freebase.com of people, places, and things
Freebase is an open database of the world?s information. It is built by the community and for the community?free for anyone to query, contribute to, built applications on top of, or integrate into their websites.

Website
http://www.freebase.com/
Gapminder World demographic databases
Gapminder is an independent Swedish foundation with no political, religious or economic affiliations. Gapminder is a fact tank, not a think tank. Gapminder fights devastating misconceptions about global development. Gapminder produces free teaching resources making the world understandable based on reliable statistics. Gapminder promotes a fact-based worldview everyone can understand. Gapminder collaborates with universities, UN, public agencies and non-governmental organizations. All Gapminder activities are governed by the board. We do not award grants. Gapminder Foundation is registered at Stockholm County Administration Board.

GDELT Global Events Database
GDELT is the largest, most comprehensive, and highest resolution open database of human society ever created. Creating a platform that monitors the world’s news media from nearly every corner of every country in print, broadcast, and web formats, in over 100 languages, every moment of every day and that stretches back to January 1, 1979 through present day, with daily updates, required an unprecedented array of technical and methodological innovations, partnerships, and whole new mindsets to bring this all together and make it a reality.

General Social Survey (GSS) since 1972
The GSS gathers data on contemporary American society in order to monitor and explain trends and constants in attitudes, behaviors, and attributes. The GSS contains a standard core of demographic, behavioral, and attitudinal questions, plus topics of special interest. Among the topics covered are civil liberties, crime and violence, intergroup tolerance, morality, national spending priorities, psychological well-being, social mobility, and stress and traumatic events.

Website
http://gss.norc.org/
Geo Spatial Data from ASU
Geospatial analysis, or just spatial analysis is an approach to applying statistical analysis and other analytic techniques to data which has a geographical or spatial aspect. Such analysis would typically employ software capable of rendering maps processing spatial data and applying analytical methods to terrestrial or geographic datasets, including the use of geographic information systems and geomatics.

Geo Wiki Project – Citizen-driven Environmental Monitoring
The Geo-Wiki Project is a global network of volunteers who wish to help improve the quality of global land-cover maps. Because large differences occur between existing global land-cover maps, current ecosystem and land-use science lacks crucial accurate data (for example, to determine the potential of additional agricultural land available to grow crops in Africa).

Website
http://geo-wiki.org/
GeoFabrik – OSM data extracted to a variety of formats and areas
The OpenStreetMap (OSM) project was founded in the United Kingdom in 2004 and is aimed at creating a free, world-wide geographic data set. OpenStreetMap wants to be for geodata what Wikipedia is for encyclopedic knowledge. The focus is mainly on transport infrastructure (streets, paths, railways, rivers), but OpenStreetMap also collects a multitude of points of interest, buildings, natural features and landuse information, as well as coastlines and administrative boundaries.

GeoLife GPS Trajectory from Microsoft Research
This GPS trajectory dataset was collected in (Microsoft Research Asia) Geolife project by 182 users in a period of over three years (from April 2007 to August 2012). A GPS trajectory of this dataset is represented by a sequence of time-stamped points, each of which contains the information of latitude, longitude and altitude. This dataset contains 17,621 trajectories with a total distance of about 1.2 million kilometers and a total duration of 48,000+ hours. These trajectories were recorded by different GPS loggers and GPS-phones, and have a variety of sampling rates. 91 percent of the trajectories are logged in a dense representation, e.g. every 1~5 seconds or every 5~10 meters per point.

Website
http://research.microsoft.com/en-us/downloads/b16d359d-d164-469e-9fd4-daa38f2b2e13/
GeoNames Worldwide
The GeoNames database contains over 10,000,000 geographical names corresponding to over 7,500,000 unique features.[1] All features are categorized into one of nine feature classes and further subcategorized into one of 645 feature codes. Beyond names of places in various languages, data stored include latitude, longitude, elevation, population, administrative subdivision and postal codes.

Website
http://www.geonames.org/
German Social Survey
The German General Social Survey is a national data generation program in Germany, which is similar to the American General Social Survey (GSS). Its mission is to collect and disseminate high quality statistical surveys on attitudes, behavior, and social structure in Germany.

GIS
This GIS includes: ArcGIS Open Data portal; Cambridge, MA, US, GIS data on GitHub; Factual Global Location Data; Geo Spatial Data from ASU; Geo Wiki Project – Citizen-driven Environmental Monitoring.

GitHub Collaboration Archive
GitHub Archive is a project to record the public GitHub timeline, archive it, and make it easily accessible for further analysis.

Global Administrative Areas Database (GADM)
GADM is a spatial database of the location of the world’s administrative areas (or adminstrative boundaries) for use in GIS and similar software. Administrative areas in this database are countries and lower level subdivisions such as provinces, departments, bibhag, bundeslander, daerah istimewa, fivondronana, krong, landsvæðun, opština, sous-préfectures, counties, and thana.

Website
http://www.gadm.org/
Global Climate Data Since 1929
Climate Information for every country in the world with historical data in some cases date back to 1929. Here you can check the status of an earlier time in one of the more than 9,000 stations that have information. You can meet the annual averages, monthly averages and extended information for a day.

Global Religious Futures Project
Pew Research Center is a nonpartisan fact tank that informs the public about the issues, attitudes and trends shaping America and the world. We conduct public opinion polling, demographic research, content analysis and other data-driven social science research. We do not take policy positions.

Google Books Ngrams (2.2TB)
N-grams are fixed size tuples of items. In this case the items are words extracted from the Google Books corpus. The n specifies the number of elements in the tuple, so a 5-gram contains five words or characters. The n grams in this dataset were produced by passing a sliding window of the text of books and outputting a record for each new token.

Google Finance
Get real-time stock quotes & charts, financial news, currency conversions, or track your portfolio with Google Finance.

Google Trends
Google Trends is a public web facility of Google Inc., based on Google Search, that shows how often a particular search-term is entered relative to the total search-volume across various regions of the world, and in various languages. The horizontal axis of the main graph represents time (starting from 2004), and the vertical is how often a term is searched for relative to the total number of searches, globally. Below the main graph, popularity is broken down by countries, regions, cities and language. Note that what Google calls “language”, however, does not display the relative results of searches in different languages for the same term(s).

Website
http://www.google.com/trends?q=google&ctab=0&geo=all&date=all&sort=0
Google Web 5gram (1TB, 2006)
Web 1T 5-gram Version 1, contributed by Google Inc., contains English word n-grams and their observed frequency counts. The length of the n-grams ranges from unigrams (single words) to five-grams. This data is expected to be useful for statistical language modeling, e.g., for machine translation or speech recognition, as well as for other uses.

Government
This Government includes: OpenDataSoft’s list of 1,600 open data; Open Data for Africa; A list of cities and countries contributed by community.

Gutenberg eBooks List
An electronic book (or e-book) is a book publication made available in digital form, consisting of text, images, or both, readable on the flat-panel display of computers or other electronic devices.[1] Although sometimes defined as “an electronic version of a printed book”,[2] some e-books exist without a printed equivalent. Commercially produced and sold e-books are usually intended to be read on dedicated e-reader devices. However, almost any sophisticated computer device that features a controllable viewing screen can also be used to read e-books, including desktop computers, laptops, tablets and smartphones.

Website
http://www.gutenberg.org/wiki/Gutenberg:Offline_Catalogs
Hansards text chunks of Canadian Parliament
This release contains 1.3 million pairs of aligned text chunks (sentences or smaller … from the official records (Hansards) of the 36th Canadian Parliament.

Webiste
http://www.isi.edu/natural-language/download/hansard/
Hard Drive Failure Rates
The 4TB Seagate drives are our workhorse drives today and their 2.8% annualized failure rate is more than acceptable for us. Their low failure rate roughly translates to an average of one drive failure per Storage Pod per year.

Harvard Dataverse Network of scientific data
Dataverse is an open source web application to share, preserve, cite, explore, and analyze research data. It facilitates making data available to others, and allows you to replicate others’ work more easily. Researchers, data authors, publishers, data distributors, and affiliated institutions all receive academic credit and web visibility. Dataverse provides a robust infrastructure for data stewards to host and Share this dataverse on your favorite social media networks.

Healthcare
This Healthcare includes: EHDP Large Health Data Sets; Gapminder World demographic databases; Medicare Coverage Database (MCD), U.S.; Medicare Data Engine of medicare.gov Data; Medicare Data File.

Heart Rate Time Series from MIT
Heart rate is the speed of the heartbeat measured by the number of contractions of the heart per minute (bpm). The heart rate can vary according to the body’s physical needs, including the need to absorb oxygen and excrete carbon dioxide. It is usually equal or close to the pulse measured at any peripheral point. Activities that can provoke change include physical exercise, sleep, anxiety, stress, illness, and ingestion of drugs.

HES
The Household Electricity Use Study monitored domestic electrical appliances in a total of 251 owner-occupier households across England over the period of April 2010 to April 2011. Twenty six of these households were monitored for a full year; whilst the remaining 225 were monitored for the duration of one month on a rolling basis throughout the trial period.

HFED
HFED is a high frequency EMI dataset having traces taken from signal analyser and USRP. Our data processing and visualization script is open source and is accessible on Github.

Website
http://hfed.github.io/
High-Resolution Contact Networks from Wearable Sensors
This data set contains the temporal network of contacts between individuals measured in an office building in France, from June 24 to July 3, 2013. This page provides a collection of datasets obtained through the SocioPatterns sensing platform.

Historical MacroEconomc Statistics
Historical Statistics argues for trans-historical reformulations of the basic economic concepts of production, work and consumption. Important issues concern how to deal with violence, double counting of transaction costs, human capital formation, non-market activities and causation of final consumption. Production, work and consumption are defined as relations between events, the subject matter and the agent. Eight different definitions of GDP are presented.

Homeland Infrastructure Foundation-Level Data
HIFLD (Homeland Infrastructure Foundation-Level Data) provides National foundation-level geospatial data within the open public domain that can be useful to support community preparedness, resiliency, research, and more. The data is available for download as CSV, KML, Shapefile, and accessible via web services to support application development and data visualization.

Hubway Million Rides in MA
Data geeks of all stripes! Students, professors, designers, artists, data nerds by profession and those who just do it for fun.
Visualizations, animations, maps, info graphics that tell us something new or illustrate the awesomeness of more than half a million bike trips in one year. Winning entries were both smart and beautiful, and included interactive data analysis tools, animations, artistic representations, and even a video game.

Human Connectome Project
The HCP (Human Connectome Project) is mapping the human connectome as accurately as possible in a large number of normal adults and is making this data freely available to the scientific community using a powerful, user-friendly informatics platform.

Humanitarian Data Exchange
The Humanitarian Data Exchange. Find, share and use humanitarian data all in one place.

Website
https://data.hdx.rwlabs.org/
ICOS PSP Benchmark
The ICOS PSP benchmarks repository contains an adjustable real-world family of benchmarks suitable for testing the scalability of classification/regression methods. When we test a machine learning method we usually choose a test suite containing datasets with a broad set of characteristics, as we are interested in knowing how the learning method reacts to a veriety of scenarios. The PSP field provides us with a whole family of real-world classification/regression problems that can be adjusted almost arbitrarily in terms of number of variables, number of classes, class balance, etc. Thus, these datasets are an ideal benchmark suite for data mining methods.

ICPSR (UMICH)
ICPSR advances and expands social and behavioral research, acting as a global leader in data stewardship and providing rich data resources and responsive educational opportunities for present and future generations.
ICPSR provides leadership and training in data access, curation, and methods of analysis for the social science research community.

Image Processing
This list of public data sources are collected and tidied from blogs, answers, and user responses. Most of the data sets listed below are free, however, some are not. Other amazingly awesome lists can be found in the awesome-awesomeness and sindresorhus’s awesome list.

Website
http://icwsm.cs.umbc.edu/
ImageNet (in WordNet hierarchy)
ImageNet is an image database organized according to the WordNet hierarchy (currently only the nouns), in which each node of the hierarchy is depicted by hundreds and thousands of images. Currently we have an average of over five hundred images per node. We hope ImageNet will become a useful resource for researchers, educators, students and all of you who share our passion for pictures.
On this page, you will find some useful information about the database, the ImageNet community, and the background of this project.

Website
http://www.image-net.org/
IMDb Database
This page describes various alternate ways to access IMDb locally by holding copies of the data directly on your system.

Indoor Scene Recognition
In this database contains 67 Indoor categories, and more than 15000 images. The number of images varies across categories, but there are at least 100 images per category. All images are in jpg format. The images provided here are for research purposes only.

Infochimps
Infochimps Cloud is a suite of robust, scalable cloud services that make it faster and far less complex to develop and deploy enterprise Big Data applications. Whether you need real–time analytics on multi–source streaming data, a scalable NoSQL database or an elastic, cloud-based Hadoop cluster — Infochimps Cloud is your easiest step to Big Data.

Website
http://www.infochimps.com/
INFORM Index for Risk Management
INFORM is a global, open-source risk assessment for humanitarian crises and disasters. It can support decisions about prevention, preparedness and response. INFORM’s user-friendly interface allows policymakers to prioritize countries by multiple dimensions of risk and visualize disaster risk. The results of INFORM are also available for the past 5 years, so trends can by analyzed as well. It is a powerful tool that actors involved in disaster prevention, preparedness and response can use to collaborate, plan efficiently, and save lives.

Institute for Demographic Studies
The Institute for Demographic Studies or INED, is a public research institute specialized in population studies that works in partnership with the academic and research communities at national and international levels.

Wesite
http://www.ined.fr/en/
Institute of Education Sciences
The Institute of Education Sciences (IES) is the independent, non-partisan statistics, research, and evaluation arm of the U.S. Department of Education. IES’ stated mission is to provide scientific evidence on which to ground education practice and policy and to share this information in formats that are useful and accessible to educators, parents, policymakers, researchers, and the public.

Website
http://eric.ed.gov/
Integrated Marine Observing System (IMOS) – roughly 30TB of ocean measurements; on S3
IMOS has been routinely operating a wide range of observing equipment throughout Australia’s coastal and open oceans, making all of its data accessible to the marine and climate science community, other stakeholders and users, and international collaborators. IMOS is designed to be a fully-integrated, national system, observing at ocean-basin and regional scales, and covering physical, chemical and biological variables.

Website
https://imos.aodn.org.au/ , http://imos-data.s3-website-ap-southeast-2.amazonaws.com/
International Affective Picture System, UFL
The International Affective Picture System (IAPS) is being developed to provide a set of normative emotional stimuli for experimental investigations of emotion and attention. The goal is to develop a large set of standardized, emotionally-evocative, internationally-accessible, color photographs that includes contents across a wide range of semantic categories.

International Economics Database; various data tools
The purpose of the Widukind project is to provide a unique website accessible for all users, allowing them to free download public economic data as released by national producers (national institutes of statistics, central banks) as well as international ones (IMF, World Bank, OECD, Eurostat, ECB).

Website
http://widukind.cepremap.org/ ; https://github.com/Widukind
International HapMap Project
The International HapMap Project was an organization that aimed to develop a haplotype map (HapMap) of the human genome, to describe the common patterns of human genetic variation. HapMap is used to find genetic variants affecting health, disease and responses to drugs and environmental factors.

Website
http://hapmap.ncbi.nlm.nih.gov/downloads/index.html.en
International Networks Archive
The International Archive is to assemble data sets relevant to empirical research on mapping the global web in a central location and to standardize them so the various indicators can be combined. Given the immense amount of work that defining a global web involves we argue for disseminating the raw data as widely as possible so as to recruit the largest possible number of collaborators.

International Social Survey Program ISSP
The ISSP is a continuing annual programme of cross-national collaboration on surveys covering topics important for social science research. It brings together pre-existing social science projects and coordinates research goals, thereby adding a cross-national, cross-cultural perspective to the individual national studies. The ISSP researchers develop questions which are meaningful and relevant to all countries which can be expressed in an equal manner in different languages. The results of the surveys provide a cross-national and cross-cultural perspective to individual national studies.

Website
http://www.issp.org/
International Studies Compendium Project
The International Studies Compendium Project, published in association with the International Studies Association (ISA), is available as an online reference or as a 12-volume set in print. This resource is the most comprehensive reference work of its kind for the fields of international studies and international relations. Comprising a series of literature review essays, referred and rigorous, comprehensive, and neutral in tone, each details fruitful lines of research up to the current “state of the art”. As such, the essays provide an invaluable resource for students and scholars new to a particular area of research who need an overview that maps the existing scholarship in a useful way.

International Trade Statistics
Firms scanning the world market for opportunities to diversify products, markets and suppliers, and trade support institutions (TSIs) setting priorities in terms of trade promotion, sectoral performance, partner countries and trade development strategies must have detailed statistical information on international trade flows in order to utilize resources effectively.

Internet Product Code Database
First of all the term UPC has been deprecated the new term is UCC-12. But the world has moved beyond that. As of January 2005, retailers in the U.S. are supposed to be able to support the EAN/UCC-13 code (the rest of the world has done this for years), which uses similar symbology, and one additional digit.

Website
http://www.upcdatabase.com/
James McGuire Cross National Data
The Several contents are: Health and Health Care Data; Infant, Child, and Maternal Mortality; Economic Affluence; Democracy, Civil and Political Rights, Women in Parliament; Water and Sanitation.

Website
http://jmcguire.faculty.wesleyan.edu/welcome/cross-national-data/
Joint External Debt Data Hub
The Joint External Debt Hub (JEDH)—jointly developed by the Bank for International Settlements (BIS), the International Monetary Fund (IMF), the Organization for Economic Cooperation and Development (OECD) and the World Bank (WB)—brings together external debt data and selected foreign assets from international creditor/market and national debtor sources.

Website
http://www.jedh.org/
Journal of Cell Biology DataViewer
The JCB DataViewer is a web-based, multi-dimensional image data-viewing application. It is a tool for visualization and simple analysis of original image data files associated with JCB articles. Image data are archived by the Journal and may be freely accessed by readers using the JCB DataViewer. Download of author-provided image data and associated metadata in OME-TIFF format is also possible with author permission, allowing for independent analysis of image data irrespective of acquisition or viewing software. Although the JCB DataViewer is designed to host and facilitate sharing and analysis of original microscopy image data, authors may also upload other types of original image data as supplements to their manuscripts, including histology and electron micrographs and digital scans of gels or blots.

Kaggle Competition Data
Kaggle is a platform for data science competitions. We help you solve difficult problems, recruit strong teams, and amplify the power of your data science talent.

Website
https://www.kaggle.com/
KDD Cup by Tencent 2012
The dataset represents a sampled snapshot of Tencent Weibo users’ preferences for various items –– the recommendation of items to users and the history of users’ ‘following’ history. It is of a larger scale compared to other publicly available datasets ever released. Also it provides richer information in multiple domains such as user profiles, social graph, item category, which may hopefully evoke deeply thoughtful ideas and methodology.

Website
http://www.kddcup2012.org/
KDNuggets Data Collections
KDnuggets is a leading site on Business Analytics, Big Data, Data Mining, and Data Science.

Keel Repository for classification, regression and time series
KEEl at providing to the machine learning researchers a set of benchmarks to analyze the behavior of the learning methods. Concretely, it is possible to find benchmarks already formatted in KEEL format for classification (such as standard, multi instance or imbalanced data), semi-supervised classification, regression, time series and unsupervised learning. In several domains as statistics, signal processing or econometrics, a time series is a sequence of data points, measured typically at successive times spaced at uniform time intervals. Time series data have a natural temporal ordering. This makes time series analysis distinct from other common data analysis problems, in which there is no natural ordering of the observations.

Labeled Faces in the Wild (LFW)
A Database for Studying Face Recognition in Unconstrained Environments Most face databases have been created under controlled conditions to facilitate the study of specific parameters on the face recognition problem. These parameters include such variables as position, pose, lighting, background, camera quality, and gender. While there are many applications for face recognition technology in which one can control the parameters of image acquisition, there are also many applications in which the practitioner has little or no control over such parameters. This database, Labeled Faces in the Wild, is provided as an aid in studying the latter, unconstrained, recognition problem. The database contains labeled face photographs spanning the range of conditions typically encountered in everyday life.

Lahman’s Baseball Database
Sean Lahman is an award-winning database journalist and author. He develops interactive databases and data driven stories for the Rochester Democrat and Chronicle and other Gannett newspapers and websites. He also writes a weekly column on emerging technology and innovation.

Website
http://www.seanlahman.com/baseball-archive/statistics/
Landsat 8 on AWS
Landsat 8 data is available for anyone to use via Amazon S3. All Landsat 8 scenes from 2015 are available along with a selection of cloud-free scenes from 2013 and 2014. All new Landsat 8 scenes are made available each day, often within hours of production.

Lending Club Loan Data
Lending Club is the world’s largest online marketplace connecting borrowers and investors. Lending Club’s platform has the potential to profoundly transform traditional banking over the next decade. Lending Club is helping reinvent the consumer lending industry. All loans facilitated by Lending Club are issued by a bank and subject to the same consumer protection, fair lending, and disclosure requirements as any other bank loan.

Website
https://www.lendingclub.com/info/download-data.action
Leveraging open data to understand urban lives
Data mining one of the hottest topics on the media in past years, exhibits a new way to help companies, organizations and even ordinary people to make plans and decisions in near future. We are convinced by the knowledge derived from data, mostly because the data recording historical events is more solid and reliable than people’s experience that is influenced by so many random factors in reality.

Website
http://xiaming.me/posts/2014/10/23/leveraging-open-data-to-understand-urban-lives/
List of all countries in all languages
Umpirsky Country List: List of all languages with names and ISO 639-1 codes in all languages and all data formats.

Localytics Data Visualization Challenge
Data visualization is viewed by many disciplines as a modern equivalent of visual communication. It involves the creation and study of the visual representation of data, meaning information that has been abstracted in some schematic form, including attributes or variables for the units of information.

Machine Comprehension Test (MCTest) of text from Microsoft Research
Understanding unstructured text is a major goal within natural language processing. Comprehension tests pose questions based on short text passages to evaluate such understanding. In this work, we investigate machine comprehension on the challenging {\it MCTest} benchmark. Partly because of its limited size, prior work on {\it MCTest} has focused mainly on engineering better features.

Website
http://research.microsoft.com/en-us/um/redmond/projects/mctest/index.html
Machine Learning
This Machine Learning includes: Delve Datasets for classification and regression (Univ. of Toronto); Discogs Monthly Data; eBay Online Auctions (2012); IMDb Database; Keel Repository for classification, regression and time series.

Machine Learning Data Set Repository
This repository manages the following types of objects. Data Sets Raw data as a collection of similarily structured objects. Material and Methods Descriptions of the computational pipeline. Learning Tasks Learning tasks defined on raw data.

Website
http://mldata.org/
Machine Translation of European languages
We provide training data for four European language pairs, and a common framework (including a baseline system). The task is to improve methods current methods. This can be done in many ways. For instance participants could try to improve word alignment quality, phrase extraction, phrase scoring add new components to the open source software of the baseline system.

Website
http://statmt.org/wmt11/translation-task.html#download
MacroData Guide by Norsk samfunnsvitenskapelig datatjeneste
NSD is the Data Protecion Official for Research for all the Norwegian universities, university colleges and several hospitals and research institutes. The Data Protecion Official scheme implies that the requirement for obtaining licenses from the Data Inspectorate for a greater part of research projects are replaced by a notification requirement where NSD is the last instance for reviewing applications for licenses. This means that the Data Inspectorate has delegated part of its responsibility to NSD in relation to the Personal Data Act and Health Register Act.

Website
http://nsd.uib.no/
Marine Traffic – ship tracks, port calls and more
MarineTraffic maintains a database of real-time and historical ship positions sourced from the largest station network and Satellite constellation.

Medicare Coverage Database (MCD), U.S
The Medicare Coverage Database (MCD) contains all National Coverage Determinations (NCDs) and Local Coverage Determinations (LCDs), local articles, and proposed NCD decisions. The database also includes several other types of National Coverage policy related documents, including National Coverage Analyses (NCAs), Coding Analyses for Labs (CALs), Medicare Evidence Development & Coverage Advisory Committee (MEDCAC) proceedings, and Medicare coverage guidance documents.

Medicare Data Engine of medicare.gov Data
These data allow you to compare the quality of care at every Medicare and Medicaid-certified nursing home in the country, including over 15,000 nationwide.

Website
https://data.medicare.gov/
Medicare Data File
The Centers for Medicare & Medicaid Services (CMS) makes identifiable data files (IDFs) available to certain stakeholders as allowed by federal laws and regulations as well as CMS policy. IDFs contain protected health information (PHI) and/or personally identifiable information (PII) and CMS is committed to ensuring this information is protected.

Website
http://go.cms.gov/19xxPN4
MeSH, the vocabulary thesaurus used for indexing articles for PubMed
MeSH is the National Library of Medicine’s controlled vocabulary thesaurus. It consists of sets of terms naming descriptors in a hierarchical structure that permits searching at various levels of specificity.

Microsoft Data Science for Research
Microsoft Research provides a continuously refreshed collection of free datasets, tools and resources designed to advance the state of the art of academic research in many areas of computer science, such as natural language processing and computer vision. In addition, you can browse datasets and apply for cloud-based compute cycles available under the Azure for Research program.

Website
http://aka.ms/Data-Science
Microsoft MAchine Reading COmprehension Dataset (or MS MARCO)
Microsoft Machine Reading Comprehension (MS MARCO) is a new large scale dataset for reading comprehension and question answering. In MS MARCO, all questions are sampled from real anonymized user queries. The context passages, from which answers in the dataset are derived, are extracted from real web documents using the most advanced version of the Bing search engine. The answers to the queries are human generated if they could summarize the answer.

Million Song Dataset
The Million Song Dataset is a freely-available collection of audio features and metadata for a million contemporary popular music tracks.

Minneapolis Institute of Arts metadata
A collection of metadata associated with the collection of the Minneapolis Institute of Art.

Minnesota Population Center
The Minnesota Population Center (MPC) is a University-wide interdisciplinary cooperative for demographic research. The MPC serves more than 80 faculty members and research scientists from eight colleges and institutes at the University of Minnesota. As a leading developer and disseminator of demographic data, we also serve a broader audience of some 60,000 demographic researchers worldwide.

Website
https://www.ipums.org/
MIT Cancer Genomics Data
Estimating Dataset Size Requirements for Classifying DNA Microarray Data.

Website
http://portals.broadinstitute.org/cgi-bin/cancer/datasets.cgi
MIT Reality Mining Dataset
This experiment was to explore the capabilities of the smart phones that enabled social scientists to investigate human interactions beyond the traditional survey based methodology or the traditional simulation base methodology. These data sets were collected with tools developed in the MIT Human Dynamics Lab.

Website
http://realitycommons.media.mit.edu/realitymining.html
MNIST database of handwritten digits, near 1 million examples
It is a good database for people who want to try learning techniques and pattern recognition methods on real-world data while spending minimal efforts on preprocessing and formatting.
The MNIST database of handwritten digits, available from this page, has a training set of 60,000 examples, and a test set of 10,000 examples. It is a subset of a larger set available from NIST.

Mobile Social Networks from UMASS
The Proximity Mobile Social Networks database is based on data collected by the Privacy, Internetworking, Security, and Mobile Systems. The data provide a record of successful mote-to-mote connections over the course of each trial.

Website
https://kdl.cs.umass.edu/display/public/Mobile+Social+Networks
More Song Datasets
The goal is to be able to train on the whole dataset, and then easily compare the results with previous publications. All files have been uploaded to the Echo Nest API. The Million Song Dataset is a freely-available collection of audio features and metadata for a million contemporary popular music tracks.

Website
http://labrosa.ee.columbia.edu/millionsong/pages/additional-datasets
MovieLens Data Sets
The data sets were collected over various periods of time, depending on the size of the set. Before using these data sets, please review their README files for the usage licenses and other details.

Multi-Domain Sentiment Dataset (version 2.0)
The Multi-Domain Sentiment Dataset contains product reviews taken from Amazon.com from many product types (domains). Some domains (books and dvds) have hundreds of thousands of reviews. Others (musical instruments) have only a few hundred. Reviews contain star ratings (1 to 5 stars) that can be converted into binary labels if needed.

Museums
An awesome list of high-quality open datasets in public domains (on-going). By everyone, for everyone!

NASA Exoplanet Archive
The first space mission to search for Earth-sized and smaller planets in the habitable zone of other stars in our neighborhood of the galaxy.

NASA Global Imagery Browse Services
The Global Imagery Browse Services (GIBS) system is a core EOSDIS component which provides a scalable, responsive, highly available, and community standards based set of imagery services. These services are designed with the goal of advancing user interactions with EOSDIS’ inter-disciplinary data through enhanced visual representation and discovery.

NASDAQ
The Nasdaq Stock Market is an American stock exchange. It is the second-largest exchange in the world by market capitalization.

National Weather Service GIS Data Portal
This page contains links to data that are distributed via web server technology in the Open Geospatial Consortium (OGC). In addition, some of the NWS data is available as geo-referenced image files such as geo-gifs. NWS provides access to watches, warnings, advisories, and other similar products in the Common Alerting Protocol (CAP) and Atom Syndication Format (ATOM)..

Website
http://www.nws.noaa.gov/gis/
Natural Earth – Vectors and Rasters of the World
Natural Earth is a public domain map dataset available at 1:10m, 1:50m, and 1:110 million scales. Featuring tightly integrated vector and raster data, with Natural Earth you can make a variety of visually pleasing, well-crafted maps with cartography or GIS software.

Natural History Museum (London) Data Portal
The Museum is committed to open access and open science, and has launched the Data Portal to make its research and collections datasets available online. It allows anyone to explore, download and reuse the data for their own research.

NBER Patent Citations
These data comprise detail information on almost 3 million U.S. patents granted between January 1963 and December 1999, all citations made to these patents between 1975 and 1999 (over 16 million), and a reasonably broad match of patents to Compustat (the data set of all firms traded in the U.S. stock market).

Website
http://nber.org/patents/
NCBI Proteins
A collection of genomics, functional genomics, and genetics studies and links to their resulting datasets. This resource describes project scope, material, and objectives and provides a mechanism to retrieve datasets that are often difficult to find due to inconsistent annotation, multiple independent submissions, and the varied nature of diverse data types which are often stored in different databases. A database that includes protein sequence records from a variety of sources, including GenPept, RefSeq, Swiss-Prot, PIR, PRF, and PDB.

NCBI Taxonomy
The Taxonomy Database is a curated classification and nomenclature for all of the organisms in the public sequence databases. This currently represents about 10% of the described species of life on the planet.

NCI Genomic Data Commons
The NCI Genomic Data Commons (GDC) is a unified knowledge base that promotes sharing of genomic and clinical data between researchers and facilitates precision medicine in oncology.

NDAR
The National Database for Autism Research (NDAR) is an NIH-funded research data repository that aims to accelerate progress in autism spectrum disorders (ASD) research through data sharing, data harmonization, and the reporting of research results. NDAR also serves as a scientific community platform and portal to multiple other research repositories, allowing for aggregation and secondary analysis of data.

Website
https://ndar.nih.gov/
Netflix Prize
The Netflix Prize was an open competition for the best collaborative filtering algorithm to predict user ratings for films, based on previous ratings without any other information about the users or films, i.e. without the users or the films being identified except by numbers assigned for the contest.

Network Repository with Interactive Exploratory Analysis Tools
The first interactive data and network repository with real-time analytics. Network repository is not only the first interactive repository, but also the largest network and graph data repository with over 500+ donations. This large comprehensive collection of network graph data is useful for making significant research findings as well as benchmark data sets for a wide variety of applications and domains (e.g., network science, bioinformatics, machine learning, data mining, physics, and social science) and includes relational, attributed, heterogeneous, streaming, spatial, and time series data as well as non-relational machine learning data. All data sets are easily downloaded into a standard consistent format.

NeuroData
Our goal is to work together with neuroexperimentalists to discover fundamental principles governing the relationship between mind and brain, via building and deploying open source data-driven tools that run at scale on open access data. This includes analytics, databases, cloud computing, and Web-services applied to both big neuroimages and big neurographs.

Website
http://neurodata.io/
Neuroelectro
The goal of the NeuroElectro Project is to extract information about the electrophysiological properties (e.g. resting membrane potentials and membrane time constants) of diverse neuron types from the existing literature and place it into a centralized database.

Website
http://neuroelectro.org/
NIMH Data Archive
The National Institute of Mental Health Data Archive (NDA) makes available human subjects data collected from hundreds of research projects across many scientific domains. The NDA provides infrastructure for sharing research data, tools, methods, and analyses enabling collaborative science and discovery. De-identified human subjects data, harmonized to a common standard, are available to qualified researchers. Summary data is available to all.

NIST complex networks data collection
In analyzing large-scale complex networks, it is important to establish a standard dataset from which algorithms and claims be compared and verified. Currently, it is often difficult to track down the original data used for computational experiments. Much of it is floating around in various formats throughout the net, imbedded in papers, and often difficult to get from the authors. Moreover, the datasets are often modified (filtered) by research groups interested in different attributes, so that even when the name and descriptions match a citation in a paper, there is no guarantee that the data is identical.

NOAA Bering Sea Climate
There is an explosion of interest in Northern Hemisphere climate, and new science programs are highlighting the importance of recent changes in the Arctic on mid-latitude climate impacts. The Bering Sea is one of the world’s major fisheries, and fisheries from Alaskan waters represents half of the landed U.S. catch of fish and shellfish. Because of the changes going on in the Arctic, future evolution of the Bering Sea climate/ecosystem is more uncertain. This is a symmetric problem: climate change impacts ecosystems, and ecosystems serve as indicators for climate change.

NOAA Climate Datasets
NCEI is the world’s largest provider of weather and climate data. Land-based, marine, model, radar, weather balloon, satellite, and paleoclimatic are just a few of the types of datasets available.

NOAA Realtime Weather Models
Numerical Weather Prediction (NWP) data are the form of weather model data we are most familiar with on a day-to-day basis. NWP focuses on taking current observations of weather and processing these data with computer models to forecast the future state of weather. Knowing the current state of the weather is just as important as the numerical computer models processing the data. Current weather observations serve as input to the numerical computer models through a process known as data assimilation to produce outputs of temperature, precipitation, and hundreds of other meteorological elements from the oceans to the top of the atmosphere.

Website
http://www.ncdc.noaa.gov/data-access/model-data/model-datasets/numerical-weather-prediction
NOAA SURFRAD Meteorology and Radiation Datasets
NOAA/ESRL’s Global Monitoring Division (formerly CMDL) of the National Oceanic and Atmospheric Administration, conducts sustained observations and research related to source and sink strengths, trends and global distributions of atmospheric constituents that are capable of forcing change in the climate of Earth through modification of the atmospheric radiative environment, those that may cause depletion of the global ozone layer, and those that affect baseline air quality.

Notre Dame Global Adaptation Index (NG-DAIN)
The Notre Dame Global Adaptation Initiative (ND-GAIN) is part of the Climate Change Adaptation Program of the University of Notre Dame’s Environmental Change initiative (ND-ECI). The ND-GAIN Country Index follows a data-driven approach to show which countries are best prepared to deal with global changes brought about by overcrowding, resource-constraints and climate disruption. The Index aims to unlock global adaptation solutions in the corporate and development communities to save lives and improve livelihoods while strengthening market positions.

NSSDC (NASA) data of 550 space spacecraft
The NASA Space Science Data Coordinated Archive serves as the permanent archive for NASA space science mission data. “Space science” means astronomy and astrophysics, solar and space plasma physics, and planetary and lunar science. As permanent archive, NSSDCA teams with NASA’s discipline-specific space science “active archives” which provide access to data to researchers and, in some cases, to the general public.

NYC Taxi Trip Data 2009-
The yellow and green taxi trip records include fields capturing pick-up and drop-off dates/times, pick-up and drop-off locations, trip distances, itemized fares, rate types, payment types, and driver-reported passenger counts. The data used in the attached datasets were collected and provided to the NYC Taxi and Limousine Commission (TLC) by technology providers authorized under the Taxicab & Livery Passenger Enhancement Programs (TPEP/LPEP). The trip data was not created by the TLC, and TLC makes no representations as to the accuracy of these data.

Website
http://www.nyc.gov/html/tlc/html/about/trip_record_data.shtml
NYC Uber trip data April 2014 to September 2014
This directory contains data on over 4.5 million Uber pickups in New York City from April to September 2014, and 14.3 million more Uber pickups from January to June 2015. Trip-level data on 10 other for-hire vehicle (FHV) companies, as well as aggregated data for 329 FHV companies, is also included. All the files are as they were received on August 3, Sept. 15 and Sept. 22, 2015.

Website
https://github.com/fivethirtyeight/uber-tlc-foil-response
OANDA
OANDA Corporation provides Internet-based foreign exchange (forex) trading and currency information services to individuals, corporations, portfolio managers, and financial institutions worldwide. The company provides fxTrade, a forex trading platform that enables users to access charting features, multiple sub-accounts to test various trading strategies and financial news, and market analysis; fxTrade Mobile, a forex trading platform for iPhone, iPad, and Android devices that provides charting features, financial news and market analysis, and more; MetaTrader 4 (MT4), a Windows-based electronic trading platform with automated trading capabilities; MT4 Hedging Compatibility,

Website
http://www.oanda.com/
OASIS
The Open Access Series of Imaging Studies (OASIS) is a project aimed at making MRI data sets of the brain freely available to the scientific community. By compiling and freely distributing MRI data sets, we hope to facilitate future discoveries in basic and clinical neuroscience. OASIS is made available by the Washington University Alzheimer’s Disease Research Center, Dr. Randy Buckner at the Howard Hughes Medical Institute (HHMI) at Harvard University, the Neuroinformatics Research Group (NRG) at Washington University School of Medicine, and the Biomedical Informatics Research Network (BIRN).

Website
http://www.oasis-brains.org/
OONI: Open Observatory of Network Interference – Internet censorship data
The Open Observatory of Network Interference (OONI) is a free software project under the Tor Project which aims to detect internet censorship, traffic manipulation and signs of surveillance around the world through the collection and processing of network measurements. Since late 2012, OONI has collected millions of network measurements across more than 90 countries around the world, shedding light on multiple cases of network interference.

Open Crime and Policing Data in England, Wales and Northern Ireland
Individual crime and anti-social behaviour (ASB) incidents, including street-level location information and subsequent police and court outcomes associated with the crime.

Website
https://data.police.uk/data/
Open Data Certificates (beta)
Open Data Certificate is a free online tool developed and maintained by the Open Data Institute, to assess and recognise the sustainable publication of quality open data. It assess the legal, practical, technical and social aspects of publishing open data using best practice guidance.

Open Data for Africa
The AfDB Statistical Data Portal has been developed in response to the increasing demand for statistical data and indicators relating to African Countries. The Portal provides multiple customized tools to gather indicators, analyze them, and export them into multiple formats. With the Data Portal, you can visualize Socio-Economic indicators over a period of time, gain access to presentation-ready graphics and perform comprehensive analysis on a Country and Regional level.

Open Library Data Dumps
Open Library provides dumps of all the data in various formats. Currently these dumps are generated every month.

Open Mobile Data by MobiPerf
MobiPerf is an open source application for measuring network performance on mobile platforms. You can measure your network’s throughput and latency, as well as other useful network metrics. MobiPerf also performs measurements at regular intervals in the background. The data is collected either anonymously or from your selected account, which allows you to see your own data. The user credentials collected are not shared outside of this site, and any data used in research projects in universities are anonymized before use.

Website
https://console.cloud.google.com/storage/browser/openmobiledata_public/?pli=1
Open Multilingual Wordnet
The individual wordnets have been made by many different projects and vary greatly in size and accuracy. We have (i) extracted and normalized the data, (ii) linked it to Princeton WordNet 3.0 and (iii) put it in one place. The Open Multilingual Wordnet and its components are open: they can be freely used, modified, and shared by anyone for any purpose. There is a fuller list of wordnets at the Global Wordnet Association’s Wordnets in the World page.

Open Traffic collection
This Contemporary includes:
Open Traffic Data project
OpenTraffic.io
MDM-Portal
Datex2 Portal

Website
https://github.com/graphhopper/open-traffic-collection
Open-ODS (structure of the UK NHS)
The Organisation Data Service (ODS) is responsible for publishing organisation and practitioner codes, along with related national policies and standards. We’re also responsible for the ongoing maintenance of the organisation and person nodes of the Spine Directory Service, the central data repository used within various NHS systems and services.

Website
https://digital.nhs.uk/home
OpenAddresses
OpenAddresses.io is a global collection of address data sources, open and free to use” which was created by OSMers User:ToeBee, User:Ingalls and User:Lxbarth, among others. In fact it started out as a spreadsheet of government address datasets maintained by ToeBee, but now has an aggregated download, an API, and a website.

Website
http://openaddresses.io/
OpenCorporates Database of Companies in the World
OpenCorporates is the largest open database of companies and company data in the world, with in excess of 100 million companies in a similarly large number of jurisdictions. Our primary goal is to make information on companies more usable and more widely available for the public benefit, particularly to tackle the use of companies for criminal or anti-social purposes, for example corruption, money laundering and organised crime.

Website
https://opencorporates.com/
OpenDataNetwork
A search engine of all Socrata powered data portals. Publish data and share. Find data and build. Answer questions.

OpenDataSoft’s list of 1,600 open data
We rolled up our sleeves and started aggregating all of the Open Data portals we could get our hands on. We are thrilled to present you the first version of our comprehensive list of 2600+ Open Data portals around the world.

Website
https://www.opendatasoft.com/a-comprehensive-list-of-all-open-data-portals-around-the-world/
OpenFlights – airport, airline and route data
OpenFlights is a tool that lets you map your flights around the world, search and filter them in all sorts of interesting ways, calculate statistics automatically, and share your flights and trips with friends and the entire world (if you wish). It’s also the name of the open-source project to build the tool

OpenfMRI
The OpenfMRI project has provided a resource for researchers to make their MRI data openly available to the research community.

Website
https://openfmri.org/
OpenPaymentsData, Healthcare financial relationship data
Open Payments is the federally run transparency program that collects information about these financial relationships and makes it available to you. These relationships can involve money for research activities, gifts, speaking fees, meals, or travel. One of the ways we provide this data to the public is through this search tool, which allows you to search for a doctor, teaching hospital, or company that has made payments. Exploring this information, and discussing the results you find with your healthcare provider, can help you make more informed healthcare decisions.
Use the Data Explorer tool to view the full data sets and create visualizations such as charts and graphs to get an in-depth look at the data submitted by applicable manufacturers and GPO’s.

OpenSNP genotypes data
With openSNP you can share stories about your genetic variations and phenotypes, and discover the stories of other users. openSNP gets the latest open access journal articles on genetic variations from the Public Library of Science. Phenotypes are the observable characteristics of your body, such as height, eye color or preference for coffee. Share your phenotype with other openSNP users, and find others with similar characteristics and traits.
Your data may help scientists discover new genetic associations!

Website
https://opensnp.org/
OpenStreetMap (OSM)
This project that creates and distributes free geographic data for the world. We started it because most maps you think of as free actually have legal or technical restrictions on their use, holding back people from using them in creative, productive, or unexpected ways. OpenStreetMap is a federative project. That means that a lot a essential resources are provided by third party providers

OSU Financial data
We provide a vibrant research and teaching atmosphere, characterized by extensive collaboration with a shared goal of conducting leading-edge research and providing students with the skills they need to succeed in the field of finance.

Website
https://fisher.osu.edu/academic-departments/department-finance
Our World in Data
Our World in Data (OWID) is an online publication that shows how living conditions are changing. The aim is to give a global overview and to show changes over the very long run, so that we can see where we are coming from and where we are today. We have a list of all current and future data-entries that shows which topics we will cover in this publication. There will be 275 entries. Offline we are constantly collecting material for the future entries; this catalogue includes much more than ten thousand references to visualisations, data sources, and research papers.

Website
https://ourworldindata.org/
Pathguid – Protein-Protein Interactions Catalog
Pathguide contains information about more than 500 biological pathway related resources and molecular interaction related resources.

Website
http://www.pathguide.org/
Personae Corpus
The Personae corpus was collected for experiments in Authorship Attribution and Personality Prediction. It consists of 145 Dutch-language essays, written by 145 different students (BA in Linguistics and Literature at the University of Antwerp, Belgium). Each student also took an online MBTI personality test, allowing personality prediction experiments. The corpus was controlled for topic, register, genre, age, and education level.
We make available the original texts, a syntactically annotated version of the texts, and the metadata.

Website
http://www.clips.uantwerpen.be/datasets/personae-corpus
PewResearch Society Data Collection
Pew Research Center makes its data available to the public for secondary analysis after a period of time.

Physics
An awesome list of high-quality open datasets in public domains (on-going). By everyone, for everyone!

Pinhooker: Thoroughbred Bloodstock Sale Data
An R Package to compile data sets of historic results from thoroughbred sales

PLAID
Pleiades – Gazetteer and graph of ancient places
Pleiades is a community-built gazetteer and graph of ancient places. It publishes authoritative information about ancient places and spaces, providing unique services for finding, displaying, and reusing that information under open license. It publishes not just for individual human users, but also for search engines and for the widening array of computational research and visualization tools that support humanities teaching and research.

Website
https://pleiades.stoa.org/
Protein Data Bank
The RCSB PDB builds upon the data by creating tools and resources for research and education in molecular biology, structural biology, computational biology, and beyond.

Protein-protein interaction network
Background Interaction detection methods have led to the discovery of thousands of interactions between proteins, and discerning relevance within large-scale data sets is important to present-day biology. The dataset consists of protein-protein interaction network described and analyzed in (1) and available as an example in the software package – PIN (2).

Website
http://vlado.fmf.uni-lj.si/pub/networks/data/bio/Yeast/Yeast.htm
Psychiatric Genomics Consortium
The purpose of the Psychiatric Genomics Consortium (PGC) is to unite investigators around the world to conduct meta- and mega-analyses of genome-wide genomic data for psychiatric disorders. This website provides information about the organization, implementation, and results of the PGC.

Psychology/Cognition
An awesome list of high-quality open datasets in public domains (on-going). By everyone, for everyone!

Website
https://github.com/caesar0301/awesome-public-datasets#psychology-cognition
PubChem Project
PubChem is a database of chemical molecules and their activities against biological assays.

PubGene (now Coremine Medical)
Public Domains
An awesome list of high-quality open datasets in public domains (on-going). By everyone, for everyone!

Website
https://github.com/caesar0301/awesome-public-datasets#public-domains
PyPI and Maven Dependency Network
As time is always running out, i don’t think i’ll have the time in a while to work again on the data I collected for the last three articles, Going offline with Maven, State of the Maven/Java dependency graph and State of the PyPi/Python dependency graph.

Website
https://ogirardot.wordpress.com/2013/01/31/sharing-pypimaven-dependency-data/
Quandl
The world’s most powerful data lives on Quandl. Designed for professionals, Quandl delivers financial, economic and alternative data to over 150,000 people worldwide. Our customers include the world’s top hedge funds, asset managers and investment banks.

Website
https://www.quandl.com/
Rapid7 Sonar Internet Scans
Project Sonar is a community effort to improve security through the active analysis of public networks. This includes running scans across public internet-facing systems, organizing the results, and sharing the data with the information security community. The three components to this project are tools, datasets, and research.

RDataMining –
RDataMining.com is a leading website on R and data mining, providing examples, documents, tutorials, resources and training on data mining and analytics with R.

REDD
REDD, a data set for energy disaggregation. The data contains power consumption from real homes, for the whole house as well as for each individual circuit in the house (labeled by the main type of appliance on that circuit). The data is intended for use in developing disaggregation methods, which can predict, from only the whole-home signal, which devices are being used (though any other uses are of course encouraged as well).

Website
http://redd.csail.mit.edu/
Reddit Comments
Restaurants Health Score Data in San Francisco
Retrosheet Baseball Statistics
Retrosheet is a non-profit organization whose website features major league baseball box scores and play-by-play narratives for almost every contest from 1871–1872, 1874, 1911 National League, and since 1913. It also includes scores from every Major League Baseball game played since the 1871 season (what is officially the inception of Major League Baseball history), as well as all All-Star, League Championship Series and World Series games.

Revolution Analytics Collection
The Revolution Analytics collection contains some of the data sets we use at Revolution to show off the Parallel External Memory Algorithms in our RevoScaleR package. The collection includes easily accessible “tarred-up” versions of the Airlines Data Set, Census5PCT2000 data set and an artificial set of mortgage default data.

Rijksmuseum Historical Art Collection
The Rijksmuseum is a Dutch national museum dedicated to arts and history in Amsterdam. The museum has on display 8,000 objects of art and history, from their total collection of 1 million objects from the years 1200–2000, among which are some masterpieces by Rembrandt, Frans Hals, and Johannes Vermeer. The museum also has a small Asian collection, which is on display in the Asian pavilion.

RITA Airline On-Time Performance data
RITA/BTS transport data collection (TranStat)
Sample R data sets
This package contains a variety of datasets.

Website
http://stat.ethz.ch/R-manual/R-patched/library/datasets/html/00Index.html
Sanger Catalogue of Somatic Mutations in Cancer (COSMIC)
COSMIC, the Catalogue Of Somatic Mutations In Cancer, is the world’s largest and most comprehensive resource for exploring the impact of somatic mutations in human cancer.

Sanger Genomics of Drug Sensitivity in Cancer Project (GDSC)
The Genomics of Drug Sensitivity in Cancer Project is part of a Wellcome Trust funded collaboration between The Cancer Genome Project at the Wellcome Trust Sanger Institute (UK) and the Center for Molecular Therapeutics, Massachusetts General Hospital Cancer Center (USA). This collaboration integrates the expertise at both sites toward the goal of identifying cancer biomarkers that can be used to identify genetically defined subsets of patients most likely to respond to cancer therapies.

Website
http://www.cancerrxgene.org/
SaudiNewsNet Collection of Saudi Newspaper Articles (Arabic, 30K articles)
The dataset currently contains 31,030 Arabic articles (with a total number of 8,758,976 words). The articles were extracted from the following Saudi newspapers (sorted by number of articles): Al-Riyadh; Al-Jazirah; Al-Yaum; Al-Eqtisadiya.

SciencesPo World Trade Gravity Datasets
Scopus Citation Database
Scopus is the largest abstract and citation database of peer-reviewed literature: scientific journals, books and conference proceedings.

Search Engines
This Search Engines Includes: Agriculture; Biology; Climate/Weather; Complex Networks.

Website
https://github.com/caesar0301/awesome-public-datasets#search-engines
Sequence Read Archive(SRA)
The Sequence Read Archive (SRA) stores raw sequence data from “next-generation” sequencing technologies including Illumina, 454, IonTorrent, Complete Genomics, PacBio and OxfordNanopores. In addition to raw sequence data, SRA now stores alignment information in the form of read placements on a reference sequence.

Skytrax’ Air Travel Reviews Dataset
This Skytrax User Reviews Dataset includes: 41396 Airline Reviews; 17721 Airport Reviews; 1258 Seat Reviews; 2264 Lounge Reviews.

Website
https://github.com/quankiquanki/skytrax-reviews-dataset
Sloan Digital Sky Survey (SDSS) – Mapping the Universe
The Sloan Digital Sky Survey has created the most detailed three-dimensional maps of the Universe ever made, with deep multi-color images of one third of the sky, and spectra for more than three million astronomical objects. Learn and explore all phases and surveys—past, present, and future—of the SDSS.

Website
http://www.sdss.org/
Smithsonian Institution Global Volcano and Eruption Database
The mission of GVP is to document, understand, and disseminate information about global volcanic activity. We do this through four core functions: reporting, archiving, research, and outreach. The data systems that lie at our core have been in development since 1968 when GVP began documenting the eruptive histories of volcanoes.

Website
http://volcano.si.edu
SMS Spam Collection in English
The SMS Spam Collection v.1 is a public set of SMS labeled messages that have been collected for mobile phone spam research. It has one collection composed by 5,574 English, real and non-enconded messages, tagged according being legitimate (ham) or spam.

Website
http://www.dt.fee.unicamp.br/~tiago/smsspamcollection/
Social Networks
This Social Networks includes: 72 hours #gamergate Twitter Scrape; Ancestry.com Forum Dataset over 10 years; Cheng-Caverlee-Lee September 2009 – January 2010 Twitter Scrape; CMU Enron Email of 150 users.

Website
https://github.com/caesar0301/awesome-public-datasets#social-networks
Social Sciences
The main social sciences include economics, political science, human geography, demography, psychology, and sociology. In a wider sense, social science also includes some fields in the humanities such as anthropology, archaeology, jurisprudence, history, and linguistics.
