This project provides a sortable list of datasets that may be of use to journalists. Most of the datasets and descriptions come from Jeremy Singer-Vine’s Data Is Plural, “a weekly newsletter highlighting useful and curious datasets”.
You can save a data set for future reference by clicking the "+" at the top right-hand corner of the dataset cards.
This project provides a sortable list of datasets that may be of use to journalists. Most of the datasets and descriptions come from Jeremy Singer-Vine’s Data Is Plural, “a weekly newsletter highlighting useful and curious datasets”.
The original Data is Plural spreadsheet (and this remixed project) are published under a Creative Commons Attribution - Share Alike 4.0 International license.
You can find the GitHub repo for this project here.
The Global Investigative Journalism Network has published a collection of resources for finding and working with data. They also have resources available in Arabic, Bangla, Chinese, French, Portugeuse, Russian and Spanish.
Voice of America does not endorse and has not verified these datasets. This page was created as a resource for journalists to find potentially useful data to help report stories.
VOA provides trusted and objective news and information in over 45 languages to a measured weekly audience of more than 275.2 million people around the world. For over 75 years, VOA journalists have told American stories and supplied content that many people cannot get locally: objective news and information about the US, their specific region and the world. Learn more
Data scientist George Ho has compiled a dataset of 589,000+ clues to cryptic crosswords, “collected from various blogs and publicly available digital archives.” The collection, released earlier this month, is available to download and also explore online. (For example.) Its “datasheet” describes the motivation, collection process, composition, and more. — : September 22, 2021
Links:
Tags: entertainmentgames
The Spanish Electoral Archive, published this summer, provides detailed results of all municipal, regional, general, and European Parliament elections in Spain since the country’s transition to democracy in the late 1970s. The project’s datasets standardize records from various official sources that, in many cases, drill down to the level of individual ballot boxes. — : September 22, 2021
Links:
Civic technologist Forest Gregg has begun filing FOIA requests to the National Labor Relations Board to collect newly-available data on employers’ voluntary recognition of employee unions, drawn from agency’s relevant notification form. The records so far include 70+ recognitions in late 2020 and early 2021, plus nearly 1,000 from a prior reporting program between 2007 and 2009; they list the employer, union, case number, relevant dates, and more. Previously: Union election results (DIP 2021.05.05). — : September 22, 2021
Links:
The US Geological Survey’s National Water Information System provides data on the “occurrence, quantity, quality, distribution, and movement of surface and underground waters” around the country. The surface water measurements — mainly streamflow and gage height — come from tens of thousands of monitoring sites. (Here’s a site near Baton Rouge before and after hurricanes Ida and Nicholas.) There’s an API for accessing the records, including daily summaries and real-time measurements. Previously: NOAA’s water-level data (DIP 2016.03.23). [h/t Michael Allen] — : September 22, 2021
Links:
The COVID-19 School Data Hub, which launched last week, is “a central database for educators, researchers, and policymakers to understand how the COVID-19 pandemic shaped students’ modes of learning in 2020-21.” The project’s team, led by economist Emily Oster, has gathered data on learning models (in-person, virtual, or hybrid) used by public schools and districts at various points in time, their masking policies, and reported COVID-19 cases. The datasets can be downloaded in bulk or by state. The coverage and granularity vary by topic and state; the project’s documentation describes the collection methods and availability. — : September 22, 2021
Links:
A UK House of Commons Library research briefing in July included a spreadsheet of “MPs who have left the Chamber voluntarily, been asked to withdraw, or who have been suspended,” along with the date, reason, and suspension period. Another briefing, published the same day, “attempts to capture all instances where an apology has been made on the floor of the House of Commons since 1979.” [h/t Andi Fugard] — : September 8, 2021
Links:
Tags: governmentpolitics
The ScanCode LicenseDB provides information about 1,700+ software licenses, ranging from the common (e.g., MIT License) to the idiosyncratic (SQLite Blessing) to the obscure (Ubuntu Font License). The records, which are part of a broader license-detection toolkit, list each license’s core phrasing, general category, custodian, relevant URLs, and other details. [h/t Philippe Ombredanne] — : September 8, 2021
Links:
Tags: technology
The Trivedi Centre for Political Data has published a dataset of “all parties that have contested national and state elections in India since 1962,” with an eye toward unifying the information across name changes. For each legislative level and state, the dataset indicates each party’s first and last year contesting elections, number of seats won, number of female and Scheduled Caste/Tribe candidates fielded, and more. [h/t Vijdan Mohammad Kawoosa + Gilles Verniers] — : September 8, 2021
Links:
The US Bureau of Transportation Statistics’ summer update to its National Transportation Atlas Database adds a dataset on “alternative fuel corridors” — stretches of highway with a sufficient frequency of fueling stations. (Electric vehicle corridors must, for instance, have charging stations at least every 50 miles.) The release covers electric, hydrogen, propane, compressed natural gas, and liquefied natural gas infrastructure, and complements a prior dataset of 56,000+ such stations. Related: The Department of Energy’s maps and datasets of alternative fueling stations and corridors. [h/t Morgan Stevens] — : September 8, 2021
Links:
Tags: energytransportation
OpenSanctions, an open-source project that launched its website last week, is “an international database of persons and companies of political, criminal, or economic interest.” It combines and standardizes data from 20+ sources, such as the US Treasury’s sanctions lists (DIP 2018.02.21), Interpol’s Red Notices, members of EU parliament, and the CIA’s index of world leaders. The project uses a detailed schema to represent the particulars of each entity, including aliases, known cryptocurrency wallets, aircraft registrations, sanction dates, and more. You can download the data with those detailed representations or in simpler formats. [h/t Friedrich Lindenberg] — : September 8, 2021
Links:
The website Botwiki “was created in July 2015 by Stefan Bohacek with the goal of preserving examples of interesting and creative online bots” and providing tutorials for building them. Bohacek has curated a dataset of 70+ popular examples running on Twitter, drawn from Botwiki and from Tully Hansen’s Omnibots list. Among them: @year_progress, @nyt_first_said, and @tiny_star_field. — : September 1, 2021
Links:
China Labour Bulletin, founded in 1994 as a monthly newsletter, is a Hong Kong–based organization “that supports and actively engages with the workers’ movement in China.” Its map and dataset of worker strikes and protests provides details on 13,000+ events since 2011, including their location, date, and description; industry categories and ownership types; employee demands; and authorities’ response. [h/t The China Data Lab] — : September 1, 2021
Links:
Ting Zhang et al. have trained an algorithm to identify wind turbines in coastal satellite imagery, and have used it to build a dataset listing the location and construction month of 6,924 turbines offshore of 14 countries between 2015 and 2019. To test the algorithm’s accuracy, the researchers compared its results to other sources, including the US Wind Turbine Database (DIP 2018.04.25), the UK’s Renewable Energy Planning Database, the European Marine Observation and Data Network, and Open Power System Data (DIP 2019.08.14). — : September 1, 2021
Links:
Tags: energy
ClaimReview is an open standard for adding structured information to fact-check articles, such as the specific claim reviewed, where it appeared, the fact-checking organization, and the reviewer’s rating. The schema has been adopted by a range of big-name publishers, including the Washington Post, PolitiFact, and Univision, as well as smaller outlets around the world. The structured-data website Data Commons hosts a feed of 29,000+ ClaimReview-tagged fact-checks, as well as a curated subset. — : September 1, 2021
Links:
Tags: journalismmedia
The Aid Worker Security Database is “a global compilation of reports on major security incidents involving deliberate acts of violence affecting aid workers,” with more than 3,200 records since 1997. Researchers gather, evaluate, and categorize information from official reports, partnerships with humanitarian agencies, news media, and other sources. For each incident, the database indicates its date and location; the number of workers killed, wounded, or kidnapped; their general affiliations; the type of attacker and means of attack; a brief description; and more. [h/t The Costs of War Project] — : September 1, 2021
Links:
The Ergast Developer API provides seven decades of Formula One racing results, with details on each season, race, and result since 1950, each lap time since 1996, each pit stop since 2012, and more. In addition to querying the API, you can also explore the data online and download it in full. As seen in: FiveThirtyEight’s “Who’s The Best Formula One Driver Of All Time?” [h/t Eric Gardner + Cameron Yick + David Ortiz] — : August 25, 2021
Links:
The US Patent and Trademark Office has built a series of machine-learning models to identify patents that involve AI technologies, such as natural language processing or computer vision. Its Artificial Intelligence Patent Dataset, released in June, focuses on eight of these technologies and provides predictions of their presence (or absence) in 13.2 million granted patents and patent applications since 1976, finding hits in 11% of the documents. [h/t Nicholas Rada] — : August 25, 2021
Links:
Tags: historytechnology
Open Buildings, a project led by Google Research’s Ghana office, has published a dataset of 516 million building footprints in Africa, estimated from satellite imagery. The dataset, which you can explore online and download as CSVs, spans roughly 64% of the continent. It describes each estimated footprint’s coordinates, shape, and area, plus the detection algorithm’s degree of confidence. Previously: Footprints of buildings in the US (DIP 2018.07.18), and in Canada and New Zealand (DIP 2019.09.25). — : August 25, 2021
Links:
Tags: architecturemapping
UNU-WIDER’s Government Revenue Dataset “aims to present a complete picture of government revenue and tax trends over time.” The project, updated this month, currently covers 196 countries and goes back, in most cases, to the early 1980s. It draws on data from OECD and IMF reports and includes dozens of variables, such as total revenue, natural resource taxes, and foreign grants received. Previously: The OECD’s Global Revenue Statistics Database (DIP 2018.08.01). [h/t Lisa Chauvet & Marin Ferry + Erik Feiring] — : August 25, 2021
Links:
Tags: governmentmoney
US Congress members and candidates must report all stock purchases and sales exceeding $1,000, as well as those of their spouses and dependent children. Those records are technically available through the House’s and Senate’s financial disclosure portals, but neither provides bulk data. Software engineer Tim Carabat’s Senate Stock Watcher and House Stock Watcher websites fill that gap by making the transactions available to browse, query, and download. In the case of the House, where reports are still provided as PDFs, Carabat also coordinates the manual transcription of those files. — : August 25, 2021
Links:
Tags:
Nestflix, a new website by designer/developer Lynn Fisher, catalogs more than 400 fictional films and TV shows that appear within actual films and TV shows. For instance: 30 Rock’s The Rural Juror and Home Alone’s Angels with Filthy Souls. The project is open-source; the data files for each item include the title, a description, a quotation, the parent show/film, and more. — : August 18, 2021
Links:
MEAD, as the publication acronymizes itself, “provides sweet, intoxicating data for your investigations of early North America and the Atlantic World.” The initiative, affiliated with the University of Pennsylvania, hosts a few dozen datasets on a range of topics; many focus on Pennsylvania and on slavery, while other subjects include George Washington’s shipping invoices and the 19th century children’s book industry. [h/t Noah Veltman] — : August 18, 2021
Links:
The Upworthy Research Archive describes 32,000+ headline-testing experiments conducted in 2013–15 by Upworthy, the online publication that popularized a once-ubiquitous style of headline. The dataset, contributed by the publication to a team of academics, is split into three tranches for use in different phases of research. In total, it covers 150,000+ headline-plus-image permutations; for each, it provides the headline, an image identifier, the number of viewers assigned to see it, the number who clicked, and other details. — : August 18, 2021
Links:
Incarceration Transparency, a project undertaken by law students and faculty at Loyola University New Orleans, has compiled data on more than 830 deaths in Louisiana jails, prisons, and juvenile detention centers, primarily between 2015 and 2019, based on 130+ public records requests. The information includes each decedent’s name, age, sex, race, and trial status; the date, facility, and cause of death; and other factors. Read more: The New Yorker’s recent profile of the project and the professor leading it. Previously: Deaths in US jails, via Reuters (DIP 2020.10.21). — : August 18, 2021
Links:
Michael A. Allen et al. are gathering and standardizing data on the United States military’s global presence. The project’s CSV files and R package include annual, country-level troop counts between 1950 and 2020, drawn from prior work by economist Tim Kane and from the government’s Defense Manpower Data Center. They also include a listing of US military bases abroad, primarily sourced from Base Nation, a book by political anthropologist David Vine (who, disclosure, is a cousin of mine). Related: Vine also maintains various lists of US military bases abroad since 1776 and has published a follow-up book, The United States of War. [h/t u/smurfyjenkins] — : August 18, 2021
Links:
In 2004, Seattle historian Rob Ketcherside began a quest to find every public clock in the city, past and present. In 2015, he gathered his findings into a dataset that identifies each clock, owner, and piece of supporting evidence, which he continues to update. Ketcherside has also compiled datasets of drive-in public markets, Seattle street renamings, and the city’s new buildings in 1890. — : August 11, 2021
Links:
Tags: miscellaneous
Projects at the Observatory for Political Conflict and Democracy analyze newspaper articles to construct datasets about election campaigns, protest events, and public debates across a range of European countries. The projects’ codebooks describe how they select the articles — often spanning multiple decades — and categorize the people, parties, actions, and issues in them. [h/t Neil Dullaghan] — : August 11, 2021
Links:
The Atrocity Forecasting Project has constructed (and recently updated) a dataset of targeted mass killings, which it defines as “the direct killing of noncombatant members of a group by an organized armed force or collective with the intent of destroying the group, or intimidating the group by creating a perception of imminent threat to its survival.” The dataset includes 207 such episodes from 1946 to 2020; it lists each atrocity’s timing and location, targeted groups, type of perpetrator, intent, severity, and other aspects. — : August 11, 2021
Links:
The Legal Services Corporation, a nonprofit that Congress has directed to study local eviction procedures and their effects, has partnered with Temple University to compile a database of eviction laws in all US states and territories, and in a sample of 30 cities. It lists the causes for which landlords can evict tenants, the remedies available to unlawfully evicted tenants, filing fees, service requirements, and much more, with pointers to the relevant sections of each law. Previously: Eviction rates from the Eviction Lab. (DIP 2018.04.18). [h/t Morgan Stevens] — : August 11, 2021
Links:
Tags: law
A collaboration led by Stanford University’s Big Local News has gathered (and standardized) recent enrollment figures from 33 state education departments. The resulting dataset, which spans ~70,000 public schools, can be downloaded in bulk and explored online. Most states provided data down to the grade level; some also provided student counts by gender, race, ethnicity, ELL status, homelessness, economic status, and/or disability. The timeframes vary, but include at least the 2019–20 and 2020–21 school years for each state. See the documentation for details. As seen in: “The Kindergarten Exodus” (NYT), “How going remote led to dramatic drops in public school students” (EdSource), and a new academic study. [h/t Simon Willison + Cheryl Phillips] — : August 11, 2021
Links:
The World Wide Lightning Location Network uses radio sensors to detect the location and power of 200+ million lightning strokes per year. Access to WWLLN’s raw, detailed data costs money but earth scientists Jed O. Kaplan and Katie Hong-Kiu Lau have converted it into a few public-access gridded-globe timeseries of lightning activity from 2010 to 2020 — daily and monthly strokes per km2, and monthly stroke power. [h/t Robin Sloan] — : July 14, 2021
Links:
Tags: climate
Since 1973, the National Bureau of Economic Research’s 1,500+ affiliated researchers have published 29,000+ (pre-peer review) articles through the organization’s working papers series. It provides structured information about each paper, using a template from the RePEc project. PhD student Ben Davies has converted those files into CSV tables and an R package listing each paper��s title, publication month, ID, and authors. [h/t Alex Albright] — : July 14, 2021
Links:
Tags: economics
PhD candidate Semra Sevi recently compiled a dataset of 44,000+ candidates for Canadian federal office from 1867 to 2019 (and similar for Ontario provincial candidates). It lists each candidate’s name, gender, birth year, occupation, party, and incumbency status, plus the election’s date, riding, and outcome. And a new dataset from Anna Johnson et al. delves into the demographics of 4,516 Canadian federal candidates from 2008 to 2019, including their gender, age, race, Indigenous background, occupational category, and more. [h/t Marina Smailes + Erin Tolley] — : July 14, 2021
Links:
The Arizona OpenGIS Initiative for Deceased Migrants is a collaboration between Pima County’s Office of the Medical Examiner and Humane Borders, a nonprofit that maintains water stations in the Sonoran Desert. “Although each organization has a distinct mission, both are committed to the common vision of raising awareness about migrant deaths and lessening the suffering of families by helping to provide closure through the identification of the deceased and the return of remains.” The initiative’s maps and dataset provide details on 3,700+ deaths since 1990, including the deceased’s name, sex, age, and cause of death; their body’s location and condition; and the date reported. [h/t Olaya Argüeso Pérez] — : July 14, 2021
Links:
Tags: deathimmigrationrefugees
Axel Dreher et al. have published person-level data on 2.5+ million refugees who arrived in the US between 1975 and 2008. The anonymized records, obtained from the National Archives and originally collected by the Office of Refugee Resettlement, indicate each refugee’s country and date of birth, marital and family status, education level and English proficiency, date of US arrival, US city of resettlement, and more. The researchers also combined these records with public reports from the Bureau of Population, Refugees, and Migration (DIP 2015.11.25) to create a geocoded dataset of annual resettlements by citizenship and destination city from 1975 to 2018. [h/t Chris Parsons] — : July 14, 2021
Links:
For their recent paper, “Geography of roadkills within the Tropical Andes Biodiversity Hotspot,” ecologists Pablo Medrano-Vizcaíno and Santiago Espinosa surveyed three 33-kilometer road segments dozens of times in 2014. Their publication datasets provide details about the 445 dead vertebrates they found, including one new-to-science species of snake. [h/t Christian Miles + The Economist] — : July 7, 2021
Links:
VIRION is “an open atlas of the vertebrate virome” that launched in May. It represents the associations between 9,000+ viruses and 3,700+ host species, drawing from a range of sources, including USAID’s PREDICT project and the GloBI project, which organizes data about inter-species interactions. [h/t Timothée Poisot] — : July 7, 2021
Links:
The North Carolina Rural Health Research Program maintains a list of rural hospital closures in the US since 2005. The dataset contains 181 entries through June 2021, each representing a complete closure or a conversion from inpatient care to other services, and indicates the hospital, number of beds, Medicare payment program, month of closure, and the location’s Rural-Urban Commuting Area classification. [h/t Betsy Ladyzhets] — : July 7, 2021
Links:
Tags: healthcare
QuantGov, an open-source project of the free-market-oriented Mercatus Center, “solves the problem of quantifying large amounts of policy text for research and comprehension by using machine learning and natural language processing.” Its RegData initiative applies this approach to government regulations over time — quantifying their length and linguistic complexity and trying to identify their relevant NAICS-classified industries. Its datasets examine rules enacted by the federal government and most US states, as well as federal and subnational regulations from Australia, Canada, and India. [h/t Aaron Staples et al.] — : July 7, 2021
Links:
Tags: government
Oregon State University’s International Water Events Database summarizes 7,000+ episodes from 1948 to 2008 that “concern water as a scarce or consumable resource or as a quantity to be managed,” categorizing their intensities and indicating the basins and countries involved. The Pacific Institute’s Water Conflict Chronology documents 926 events (including a few legends) between 3000 BC and 2019 that involved violence or the threat of it. Bernhauer et al.’s Water-Related Intrastate Conflict and Cooperation dataset details 10,000+ events in the Middle East, Mediterranean, and Sahel between 1997 and 2009. Related: Takeshi Wada’s “Geographic Distribution of Water Conflicts Worldwide: A Comparative Analysis of Four Databases” (pdf) discusses these resources plus GDELT, a broader-scope event database. — : July 7, 2021
Links:
For a 2019 study, University of Milan researchers collected and analyzed 440 recordings of “meows emitted by cats in different contexts”: when brushed by their owners, when isolated in an unfamiliar environment, and when waiting for food. [h/t Duncan Geere] — : June 30, 2021
Links:
Developer Gautham Prakash has built a dataset of the device permissions requested by more than 1 million Android apps in the Google Play marketplace. The permissions include the ability to make calls, read the phone’s contacts, record audio, get the phone’s precise location, know what other apps are running, and dozens more. — : June 30, 2021
Links:
Tags: technology
More than 1,000 people died in the inter-communal violence that erupted in Gujarat, India, in early 2002. A team of political and computational scientists recently trained students to annotate 21,000+ sentences from 1,257 contemporaneous articles about the events published in the Times of India, asking them to categorize whether police officers used force, killed someone, made arrests, failed to intervene, and/or took any other action. The resulting dataset includes the raw annotations as well as final sentence- and document-level classifications. [h/t Katherine A. Keith] — : June 30, 2021
Links:
FairVote, an organization that advocates for ranked-choice voting, has gathered the results of hundreds of elections that used those rules. The spreadsheets capture both single-winner and multi-winner elections in 26 jurisdictions between 2001 and 2021 — not yet including last week’s New York City primaries, whose results won’t be finalized until all absentee ballots are processed. Previously: ranked.vote, which provides detailed diagrams and data on a smaller number of elections (DIP 2020.05.27). — : June 30, 2021
Links:
Tags: elections
US law requires state Medicaid agencies to report the quarterly number of outpatient prescriptions, total units, and reimbursement costs for each permutation of each drug they’ve covered. The federal Medicaid program’s State Drug Utilization Data makes those records — which spanned nearly 5 million rows in 2020 alone — available as state-level and national files going back to 1991. Related: The National Drug Code Directory, which “contains product listing data submitted for all finished drugs including prescription and over-the-counter drugs, approved and unapproved drugs and repackaged and relabeled drugs.” [h/t Michael Q. Maguire] — : June 30, 2021
Links:
Tags: healthcarehistory
StudyForrest is “a one-of-a-kind resource for studying high-level cognition in the human brain under complex, natural stimulation.” Specifically: while watching Forrest Gump. In addition to fMRI scans and eye-tracking measurements, the project’s datasets include extensive annotations of the film itself, such as the location and timing of each shot. — : June 23, 2021
Links:
Many members of the New York City Council use CouncilStat to track their constituents’ requests, complaints, and other inquiries. The tool’s public dataset contains 260,000+ anonymized entries going back to 2015. It identifies each inquiry’s topic (e.g., tax preparation, citizenship, affordable housing, street resurfacing), district, and dates opened and closed. As seen in: A pre-election analysis of the requests by The City’s Ann Choi. — : June 23, 2021
Links:
Tags: government
Lisa Grossman Liu et al. have developed the Medical Abbreviation and Acronym Meta-Inventory, a database that maps 104,000+ medical abbreviations and acronyms to 170,000+ different meanings. To build it, the authors standardized data from eight sources, including the Unified Medical Language System, Wikipedia, and ADAM: Another Database of Abbreviations in MEDLINE. Related: “At our urban academic medical center, acronyms constituted 30–50% of the words in a typical medicine admission note.” — : June 23, 2021
Links:
Tags: healthcarelanguage
The UN-affiliated Civilian Impact Monitoring Project conducts “real-time collection, analysis and dissemination of open source data on the civilian impact from armed violence in Yemen.” Its public datasets include the monthly incident and casualty counts and the incidents per region damaging various types of civilian infrastructure. Related: ACAPS, a humanitarian analysis group, is aggregating data on a range of “key drivers” and outcomes of the crisis (such as fuel prices, malnutrition, and internal displacement) in each district and governorate. Previously: The Yemen Data Project (DIP 2019.04.03). [h/t Sadam Al-Adwar] — : June 23, 2021
Links:
Tags: conflict
To build the Princeton Corpus of Political Emails, researchers auto-subscribed to thousands of mailing lists run by candidates, political parties, and other groups participating in the 2020 US election cycle. They’ve received 400,000+ messages so far. Since October, you’ve been able to search the corpus online; as of last month, you can request access to v1.0 of its bulk dataset, which contains 300,000+ emails received through Election Day. For each, it provides the subject and body text, sender, office sought, and more. Previously: Congressional e-newsletters via DCinbox (DIP 2021.03.03), and political emails gathered by The Markup and by FiveThirtyEight (DIP 2020.03.04). [h/t Samantha Guss] — : June 23, 2021
Links:
The London Fire Brigade responds to hundreds of requests to rescue animals each year. Its monthly-updated spreadsheet of such events goes back to 2009; it lists the location and type of property, the kind of animal and rescue, hours spent, a (very) brief description, and more. [h/t Soph Warnes] — : June 16, 2021
Links:
Tags: miscellaneous
Gerard de Melo’s Etymological Wordnet provides structured data on the relationship of words to one another, mostly mined from their semi-structured descriptions on Wiktionary as of 2013. The dataset includes hundreds of thousands of word-origin associations, among other connections. As seen in: “Surprising shared word etymologies,” a recent blog post by Daniel de Haas. [h/t Michael Allen] — : June 16, 2021
Links:
Tags: language
Maria C. Escobar-Lemmon et al. have assembled a dataset on the representation of women in constitutional, supreme, and highest-appellate courts around the world. It lists the number and percentage of women on those courts in 175 countries (from 1970 to 2013), and indicates the first year a woman was appointed to each court (updated through 2020). [h/t Alice J. Kang] — : June 16, 2021
Links:
Stanford University’s Educational Opportunity Project uses restricted-access data on standardized test results to estimate trends in academic performance and learning rates in grades 3–8 across US schools, school districts, counties, states, and other geographies, and with respect to race, gender, and economic status. Last week the project released v4.1 of their public dataset, adding estimates for Native American students and Bureau of Indian Education schools. As seen in: “The Bureau of Indian Education Hasn’t Told the Public How Its Schools Are Performing. So We Did It Instead,” from ProPublica and the Arizona Republic, which compiled data for the new estimates. [h/t Otis Anderson] — : June 16, 2021
Links:
Tags: education
“Maybe you’ve seen it in the media: that map of the U.S. painted with blobs of yellow, orange and red. It shows drought – but how do we know which colors go where?” US Drought Monitor, a collaboration between the University of Nebraska and two federal agencies, describes the process behind its weekly maps. Its authors, who take shifts drawing the drought-intensity boundaries, synthesize various sources of quantitative information — such as the Palmer Drought Severity Index and the Surface Water Supply Index — and local knowledge. The results can be downloaded as geospatial files, timeseries, and summary statistics. As seen in: “How Severe Is the Western Drought? See For Yourself,” from the New York Times. — : June 16, 2021
Links:
ProofWiki is “an online compendium of mathematical proofs,” with 21,000+ of them and counting. (It also provides lists of notable mathematicians and a page of jokes.) Sean Welleck et al.’s NaturalProofs dataset contains a processed version of the website’s XML dump, plus data derived from other proof-containing resources. — : June 9, 2021
Links:
Tags: science
“Using AIR-ETH, a new helicopter-borne ground-penetrating radar (GPR) platform,” Melchior Grab et al. “measured the ice thickness of all large and most medium-sized glaciers in the Swiss Alps during the years 2016–20. Most of these had either never or only partially been surveyed before.” The team’s latest inventory combines detailed geospatial data from this and previous surveys. Previously: The Randolph Glacier Inventory (DIP 2015.12.16). — : June 9, 2021
Links:
The International Budget Partnership has published an assessment of accountability in the emergency fiscal policies that 120 governments introduced between March and September 2020. The report’s downloadable data (see bottom of page) includes each policy package’s name, date introduced, and whether it was a legislative act or executive decree; ratings on various aspects of transparency, oversight, and participation; and scores on 26 specific questions. Related: The partnership’s Open Budget Survey and data explorer. [h/t Rajan Zaveri] — : June 9, 2021
Links:
Every Monday, the Energy Information Administration collects data “on retail prices for regular, midgrade, and premium grades of gasoline from a sample of retail gasoline outlets across the United States using Form EIA-878, Motor Gasoline Price Survey Schedule A.” The survey informs the agency’s average gas price estimates, which are available at a national and regional level, as well as for a few selected states and cities, going back to the 1990s (though the sampling methodology changed in 2018). The EIA also collects data on diesel prices, using a separate survey. — : June 9, 2021
Links:
Tags: transportation
The CDC has begun publishing daily-historical data on vaccination progress in the US, going back to mid-December 2020. For the country overall, each state and territory, and a handful of other jurisdictions (e.g., the Bureau of Prisons), the dataset indicates the number of Pfizer-BioNTech, Moderna, and J&J/Janssen doses delivered, total doses administered by age group, percentages of populations fully vaccinated, and more. Less-detailed information is also available for each county over time and for national demographic trends. Related: My colleague Peter Aldhous has incorporated the CDC’s data into BuzzFeed News’ vaccination tracker. — : June 9, 2021
Links:
Tags: COVID-19healthcare
The World Institute of Kimchi’s Omics Database of Fermentative Microbes provides “genome, metagenome, metataxonome, and (meta)transcriptome sequences” of bacteria and other microorganisms associated with a variety of fermented foods. You can search the sequenced microbes by taxonomy, research study, and food sampled. — : June 2, 2021
Links:
Tags: food
Open Access UK is a database of 70,000+ meetings between lobbyists and government ministers going back to 2012. The records have been compiled by Transparency International UK from scattered official publications; they list the date of the meeting, the lobbying organization, the minister lobbied, and a brief description of the meeting’s purpose. [h/t Gavin Freeguard] — : June 2, 2021
Links:
Tags: government
The New York Civil Liberties Union has updated its NYPD misconduct complaint database to include “the race or ethnicity of the impacted person and officer, incident location, current employment status of the officer, and other data,” and to remove duplicates in the original records obtained from the city. Also: Philadelphia publishes monthly-updated data on police complaints from the past five years. Previously: ProPublica’s subset of the NYPD data (DIP 2020.07.29); and CPDP’s database of complaints against Chicago officers (DIP 2015.11.25). Read more: The Financial Times has compared and analyzed these cities’ datasets, finding that the 10% most-cited officers account for roughly a third of all complaints. [h/t Christine Zhang + George Ho] — : June 2, 2021
Links:
“The Bureau of Alcohol, Tobacco, Firearms and Explosives inspects thousands of licensed gun dealers and manufacturers each year, but what happens in those investigations is rarely revealed to the public.” So reporters at The Trace and USA Today compiled a database of nearly 2,000 ATF inspections between July 2015 and June 2017 with violations, based on PDFs obtained through a FOIA lawsuit by the Brady gun control group. It includes dealer information, lists of violations, final dispositions, and more. Read more: “The ATF Catches Thousands of Lawbreaking Gun Dealers Every Year. It Shuts Down Very Few.” Previously: The ATF’s licensing database (DIP 2015.11.04). — : June 2, 2021
Links:
In the 1880s–1910s, the Chicago Tribune, Tuskegee University, and the NAACP each began collecting data on lynchings in the US. In the 1990s, sociologists Stewart Tolnay and E.M. Beck reverified, standardized, and extended those three collections, compiling a now-seminal dataset that described 2,800+ victims of these mob-led extrajudicial killings — the majority of whom were Black — in 10 Southern states from 1882–1930. The researchers shared a copy of that data with Project HAL, where you can download it. Meanwhile, they and collaborators have continued updating and expanding the research; you can request access to their latest datasets. In 2019, sociologists Charles Seguin and David Rigby published a dataset that aims to complement Tolnay and Beck’s by presenting information (and an interactive map) on 1,328 lynching victims in 38 additional states from 1883–1941. [h/t Geoff Hing + Lisa D. Cook] — : June 2, 2021
Links:
Between 2002 and 2011, the Euro Spatial Diffusion Observatory convinced 22,500 people in France to open their wallets and count the 300,000+ Euro coins in them. The researchers also conducted smaller surveys in Germany and Belgium. The project’s public datasets detail the coins’ values and countries of origin, plus socioeconomic details about their owners. — : May 26, 2021
Links:
OpenCelliD bills itself as “the world’s largest open database of cell towers,” drawing from a combination of crowdsourcing, information provided by telecom firms, and a collaboration with the Mozilla Location Service. Its public datasets (registration required) indicate the latitude, longitude, radio type, and identifiers of more than 40 million GSM, CDMA, UMTS, and LTE “logical cells.” Related: An interactive map of the data (and methodology) by software developer Alper Cinar. [h/t u/cavedave] — : May 26, 2021
Links:
Tags: infrastructuretechnology
The Inter-American Development Bank’s Database of Political Institutions 2020 provides structured information on 180+ countries’ national governments and elections going back to 1975. The topics include electoral rules, term limits, party and leader tenure, party fragmentation, the role of the military in government, competitiveness, and more. [h/t Cesi Cruz + Brian C. Keegan] — : May 26, 2021
Links:
The US Census Bureau’s Longitudinal Employer-Household Dynamics initiative partners with state governments to examine employment and economic mobility at “detailed levels of geography and industry and for different demographic groups.” Its downloadable and interactive datasets include quarterly job-flow metrics, which track workers’ movement between sectors, states, and metro areas; an experimental study of Army veterans’ employment outcomes; and more. [h/t Jared Shepard] — : May 26, 2021
Links:
For centuries, a network of volunteers recorded monthly measurements from thousands of rain gauges across the UK and Ireland. Those observations were trapped on paper forms known “10-year sheets” until very recently, when the UK’s Meteorological Office scanned them. But the records still weren’t in an analyzable form, so a team of climate scientists organized the Rainfall Rescue project, sending a call for help just as Britain entered its first COVID-19 lockdown. Within months, more than 16,000 volunteers had transcribed all 65,000+ pages of scans, entering millions of rain measurements from 1677–1960. Read more: A Twitter thread from organizer Ed Hawkins. [h/t Charlotte Slaymark] — : May 26, 2021
Links:
The e.e. cummings free poetry archive, launched last week by journalist Ben Welsh, “aims to republish all of the author’s work as it gradually enters the public domain.” So far, it includes more than 100 poems, available to read online and as data files that include each poem’s collection, title, first line, and full text. Read more: Welsh’s introductory Twitter thread, which highlights technical details and volunteer contributions. — : May 12, 2021
Links:
Drawing from company websites and communications with government railway agencies, OBC Transeuropa’s Gianluca De Feo and Lorenzo Ferrari have identified 271 passenger train routes that cross Europe’s national borders. For each route, their dataset lists its two endpoints, the countries it passes through, route type (high-speed, regional, etc.), operating company, and more. Related: “Four ways of looking at European cross-border rail links,” a follow-up article. [h/t Giuseppe Sollazzo] — : May 12, 2021
Links:
Tags: transportation
In its FOIA reading room, the US Transportation Security Administration publishes weekly PDF files that indicate the number of people passing through its checkpoints, broken down by hour and location. IT specialist Mike Lorengo has been converting these PDFs into structured data files. Related: The TSA also publishes a table of total “traveler throughput” for each day in 2021, compared to the same weekdays in 2020 and 2019. — : May 12, 2021
Links:
Tags: transportation
To receive accreditation from the American Bar Association, law schools must submit a range of data-points about tuition, financial aid, student demographics, class sizes, employment outcomes, and more. The ABA’s disclosure site provides school-level PDF reports, as well as annual spreadsheets comparing all accredited schools since 2011. Related: “There Are Only Two Black Male Prosecutors For All Of Long Island,” a recent Gothamist article that uses the data. [h/t Charles Lane] — : May 12, 2021
Links:
For decades, the Southern Poverty Law Center has conducted annual censuses of US-based hate groups, which it defines as those with “beliefs or practices that attack or malign an entire class of people, typically for their immutable characteristics.” Its 2020 review found 838 such groups — a decline from prior years, which the researchers attribute to several factors, including the COVID-19 pandemic, the rise of difficult-to-track online networks, and “the continuing collapse of the Ku Klux Klan.” The center’s map of 2000–2020 findings links to annual spreadsheets that detail each group’s title, location, and ideology. — : May 12, 2021
Links:
Statistician Dan Oehm has constructed a dataset describing all 40 US seasons of Survivor, providing details on every contestant, challenge winner, individual vote, episode viewership, and more. You can access the data as an R package or an Excel file. [h/t u/antirabbit] — : May 5, 2021
Links:
Tags: mediatelevision
To examine diversity in the US Federal Reserve System, a team led by central bank–watchers Peter Conti-Brown and Kaleb Nygaard has compiled a biographical dataset of the nearly 2,000 people who have served as directors of the system’s twelve Reserve Banks between 1914 and 2019. The dataset expands the information available in official reports to include “race, gender, profession, education, age, time spent in position, and whether or not the directors later held a position on the FOMC.” — : May 5, 2021
Links:
A team led by political scientist Sara Hellmüller has categorized the “evolving mandate tasks” of all 121 United Nations peace missions between 1991 and 2020. The dataset identifies 41 kinds of tasks, which it sorts into three categories: “minimalist,” “moderate,” and “maximalist.” Minimalist tasks “reflect an approach that contents itself with absence of armed conflict,” while the maximalist approach “seeks to address root causes,” such as by supporting military reform and women’s rights. Previously: Official UN peacekeeping data and the Geocoded Peacekeeping Operations dataset (DIP 2019.11.27). [h/t Roland Paris] — : May 5, 2021
Links:
The US National Labor Relations Board publishes a searchable and downloadable database that details the results of thousands of union elections the agency has conducted, going back more than a decade. The records list the employer name and location, tally date and type, petitioned-for employee unit, proposed labor union(s), number of votes for and against those union(s), and more. [h/t Cory McCartan] — : May 5, 2021
Links:
The History Database of the Global Environment, a project of the Netherlands Environmental Assessment Agency, provides land use estimates that span 12,000 years — from 10,000 BCE to the near-present. The datasets include gridded, country-level, and regional estimates of cropland, pasture, grazing areas, and other typologies over time. Related: Clio Infra gathers “worldwide data on social, economic, and institutional indicators for the past five centuries,” including estimates of cropland, livestock, metal production, and more. [h/t Cédric Scherer + u/cavedave] — : May 5, 2021
Links:
On its official website, the Tour de France lists riders’ results in its famed bicycle race since 1903. The site doesn’t provide downloads, but applied mathematician Thomas Camminady has scraped it to build a CSV file containing each finisher’s rank, time, team, and more. — : April 28, 2021
Links:
For more than a century, the Universal Postal Union has collected and published statistics about the world’s postal systems. Online, you can query and export country-level data — the number of letter-boxes and permanent post offices, operating revenue, total staff, and much more — going back to 1980. Related: Jon C. Rogowski et al. have used historical UPU reports to count the number of post offices per country between 1875 and 2007. Previously: US post office locations, 1639–2000 (DIP 2021.04.07). — : April 28, 2021
Links:
In 2018, Michael G. Miller and Brian T. Hamel published a study examining how voters and donors responded to scandals embroiling members of the US House between 1980 and 2010, building on the work of Scott Basinger and others. They have since expanded the dataset, which now covers both the House and the Senate and extends through 2018, providing information on 316 legislator-scandal combinations (categorized as financial, sexual, political, or “other”) and their outcomes. — : April 28, 2021
Links:
In order to hire foreign workers through the government’s H-1B, H-2A, and H-2B programs, US employers need permission from the Department of Labor. The agency collects and publishes data on each “certification” request, detailing the employer (name, location, industry, etc.), job position (title, pay, etc.), and approval status. The datasets go back more than a decade for each program and receive quarterly updates, the most recent being posted last week. Related: At BuzzFeed News, we used the data throughout our 2015/16 series investigating the H-2 program, and maintain a dataset that standardizes key fields from the raw files. [h/t George Ho] — : April 28, 2021
Links:
Launched in 2015, the Water Point Data Exchange today describes 577,000+ specific water access points: boreholes, hand-dug wells, protected springs, rainwater harvest tanks, and more. The platform gathers information from governments and their partners in 50+ countries — mostly in Africa, but also with 10,000+ data-points each in Afghanistan, Bangladesh, Haiti, and India. The records indicate the coordinates of each access point, the date checked, water availability when checked, the water source and/or transport system, and other details. [h/t Katy Sill and Adam Kariv] — : April 28, 2021
Links:
Tags: environmentoceanswater
Richard J. Gentry et al. have overseen the collection of data on 1,400+ CEO dismissals and thousands of other CEO departures from S&P 1500 companies between 1992 and 2018. Related: Claudio Fernandez-Araoz et al. have compiled data on CEO and CFO turnover between 2014 and 2018. [h/t Steve Boivie] — : April 21, 2021
Links:
Tags: business
The Endangered Language Alliance’s Languages of New York City map highlights nearly 700 languages and dialects spoken in NYC and nearby counties. For each language, it indicates a number of significant sites where it is or has been spoken. The project’s downloadable dataset lists each site's status and neighborhood or city, plus the language’s linguistic family, countries of origin, and estimated number of global speakers. [h/t Ross Perlin] — : April 21, 2021
Links:
Tags: language
For more than a year, the New York Times collected data on "coronavirus infections, deaths and testing for state and federal prisons; immigration detention centers; juvenile detention facilities; local, regional and reservation jails; and those in the custody of the U.S. Marshals Service.” On Friday, it published the case and death tolls for 2,000+ of these facilities, as of March 2021. Read more: The Times’ reporting and graphics based on the data. Previously: Weekly COVID-19 numbers for each state prison system (DIP 2020.05.06), collected by the Marshall Project and Associated Press. [h/t Libby Seline] — : April 21, 2021
Links:
The nonprofit Common Crawl is “dedicated to providing a copy of the internet to internet researchers, companies and individuals at no cost for the purpose of research and analysis.” Over the past decade, it has gathered and shared petabytes of data from its roughly-monthly web crawls. The most recent, completed in early March, contains 2.7 billion pages. Related: The University of Mannheim’s Web Data Commons generates structured data from the crawls, including nearly 300 million rows of information (on products, events, museums, and more) extracted from websites’ schema.org markup. [h/t Andrea Volpini] — : April 21, 2021
Links:
Tags: mediatechnology
The Center for Strategic and International Studies has analyzed data on 980 domestic terrorism plots and attacks in the US from 1994 through January 2021, categorizing them as “violent far-right,” “violent far-left,” “religious,” “ethnonationalist,” or “other.” (The project’s methodology provides further details.) The researchers haven’t published the full dataset, but have allowed the Washington Post to publish a large slice of it — a dozen columns for each incident, with dates, locations, type of target, and more. Based on its own research, the Post added eight more columns about the far-right attacks, which it used for a data-driven article on the topic. Previously: The Global Terrorism Database (DIP 2018.02.07). — : April 21, 2021
Links:
As part of its “efforts to better understand the ecology of forest elephant movements, and the spatial and temporal pattern of poaching,” the Elephant Listening Project has been collecting audio 24/7 from 50 locations in the Republic of the Congo and publishing the raw recordings — ”more than 1 million hours of the sounds of birds, primates, insects, frogs, you name it. If it makes sound, we record it.” — : April 14, 2021
Links:
Tags: animalsaudioenvironment
Since January 2019, software engineer Nick Jones has been capturing hourly screenshots of five news sites’ homepages: The New York Times, Washington Post, Wall Street Journal, Fox News, and CNN. You can browse the images online and download them from predictable URLs. Note: “This project is unaffiliated with PastPages, a similar effort that took screenshots from 2012 to 2018 from a much wider range of news websites.” — : April 14, 2021
Links:
Tags: journalismmedia
The Census Bureau’s Household Pulse Survey “is a 20-minute online survey studying how the coronavirus pandemic is impacting households across the country from a social and economic perspective,” and asks questions about food security, housing, telework, and more. It has collected more than 2 million responses since last April, with results published on a rolling basis via microdata files, statistical tables, and interactive graphics. Related: The Urban Institute has been collating and standardizing the data files. [h/t Michael Allen + Eric Gardner] — : April 14, 2021
Links:
Researchers at the Bank for International Settlements (DIP 2020.08.12) have compiled a dataset of central banks’ policy responses to the coronavirus pandemic. It covers more than 900 announcements by 39 central banks, grouped by monetary tool: interest rates, reserve policies, lending operations, foreign exchange, and asset purchases. The researchers also “provide further details relevant to each type of tool, such as the maturity and whether the instrument was new to the central bank or not.” [h/t Carlos Cantú] — : April 14, 2021
Links:
The Guttmacher Institute’s new Global Abortion Incidence Dataset gathers annual abortion statistics for more than 100 countries — for as many years between 1990 and 2018 as possible. The dataset also indicates the sources for the figures, “whether spontaneous abortions are included, whether or not the data are considered complete and the reason behind it, the marital status of the sample for studies and surveys,” and more. Previously: Pregnancies, births, and abortions by US state and age group (DIP 2020.10.28), also from Guttmacher. [h/t Cynthia Beavin] — : April 14, 2021
Links:
The Great Lakes Fishery Commission, whose founding was spurred by an invasion of sea lampreys, publishes several datasets relevant to the famous North American basin. Among them: “Commercial Fish Production In The Great Lakes 1867-2015,” which tallies the number of pounds caught each year by lake, jurisdiction, and species. [h/t Forest Gregg] — : April 7, 2021
Links:
Tags: animalsenvironmentwater
London is providing an “experimental dataset” categorizing the various coronavirus-related restrictions that have affected the UK capital since March 2020. It lists the dates of 22 policy changes, including three separate lockdowns; whether schools, pubs, restaurants, and/or shops were closed; whether household mixing was banned; and more. [h/t Olivier Lejeune] — : April 7, 2021
Links:
For an article initially published in 2013 and since updated, Our World in Data has examined historical trends in the number of hours that people work. The data sources vary in geographic and temporal scope, with most spanning decades; they include Huberman and Minns (PDF, see Table 3), the Penn World Table, the Total Economy Database (registration required), the OECD, and more. — : April 7, 2021
Links:
The Chinese Name Database, published by social psychology grad student Han-Wu-Shuang (Bruce) Bao (包寒吴霜), “contains nationwide frequency statistics of 1,806 Chinese surnames and 2,614 Chinese characters used in given names, covering about 1.2 billion Han Chinese population (96.8% of the Han Chinese household-registered population born from 1930 to 2008 and still alive in 2008).” The statistics, obtained from China’s National Citizen Identity Information Center, also record the frequencies of given-name characters for six age cohorts, based on decade of birth. Read more: “What can we tell from the evolution of Han Chinese names?” — an explainer and analysis by Isabella Chua in Kontinentalist. [h/t Nathan Yau] — : April 7, 2021
Links:
Historian Cameron Blevins has released a dataset of 166,000+ post offices operating in the US between 1639 and 2000. It includes their years of service and precise/approximate geocoordinates, “making it one of the most fine-grained and expansive datasets currently available for studying the historical geography of the United States.” The project builds on research by the late Richard W. Helbock and provides the data-foundation for Blevins’s new book, Paper Trails: The US Post and the Making of the American West, and companion website. Read more: Blevins’s introductory Twitter thread. [h/t Eric Gardner + Alex Albright] — : April 7, 2021
Links:
Researchers at Turkey’s Selçuk University have built a computer vision program to measure and classify images of dried beans. Their dataset includes 13,611 specimens across seven varieties; for each, it reports the bean’s perimeter, axis lengths, roundness, and more. [h/t Meredith Broussard] — : March 31, 2021
Links:
Tags: agriculturefood
Pauline Affeldt et al. have compiled a quarter century of merger control decisions by the European Commission’s Directorate-General for Competition — decisions relating to 5,000+ cases and 31,000+ product/market combinations between 1990 and 2014. The dataset lists the target company, acquiring company, industry, product, outcome, decision date, and more. [h/t Anna Rita Bennato] — : March 31, 2021
Links:
Tags: business
Many countries apply gender quotas to their parliamentary elections, typically by reserving a certain number of seats for women or by regulating political parties’ candidate lists. In other instances, parties have instituted voluntary quotas. The Gender Quotas Database categorizes these rules for more than 120 nations, and provides additional details through its country profile pages. — : March 31, 2021
Links:
“Using a mixture of machine learning techniques, high-resolution satellite imagery, and population data,” researchers at Facebook and Columbia University have “mapped hundreds of millions of structures distributed across vast areas and then used that to extrapolate the local population density.” The project, which began half a decade ago, now provides population density datasets covering much of the world. (China, Russia, and Canada are among the notable countries missing.) Read more: Benjamin Schmidt’s exploratory Twitter thread. Previously: The Global Human Settlement Layer (DIP 2016.11.02). — : March 31, 2021
Links:
Tags: mapping
Last week, the US CDC began publishing a new, geographically-specific COVID-19 “case surveillance” dataset. Each of the 22 million rows represents a (de-identified) coronavirus case, accounting for roughly 70% of the country’s current official case count. The details include (where available) the person’s county and state of residence; age, sex, race, and ethnicity; “presence of any underlying medical conditions and risk behaviors”; and whether the person was hospitalized and/or died. Read more: Betsy Ladyzhets writes at the COVID-19 Data Dispatch, “After months of no state-by-state demographic data from the federal government, we now have county-by-county demographic data. This is a pretty big deal!" — : March 31, 2021
Links:
AnAge is “a curated database of ageing and life history in animals, including extensive longevity records.” Part of a broader project on the genetics of human ageing, the database contains 4,200+ species, from the aardvark to the ziege. Where available, it lists maximum lifespans, average gestation times, typical litter sizes, and more. [h/t Xan Gregg] — : March 24, 2021
Links:
Tags: animals
New York City’s affordable housing lotteries receive hundreds of applications per available unit. For an investigation into the odds, reporters for The City last year obtained and analyzed data on 426 lotteries from 2014 to 2019 (including building locations, unit sizes, monthly rents, and income thresholds), plus basic information (on income and household size) from more than 20 million lottery applications. [h/t Ann Choi] — : March 24, 2021
Links:
Researchers from the US Bureau of Economic Analysis have constructed a dataset of government investment in basic, social, and digital infrastructure from 1947 to 2017, using data from the agency’s National Economic Accounts. The investment amounts are broken down by subcategory (water, sewer, power, transportation, education, public safety, healthcare, etc.) and ownership (federal, state/local, or private). [h/t Donald Schneider] — : March 24, 2021
Links:
Tags: educationgovernmenthealthcareinfrastructuretransportationwater
EDGAR-FOOD, a new dataset from European Commission researchers, provides estimates of the global food system’s greenhouse gas emissions, “consistently covering each stage of the food chain for all countries with yearly frequency for the period 1990–2015.” The dataset distinguishes between four kinds of greenhouse and eight stages: “land use/land-use change activities,” production, processing, packaging, transport, retail, consumption, and waste management. Related: Carbon Brief’s coverage of the research. [h/t Duncan Geere] — : March 24, 2021
Links:
The Violence Project’s Mass Shooter Database contains extensive details about public mass shootings in the US and their perpetrators, weapons, and victims, with a goal of “finding pathways to prevention.” The database, funded by a National Institute of Justice grant, covers 170+ shootings between 1966 and early 2020. Examples of the variables include: the perpetrator’s employment status, known prejudices, and experience with mental illness; the victims’ relationship to the perpetrator and estimated years of life they lost; and the firearms’ make, model, and method of acquisition. Previously: Data on mass shootings from the Gun Violence Archive and Mother Jones (DIP 2015.12.09). — : March 24, 2021
Links:
Business historian Peter Scott has reconstructed a dataset of Britain’s “inter-war super-rich.” Scott started with an official (but mostly nameless) list of 438 “millionaires” (which tax authorities defined as people with annual incomes above £50,000) from 1928/29, and then cross-referenced the entries with other sources to identify 291 men and 28 women who fit the criteria. [h/t Jain Family Institute] — : March 17, 2021
Links:
Quotebank is “an open corpus of 178 million quotations attributed to the speakers who uttered them, extracted from 162 million English news articles published between 2008 and 2020.” Its authors used machine learning to identify the quotes and speakers, “correctly [attributing] 86.9% of quotations in our experiments.” The five most-quoted people: Barack Obama, Donald Trump, Mitt Romney, Hillary Clinton, and Narendra Modi. [h/t Lynn Cherny] — : March 17, 2021
Links:
Tags: journalismlanguagemedia
Brazil’s health ministry is publishing granular data from its COVID-19 vaccination registry, with data on more than 12 million doses administered so far. For each dose, it indicates the patient’s date of birth, sex, race, location, and eligibility group; the vaccine name, manufacturer, and lot; the date and location of vaccination; and more. [h/t Olivier Lejeune] — : March 17, 2021
Links:
Tags: COVID-19healthcare
The US Centers for Medicare & Medicaid Services requires nursing homes to submit detailed payroll information, which the agency converts into public-use files that summarize daily staffing levels at each facility. The files count employee and contract hours for dozens of types of staff, ranging from administrators to respiratory therapists. Related: “Maggots, Rape and Yet Five Stars: How U.S. Ratings of Nursing Homes Mislead the Public,” a recent New York Times investigation that uses the data in several ways. — : March 17, 2021
Links:
Tags: businesshealthcare
The Emergency Events Database (EM-DAT) contains “essential core data on the occurrence and effects of more than 21,000 [natural and technological] disasters in the world, from 1900 to present.” It focuses on disasters that have caused 10+ human deaths, affected 100+ people, sparked a state of emergency, and/or prompted a request for international assistance. Where known, the public dataset (registration required) indicates each disaster’s location, type, start/end dates, and estimated damages; the number of people killed, injured, made homeless, or otherwise affected; and more. Related: The Geocoded Disasters (GDIS) Dataset (registration required) provides spatial coordinates for the natural disasters in EM-DAT from 1960 to 2018. — : March 17, 2021
Links:
Using a combination of human observation and automated detection, the city of Amsterdam has been mapping its bat population. Its data features more than 6,000 observations, the observation routes, bat-detection device placement and findings, and the city’s known bat abodes. [h/t Martina Zamboni] — : March 10, 2021
Links:
Tags: animals
The Electoral System Design Database describes the way in which lower-house parliamentarians are chosen, covering 1,300+ such elections in 217 countries and territories since 1991. It specifies the type of system (e.g., “first-past-the-post,” “party block vote,” et cetera), the number of tiers of representation, the number of legislators directly elected, and the number of voting members. [h/t Andrew Stewart] — : March 10, 2021
Links:
The Black Media Initiative, based at the CUNY’s graduate journalism school, has launched an interactive map and database of 300+ newspapers, radio stations, and other “media outlets across the U.S. that primarily serve Black communities across the diaspora.” The information includes locations, formats, publication frequencies, target audiences, ownership categories, and more. [h/t Mike Reilley] — : March 10, 2021
Links:
Tags: journalismmediarace
Johannes Uhl et al.’s Historical Settlement Data Compilation for the United States contains estimates, for every half decade from 1810 to 2015, of the number of building units and structures in every 250-meter-by-250-meter chunk of the (present-day) contiguous US. The researchers constructed their calculations from the Zillow Transaction and Assessment Dataset, which contains more than 400 million property records and requires researchers to apply for access. Previously: The Global Human Settlement Layer (DIP 2016.11.02). [h/t Rebecca Hersher] — : March 10, 2021
Links:
Outbreak.info’s mutation situation reports track the prevalence of new strains of SARS-CoV-2, the virus that sparked the COVID-19 pandemic. For each “lineage,” the reports provide data on the geography, timing, and number of sequenced viral samples that bear the strain’s signature. The project, led by biologists at Scripps Research, is one of many that draw from the genomic data collected by GISAID (DIP 2020.03.11), which also publishes its own variants dashboard. Related: The US CDC is tracking the number of variant cases reported in each state and territory. But they’re only providing cumulative counts, so USA Today has begun scraping (and backfilling) the data to maintain a historical record. [h/t Betsy Ladyzhets] — : March 10, 2021
Links:
SoilGrids “uses state-of-the-art machine learning methods to map the spatial distribution of soil properties across the globe,” including organic carbon density, pH, clay content, and more. The maps use data from the World Soil Information Service, which standardizes millions of soil records. Both projects are run by the International Soil Reference and Information Centre, which also catalogs dozens of public-access soil datasets. [h/t Jonathan Whitaker] — : March 3, 2021
Links:
Tags: environment
A team of conservationists has compiled a decade of data on illegal cheetah sales and ownership, drawing from “over 300 sources, including direct communications with field informants, veterinarians, and cheetah owners,” court records, social media, and more. The dataset covers 1,800+ cases — including both actual seizures and alleged/suspected incidents — involving 4,000+ cheetahs or cheetah parts/derivatives. — : March 3, 2021
Links:
The volunteer-driven VaccinateCA project calls “hundreds of potential [COVID-19] vaccination sites daily, asking them if they have the vaccine and if so to whom they will administer it to and how to get an appointment.” You can examine the results on their website, and also use their API to access the latest data, which includes status reports on every California county, 20+ health care providers, and thousands of potential vaccination sites. Related: “What We’ve Learned (So Far).” [h/t Simon Willison] — : March 3, 2021
Links:
Tags: COVID-19
For more than a decade, political scientist Lindsey Cormack’s DCinbox project has collected “every official e-newsletter sent by sitting members of the U.S. House and Senate.” You can search the corpus online and also download all the emails as a series of CSV files, grouped by month. For each of the 130,000+ mailings, the files provide the date, subject, body, and sender’s Bioguide ID. (April 2020 was the highest-volume month, with more than 2,300 messages, nearly all of them mentioning the coronavirus.) — : March 3, 2021
Links:
Reporters at FiveThirtyEight and The Marshall Project filed freedom-of-information requests to 50 US cities, asking for data on all civil lawsuits against their police departments/officers “that resulted in a monetary legal settlement” in 2010–2019 — including incident and settlement dates, plaintiff and defendant names, allegation descriptions, amounts awarded, and more. They received full or partial data from 31 of those cities, and found that total payouts exceeded $3 billion. Don’t miss: The data repository’s “words of caution,” which discourage comparisons between cities. Related: Coauthor Laura Bronner’s introductory Twitter thread. [h/t Eric Gardner] — : March 3, 2021
Links:
More than 1,000 illustrated animals have graced the covers of O’Reilly Media’s technical books. The publisher hosts an online “menagerie” where you can browse the pairings; it doesn’t provide downloads, but brian d foy, author of several O’Reilly books, has written a Perl tutorial on how to scrape it, and has shared the results. — : February 24, 2021
Links:
Tags: animalsbookstechnology
To build El Indultómetro/The Pardonometer, the Civio Foundation has “collected, scraped and classified all the information contained in the [Official State Gazette] on pardons granted in Spain since 1996.” You can browse the 10,000+ pardons and commutations online or download the dataset, which describes the initial charges, form of relief, relevant dates, and more. Related: Civio’s (Spanish-language) methodology and reporting on the topic. [h/t Olaya Argüeso Pérez] — : February 24, 2021
Links:
Hanna Bäck et al. have built a dataset of 1,000+ foreign ministers — all officials holding such a post between 1789 and the mid-2010s in “the world’s 13 former and current great powers.” It describes their stints in office and biographical details, such as their marital status, occupational experience, education level, military service, and much more. [h/t Alex Quiroz Flores] — : February 24, 2021
Links:
Papers with Code cross-references machine learning papers with their datasets, code, and results. The project has connected 56,000+ of those papers to specific code repositories, and assembled a meta-dataset of 3,000+ relevant datasets. The records can be downloaded in bulk or fetched via API. Related: A recent study examining 133 facial-recognition datasets created since 1976; further coverage in MIT Technology Review. Also related: Exposing.ai, which lets you “check if your Flickr photos were used to build face recognition.” [h/t Karsten Johansson] — : February 24, 2021
Links:
Tags: languagetechnology
The Development Data Lab has gathered data on 81.2 million court cases in India’s lower judiciary between 2010 and 2018, drawn from the country’s e-Courts platform. It’s “the largest open-access dataset on judicial proceedings in the world,” says project coauthor Aditi Bhowmick. The public dataset contains each case’s state, district, court, case type, filing and decision dates, defendant and petitioner genders, legal codes, and more. It “has been fully anonymized to prevent the identification of individual judges or litigants,” but researchers can apply for more extensive access. [h/t Shruti Rajagopalan] — : February 24, 2021
Links:
Journalists at FiveThirtyEight watched 233 Super Bowl ads run by 10 frequently-advertising brands between 2000 and 2020, drawing from superbowl-ads.com’s video archive. Then they categorized each ad according to seven specific questions, which included “Was it trying to be funny?” and “Did it include animals?” — : February 10, 2021
Links:
The World Bank’s Worldwide Bureaucracy Indicators “is a unique new cross-national dataset on public sector employment and wages” in 130+ countries between 2000 and 2018, based on censuses and household surveys. The dataset’s measurements include the size of this workforce, its demographics, pay disparities versus the private sector, and more. Related: An introductory Twitter thread from coauthor Faisal Baig. — : February 10, 2021
Links:
Tags: governmentlabor
Researchers have harmonized two major datasets measuring the light visible on Earth at night. The first, from the Defense Meteorological Satellite Program, covers 1992–2013. The second, from the Visible Infrared Imaging Radiometer Suite, covers 2012 onward. The harmonized dataset covers 1992–2018 and polishes the data by removing, for example, “noises from aurora, fires, boasts, and other temporal lights.” Related: The Colorado School of Mines’ Earth Observation Group, which “specializes in nighttime observations of lights and combustion sources worldwide.” [h/t Milos Popovic] — : February 10, 2021
Links:
Tags: environmentmapping
Software developer Lucas Rodés-Guirao is aggregating coronavirus vaccination numbers, over time, by administrative division — each canton in Switzerland, for example, and each province in Argentina. The project, inspired by Our World In Data’s country-level dataset (DIP 2020.12.23), currently includes 20+ countries and provides links to the sources. [h/t Olivier Lejeune] — : February 10, 2021
Links:
Tags: COVID-19
To create their “Extremely Detailed Map of the 2020 Election,” journalists at The Upshot have gathered and standardized the ballot results for more than 100,000 precincts in dozens of states so far — representing nearly two-thirds of all votes cast. The accompanying GeoJSON dataset indicates the number of votes received by Joe Biden, by Donald Trump, and overall (including third-party and write-in candidates), joined to each precinct’s geographic boundaries. [h/t Kevin Quealy + Ryan Matsumoto] — : February 10, 2021
Links:
The National Hockey League has an undocumented API. Technologist Drew Hynes is making sense of the dozens of endpoints, which provide data on decades of player stats, game schedules, draft picks, and more. [h/t Jemma Issroff] — : February 3, 2021
Links:
Tags: sports
DuckDuckGo’s Tracker Radar watches the web’s watchers. The regularly-updated dataset currently covers 36,000+ of the most common third-party domains, and provides “detailed information about their tracking behavior, including prevalence, ownership, fingerprinting behavior, cookie behavior, privacy policy,” and more. As seen in: Blacklight, The Markup’s “real-time website privacy inspector,” which uses the dataset. [h/t Surya Mattu] — : February 3, 2021
Links:
Tags: businesstechnology
The World Mortality Dataset contains recent “all-cause mortality” counts for 79 countries, aggregated to weekly, monthly, or quarterly totals (depending on availability). Launched last week by researchers Ariel Karlinsky and Dmitry Kobak, the project draws on the Human Mortality Database, EuroStat, and data collected by the New York Times (DIP 2020.05.27); and expands upon it by gathering data from government websites and through direct inquiries. [h/t Dror Guldin] — : February 3, 2021
Links:
Tags: death
Among its goals, OpenTopography wants to “democratize online access to high-resolution (meter to sub-meter scale), Earth science-oriented, topography data acquired with lidar and other technologies.” The project hosts hundreds of elevation datasets from around the world, available to visualize online and to download. Related: The US Geological Survey’s 3D Elevation Program aims “to provide the first-ever national baseline of consistent high-resolution topographic elevation data” by 2023, and already publishes a range of subnational datasets. [h/t Ricardo Pereira] — : February 3, 2021
Links:
Tags: mapping
Political scientist Anna Gunderson has assembled a longitudinal dataset of private prisons at the federal, state, and local level through 2016. Using decades of financial reports by the two largest operators, plus those by two other companies one later acquired, Gunderson’s dataset identifies each facility’s name, location, primary customer, capacity, security level, contract information, and more. Related: Last week, President Biden signed an executive order to phase out the federal use of such facilities. — : February 3, 2021
Links:
The Codex Atlanticus “is the largest existing collection of original drawings and text by Leonardo da Vinci” — 1,119 pages assembled by the 16th–century sculptor Pompeo Leoni. Milan’s Biblioteca Ambrosiana and The Visual Agency have created an interactive graphic that lets you explore the pages by year and subject; you can also download that metadata through the graphic’s “About the project” section. [h/t Giuseppe Sollazzo] — : January 27, 2021
Links:
The European Opinion Polls as Open Data repository collects the results of party-preference polls in 34 countries. For each poll, it lists the polling firm and commissioners, when the fieldwork began and ended, its scope and sample size, and the topline numbers for each party. Related: Project maintainer Filip Van Laenen on “how simple things can turn out to be rather complicated.” [h/t Erik Gahner Larsen] — : January 27, 2021
Links:
The Firm-Level Risk project uses “textual analysis of quarterly earnings conference calls held by more than 11,000 listed firms in 81 countries” to construct company-by-company “measures of exposure, risk, and sentiment.” The dataset goes back nearly two decades and includes sub-measures for various themes, such as tax policy, Brexit, and COVID-19. [h/t Stephan Hollander] — : January 27, 2021
Links:
The US General Services Administration publishes a list of all registered .gov domains (DIP 2017.01.18). In 2011, 2014, and 2015, open-source advocate Ben Balter used software to scan each federally-managed domain “to sniff out information about [their] technology and capabilities.” For instance: Do the domains support HTTPS? Do they use Google Analytics? Earlier this month, Balter ran the scan again, “to serve as a snapshot of the state of government technology ahead of the incoming Biden administration.” You can browse and download the results. — : January 27, 2021
Links:
Tags: governmenttechnology
The OpenSky Network crowdsources air traffic data, thanks to members who collect the radio signals that aircraft periodically broadcast. The nonprofit organization has published a “COVID-19 flight dataset,” which contains metadata for the tens of millions of flights those members observed in 2019 and 2020, with plans to update the dataset until the pandemic ends. It includes each flight’s call sign, aircraft model type, origin and destination airports, time first and last seen, and more. [h/t Evgeny Pogrebnyak] — : January 27, 2021
Links:
Tags: COVID-19transportation
Winning the Internet, an experimental newsletter from The Pudding’s Russell Goldenberg, analyzes the hyperlinks found in 100+ link-heavy general-interest newsletters. Last month, Goldenberg published sixth months of the underlying data — nearly 150,000 newsletter/link/date observations in total. — : January 20, 2021
Links:
Tags: mediasocial media
Thanks to India’s Right to Information Act, energy researcher Sandeep Pai has compiled a dataset of the country’s 459 operational coal mines. It includes each mine’s name, location (state, district, latitude, longitude), ownership, production tonnage, and more. Related: Pai’s introductory Twitter thread. — : January 20, 2021
Links:
Tags: energyenvironment
The Office of the French Ambassador for Digital Affairs has been tracking user-agreement documents (such as terms of service and privacy policies) published by 170+ websites, apps, and other bits of software — from Airbnb and Google Analytics to The New York Times and Zillow. You can download them all and subscribe to notifications when they change. [h/t Vincent Viers] — : January 20, 2021
Links:
Tags: technology
The Program on Extremism at George Washington University is building a “central database of court records related to the events of January 6, 2021.” In addition to linking to the government’s criminal complaints, indictments, and other documents, the researchers are publishing a spreadsheet of all people who’ve been federally charged — with names, age, gender, home state, and date charged. — : January 20, 2021
Links:
US Crisis Monitor “provides the public with real-time data and analysis on political violence and demonstrations around the country, establishing an evidence base from which to identify risks, hotspots, and available resources to empower local communities in times of crisis.” The project, launched last summer as a collaboration between the Armed Conflict Location & Event Data Project (DIP 2017.11.29) and Princeton University’s Bridging Divides Initiative, now contains information on 20,000+ peaceful protests (the vast majority of entries), violent demonstrations, riots, and other events since April 2020. For each entry, the dataset lists the event type, date, location, groups involved, as well as a brief summary. As seen in: The Police’s Tepid Response To The Capitol Breach Wasn’t An Aberration (FiveThirtyEight). — : January 20, 2021
Links:
Lichess is a free, open-source, donation-supported chess server. Its team publishes a database of the rated matches played on its platform (more than 1.7 billion so far, since 2013), which it recently used to revamp its dataset of original chess puzzles. Previously: Chess games from high-level players and tournaments (DIP 2016.02.17). — : January 13, 2021
Links:
Tags: games
Government agencies often commercialize their innovations through “technology transfer” programs, which strike collaboration and licensing agreements with outside parties. NASA provides a technology transfer API for accessing data about its patent portfolio and software catalog; the National Institutes of Health offers a similar API for querying its licensing opportunities. [h/t Tom Folkes] — : January 13, 2021
Links:
Tags: technology
Political scientists Corinna Kroeber and Tobias Remschel have compiled a dataset of “all written communication published by the German Bundestag between 1949 and 2017,” unifying datasets released by Germany’s parliament “in a manner easily accessible to researchers applying text analysis.” For each of the 131,835 reports, requests, bills, and other documents, the dataset provides the full text, date, author information, and more. Previously: Six million parliamentary speeches from nine countries (DIP 2020.04.29. — : January 13, 2021
Links:
Global Fishing Watch — initially organized as a collaboration between conservationists and Google — uses satellite imagery, ship signals, and other sources “to visualise, track and share data about global fishing activity in near real-time and for free.” The project’s public datasets (free registration required) examine various aspects of the industry, including the geography of “fishing effort” (2012–16) and transshipment between vessels. As seen in: “Why the U.K. and EU Are Fighting Over Fish” (Bloomberg). [h/t Nathan Yau] — : January 13, 2021
Links:
The Comparative National Elections Project “is a partnership among scholars who have conducted election surveys on five continents,” with a focus on understanding the factors that shape voters’ decisions. The project’s publicly-available datasets include 48 of the surveys, which use a combination of country-specific questions and shared questionnaire — asking about political news consumption, attitudes towards democracy, interpersonal communications, and other topics. Related: The Comparative Study of Electoral Systems, a similar collaboration with “a special emphasis on voting and turnout.” [h/t Erik Gahner Larsen] — : January 13, 2021
Links:
Tags: elections
Over the course of 2020, Eli Holder paid workers on Mechanical Turk to turn news headlines into 5/7/5-syllable poems. The result: 2,760 “Doom Haikus,” which you can browse on a timeline or download in bulk. For each poem, the dataset also includes the original article URL, date processed, headline, and SEO snippet. [h/t Karsten Johansson] — : January 6, 2021
Links:
In 2017, a team of researchers downloaded and analyzed 1.25 million publicly-available Jupyter notebooks — documents that weave computational code, output, and text. They also published the notebooks and their related metadata. Inspired by that project, a team at JetBrains recently did a follow-up scan, analyzing and publishing data on nearly 10 million notebooks. — : January 6, 2021
Links:
Tags: technology
The College Athletics Financial Information Database, run by the privately-funded Knight Commission on Intercollegiate Athletics, details the annual sources of revenue (such as ticket sales) and expenses (such as coaches’ compensation) for hundreds of schools, based on information self-reported to the NCAA and federal government. Many of the records were obtained via freedom-of-information requests by USA Today and Syracuse University students. [h/t Craig Garthwaite et al.] — : January 6, 2021
Links:
The UN and the World Bank have launched a new interactive map and dataset that quantify the transportation costs for international trade — country-by-country and broken down by mode of transportation (sea, air, rail, road), trading partner, and commodity. The numbers, based both on directly-reported figures and statistical modelling, include costs overall, per unit, and per unit per kilometer. The project currently covers only 2016, but has plans to expand. [h/t Jan Hoffmann] — : January 6, 2021
Links:
Carnegie Mellon University’s epidemiological forecasting group and Facebook have partnered to field a large-scale coronavirus survey in the US; they’ve collected more than 15 million responses since April 2020. The University of Maryland has formed a similar partnership for an international survey, in which “a representative sample of Facebook users is invited on a daily basis to report on symptoms, social distancing behavior, mental health issues, and financial constraints”; millions have also participated. Geographically-aggregated results of the US survey can be downloaded via an online interface or Delphi’s API; the international results are also available via API. Practical example: An analysis of state-by-state mask usage, with code. [h/t Alex Reinhart] — : January 6, 2021
Links:
Tags: COVID-19
YouTuber (and former public radio reporter) Adam Ragusea recently asked his viewers to answer a detailed survey about whether (and why, and how) they wash meat before cooking it. He received more than 13,000 responses. He then made a video about what he found and published a spreadsheet of the anonymized answers. — : December 23, 2020
Links:
Tags: food
The European Space Agency has released new longitudinal data on the Northern Hemisphere’s permafrost — ground that remains 0°C/32°F or colder for at least two years. Through a combination of satellite detection and on-the-ground measurements, the datasets quantify the permafrost’s thickness, extent, and temperature between 1997 and 2017. [h/t Simon Proud] — : December 23, 2020
Links:
In a paper published this spring, Mahler et al. describe their dataset of social scientists’ appearances in US congressional hearings — more than 15,000 instances in all, at more than 10,000 hearings between 1946 and 2016. For each testimony, the dataset indicates the expert’s name, discipline, title, and professional affiliations, as well as the hearing’s date, title, and committee. Economists predominate, followed by political scientists, psychologists, sociologists, and then anthropologists. [h/t Deblina Mukherjee] — : December 23, 2020
Links:
Tags: governmenthistoryscience
The US federal government has finally begun publishing county-level data on COVID-19 test counts, positivity rates, and delays. And that’s just a slice of the information now available through the daily-updated, multi-agency Community Profile Reports, which also assign each county to a “concern category” and aggregate the metrics to the CBSA, state, and regional levels. Related: Ryan Panchadsaram’s enthusiastic Twitter thread. — : December 23, 2020
Links:
Our World in Data is tracking the number of COVID-19 vaccine doses administered per country, compiling their dataset from a range of government sources, including press releases and ministers’ tweets. In addition to listing the total doses administered, the US Department of Health and Human Services is also publishing datasets that tally how many Pfizer-BioNTech and Moderna vaccine doses have been allocated and shipped to each state and territory. — : December 23, 2020
Links:
This is, “to the best of our knowledge, [...] the first dataset providing network traffic traces and corresponding event logs from a complex cyber defense exercise” — a two-day Cyber Czech event in March 2019. — : December 16, 2020
Links:
Tags: technologywar
Economist Victor Gay has built a geographic dataset that traces, year by year, the administrative boundaries of France’s Third Republic, which governed from 1870 to 1940, when the Vichy Regime took power. The dataset provides annual shapefiles delineating the country’s départements, arrondissements, and cantons; as well as for its “most significant special administrative constituencies: military, judicial and penitentiary, electoral, academic, labor inspection, and ecclesiastical.” — : December 16, 2020
Links:
The COVID Border Accountability Project is tracking countries’ pandemic-related travel and immigration restrictions, on a weekly basis. The project’s team categorizes various aspects of the restrictions — whether they hinge on citizenship, halt new visa applications, et cetera — and turns them into a longitudinal dataset. Previously: The UN World Food Program’s travel-restrictions dataset (DIP 2020.12.09). — : December 16, 2020
Links:
Thanks to a FOIA lawsuit by a group of news organizations, the US Small Business Administration has released additional data about the financial assistance distributed through its Paycheck Protection Program. Previously (DIP 2020.07.08), the SBA’s public data withheld the specific amounts for all loans (instead listing only a broad range), as well as names and addresses for loans below $150,000. The new datasets include those amounts, names, and addresses for all loans. — : December 16, 2020
Links:
Recently launched, Enslaved.org allows the public to “explore or reconstruct the lives of individuals who were enslaved, owned slaves, or participated in the historical trade.” Its interactive database contains 600,000+ records, with plans to expand. The collaborative, schlolar-led project also includes The Journal of Slavery and Data Preservation, which “publishes original, peer-reviewed datasets about the lives of enslaved Africans and their descendants.” The first issue features three datasets originally published through a precursor to Enslaved.org — Slave Biographies: The Atlantic Database Network. Those datasets focus on Louisiana slaves (1719–1820), New Orleans “Free Blacks” (1840–1860), and enslaved Africans in Maranhão, Brazil (1767–1831). — : December 16, 2020
Links:
Data scientist Jared Wilber has built a dataset of all paintings in Bob Ross’s 31 seasons of “The Joy of Painting,” scraped from the searchable database at TwoInchBrush.com. For each painting, the dataset lists the title, season, episode, YouTube link, and list of colors used. For a 2014 article at FiveThirtyEight, Walt Hickey created a dataset categorizing the types of things Ross depicted in each episode. Related: “Where Are All the Bob Ross Paintings? We Found Them,” a video from the New York Times. [h/t u/palpitations] — : December 9, 2020
Links:
Tags: arttelevision
During Colombia’s struggle for independence, Royalists executed scores of women by shooting squad, the most famous of whom was the seamstress and spy known as “La Pola.” Writing last year for the cultural journal of Colombia’s central bank, historian Pablo Rodríguez Jiménez presented a list of 76 women known to have suffered this fate — their names, locations, and dates of death. Colombia-based Datasketch has converted that list into a spreadsheet. [h/t Juan Pablo Marín Díaz] — : December 9, 2020
Links:
The CIA’s World Factbook “provides information on the history, people and society, government, economy, energy, geography, communications, transportation, military, and transnational issues for 267 world entities.” The details are extensive and fairly standardized. Open data–enthusiast Gerald Bauer has converted the publication into a series of JSON files. Now you know: The physical areas of five countries (Georgia, Ireland, Latvia, Lithuania, and Sri Lanka) are all described as “slightly larger than West Virginia.” — : December 9, 2020
Links:
Tags: miscellaneous
The UN World Food Program has been tracking countries’ and airlines’ travel restrictions during the COVID-19 pandemic, based on official communications, media reports, and other sources. The country-level dataset indicates whether travelers must obtain a recent negative test and what type of quarantine or self-isolation is required. [h/t Cassidy Chansirik] — : December 9, 2020
Links:
On Monday, the Department of Health and Human Services released a dataset on coronavirus-related capacity at thousands of US hospitals — information the agency previously only published as state-level metrics. The self-reported, weekly-updated dataset quantifies various aspects of capacity, such as the number of staffed ICU beds and the number of beds occupied by patients with COVID-19. “This data is tremendously complex and is the result of substantial ongoing efforts,” notes an accompanying blog post. “We opted not to have perfect be the enemy of good, so these datasets will have imperfections.” Related: An FAQ “developed in collaboration with a group of data journalists, data scientists, and healthcare system researchers who have reviewed the data.” [h/t Ryan Panchadsaram] — : December 9, 2020
Links:
The decades-old, frequently-updated, and downloadable On-Line Encyclopedia of Integer Sequences contains more than 338,000 lists of those things. Each has some particular significance, ranging from the famous (the Fibonacci numbers) to the intriguing (“days required to spread gossip to n people”) to the obscure (“numbers n such that 2^n + 35 is prime”) to the super-obscure. Related: This xkcd comic and its impact. [h/t Dan Brady] — : December 2, 2020
Links:
Tags: miscellaneous
From July 1776 to June 1826, Thomas Jefferson recorded thousands of nearly-daily weather observations — temperatures, precipitations, humidities, wind speeds — at Monticello, Paris, Milan, and scores of other locations. Now a UVA/Princeton collaboration has turned those handwritten records into an explorable and downloadable database. [h/t Erica Cavanaugh] — : December 2, 2020
Links:
Last month, the University of Illinois’ Cline Center for Advanced Social Research published version 2.0 of its Coup D’état Project, a dataset detailing more than 900 coups, attempted coups, and coup conspiracies from 1945 to 2019. Each entry indicates the country and date, plus the “type of actor who initiated the coup (i.e. military, palace, rebel, etc.) as well as the fate of the deposed executive (killed, injured, exiled, etc.).” Previously: Powell and Thyne’s coup dataset (DIP 2016.07.20). — : December 2, 2020
Links:
As part of its Pradhan Mantri Gram Sadak Yojana road-development program, India’s Ministry of Rural Development has gathered data on 700,000+ rural facilities, which data-science engineer Pratap Vardhan has organized into state-level CSV files. The information includes each facility’s name, category (e.g., education, medical, etc.), subcategory, state, district, block, address, and geocoordinates. Related: An exploratory Twitter thread by Vardhan, who says, “This is probably the largest open indian geo-tagged dataset I’ve seen!? It’s mostly great!?” — : December 2, 2020
Links:
The US Department of Education publishes a range of aggregate datasets on federal student loans, including the amounts outstanding ($1.5+ trillion overall, from 43 million students), volumes of financial aid requested and awarded (by student demographic and by school), default rates, and forgiveness. — : December 2, 2020
Links:
DIP editions 2016.11.16 and 2018.08.08 featured datasets of trees in NYC and other cities. But wait, there's more: millions of trees in Bogotá, Paris, Vienna, Amsterdam, Helsinki, and Dublin. [h/t Juan Pablo Marín Díaz + Ana Lucía González + Cormac O’Keeffe + Topi Tjukanov + Tuija Sonkkila + Martin Bangratz + Sanne Hombroek + u/cavedave] — : November 18, 2020
Links:
Tags: environmentmapping
A new dataset from the Urban Institute “provides a comprehensive accounting of public spending on children from 1997 through 2016.” Drawing on the US Census Bureau’s Annual Survey of State and Local Government Finances and other sources, the dataset summarizes “state-by-state spending on education, income security, health, and other areas.” [h/t Erica Greenberg] — : November 18, 2020
Links:
For decades, the US Department of Education’s Civil Rights Data Collection has compiled “data on key education and civil rights issues in our nation’s public schools,” including “student enrollment and educational programs and services, most of which is disaggregated by race/ethnicity, sex, limited English proficiency, and disability.” Last month, the department released the CRDC for the 2017–18 school year. Related: ProPublica has used CRDC data to investigate racial inequality and the use of restraints and seclusions. [h/t Andrew McCartney] — : November 18, 2020
Links:
Researchers from ETH Zurich’s Institute for Transport Planning and Systems have assembled 170 million observations of traffic intensity on urban roads, registered by 23,000+ detection points in 40 cities, “making it the largest multi-city traffic dataset publically available.” The cities are mostly in Western Europe, but also include Tokyo, Taipei, Melbourne, Vilnius, Los Angeles, and Toronto. [h/t ddechamb] — : November 18, 2020
Links:
Tags: transportation
The World Inequality Database “aims to provide open and convenient access to the most extensive available database on the historical evolution of the world distribution of income and wealth, both within countries and between countries.” The project, co-directed by Thomas Piketty, published a major update last week, expanding its geographic and temporal coverage. The data points vary by country; you can download them interactively or in bulk. Previously: Frederick Solt’s Standardized World Income Inequality Database (DIP 2019.12.04) and the United Nations University’s World Income Inequality Database (DIP 2016.06.01). — : November 18, 2020
Links:
Operabase has gathered information about more than 500,000 opera performances staged since 1996. The website doesn’t provide direct downloads but you can access a dataset on six full seasons of stagings, covering thousands of runs in hundreds of cities, thanks to a “data donation” to support Alexander N. Cuntz’s study of how copyright affects performance frequency. — : November 11, 2020
Links:
The al-Ṯurayyā Project features an interactive map of the early Islamic world, with 2,000 named locations — from Damascus to Baghdad and beyond — and historical routes between them. The underlying dataset includes geocoordinates, Arabic spellings, transliterations, primary sources, and other details. [h/t Jajwalya Karajgikar] — : November 11, 2020
Links:
“Why do transit-infrastructure projects in New York cost 20 times more on a per kilometer basis than in Seoul?” With the aim of answering questions like these, the NYU-based Transit Costs Project is building a dataset that already spans more than 500 urban rail projects around the world. For each project, the dataset specifies the city, start year, end year, rail length, number of stations, total cost, and more. — : November 11, 2020
Links:
Tags: transportation
As part of Jones v. United States Postal Service, a federal lawsuit filed in August, USPS must submit weekly performance reports that indicate, at a national and district level, the percentage of mail that was processed (though not necessarily delivered) on time. The agency files these reports as PDFs; Save the Post Office, a decade-old website run by a retired English professor, has been collecting those PDFs and converting them into spreadsheets. Related: Aaron Gordon’s pre-election analysis of the USPS data, from Gordon’s (limited-run) newsletter about the postal service. — : November 11, 2020
Links:
The Marshall Project has obtained and published official data from US Customs and Border Protection listing 580,000+ times that the agency detained migrant children since early 2017. For each detention, the dataset includes the date and time the child entered and left CBP custody, as well as the child’s age, gender, and citizenship. Related: The Marshall Project’s report on the data. — : November 11, 2020
Links:
Tags: familyimmigrationjustice
Some years ago, economists Peter T. Leeson and Jake Russ compiled a dataset of 10,000+ witch trials in Europe. Over the course of 550 years, the trials accused more than 43,000 people and led to 16,000 deaths. Related: Leeson and Russ’s academic paper analyzing the data (PDF). Previously: The Survey of Scottish Witchcraft (DIP 2016.01.27). [h/t Sophie Warnes] — : October 28, 2020
Links:
A dataset of state regulations pertaining to standard bicycles, electric bicycles, and electric scooters accompanies a recent paper in the Journal of Law and Mobility. It includes various classifications for each state — indicating, for instance, whether riders can use the sidewalk, whether adults must wear helmets, and whether DUI laws apply. — : October 28, 2020
Links:
Tags: lawtransportation
A team of academics has built a dataset and map indicating the geographic extents of 6,000+ mining sites around the world. The project traces out 21,000+ polygons covering 57,000+ square kilometers, with boundaries based on experts’ visual interpretation of satellite imagery. It focuses on “above-ground features,” such as “open cuts, tailings dams, waste rock dumps, water ponds, and processing infrastructure.” — : October 28, 2020
Links:
Tags: mapping
Since late March, US Immigration and Customs Enforcement has been updating a webpage that tallies coronavirus cases and deaths at each of its detention facilities. But the site doesn’t provide any historical numbers; it’s just a snapshot. To fill that gap, the Vera Institute of Justice has been continually downloading the webpage, parsing it, and turning the information into structured, longitudinal data. Related: Over the summer, Vera used the data to estimate “the true scope” of COVID-19 in ICE detention, and called on the agency to provide more detailed statistics. — : October 28, 2020
Links:
Last month the Guttmacher Institute, a “research and policy organization committed to advancing sexual and reproductive health and rights,” released its latest estimates of annual pregnancies, births, and abortions among women in the US — overall, and also by age-group and state. The national statistics cover 1973 to 2016, while the state-level numbers go from 1988 to 2016; both sets of estimates are derived from a combination of government data and the institute’s own Abortion Provider Census. Related: An introductory Twitter thread. [h/t Isaac Maddow-Zimet] — : October 28, 2020
Links:
Tags: healthcarewomen
Scientists recently documented “widespread declines in Pacific salmon size based on 60 years of measurements from 12.5 million fish across Alaska,” collected by state officials from more than 1,000 sampling locations. The underlying dataset lists each salmon’s age, sex, length, and other details where available. [h/t Holly Kindsvater] — : October 21, 2020
Links:
Technology companies often offer “bug bounties” — rewards to people who tell them about vulnerabilities in their websites and apps, often coordinated through online platforms. For the past few years, security engineer Arkadiy Tetelman has been scraping and publishing data about the bug bounty “targets” listed on several platforms, including the eligible domains, maximum payouts, and more. — : October 21, 2020
Links:
Tags: technology
The Equity in Athletics Disclosure Act requires thousands of US colleges to provide annual data on athletic particpation, staffing, and finances by team gender and sport. School- and team-level datasets are available through the Department of Education for the academic years ending 2003–19. Related: USAFacts recently used the data to examine college football finances. [h/t Sasha Anderson] — : October 21, 2020
Links:
Last week, Reuters published an investigation into deaths in US jails. Because the federal government doesn’t publish jail-by-jail mortality data, reporters “filed more than 1,500 records requests to obtain information about deaths in 523 U.S. jails – every jail with an average population of 750 or more inmates, and the 10 largest jails or jail systems in nearly every state.” The resulting dataset contains details about more than 7,500 inmate deaths between 2008 and 2019, including the cause of death, custody status, and demographic information. Among the findings: “At least two-thirds of the dead inmates identified by Reuters, 4,998 people, were never convicted of the charges on which they were being held.” [h/t Grant Smith] — : October 21, 2020
Links:
Verified Voting is “a non-partisan organization focused exclusively on the critical role technology plays in election administration," and its Verifier project provides “the only comprehensive data set of voting equipment down to the precinct level of the United States, going back to 2006.” For each election year and jurisdiction, the database indicates the type of technology in use (ranging from hand-counted paper ballots to touchscreen systems), equipment brands and models, and other details, such as whether there is a “voter-verified paper audit trail.” [h/t Jonathan Cohen] — : October 21, 2020
Links:
The Kiss List explores artist Galen Beebe’s 48 first kisses. Reconstructed from “memories and journals,” and developed with her partner John West, the dataset and visualization present “a set of facts that show who, what, where, when, and why I kissed how I did.” — : October 7, 2020
Links:
Tags: artmiscellaneous
The Bank of England has considerably expanded its longitudinal dataset on the UK’s economy (DIP 2017.01.25), renaming it “a millennium of macroeconomic data.” A few indicators (such as GDP per capita) now stretch back to 1086, thanks to the Domesday Book, while several others (such as consumer price inflation) now extend to the 13th century. [h/t Alex Albright] — : October 7, 2020
Links:
The Library of Congress’ Newspaper Navigator dataset extracts “visual content” from more than 16 million pages of newspapers from 1789 to 1963, drawn from the library’s Chronicling America project (DIP 2017.08.16). To compile the dataset, its creators used machine learning to detect seven types of visuals: photos, illustrations, maps, comics, editorial cartoons, headlines, and ads. They also built an interactive search tool. [h/t Jessamyn West] — : October 7, 2020
Links:
In August, BuzzFeed News published a two-part investigation into China’s “vast new infrastructure” for imprisoning Muslim minorities in its Xinjiang region. The reporters used a novel methodology to of hundreds of detention facilities: examining the gaps in Baidu Maps’ satellite imagery. Last month, they published a dataset of those facilities’ coordinates and statuses. The Australian Strategic Policy Institute has also launched its Xinjiang Data Project, identifying more than 380 detention facilities as well as the destruction of religious/cultural sites in the region. The project — which builds on previous research by the institute, BuzzFeed News, and others — classifies the detention sites into four tiers, from “low-security re-education facilities” to “suspected maximum-security prisons.” [h/t William Yang] — : October 7, 2020
Links:
Tags: human rights
As part of a new investigative series, the Center for Public Integrity and Stateline have published a dataset of polling places during the 2012–18 US general elections for 30 states (and plan to add more states “in the coming weeks”). To assemble the dataset, reporters filed 1,200 records requests, and then converted the disparate files they received into standardized CSVs listing each polling place’s county, precinct, name, and address. — : October 7, 2020
Links:
Mathematician Chris K. Caldwell maintains a searchable, downloadable database of the largest prime numbers known to humankind — plus who discovered them, when, and how. Related: Smaller primes. — : September 30, 2020
Links:
Tags: miscellaneous
The Social Connectedness Index, a collaboration between Facebook and academic researchers, quantifies “the intensity of connectedness between locations” by measuring the frequency of Facebook-friendships linking their residents. The index represents this measurement on a scale from 1 to 1,000,000,000; the publicly available datasets provide it for every pair of countries, every pair of US counties, every county-country pair, and between subnational regions around the world. Related: An illustrative Twitter thread demonstrating the data. [h/t Johannes Stroebel] — : September 30, 2020
Links:
Tags: mappingsocial media
The Global Wind Atlas aims “to help policymakers, planners, and investors identify high-wind areas for wind power generation virtually anywhere in the world.” The internationally-funded project provides a range of global and country-specific datasets, including wind speeds at various heights, as well as a description of its methodology and an FAQ. [h/t Anton Rühling] — : September 30, 2020
Links:
Tags: energy
The US Treasury’s “Debt to the Penny” dataset reports the total amount of outstanding public debt issued by the federal government — updated daily and going back to April 1, 1993. As of Monday, that number was $26,811,409,726,497.33. Another of the Treasury’s datasets provides annual debt figures going back to 1789. [h/t Sam Hunley] — : September 30, 2020
Links:
For years, Open States has allowed you to “track bills, review upcoming legislation, and see how your local representatives are voting in your state.” The volunteer-driven project provides bulk downloads of nearly all its data, plus an API. More recently, it has started tracking coronavirus-related legislation, with data on more than 3,300 bills across the country. [h/t Amy Cesal] — : September 30, 2020
Links:
Tim Morgan’s bible-api.com “provides a JSON API for grabbing bible verses and passages.” You can choose verses from six translations in five languages, and can download both the underlying code and the data. [h/t Oto Brglez] — : September 2, 2020
Links:
For their “Pacem in Terris: Are Papal Visits Good News for Human Rights?” working paper, economists Marek Endrich and Jerg Gutmann have compiled “the first global dataset on papal travels outside of Italy.” For each country-year combination between 1964 and 2017, the dataset indicates whether the Pope visited, who he was, and the country’s “latent human rights” score for that year. — : September 2, 2020
Links:
Tags: human rightsreligion
UX designer Tait Chamberlain has constructed a dataset of all US presidential cabinet nominations — including nominees appointed, withdrawn, and rejected — from George Washington’s to the current administration. The spreadsheets contain “service dates, notable scandals, education, military service, foreign birth, known minority and gender status, whether the appointee died in office, and the senate confirmation vote tallies.” Previously: Cabinets around the world, 1966–2016 (DIP 2020.08.05). — : September 2, 2020
Links:
Education Week is collecting coronavirus-era reopening plans “from a sample of school districts around the country.” The dataset (free registration required) covers more than 800 US public school districts so far, and is being updated weekly. USAFacts has collected similar information from the 255 largest US public school districts as of August 17. And coronaviral.fyi has pulled together data on a thousand districts’ plans in Connecticut, New Jersey, New York, and Pennsylvania. [h/t Sasha Anderson + Stephen Stirling and Rebekah F. Ward] — : September 2, 2020
Links:
The China Biographical Database is packed with details on “approximately 470,000 individuals” from historical China, “primarily from the 7th through 19th centuries.” The extensively documented records include information about kinships, social statuses, offices and postings, aliases, known addresses, and more. The project, which has a long history of its own, provides bulk downloads as well as an API. [h/t Yifei Hu] — : September 2, 2020
Links:
Tags: history
“With millions of images in our library and billions of user-submitted keywords, we work hard at Shutterstock to make sure that bad words don’t show up in places they shouldn’t.” The company’s dataset of Dirty, Naughty, Obscene, and Otherwise Bad Words contains the block-lists for their autocompletion and recommendation features, covering 2,600+ words and phrases in 28 languages. [h/t Katie McCulloch] — : August 26, 2020
Links:
Tags: language
The US National Park Service publishes a dataset of 28,000+ lines describing “formal and informal trails as well as routes within and across” the park system. The dataset provides the trail names, types, surfaces, allowed uses, and more. [h/t u/torrijasycafe] — : August 26, 2020
Links:
Tags: environmentmapping
The US Government Publishing Office’s govinfo.gov provides online access to a wide range of official federal publications — including bulk downloads of congressional bills, the Federal Register, the Code of Federal Regulations, and more. It also provides sitemaps “to crawl and harvest content” from many of its other collections. [h/t Christine Stefano] — : August 26, 2020
Links:
Tags: government
The Authoritarian Ruling Elites Database, compiled by political scientist Austin S. Matthews, is “a collection of biographical and professional information on the individuals who constitute the top elite of authoritarian regimes.” Each of the project’s 18 datasets focuses on a particular regime, such as the military dictatorship that ruled Chile from 1973 to 1990. The biographical data-points include gender, occupation, dates of birth and death, tenure among the elite, and more. — : August 26, 2020
Links:
Tags: history
The Mass Mobilization Project is “an effort to understand citizen movements against governments, what citizens want when they demonstrate against governments, and how governments respond to citizens.” The project’s dataset covers more than 14,000 protests in more than 160 countries between 1990 and early 2017. For each protest, it indicates the location, dates, estimated number of participants, protesters’ demands, the state’s response, and more. The project, led by political scientists David H. Clark and Patrick M. Regan, is indirectly funded by the CIA through the government-sponsored Political Instability Task Force. [h/t Erik Gahner Larsen] — : August 26, 2020
Links:
The US Geological Survey’s Native Bee Inventory and Monitoring Lab keeps tabs on the country’s bee species, including through a dataset of more than 400,000 observations of “native and non-native bees, wasps and other insects.” (Free registration required.) Related: The lab also publishes “The Very Handy Manual: How to Catch and Identify Bees and Manage a Collection,” plus thousands of high-resolution photos. — : August 19, 2020
Links:
Tags: animalsenvironment
England’s Immigrants 1330-1550 is “a fully-searchable database containing over 64,000 names of people known to have migrated to England during the period of the Hundred Years’ War and the Black Death, the Wars of the Roses and the Reformation,” drawing on “taxation assessments, letters of denization and protection, and a variety of other licences and grants.” In addition to names, the dataset includes nationalities, places of residence, occupations, and more. [h/t W. Mark Ormrod] — : August 19, 2020
Links:
Tags: historyimmigration
With help from academics and former Hill staffers, journalist Derek Willis has assembled an archive of the weekly job and internship bulletins sent by the US House of Representatives. The archive, which goes back to late 2013, includes both the original PDFs and text extracted from them. — : August 19, 2020
Links:
Tags: government
The Reflective Democracy Campaign and the Center for Technology and Civic Life have partnered to produce datasets examining the demographics of 3,000+ elected sheriffs; 2,800+ elected prosecutors; and candidates for federal, state, and local offices in 2012, 2014, 2016, and 2018. [h/t Stacy Montemayor] — : August 19, 2020
Links:
Tags: governmentlaw
The United States Sentencing Commission publishes annual datasets, going back to 2002, on people and organizations criminally sentenced in federal court. The files are anonymized, but contain hundreds of variables detailing the circumstances and outcomes of each decision. The commission also publishes “special collections” with additional information on drug-trafficking and economic crimes. Note: The datasets are published as SAS and SPSS files, but Kevin H. Wilson has shared Python code to convert them to CSVs. [h/t Giuseppe Sollazzo] — : August 19, 2020
Links:
On Reddit, users can choose to publicize their history of upvoting and downvoting other users’ posts. Software engineer Joey Leake recently collected and published data on more than 44 million of these votes. For each, Leake’s dataset lists the post ID, relevant subreddit, the vote’s timestamp (in most cases), and the voter’s username. — : August 12, 2020
Links:
Tags: social media
A team led by public policy professor Omar Asensio used a field experiment to collect data on 3,395 electric vehicle charging sessions. The dataset “contains sessions from 85 EV drivers with repeat usage at 105 stations across 25 sites at a workplace charging program”; it indicates the date and length of each session, total energy used, cost, and more. — : August 12, 2020
Links:
Tags: energytransportation
The Bank for International Settlements, established in 1930 and run by a group of central banks, publishes a range of statistical datasets “designed to inform analysis of financial stability, international monetary spillovers and global liquidity.” The datasets are available to explore online and to download; they cover exchange rates, cross-border liabilities, consumer prices, debt service ratios, and more. [h/t Erik Gahner Larsen] — : August 12, 2020
Links:
On Friday, the Washington Post released several datasets and computer scripts from its Pulitzer Prize–winning series, “2°C: Beyond the Limit,” which examined how “extreme climate change has arrived in America.” The data files — derived from NOAA’s nClimDiv and nClimGrid climate datasets — contain the annual average temperatures and seasonal temperature changes for each state and county in the contiguous US. — : August 12, 2020
Links:
The Centre for Disaster Protection and Development Initiatives have been jointly compiling data on the billions of dollars of humanitarian and development aid that the IMF, World Bank, and other agencies have allocated in response to the COVID-19 pandemic. For each “flow” of funds, the dataset specifies the funding source, amount, approval date, purpose, and more. Related: The UN’s provides a downloadable and explorable dataset of its coronavirus–related humanitarian funding, and has built an interactive map based on these and related datasets. — : August 12, 2020
Links:
SketchGraphs characterizes 15 million computer-aided design (CAD) sketches, “extracted from real-world CAD models” and obtained from a popular online CAD platform. Each of the dataset’s sketches “is represented as a geometric constraint graph where edges denote designer-imposed geometric relationships between primitives, the nodes of the graph.” — : August 5, 2020
Links:
Tags: arttechnology
The USDA has been publishing basic information about animals that its National Veterinary Services Laboratories have confirmed contracted the novel coronvirus. The simple table — unavailable to download, but easy enough to copy-paste into a spreadsheet — lists the types of animals (mostly “Cat” and “Dog,” but also a lion and a tiger), the states they lived in, the dates confirmed, and the methods of diagnosis. — : August 5, 2020
Links:
Breaking Trust, a new project from the Atlantic Council, provides a dataset of 115 “software supply chain” attacks and disclosures in the past decaude. These vulnerabilities occur “when an attacker accesses and edits software somewhere in the complex software development supply chain to compromise a target farther up the chain by inserting their own malicious code.” The dataset’s examples include Stuxnet, malicious browswer extensions, and various attacks on software package registries. [h/t Maya Kaczorowski + Lily Liu] — : August 5, 2020
Links:
Tags: technology
The World Values Survey, first fielded in 1981, “is the largest non-commercial, cross-national, time series investigation of human beliefs and values ever executed, currently including interviews with almost 400,000 respondents.” Last month, the project began releasing data from its seventh wave of interviews, conducted in 77 countries and covering hundreds of questions about religion, migration, stereotypes, trust, and more. [h/t Michael Howlett + Seth J. Meyer] — : August 5, 2020
Links:
Tags:
WhoGov, a new project led by two graduate students at Oxford, provides “bibliographic information, such as gender and party affiliation, on cabinet members in July every year in the period 1966-2016 in all countries with a population of more than 400,000 citizens.” In all, the dataset covers more than 50,000 officials, “makes it possible to answer questions such as; what is the share of female cabinet members globally, which type of regime has the highest cabinet turnover, and have cabinets increased in size over time?” [h/t Yujin Julia Jung + Max Grömping] — : August 5, 2020
Links:
Tags: government
The Meteoritical Society’s Meteoritical Bulletin Database “is a clearinghouse for basic information about each meteorite, including the classification, place and year of discovery, whether if was observed to fall, references to catalogues in which the meteorite is described, and known synonyms that may be encountered in the literature.” The society doesn’t provide an easy way to download the full database, but NASA’s open data portal hosts a key slice of it — 45,000+ landings recorded through mid-2013. Related: Craig Taylor’s animated, 3-D rendering of the data. EwanP] — : July 29, 2020
Links:
Christoph Trinn and Felix Schulte’s TERRGO dataset presents “a fresh look at territorial self-governance in more than 2,200 second-level regions in 96 Western and non-Western democracies, semi-democracies, and a selection of autocratic regimes between 2000 and 2018.” For each region (for instance, Guam or Greenland), TERRGO provides several self-governance metrics, such as whether the national constitution protects the its status and to what degree it can set taxes. — : July 29, 2020
Links:
Tags: government
FiveThirtyEight’s frequently-updated polling database provides results from thousands of polls (and hundreds of pollsters) on the current presidential, congressional, and gubernatorial elections. Those datasets — plus historical polling averages for presidential elections since 1980 — are available to download. The files list each poll’s sample size, methods, timeframe, FiveThirtyEight pollster rating, and more. — : July 29, 2020
Links:
ProPublica has published a dataset of more than 12,000 civilian complaints against nearly 4,000 NYPD officers. The reporters obtained the data through a freedom-of-information request to New York City’s Civilian Complaint Review Board, after state lawmakers overturned a decades-old statute that had shielded the records. ProPublica’s database “lists the name of each officer, the race of the complainant and the officer, a category describing the alleged misconduct, and whether the CCRB concluded the officers’ conduct violated NYPD rules.” Related: In 2018, my colleagues Kendall Taggart and Mike Hayes obtained thousands of NYPD disciplinary records from a source who requested anonymity — giving the public access to this closely-guarded information for the first time and demonstrating how the department had let hundreds of officers keep their jobs after committing fireable offenses. (Soon after, the NYPD announced an independent panel to review its disciplinary program.) [h/t Jan Willem Tulp + Ed Vine] — : July 29, 2020
Links:
The Milken Institute’s FasterCures project is tracking hundreds of potential COVID-19 treatments and vaccines. For each candidate, the project’s database lists its category (e.g., DNA-based vaccines, cell-based therapies, et cetera), a brief description, its stage of development, “anticipated next steps,” funders, and more. Related: The project’s interactive graphic exploring the vaccines. — : July 29, 2020
Links:
The Economist’s Big Mac Index, which the magazine invented in 1986, compares the cost of the signature McDonald’s hamburger around the world. “Burgernomics was never intended as a precise gauge of currency misalignment, merely a tool to make exchange-rate theory more digestible,” it says. “Yet the Big Mac index has become a global standard, included in several economic textbooks and the subject of dozens of academic studies.” The index is updated twice a year (including last week) and now covers 55 countries; both the data (going back to April 2000) and calculation code are available to download. — : July 22, 2020
Links:
The General Bathymetric Chart of the Ocean project has published a new version of its Arctic Ocean depth chart, which was first released in 1997 and last updated in 2012. The latest dataset incorporates new sources, has “more than twice the resolution” as the previous version, and is more precise. Previously: Arctic (and Antarctic) ice coverage (DIP 2016.09.14), which is hitting year-over-year lows. — : July 22, 2020
Links:
Tags: mapping
The Mexican Migration Project “was created in 1982 by an interdisciplinary team of researchers to further our understanding of the complex process of Mexican migration to the United States.” Since then, the project — co-directed by professors at the University of Guadalajara and Princeton University — has interviewed more than 176,000 people from 170 communities in Mexico, some who migrated and others who did not. The datasets (registration required) record various facets of their lives and migrations: demographic, health, and economic attributes; migration timings, circumstances, and destinations; community characteristics; and more. [h/t Brian K. Kovak and Rebecca Lessem] — : July 22, 2020
Links:
Tags: immigration
Since early April, Imperial College London and YouGov have been surveying people in 29 countries about their coronavirus-related behaviors and opinions. Topics include mask usage, self-isolation, working from home, vaccinations, and economic activity; the 230,000+ (anonymized) responses are available to download. [h/t Akin Unver] — : July 22, 2020
Links:
The Atlas of Surveillance, a new project from the Electronic Frontier Foundation, documents various types of surveillance technology used by 3,500 law enforcement agencies around the US. The 5,300 data-points, crowdsourced with the help of hundreds of students and volunteers, cover a dozen categories of technology, such as automated license plate readers, facial recognition systems, and partnerships with doorbell camera–companies. [h/t anigbrowl] — : July 22, 2020
Links:
Tags: justicelawtechnology
The Historical Light Aids to Navigation dataset “shows the development of historical lighthouses, lightships, harbour lights and beacons in England and Wales for several benchmark years between 1514-1911,” drawn from navigational charts, government publications, and other sources. For each of the 600+ entries, the dataset provides the light aid’s name, geocoordinates, and (when available) its visibility range, height, and number of lights. — : July 15, 2020
Links:
The Caselaw Access Project (DIP 2018.11.07) has begun publishing a citation graph, a dataset listing the previous cases that each court decision cites. The latest release covers 43 million citations. The project also provides aggregated versions of the data, plus interactive graphics showing the frequency of citations to and from courts in each state. — : July 15, 2020
Links:
Tags: law
The anonymous author of Squirrelling Data has been collating information from the Singapore Ministry of Health’s coronavirus press releases. Among the datasets: daily case counts in dozens of migrant worker dormitories, which have been hit hard. [h/t Joses Ho] — : July 15, 2020
Links:
The CoronaNet Research Project aims “to collect as much information as we can about the various fine-grained actions governments are taking to defeat the coronavirus.” The project, which has drawn contributions from more than 400 researchers around the world, published its initial release a few weeks ago, and now details nearly 16,000 policy events in nearly 200 countries. Related: The nonprofit Hikma Health says it has compiled “the largest county-level COVID-19 policy dataset in the nation,” covering 1,200 US counties and more than 120 Native American communities. The dataset indicates the dates on which each jurisdiction undertook various responses, such as closing schools and restricting large gatherings. [h/t Alex Pashanov] — : July 15, 2020
Links:
The US Centers for Disease Control and Prevention has published a dataset containing demographic and medical information on 1.75 million deidentified COVID-19 patients. For each confirmed or probable case, the dataset report reports the patient’s age group and race/ethnicity, the date of their initial symptoms, whether they were hospitalized, whether they had an “underlying morbidity or disease,” and more — although several of the fields contain high percentages of “unknown” values. The dataset is similar to the one The New York Times got from the CDC through a Freedom of Information Act lawsuit (see: “The Fullest Look Yet at the Racial Inequity of Coronavirus”), but it does not specify the patient’s county. [h/t Marc Bevand + Steven Mosher] — : July 15, 2020
Links:
Microsoft researchers Revanth Rameshkumar and Peter Bailey have assembled the Critical Role Dungeons and Dragons Dataset, converting 159 transcripts of a popular, live-streamed role-playing show into structured information about 398,682 bits of dialogue. [h/t Lynn Cherny] — : July 8, 2020
Links:
Last week, the UK expanded access to its datasets defining the geographical boundaries of 23 million “title extents” in England, Wales, and Scotland. Previously: UK property sales (DIP 2016.03.23). — : July 8, 2020
Links:
Tags: mapping
For the past decade, programming Q&A site StackOverflow has run an annual survey, asking developers about the languages they use, their workplaces, learning goals, salaries, and more. The site provides anonymized, respondent-level data for each survey, including the 2020 edition, which received 64,000+ responses. Between 2016 and 2018, the not-for-profit FreeCodeCamp ran an annual “new coder” survey, which attracted more than 31,000 responses in its most recent year; those datasets are also available to download. [h/t Jason Norwood-Young — : July 8, 2020
Links:
Tags: businessmoneytechnology
Rutgers University’s Center for American Women and Politics has made public its Women Elected Officials Database, which “represents the most complete collection of information anywhere in the world about women elected officials in the United States.” It covers all “women who have held office at the congressional, statewide elected executive, and state legislative levels nationwide,” going back to the 1890s. For each of the 11,540 officeholders, “the database includes their geographic information, party identification, and race identification where available.” You can explore the data online and also (with a free registration) download it. Previously: Women candidates for the US House, 1972–2010 (DIP 2017.07.26) . — : July 8, 2020
Links:
On Monday, the US Treasury Department and Small Business Administration released detailed data on the financial assistance given to businesses through the government’s Paycheck Protection Program. For the 660,000+ loans of at least $150,000, the dataset includes each recipient’s name, address, industry classification, and business type, plus the name of lender, the number of jobs the business said would be supported, loan amount (grouped into several ranges), and more. For aid less than $150,000, the dataset contains similar information, but without names or addresses. Related: Efforts to make the data more accessible are already underway. Simon Willison, for instance, generated a searchable database of all loans of $150,000+, and The Washington Post has published an interactive database of the $1,000,000+ loans. Also related: There are some errors in the data. — : July 8, 2020
Links:
BEHACOM is a dataset that details minute-by-minute usage statistics for 12 Spanish and Italian men “interacting for fifty-five consecutive days with their personal computers in their own way and without restrictions.” — : July 1, 2020
Links:
Tags: technology
To better understand accessibility issues in the NYC subway, the Two Sigma Data Clinic has constructed a series of datasets and diagrams describing each station’s elevator connections — from the street to the station, and between various points and platforms within it. Related: The subway system’s official list of elevators and escalators. Previously: Turnstile data from NYC and Chicago (DIP 2017.02.08). [h/t Erin Stein] — : July 1, 2020
Links:
Tags: transportation
A team of researchers in the UK has been gathering data on COVID-19 “transmission events” that have resulted in clusters of cases, based on various official, scholarly, and news reports. For each of the 250+ events listed so far, the dataset specifies the setting (e.g., “work,” “household,” “religious,” “elderly care”), whether it was indoors or outdoors, geographical location, number of cases involved, and more. The New York Times has also been tracking the location and size of COVID-19 clusters in the US — so far, more than 1,700 clusters with at least 50 cases. The NYT doesn’t provide a download button, but its webpage loads the data from a JSON file. [h/t Kai Kupferschmidt] — : July 1, 2020
Links:
At least three states publish monthly, machine-readable statistics on new voter registrations: Florida, Virginia, and North Carolina. Maryland and the District of Columbia publish similar data as PDF tables. A recent report by Center for Election Innovation & Research analyzed pandemic-era registrations, using that data plus numbers obtained from Arizona, California, Colorado, Delaware, Georgia, Illinois, and Texas. The analysis found a “steep decline in new registrations,” largely attributable to social distancing measures and DMV closures. To accompany an article on the topic, FiveThirtyEight has compiled CEIR’s counts for those 12 states, for the first five months of 2016 and 2020, into a simple CSV. — : July 1, 2020
Links:
Tags: governmentpolitics
Opportunity Insights’ Economic Tracker “combines anonymized data from leading private companies – from credit card processors to payroll firms – to provide a real-time picture of indicators such as employment rates, consumer spending, and job postings across counties, industries, and income groups.” Small business revenue, for instance, appears to be down about 55% in Boston, compared to the beginning of the year. The project’s GitHub page provides aggregated data corresponding to each of the charts and graphs. Previously: Opportunity Insights’ extensive datasets on economic mobility (DIP 2019.06.12). [h/t Matteo Ferroni + Dan Stein] — : July 1, 2020
Links:
The Digital Atlas of Roman and Medieval Civilizations at Harvard University “makes freely available on the internet the best available materials for [...] mapping and spatial analysis of the Roman and medieval worlds.” The project’s datasets include economic indicators, climate records, ports and harbors, shipwrecks, roads, and more. [h/t Pier Rolla] — : June 24, 2020
Links:
Economist and parenting book–author Emily Oster has been collecting and publishing “preliminary, unscientific data on child care centers which were open in the pandemic.” The dataset currently includes more than 900 centers, their locations, age ranges served, whether they were open the whole time or just part of it, the number of students and staff, and the number of COVID-19 cases in students and staff. [h/t Laura Libby] — : June 24, 2020
Links:
Mapping the Gay Guides “aims to understand often ignored queer geographies using the Damron Address Books, an early but longstanding travel guide aimed at gay men since the early 1960s.” The project — launched earlier this year and led by historians Amanda Regan and Eric Gonzaba — includes an interactive map, downloadable dataset, methodology, and ethics statement. It already covers more than 22,000 entries in more than 30 states, with plans for further expansion. [h/t Giuseppe Sollazzo] — : June 24, 2020
Links:
Native-Land.ca maps the historical geographic extents of Indigenous territories in North America, Australia, New Zealand, parts of South America, and elsewhere. The project’s datasets cover 1,400+ territories, 900+ languages, and 800+ related treaties, drawn from a wide range of resources. It was launched in 2015 by programmer Victor Temprano and is now run by a not-for-profit organization with a board of directors. Related: Earlier this year, High Country News published Land-Grab Universities, a “two-year inquiry into the origin of wealth that undergirds the nation’s system of higher education,” accompanied by a detailed methodology and dataset spanning nearly 11 million acres of expropriated Indigenous land. — : June 24, 2020
Links:
The National Conference of State Legislatures has built a database of state-level policing bills and executive orders introduced since May 25, the day George Floyd was killed. The database covers proposed bills on “oversight and data, training, standards and certification, use of force, technology, policing alternatives and collaboration, [...] and other timely issues.” So far, it contains brief descriptions, author information, statuses, and last-activity dates for more than 250 pieces of legislation in 25 states and DC. Although the database does not provide downloads, its HTML output is highly structured. Related: Last week, The Marshall Project’s Weihua Li, Humera Lodhi, and Damini Sharma analyzed the database and put the bills in context. — : June 24, 2020
Links:
Data educator Allison Horst earlier this month released a dataset describing the physical characteristics of 344 Antarctic penguins, derived from data collected by marine biologist Kristen Gorman and the Palmer Station. With palmerpenguins, Horst aims to provide a data-exploration alternative to the ubiquitous iris dataset, which was first published, in 1936, in the Annals of Eugenics. Previously: Thousands of penguins (DIP 2020.03.11). [h/t Alex Cookson] — : June 17, 2020
Links:
Tags: animals
StereoSet aims to measure biases toward stereotypes — as they relate to profession, gender, race, and religion — in statistical language models, using a dataset containing thousands of sentences, each with several variations. Through the project’s online data explorer, you can examine the sentences and see how some popular language models perform. [h/t Michael McLaughlin] — : June 17, 2020
Links:
The New York Public Library has digitized more than 20 volumes of the Green Book — a series of travelers’ guides, published from the 1930s to 1960s by US postal worker Victor Green, that listed hotels, restaurants, gas stations, and other establishments where Black visitors would be safe and welcome. The library has converted its digital copies into semi-structured text and turned the 1947 edition into a fully-structured dataset. The University of South Carolina has built a dataset of the 1,500+ listings in the 1956 edition. Inn 2015, NYPL Labs combined both years’ datasets into a tool to map the locations and plan routes with them. — : June 17, 2020
Links:
CountLove.org has been using news reports to quantify protest events in the US since 2017. Its downloadable dataset currently contains more than 27,000 events and provides each event’s date, location, and approximate number of attendees. The events are also tagged with one or more topics, such as civil rights, healthcare, for racial justice, and against regulation. The site’s search/mapping tool lets you to filter by those tags, and also for “curated protest data for a more compassionate country.” Related: Tommy Leung and Nathan Perkins describe how they built the project. [h/t Audra Burch et al.] — : June 17, 2020
Links:
For several years, the Southern Poverty Law Center’s “Whose Heritage?” project has been gathering and mapping information on “public symbols of the Confederacy,” such as monuments, place names, official holidays, commemorative license plates, and municipal seals. For each of the 1,800+ entries, the project’s dataset indicates the type of monument/symbol, its location, sponsor, year dedicated, and (if applicable) year removed. [h/t Gita Jackson + Dan Brady] — : June 17, 2020
Links:
The Plug, a news site that focuses on the Black innovation economy, has been assembling a dataset of statements made by tech companies on racial justice, Black Lives Matter, and George Floyd. The dataset links to more than 200 statements so far and includes each company’s name, the timing of the statement, and other relevant context, such as the URL of their most recent diversity report and the percentage of employees and/or leaders who identify as Black. [h/t Sherrell Dorsey] — : June 10, 2020
Links:
Tags: justiceracetechnology
The US Defense Logistic Agency’s Law Enforcement Support Office sends surplus Department of Defense equipment to local law enforcement agencies, through an arrangement known as the 1033 program. The Pentagon publishes quarterly updates of the equipment transferred — which can range from coffee makers to rifles to entire aircraft — but only began doing so after the program came under intense scrutiny for its role in the militarized police response to the 2014 Ferguson protests and police militarization in general. Related: Despite the criticisms, the 1033 program has sent police departments hundreds of millions of dollars in military equipment since Ferguson, including more than 490 mine-resistant vehicles, my colleague John Templon reports. — : June 10, 2020
Links:
Campaign Zero’s 8cantwait.org advocates for eight specific reforms to curtail police officers’ use of force, including bans on chokeholds, requiring de-escalation tactics, and establishing comprehensive reporting. The organization has compiled the use of force policies for 100 large police departments, and determined whether they’ve instituted these reforms. (Direct CSV download here.) And at checkthepolice.org, the organization has gathered police union contracts in major cities and assessed whether they contain language makes it harder to hold officers accountable for misconduct. [h/t Samuel Sinyangwe] — : June 10, 2020
Links:
Several groups are collecting examples of disproportionate police responses to the protests against police brutality. One such collection, spurred by criminal defense attorney T. Greg Doucette, has compiled hundreds of instances of “unnecessary violence by law enforcement officers against civilians”; for each one, the underlying spreadsheet provides the city and state of the incident, links to visual documentation on Twitter and YouTube, and a short description. Another collection, which emerged from the /r/2020PoliceBrutality subreddit and seeks “to accumulate and contextualize evidence of police brutality during the 2020 George Floyd protests,” also includes incident dates, an interactive map, and APIs. Related: Bellingcat and The Guardian have compiled a spreadsheet of 140+ “reports of arrests, violence, and intimidation against journalists” at the protests. [h/t anjakefala + Aric Toler] — : June 10, 2020
Links:
An unnamed geospatial analyst has been mapping “every town or city I can find where a George Floyd / Black Lives Matter protest, action, or vigil has occurred since May 25” — more than 2,600 so far. The data files powering the map include each city’s name, state/region, country, coordinates, and the date it was added. Related: For a study of the relationship between police-caused deaths and demonstrations, Williamson et al. built a dataset of Black Lives Matter protests in 2014 and 2015. — : June 10, 2020
Links:
A team of researchers has built a dataset that characterizes 29,622 samples of shells from 7,894 water-dwelling species. You can search the shell images online and also download the full dataset. — : May 27, 2020
Links:
Tags: animalsmiscellaneous
From 1919 to 1941, Sylvia Beach ran Shakespeare and Company, the legendary Paris bookstore. It featured a lending library, whose members included writers such as Gertrude Stein, James Joyce, Simone de Beauvoir, and Ernest Hemingway. Princeton University’s Shakespeare and Company Project has digitized hundreds of the library’s lending cards and logbooks, and has made the data available to explore and download. [h/t Tom Merritt-Smith] — : May 27, 2020
Links:
Tags: books
The website ranked.vote, built by quantitative analyst Paul Butler, standardizes and visualizes the detailed results of a few dozen elections — in Maine, San Francisco, Santa Fe, and Burlington — that have used ranked-choice voting, where voters can list their preferred candidates in sequential order. — : May 27, 2020
Links:
Tags: elections
Last week’s newsletter featured the data and code that The Economist is using to estimate excess deaths due to COVID-19. Also last week: The New York Times began publishing the data behind its similar-but-different estimates. And earlier this month, the team at the Human Mortality Database launched its Short-term Mortality Fluctuations data series, which provides “user-friendly access to detailed data on mortality by week, sex, and aggregated age group” for more than a dozen countries. [h/t Esteban Ortiz-Ospina] — : May 27, 2020
Links:
As it has in the past, the US Census Bureau is encouraging residents to respond to the 2020 Census’s mailed questionnaire, which reduces the need for in-person census-taking. The agency calculates these "self-response" rates all the way down to individual Census tracts, and provides that data as a CSV and via its API. Those datasets, however, only represent the latest numbers, so researchers at the CUNY’s Center for Urban Research have been creating daily snapshots, which they’re also mapping and analyzing for the public. Related: The Center’s FAQ for the data. [h/t Steven Romalewski] — : May 27, 2020
Links:
Tags: mappingstatistics
At the Circulating Library “offers a biographical and bibliography database of nineteenth-century British fiction.” Launched by literature professor Troy J. Bassett in 2007, the searchable, browseable, downloadable database now contains information on more than 19,000 titles by more than 4,000 authors. — : May 20, 2020
Links:
Tags: bookshistorical
The Global Dataset of Historical Yields combines data from agricultural censuses and satellite sensors to estimate the annual yields for four major crops — maize, rice, wheat, and soybean — annually from 1981 to 2016, for each 0.5-degree square on the planet. Related: The dataset’s authors describe their methodology and the latest update. — : May 20, 2020
Links:
Tags: agriculturehistorical
The US Department of Agriculture’s National Forest Type Dataset shows the geographic distribution of the country’s “forest types” — defined as ”logical ecological groupings of species mixes.” (Examples include “deciduous oak woodland” and “subalpine fir.”) To estimate the extent of each forest type, the dataset’s developers combined satellite imagery with “nearly 100 other geospatial data layers, including elevation, slope, aspect, and ecoregions.” Related: The Washington Post has used the dataset to map fall foliage and forests where Christmas-y trees grow. [h/t Joe Fox] — : May 20, 2020
Links:
Tags: agricultureenvironment
The Design of Trade Agreements database collects information about customs unions, free trade agreements, and other similar treaties signed between 1948 and 2018. It currently includes more than 800 agreements, plus additional negotiations, accessions, withdrawals, consolidations, and amendments. For each agreement, the database indicates the its name, member countries, year of signature, and a number of policy-specific variables. [h/t Erik Gahner Larsen] — : May 20, 2020
Links:
The Economist has published the data behind its estimates of excess deaths due to COVID-19. The data repository currently covers 20 countries; it provides recent weekly/monthly death totals, officially-counted COVID-19 death, and average historical death totals for the same time periods. Related: Data journalist James Tozer’s introductory Twitter thread. [h/t Sharon Machlis] — : May 20, 2020
Links:
Iowa’s Alcoholic Beverages Division publishes itemized data on all liquor sales by grocery/liquor/convenience-type stores since January 2002 — more than 18 million purchases in all. For each purchase, the dataset includes the specific kind, amount, and cost of the liquor, plus the date and location of the sale. Previously: State liquor prices (DIP 2019.08.07). [h/t Martin Burch] — : May 13, 2020
Links:
Computer science graduate student Victor Kristof has built a dataset of 450,000 legislative edits proposed by members of the EU parliament from 2009 to 2019. For each proposed edit, the dataset points to the relevant legislation, names the parlimentarian, and indicates whether the edit was accepted. In an accompanying academic paper, Kristof et al. describe the dataset’s construction and “propose a model for predicting the success of such edits.” — : May 13, 2020
Links:
Tags: government
Since the passage of the Wilderness Act in 1964, the US federal government has established 803 official wilderness areas, which cover more than 111 million acres. They’re all part of the National Wilderness Preservation System, co-managed by the National Park Service, Bureau of Land Management, Fish and Wildlife Service, and Forest Service. The University of Montana’s Wilderness.net brings data about each of the agencies’ wildernesses into one place, where it can be explored via an interactive map and downloaded in bulk. For each wilderness, the dataset provides its name, description, boundaries, acreage, year designated, and more. — : May 13, 2020
Links:
Tags: environment
The nonprofit Good Jobs First has launched COVID Stimulus Watch, which is gathering data on the grants and loans that have been provided to corporations through the $2 trillion CARES Act. So far, the dataset contains more than 5,700 awards, totalling $54 billion. The records are based on information from public financial filings, press releases, and, most recently, data from the government’s healthcare-focused Provider Relief Fund. — : May 13, 2020
Links:
The UN International Labour Organization’s Employment Protection Legislation Database tracks “legal information on the regulation of temporary contracts and employment termination at the initiative of the employer,” on a national-regulatory level. The database quantifies and categorizes regulations under nine “themes” — including grounds for dismissal, procedures for collective dismissals, severance pay, and redress — and for more than 100 countries. — : May 13, 2020
Links:
Developer advocate Amy Boyd has assembled a dataset of 96 contestants who appeared on the British dating reality series Love Island from 2016 to 2019, including each contestant’s name, age, profession, and various metrics of success in the competition. [h/t Nick Latocha] — : May 6, 2020
Links:
Tags: entertainmenttelevision
The Planetary Science Budget Dataset “integrates the spending history, by year, of every NASA planetary science mission and related activities.” It also calculates funding by destination (e.g., Mars, Venus, the moon, etc.), and annual budget breakdowns for individual missions. The dataset is maintained by The Planetary Society, a nonprofit founded 40 years ago by Carl Sagan, Louis Friedman, and Bruce Murray — and now led by Bill Nye. [h/t Ingrid Burrington] — : May 6, 2020
Links:
Tags: sciencetechnology
New York City’s Department of Buildings has been publishing a dataset and interactive map of construction sites deemed “essential,” and thus eligible to continue work during the coronavirus pandemic. The current list covers more than 6,700 sites, including more than 1,200 at schools, 1,000 for public housing, and 250 related to health care. [h/t Josh Laurito] — : May 6, 2020
Links:
The Marshall Project has been tracking COVID-19 cases and deaths in US prisons. To compile that dataset, every week the publication’s reporters ask each state prison system and the federal Bureau of Prisons “for the total number of coronavirus tests administered to its staff members and prisoners, the cumulative number who tested positive among staff and prisoners, and the numbers of deaths for each group.” — : May 6, 2020
Links:
Last week, the COVID Tracking Project (DIP 2020.03.18) launched a beta release of its COVID Racial Data Tracker, built in collaboration with American University’s Antiracist Research & Policy Center. The tracker collects demographic statistics, now published by most states, that disaggregate the number of COVID-19 cases and deaths by race and/or ethnicity. “This is a challenging dataset to compile and code, and our data sources remain in flux, but we offer this beta release for full transparency,” the project’s organizers write; they’re seeking feedback “as we work toward building a complete dataset and getting it into our API and onto the website.” — : May 6, 2020
Links:
PoKi is “a corpus of 61,330 poems written by children from grades 1 to 12,” scraped with permission from a Scholastic website. The dataset includes each poem’s title, text, and character count, plus the author’s first name and grade. Noteworthy: “PoKi is made freely available for research with the condition that the research be used for the benefit of children.” — : April 29, 2020
Links:
Since the 1950s, the US Census Bureau has conducted monthly surveys of retail and food-services industries. The results — which estimate sales and inventory numbers by subsector — are available as machine-readable data going back to 1992. The next release is scheduled for May 15. [h/t Giuseppe Sollazzo] — : April 29, 2020
Links:
Tech companies have repurposed some of the data they collect from you into explorable, downloadable datasets that estimate the degree to which movement patterns have (or haven’t) changed in recent months. Among them: Apple, which is quantifying requests for directions; Google, which is counting visits to places such as grocery stores and transit station; and Descartes Labs, which is tracking smartphone movements. Related: Sociologist Kieran Healy recently found and explained a curious February 17 spike in Apple’s data. [h/t Hillary Hartley] — : April 29, 2020
Links:
Preprints — academic papers published online before they’ve gone through traditional peer review — have become a common way for scientists to disseminate their coronavirus-related findings. So researchers Nicholas Fraser and Bianca Kramer have begun compiling a dataset of more than 6,000 COVID-19 preprints. For each paper, the dataset includes the title, abstract, DOI, date posted, and the hosting repository (such as medRxiv, the most common so far). — : April 29, 2020
Links:
ParlSpeech V2 contains 6.3 million parliamentary speeches from nine countries: Austria, the Czech Republic, Germany, Denmark, the Netherlands, New Zealand, Spain, Sweden, and the United Kingdom. The dataset, created by political scientists Christian Rauh and Jan Schwalbach, includes the full text of each speech, plus the date, speaker, and the speaker’s party. Related: Roll call votes from the European Parliament’s first six terms (1979–2009). [h/t Robert Stelzle] — : April 29, 2020
Links:
Tags: governmenthistory
Data scientist Elle O’Brien recently described how she built and cleaned a dataset of the moral dilemmas posted to r/AmItheAsshole, “a semi-structured online forum that’s the internet’s closest approximation of a judicial system.” For each of the 97,628 posts collected, the dataset includes the title, body, date, number of Reddit upvotes, and number of comments — plus the community’s verdict. [h/t u/thumbsdrivesmecrazy] — : April 22, 2020
Links:
Urban planner Meli Harvey has taken New York City’s official dataset of sidewalks and dissected its geometries to map the width of each segment of walkable pavement. [h/t Dan Brady] — : April 22, 2020
Links:
“Building on years of painstaking work alongside our Syrian and international partners,” the Global Public Policy Institute “has compiled the most comprehensive dataset of incidents of chemical weapons use in Syria to date.” The institute has published a new interactive data portal to display the data on 345 attacks between 2012 and 2019. The data fields include the date and time of the attack, location, chemical agent, munition type and method of delivery, perpetrator, confidence rating, and more. [h/t Tobias Schneider] — : April 22, 2020
Links:
Programmer Alec Barrett has built a spreadsheet listing every question asked on every decennial US Census since 1790 — more than 900 items overall. In addition to the questions themselves, the dataset describes the subgroups of people questioned and the types of answers expected. Related: The dataset powers “The Evolution of the American Census,” Barrett’s interactive exploration of how the questions have changed over time, and what they say about America. — : April 22, 2020
Links:
Tags: historystatistics
Economics graduate student Lukas Lehner has gathered, with help from the hive-mind, a list of dozens of websites and datasets tracking policy responses to the COVID-19 pandemic – a few that have been mentioned in DIP, plus many that haven’t. Lehner’s tracker-tracker groups the resources into several topic areas, summarizes them, and indicates their data formats. [h/t François Briatte] — : April 22, 2020
Links:
VillagerDB is an online and downloadable database of thousands of the characters, accessories, tools, and other items in Animal Crossing: New Horizons — the Sims-like game that’s breaking sales records and saving my colleague’s marriage. [h/t Justin] — : April 15, 2020
Links:
New York City’s Department of Sanitation publishes a dataset of the monthly tonnage it collects from city residences, going back to 1990. For each community district and month, the dataset tallies the weight of household trash, recyclables, “organics,” and more — including, for each January, the tonnage of Christmas trees collected. Related: Reporters at The City recently analyzed the dataset, and found potential evidence of Manhattanites fleeing the city in March. They also interviewed the Department of Sanitation’s anthropologist-in-residence. — : April 15, 2020
Links:
Tags: environmenthistory
Researchers at Imperial College London and Oxford have compiled a dataset of more than 4,000 privacy policies published on major US companies’ websites. For each policy, the dataset includes its full text, counts the number of words and paragraphs in it, calculates a readability metric, and reports the number of third-party tracking cookies loaded through the policy’s webpage. [h/t Antione Uettwiller] — : April 15, 2020
Links:
Tags: businesslawtechnology
A team of academics has begun publishing a series of “exposure indices” based on smartphone “pings” collected by a location-analytics company. One set of indices describes the proportion of smartphones that, on a given day in a given US state or county, were observed at least once in another given state or county during the previous 14 days. Another tries to quantify how often smartphones are observed in the same “commercial venues” as other devices. “We are making these indices publicly available to all researchers in the context of the spread of COVID-19,” the researchers write. — : April 15, 2020
Links:
Nearly every country publishes the number of residents who’ve tested positive for COVID-19. Far fewer post data on the total number of people tested (or tests performed). But as more countries start doing that, the team at Our World In Data has begun compiling a regularly-updated dataset on COVID-19 testing, along with key context, a detailed methodology, and interactive charts. — : April 15, 2020
Links:
Software engineer Bohdan Turkynewych has compiled a dataset of more than 50 million domain names, discovered by crawling the web and processing 86 terabytes of internet traffic. Related: French registry Afnic publishes a full list of all registered .fr domains. [h/t Javier Sáenz] — : April 8, 2020
Links:
Tags: technology
Last month, the Center for the Study of the Drone at Bard College released the third edition of its Public Safety Drones report, in which researchers identified “1,578 state and local police, sheriff, fire, and emergency services agencies in the U.S. that are believed to have acquired drones.” You can explore and download the data through the report’s interactive map. — : April 8, 2020
Links:
Journalism professor A.Jay Wagner has compiled a dataset of 14 federal agencies’ Freedom of Information Act annual reports from 1975 through 2018. For each agency and year, the dataset contains the number of FOIA requests processed, granted, and denied; exemptions invoked; appeal outcomes; staffing figures; fees assessed; and more. [h/t George LeVines] — : April 8, 2020
Links:
Tags: governmentjournalism
The Long-Term Productivity database was created at the Bank of France in 2013, and has been updated several times since then. For 23 OECD countries from 1890 to 2018, the database tracks total factor productivity, labor productivity, capital intensity, GDP per capita, and more. [h/t Mokhtar Tabari] — : April 8, 2020
Links:
A group of researchers have built a structured dataset quantifying 166 countries’ economic responses to the COVID-19 pandemic. The dataset draws mainly from the International Monetary Fund’s COVID-19 Policy Tracker — which describes the policies in free-form text — and supplements it with additional research. The quantified policies include fiscal stimuli, monetary stimuli, interest-rate cuts, and interventions to control the countries’ balance of payments. [h/t u/smurfyjenkins] — : April 8, 2020
Links:
The National Library of Scotland has digitized the first eight editions of the Encyclopaedia Britannica, issued between 1768 and 1860. The effort, which captured approximately 167 million words across 143 volumes, was named a runner-up in the 2019 Digital Humanities Awards. — : April 1, 2020
Links:
In light of ongoing debates over encryption, tech publication Motherboard “collected and analyzed over 500 iPhone search warrants and related documents filed throughout 2019 to build a database of cases in which law enforcement attempted to get information from an iPhone.” The database, published last month, includes court-docket information, the requesting agency, the suspected crimes, phone models, and more. [h/t Colin Prince] — : April 1, 2020
Links:
The Global Visa Cost Dataset, published by the European University Institute’s Migration Policy Centre, “reports the cost [in 2019] of country-to-country visas for tourism, student, business, work, transit, family reunification and other motives worldwide.” The dataset’s authors describe their methodology and findings in a detailed working paper. There is, they write, “a fundamentally paradoxical situation: The richer a country, the less its citizens pay for visas to go abroad.” — : April 1, 2020
Links:
Tags: immigration
The Oxford COVID-19 Government Response Tracker “aims to track and compare government responses to the coronavirus outbreak worldwide rigorously and consistently.” The project posts daily information concerning 11 kinds of responses, ranging from travel bans to investment in vaccines, and is based on data “collected from public sources by a team of dozens of Oxford University students and staff from every part of the world.” One of those public sources is UNESCO, which has been tracking countries’ school closure policies over time. In the US, the Kaiser Family Foundation is keeping tabs on the current mandates issued by state governors (rather than legislators or local governments). — : April 1, 2020
Links:
The New York Times is conducting “a round-the-clock effort to tally every known coronavirus case in the United States,” and has begun publishing a dataset of county-level cases and deaths. France publishes spreadsheets of COVID-19 hospitalizations and deaths by département and sex; on GitHub, there’s an active effort to collect and standardize this dataset and others from France. The German government has a dashboard showing official case counts in each of its Bundesländer and Landkreise; software engineer Jan-Philip Gehrcke has been pulling that data into standardized CSV files. Spain publishes daily case, hospitalization, death, and recovery counts for each comunidad autónoma, as a dashboard with downloadable data; investigative outlet Datadista has been compiling and standardizing that dataset and similar ones. — : April 1, 2020
Links:
BusinessFinancing.co.uk attempted to identify the oldest, still-operating companies in every country in the world, and put its findings into a spreadsheet. [h/t Giuseppe Sollazzo + Iman Ghosh] — : March 25, 2020
Links:
The Marshall Project and Slate have partnered to conduct a survey of incarcerated people’s political views. They’ve received more than 8,000 responses so far, and have made the (anonymized) data available to download. — : March 25, 2020
Links:
The American Association of State Highway and Transportation Officials’ Census Transportation Planning Products initiative “produces special tabulations of American Community Survey (ACS) data that have enhanced value for transportation planning, analysis, and strategic direction,” such as commute times, carpooling, bicycle usage, available vehicles, and more. The program’s core dataset “consists of almost 200 residence-based tables, 115 workplace-based tables and 39 flow tables (home to work) for over 325,000 geographies.” [h/t Adrienne Heller] — : March 25, 2020
Links:
Tags: transportation
ACAPS, a humanitarian analysis nonprofit, is cataloguing government measures implemented in response to the pandemic. The measures range from quarantine policies (the most common) to electronic surveillance and the lockdown of refugee camps. As of Tuesday evening, the dataset contained 1,741 entries; each entry includes the date implemented, source links, and descriptive comments. Computational social scientist Rex Douglass, meanwhile, is crowdsourcing a dataset of mandatory government restrictions, such as bans on large gatherings, restaurant shutdowns, and shelter-in-place decrees. It has fewer entries so far, but more detail on US cities and states. [h/t Nishanth Arulappan] — : March 25, 2020
Links:
The European Centre for Disease Prevention and Control has been publishing a COVID-19 dashboard and daily-updated data files tracking country-level cases and deaths. Related: Our World In Data explains why the publication switched from using the World Health Organization’s situation reports to the ECDC’s data for its analyses and graphics. Also related: USAFacts is using data from state governments, local public health agencies, and Johns Hopkins University (DIP 2020.03.11) to map US cases at the county level, with the underlying data available to download. [h/t YY Ahn + Big Local News] — : March 25, 2020
Links:
Between 2002 and 2004, professors Ray Fisman and Sheena Iyengar ran a series of speed dating events for Columbia University graduate students, while collecting detailed data on the participants and results. The full dataset is available to download; data scientist Keith McNulty has also created a simplified version of it. — : March 18, 2020
Links:
Tags: miscellaneous
The Global Health Security Index is “the first comprehensive assessment and benchmarking of health security and related capabilities across the 195 countries” that signed on to the WHO’s 2005 International Health Regulations. The index is built on a “framework of 140 questions, organized across 6 categories, 34 indicators, and 85 subindicators to assess a country’s capability to prevent and mitigate epidemics and pandemics.” The first edition of the index was released this past October, and can be downloaded as a macro-enabled Excel spreadsheet. The National Health Security Preparedness Index aims to do something similar, but for US states; the 2019 results are also available as a spreadsheet. [h/t Big Local News] — : March 18, 2020
Links:
Tags: healthcare
Stayinghome.club is compiling hundreds of coronavirus-spurred work-from-home policies, university annoucements, and event cancellation statuses. The project is collaboratively edited on GitHub, and includes instructions for how to add your company/university/event. [h/t Jackie Kazil] — : March 18, 2020
Links:
Tags: miscellaneous
The COVID-19 Open Research Dataset is “a free resource of over 29,000 scholarly articles, including over 13,000 with full text, about COVID-19 and the coronavirus family of viruses for use by the global research community.” The dataset, produced by a collaboration of several research groups and the National Library of Medicine, “will be updated weekly." Related: On Monday, the White House issued a “call to action to the tech community” regarding the dataset, asking experts “to develop new text and data mining techniques that can help the science community answer high-priority scientific questions related to COVID-19.” — : March 18, 2020
Links:
The COVID Tracking Project “collects information from 50 US states, the District of Columbia, and 5 other U.S. territories to provide the most comprehensive testing data we can collect for the novel coronavirus, SARS-CoV-2.” The project — a collaboration between The Atlantic, data scientist Jeff Hammerbacher, and a growing team of volunteers — “attempt[s] to include positive and negative results, pending tests, and total people tested for each state or district currently reporting that data.” (Unfortunately, not all states are reporting each of those numbers, and private-lab testing also complicates the picture.) You can access the data online, through an API, on GitHub, and via Twitter. Related: How to Understand Your State’s Coronavirus Numbers (The Atlantic, March 12). — : March 18, 2020
Links:
The MAPPPD project — Mapping Application for Penguin Populations and Projected Dynamics — “aims to deliver open access penguin population data for the Antarctic continent, and occupancy probabilities for flying birds around the Antarctic Peninsula.” The data, last updated about a year ago, can be both browsed online and downloaded. [h/t Michael Polito] — : March 11, 2020
Links:
Tags: animals
Applied mathematics PhD student Thomas Camminady has built a spreadsheet of all Tour de France riders since 1903 (including name, team, time taken, and final ranking), based on the competition’s official results page. — : March 11, 2020
Links:
Tags: sports
San Francisco International Airport’s museum team has been collecting information on all flights in and out of SFO’s terminals. Online, you can browse and search the data, which includes details about the airlines, flight numbers, gates of arrival and departure, and more. Earlier this year, the team published a downloadable database of 769,250 flights from 2019. (Or 1.2 million flights, if you count codeshares.) [h/t Simon Batistoni] — : March 11, 2020
Links:
Tags: transportation
The Global Party Survey is “an international scientific study, directed by Pippa Norris, designed to compare political parties around the world. Drawing on survey data gathered from 1,861 party and election experts, the study uses 21 core items to estimate key ideological values, issue positions, and populist rhetoric for 1,127 parties in 170 countries.” Last month, Norris released the dataset and a paper that uses it to measure “populism as a global phenomenon.” Previously: The Manifesto Project (DIP 2017.06.21) and Party Facts (DIP 2019.01.16). — : March 11, 2020
Links:
Tags: politics
Last month, DIP featured coronavirus case counts mapped by researchers at Johns Hopkins. Since then, efforts to collect and publish COVID-19 data have grown, including: the Johns Hopkins team has moved its data repository to GitHub ... the Open COVID-19 Data Curation Group has expanded its data on individual cases ... the Italian government is publishing local case and test counts on GitHub ... Princeton PhD student Sang Woo Park is building a detailed dataset of cases in South Korea ... the decade-old Global Initiative on Sharing All Influenza Data is sharing COVID-19 genome samples and mutations ... while the collaboratively-edited CoronavirusTechHandbook.com is pointing to additional datasets and data-trackers. Related: My colleague Peter Aldhous is using the Johns Hopkins data to publish clear, concise graphics tracking known case counts globally and in the United States. [h/t John Emerson + Hannah Nam + Bruno Salzano + illo + Lam Thuy Vo + Mago Torres] — : March 11, 2020
Links:
Danish zoologist Anders Pape Møller counted the number of insects killed on the windscreen of a single car after each of 1,375 journeys along the same stretch of road between 1997 and 2017. After accounting for time of day, weather, and other factors, Møller says his data suggests an 80% decline in flying insects during that time. Related: The Guardian on two bug-splat studies, including Møller’s. [h/t Laura Norén and Brad Stenger] — : March 4, 2020
Links:
“We present the Tesco Grocery 1.0 dataset: a record of 420 M food items purchased by 1.6 M fidelity card owners who shopped at the 411 Tesco stores in Greater London over the course of the entire year of 2015, aggregated at the level of census areas to preserve anonymity,” researchers announced in a recent academic paper. For each area, the dataset contains “the number of transactions and nutritional properties [...] including the average caloric intake and the composition of nutrients.” [h/t Luca Maria Aiello] — : March 4, 2020
Links:
Tags: food
“For the first time in its 174-year history, the Smithsonian has released 2.8 million high-resolution two- and three-dimensional images from across its collections onto an open access online platform for patrons to peruse and download free of charge,” the institution announced last week. The new platform includes “data and material from all 19 Smithsonian museums, nine research centers, libraries, archives and the National Zoo”; the records are accessible via an API, and the metadata is also available on GitHub. Related: KaoKore, a dataset of 5,552 face images cropped from Japanese artworks at several (non-Smithsonian) institutions. [h/t Erin Petenko + Corin Faife] — : March 4, 2020
Links:
Tags: art
Reporters at the The Markup, a newly launched newsroom that “investigates how powerful institutions are using technology to change our society,” subscribed to receive emails from more than 200 presidential candidates, advocacy organizations, and other political groups. Four months later, they had received more than 5,000 messages, which they used to examine Gmail’s treatment of political communications. On GitHub, they’ve published the emails, relevant code, and a cleaned-up dataset. Related: Last year, FiveThirtyEight signed up for emails from the Democratic presidential campaigns; by August, they had collected 830 messages, which they published and used to see who was talking most about Donald Trump. — : March 4, 2020
Links:
The Climate Policy Database, a project of the NewClimate Institute, has collected data on more than 3,800 regulations, subsidies, and other policies related to climate change mitigation. The downloadable and browseable database, drawn from more than a dozen sources, includes policies from nearly every country in the world and lists the policies’ names, jurisdictions, years of enactment, general objectives, and more. [h/t Erik Gahner Larsen] — : March 4, 2020
Links:
Tags: climate
Historical criminologist Katherine Roscoe has transcribed archival records to create a detailed dataset of more than 2,500 people imprisoned between 1847 and 1869 at Cockatoo Island Prison — “Sydney’s most notorious 19th century prison,” now a UNESCO World Heritage Site. — : February 26, 2020
Links:
The CEPS EurLex dataset contains more than 142,000 European Union regulations, directives, and official decisions — ”almost the entire corpus of the EU’s legally binding acts passed between 1952 - 2019.” The dataset contains two dozen variables, including dates, subject matter, authors, various links, and the full text of most laws. The information comes, ultimately, from the EU’s online legal repository, eur-lex.europa.eu. [h/t Moritz Laurer] — : February 26, 2020
Links:
Tags: law
Since 2017, the U.S. Press Freedom Tracker has collected information on more than 400 incidents targeting journalists in the United States, such as arrests, attacks, and denials of access. The initiative, led by the Freedom of the Press Foundation and the Committee to Protect Journalists, provides bulk downloads of the data, plus an API. Related: The project also maintains a spreadsheet of anti-press tweets by Donald Trump. Previously: CPJ’s database of journalists who’ve been killed for reasons related to their work (DIP 2019.01.16). [h/t Sid Rao] — : February 26, 2020
Links:
Tags: journalism
The Collective Actions In Tech project database aims to document every instance of tech-industry workers banding together to raise awareness of a shared cause. The database, which is developed collaboratively and available to download, so far contains more than 200 protests, strikes, union drives, legal actions, and open letters. (The earliest event: In 1979, IBM workers protested their company’s business with apartheid South Africa.) Related: Writing for The Guardian, two of the project’s organizers describe “our eight most important insights.” [h/t anjakefala] — : February 26, 2020
Links:
The Open Contracting Data Standard is a “free, non-proprietary open data standard” that makes it “easier to share, compare and analyze” the contracts that governments award to bidders. The project has been gaining traction, with dozens of local and federal governments using the standard to publish detailed official data, including in the UK, Canada, Colombia, Nepal, Uganda, and Afghanistan. — : February 26, 2020
Links:
Tags: businessgovernment
The crowdsourced website pinballmap.com provides data on more than 25,500 pinball machines at more than 7,400 locations in the US, UK, Canada, Australia, Finland, and Japan. The website’s API lets you access the underlying data, including the specific machines available at each location. — : February 19, 2020
Links:
Tags: entertainmentgames
The Rural Household Multiple Indicator Survey (RHoMIS) “collects information on 758 variables covering household demographics, farm area, crops grown and their production, livestock holdings” and more. In an academic article published this month, researchers from the Nairobi-based International Livestock Research Institute presented a dataset of responses collected from 13,310 farm households in 21 countries in sub-Saharan Africa, Central America, and Asia between 2015 and early 2018. — : February 19, 2020
Links:
Tags: agriculture
To help study “the imperial roots of global trade,” a trio of economists have built a dataset of 168 historical empires. For each empire, the dataset lists the modern-day countries under rule (and during which years), plus whether the empire had a centralized administration, centralized religion, and/or monopoly on coin-minting. [h/t Jain Family Institute] — : February 19, 2020
Links:
Reporters at the Wall Street Journal collected data on school-specific vaccination rates — both overall and also for the measles, mumps, and rubella (MMR) vaccine — from 32 US states’ health departments. In total, the WSJ’s dataset covers more than 46,000 schools, of which 42,000 have at least one vaccination rate available. Most states provided data for the 2018–19 school year; the rest did so for 2017–18. — : February 19, 2020
Links:
“Electric utilities report a huge amount of information to the US government,” but “much of this data is not released in well documented, ready-to-use, machine readable formats.” That assessment comes from the Public Utility Data Liberation (PUDL) project, which aims to clean, standardize, and cross-link the electric utility information gathered by various agencies. Earlier this month, PUDL published its first data release; it includes information originally collected through Energy Information Administration Form 860 (details about individual generators) and Form 923 (individual power plants), the Environmental Protection Agency’s Continuous Emissions Monitoring System (hourly emissions), and the Federal Energy Regulatory Commission’s Form 1 (price rates and financial audits). The code PUDL uses to download, extract, and standardize the raw data is also available online. [h/t Zane Selvans] — : February 19, 2020
Links:
Tags: energy
What are the greatest hip-hop songs of all time? Last year, the BBC posed that question to more than 100 artists, producers, critics, and other experts, asking each to rank their top five tracks. (Notorious B.I.G.’s “Juicy” nabbed the highest rating.) Software engineer Simon Jockers has turned the responses into a structured dataset and visualized the results. — : February 12, 2020
Links:
Founded nearly 30 years ago, arXiv is an open-access repository of more than 1,600,000 scholarly articles — typically “preprints” of papers, uploaded by the authors before being peer-reviewed — in physics, math, computer science, statistics, economics, and several other fields. The website participates in Open Archives Initiative, providing metadata on uploaded articles through the initiative’s protocol; it also has an API. Last summer, computer science student Bora M. Alper collected the metadata for all the site’s papers and published it as a single file. — : February 12, 2020
Links:
Tags: miscellaneous
A team of scientists have used historical hurricane and typhoon data to simulate a 10,000 plausible years of cyclone activity. The dataset covers the world’s most active “basins” — the areas where cyclones form — and includes each simulated storm’s path, maximum wind speed, average pressure, and more. [h/t Jose A Cañizares] — : February 12, 2020
Links:
IATE is the European Union’s official terminology database, containing translations for words and phrases such as “orange juice,” “climate change policy”, and “competence of the Member States.” (That’s succo d’arancia in Italian, ilmastonmuutospolitiikka in Finnish, and tagállami hatáskör in Hungarian.) Over the past 20+ years, the project has accumulated more than 970,000 entries, translated into nearly 8 million phrasings in 25 languages. You can search the entries online or download the entire dataset as a single XML file. [h/t Laura Solana Garzón] — : February 12, 2020
Links:
Tags: language
Luc Laeven and Fabián Valencia — economists at the European Central Bank and the International Monetary Fund, respectively — have built and maintain a dataset of systemic banking crises, like those that rippled across the globe in 2008. First published in 2008 and most recently updated in 2018, the dataset covers on 151 crises affecting 118 countries from 1970 to 2017. For each episode, the dataset provides the starting and ending dates, policy responses, output loss, fiscal cost, increase in public debt, and more. [h/t Erik Gahner Larsen] — : February 12, 2020
Links:
From the US Census Bureau, you can download a dataset of all surnames that belonged to at least 100 people in 2010, and the same for 2000. Those datasets indicate the total number of people with the name, and distribution of those people by race/ethnicity. A similar list, but based on a sample population and without demographic information, is also available for 1990. Pop quiz: Try to guess the five most popular surnames that are also colors ... in order. [h/t Lynn Cherny] — : February 5, 2020
Links:
In 2018, a trio of researchers at the Instituto Universitário de Lisboa published a dataset detailing nearly 120,000 (anonymized) bookings at two (unnamed) hotels in Portugal between July 2015 and August 2017. The bookings, extracted from the hotels’ property management systems, are described in detail: the number of adults, children, and babies for the reservation; country of origin; customer, room, and deposit types; whether the guests were repeat visitors; the number of special requests made; and more. — : February 5, 2020
Links:
Tags: miscellaneous
“I’m an Astros fan. They cheated during the 2017 regular season — the evidence is clear. In an attempt to understand the scope of the cheating and the players involved, I decided to listen to every pitch from the Astros’ 2017 home games and log any banging noise I could detect.” That’s from Tony Adams, who analyzed audio spectrograms corresponding to more than 8,200 pitches. Last week, he published a website documenting his findings, including a spreadsheet of the data. Related: Adams’s list of stories and analyses that have used the data. [h/t Dan Brady] — : February 5, 2020
Links:
As the Wuhan coronavirus outbreak intensifies, a team at Johns Hopkins has been mapping the number of confirmed cases, deaths, and recoveries. The project aggregates data from several sources, including the WHO, US CDC, European CDC, and DXY, a website that reports case counts from China’s CDC and National Health Commission. The dataset powering the Johns Hopkins map is available as a spreadsheet. [h/t Fionn Delahunty] — : February 5, 2020
Links:
Reporters at ProPublica have assembled the first-ever nationwide database of US Catholic clergy “credibly accused” of sexual abuse, based on nearly 180 official lists released by dioceses and religious orders. (The majority of the lists were published during the past year and a half, following a landmark grand jury report in Pennsylvania.) The database contains more than 6,700 names so far, plus details available from some of the lists, including birth year, ordination year, assignments, and status. — : February 5, 2020
Links:
Through a public records request, Noah Veltman has obtained data on more than 23,000 personalized license plate applications flagged for review by the California DMV. For each flagged application, the dataset contains the applicant’s justification, comments from the state’s reviewers, and the proposal’s outcome. The reviewers rejected about 80% of the proposals, including those believed to be referencing racial slurs, swearing, drugs, sex, violence, and more. Previously: New York vanity plate applications (DIP 2015.10.21). — : January 29, 2020
Links:
The researcher who goes by “Gwern” maintains a dataset of more than 80 darknet marketplaces founded between 2011 and 2015. For each market (such as the infamous Silk Road), the dataset lists when it began, when it closed, why it closed, its URL, what cryptocurrencies it accepted, whether guns were allowed to be sold, and more. — : January 29, 2020
Links:
Tags: crimetechnology
Spain’s national statistics institute publishes annual microdata on the relocation of residents within, into, and out of the country. The datasets indicate each relocator’s gender and age, birthplace, previous location, and destination — down to the municipality for locations within Spain. Currently, the records cover 1988 to 2018. Related: El Confidencial’s report on intra-provincial migration patterns, using this data, and related GitHub repository. [h/t Giuseppe Sollazzo] — : January 29, 2020
Links:
Tags: immigration
The Global Georeferenced Database of Dams contains geographic data on more than 38,000 dams and their watersheds. The project, published by geographers at King’s College London, is based on a combination of satellite imagery, national registries, and other sources. At least one co-author has been working on the project since 2008. [h/t Jida Wang] — : January 29, 2020
Links:
Ever year since 2003, the American Time Use Survey, has been measuring how much time we spend sleeping, eating, and working; with friends, with family, and alone; and much more. Unlike many other time use surveys, the ATUS’s respondent-level datasets are freely available to the general public. Other time use surveys with downloadable data include the Russia Longitudinal Monitoring Survey and a Kosovo time use survey sponsored by the Millennium Challenge Corporation. Related: Through IPUMS, you can build custom data extracts of the ATUS, of historical US time survey data, and of the Multinational Time Use Study (registration required). As seen in: “How the American Work Day Changed in 15 Years” and “A Day in the Life: Women and Men,” two visualizations by Nathan Yau. [h/t Petrit Selimi] — : January 29, 2020
Links:
Tags: miscellaneousstatistics
RinkWatch “asks people who love outdoor skating to help environmental scientists monitor winter weather conditions” by reporting the conditions of their backyard ice rinks. The project’s downloadable data includes more than 30,000 “skateable” / “not skateable” observations of more than 1,200 rinks since 2012. As mentioned in: Last week’s New York Times article about outdoor skating trails in Quebec. — : January 22, 2020
Links:
Despite US adults increasingly gaining state-legal access to cannabis, “no universal standards for laboratory testing protocols currently exist,” write Nick Jikomes and Michael Zoorob, in a March 2018 article for Scientific Reports. “To investigate these concerns, we analyzed a publicly available seed-to-sale traceability dataset from Washington state containing measurements of the cannabinoid content of legal cannabis products from state-certified laboratories.” The dataset, obtained by the authors via public records requests, includes more than 200,000 test results over the course of three years. — : January 22, 2020
Links:
Tags: drugs
Using 46 million kilometers of data from OpenStreetMap, Christopher Barrington-Leigh and Adam Millard-Ball have published “the first systematic and globally commensurable measures of street-network sprawl based on graph-theoretic and geographic concepts.” You can explore their findings via an interactive map and also download aggregate metrics for nearly 200 cities and more than 160 countries. [h/t Roberto Rocha] — : January 22, 2020
Links:
Tags: mapping
To construct her maps “visualizing the geography of FM radio” in the US, Erin Davis combined the Federal Communication Commission’s service contour data — the area where reception for a station “is generally protected from interference caused by other stations” — with the agency’s radio station licensing data and genre information from radio-locator.com. The FCC also provides service contour data for broadcast television. Previously: All FCC-issued licenses (DIP 2016.09.07). [h/t Giuseppe Sollazzo] — : January 22, 2020
Links:
The AidData research lab at the College of William & Mary has published several datasets on China’s international soft-power efforts, including the first-ever dataset of government-financed development projects abroad (covering 3,485 projects between 2000 and 2014) and a structured dataset of diplomacy efforts in 25 Asia/Pacific countries between 2000 and 2016. [h/t Nick Routley + Simon Kuestenmacher] — : January 22, 2020
Links:
CIEMAT — a public institution in Spain that studies energy and the environment — has published nearly a decade of processing receipts from its Euler supercomputer. The records, which span the supercomputer’s entire lifetime, contain metadata for more than 9 million computing jobs, including timestamps, memory usage, and more. — : January 15, 2020
Links:
Tags: technology
When someone wants to officially name or rename a geographic feature of the United States — such as a mountain, creek, or island — they file a proposal with the US Board on Geographic Names. Those proposals end up on the agency’s “Action List,” the most recent year of which can be downloaded as a spreadsheet. Previously: The Board’s database of every US geographic name (DIP 2015.10.21). [h/t Noah Veltman] — : January 15, 2020
Links:
Tags: mapping
The Microbe Directory is an attempt to profile more than 7,000 bacteria, viruses, archaea, and other microorganisms. The directory can be downloaded in bulk and describes the microbes’ optimal temperature, optimal pH, Gram stain, pathogenicity, antimicrobial resistance, and more. — : January 15, 2020
Links:
Tags: science
The Comparative Competition Law project classifies the legal provisions and enforcement of antitrust laws around the world, over time. The project is run by law professors Anu Bradford and Adam Chilton, and features several datasets and detailed codebooks. (They require an email address but no formal registration.) Related: Chilton’s introductory Twitter thread. [h/t Libor Dusek] — : January 15, 2020
Links:
The Missing Migrants Project “tracks deaths of migrants, including refugees and asylum-seekers, who have gone missing along mixed migration routes worldwide.” Research for the project, an initiative of the International Organization for Migration, began after the fatal Lampedusa shipwrecks of October 2013. For each incident, the project’s datasets specify the location, the number of people who died, the number who are missing, the number who survived, the sources of information, the source quality, and more. Previously: European migration deaths, 1993 to May 2018 (DIP 2018.07.18). [h/t u/cavedave + Topi Tjukanov] — : January 15, 2020
Links:
Tags: deathimmigrationrefugees
To study how people make decisions in risky situations, a team of academics analyzed contestants’ choices in 100+ episodes of Deal or No Deal that aired in the Netherlands, Germany, and the US. Their dataset is available through ICPSR (registration required) and the Wayback Machine. — : January 1, 2020
Links:
The UC ClioMetric History Project is digitizing decades of administrative records from the University of California and other schools in the state (such as USC and Stanford). So far, they’ve uploaded data on more than 750,000 student enrollments, tens of thousands of faculty members, and 800,000 courses. — : January 1, 2020
Links:
Google has received millions of requests from copyright holders to delist billions of URLs from its search results. The company’s transparency reports include a section where you can explore and download data about these requests. One of the datasets describes 8.5 million delisting requests, with their dates, copyright holders, numbers of URLs targeted, and links to more details in the Lumen archive. Another contains every web domain targeted, while another lists the URLs for which Google says it took no action. [h/t Dan Nguyen] — : January 1, 2020
Links:
Tags: lawtechnology
Political scientists Boris Shor and Nolan McCarty’s have assigned ideology scores, on a conservative-to-liberal scale, to every US lawmaker in all 50 state legislatures. The most recent update, published May 2018, covers more than 22,000 legislators from 1993 through 2016. Shor and McCarty derived the numeric scores from a combination of legislative voting records and responses to Vote Smart’s “Political Courage Test.” — : January 1, 2020
Links:
Tags: politics
The UN’s GEMStat project provides “scientifically-sound data” on freshwater quality around the world. The data portal lets visitors explore and download water-sample results from thousands of stations in more than 80 countries. The information available for each sample varies, but it can include chemical, biological, and physical properties. Note: Not all locations have been sampled recently, and data downloads are limited to 500 stations at a time and for noncommercial purposes only. Related: “The Invisible Water Crisis,” a World Bank report from last year that used the GEMStat data. [h/t Rylan Dobson] — : January 1, 2020
Links:
Where’s The Jump aims “to provide a comprehensive listing of the jump scares in horror and thriller movies.” The website doesn’t provide data downloads, but its list of 540+ movies — with year of release, “jump count,” “jump scare rating,” availability on Netflix, and more — can be copy-pasted into a spreadsheet. Related: “The Lazy Director’s Guide To Jump Scares.” [h/t Sophie Warnes] — : December 18, 2019
Links:
Tags: entertainmentmovies
Andrew Thompson scoured three years of articles from nine publications — Axios, BuzzFeed News, The Economist, The Hill, The New York Times, Politico, Vice News, Vox, and The Wall Street Journal — for mentions of any organization in Wikipedia’s list of US think tanks. The resulting spreadsheet contains more than 15,000 entries, relating to 172 think tanks. Each entry lists the publication, the think tank, the relevant sentence, and a link to the article. — : December 18, 2019
Links:
Tags: media
In the Philippines’ drug war, President Rodrigo Duterte has given police broad permission to kill suspected drug dealers. But an investigation by reporters at the Columbia Journalism School earlier this year found that “large numbers of killings [...] have been excluded from official counts.” The reporters collected data from 23 sources on 2,320 killings by police and unidentified assailants between July 2016 and December 2017 in three municipalities. Their dataset (free account required to download, alternate version here) indicates the month, local police station, whether police recorded the death, and more. Related: Because so many deaths appear to have gone unreported, the journalists worked with the Human Rights Data Analysis Group to estimate the total number of undocumented deaths. [h/t Scilla Alecci] — : December 18, 2019
Links:
The Arab Barometer has “been conducting high quality and reliable public opinion surveys in the Middle East and North Africa since 2006,” and bills itself as “the largest repository of publicly available data on the views of men and women in the MENA region.” You can download raw survey data or explore it through the Barometer’s online analysis tool. [h/t Chris Marsicano] — : December 18, 2019
Links:
Tags: genderstatistics
The Climate Action Tracker is “an independent scientific analysis” that keeps tabs on 32 countries’ progress on tackling climate change. Through its data portal, you can explore the project’s dozens of indicators — such as emissions per capita and renewable energy capacity — and also download the data in bulk. The countries covered include “the biggest emitters and a representative sample of smaller emitters,” and account for “about 80% of global emissions and approximately 70% of global population.” [h/t Erik Gahner Larsen] — : December 18, 2019
Links:
Over at The Pudding, Elle O'Brien and Jan Diehm chart the rise and fall of “big hair”. To ’do it, they combed through a public dataset of 37,921 portraits, culled from the yearbooks of 115 American high schools in 26 states between 1905 and 2013. [h/t Sophie Warnes] — : December 11, 2019
Links:
In 2004, computer scientist Cai-Nicolas Ziegler scraped (with permission) 433,000 numeric ratings of 186,000 books by 78,000 users on the book-tracking website BookCrossing. For most users, the data includes their stated city and age. [h/t Ningshan Zhang, Kyle Schmaus, and Patrick O. Perry] — : December 11, 2019
Links:
Tags: booksstatistics
The GitHub Typo Corpus contains structured data on misspellings, bad grammar, and the ways they've been corrected. To build the dataset, Masato Hagiwara and Masato Mita analyzed the “commits” — sets of changes to files, typically accompanied by short summaries — made to tens of thousands of projects on the code-sharing platform GitHub. With “more than 350k edits and 65M characters in more than 15 languages,” the authors say it's “the largest dataset of misspellings to date.” [h/t u/Loves_Portisheads] — : December 11, 2019
Links:
Tags: languagetechnology
The International Environmental Agreements Database Project describes more than 3,700 “international environmental treaties, conventions, and other agreements with links to text, membership, performance data, secretariat, and summary statistics.” The database, hosted at the University of Oregon, includes agreements from the 1850s to the present and can be queried online. It also includes detailed pages for each treaty, such as this one for the UN’s Paris Agreement. [h/t Erik Gahner Larsen] — : December 11, 2019
Links:
“Forty-three years after the Supreme Court reversed course and reinstated the death penalty, reliable data on the individuals sent to death row is maddeningly difficult to obtain.” So reporters at The Intercept “set out to compile a comprehensive dataset on everyone sentenced to die in active death penalty jurisdictions since 1976.” The resulting database contains more than 7,300 entries; for each person, it contains demographics, sentencing information, whether the person is still on death row, whether they’ve been exonerated, and more. Previously: Death sentences (DIP 2018.08.01), executions (DIP 2019.05.15), and last words (DIP 2019.03.06). — : December 11, 2019
Links:
The Database of British and Irish Hills began 20 years ago and now catalogs more than 20,000 bits of bumpy terrain. The data files include detailed coordinates, map references, and descriptive characteristics. [h/t Declan Valters] — : December 4, 2019
Links:
Tags: mapping
Last month, Nature released findings and response data from its fifth survey of science graduate students. The questionnaire covered demographics, motivations, ambitions, satisfaction, mental health, and other topics. More than 6,000 students participated, including 1,000+ each living in Europe, the Americas, and Asia. Related: Nature’s editorial board calls for “urgent attention” to students’ mental health. [h/t Mattias Björnmalm] — : December 4, 2019
Links:
The Samoan government has been tweeting updates regarding the country’s measles crisis. Epidemiologist Chris von Csefalvay has been converting those tweets into a simple dataset that counts the number of cases and deaths (by age group) at the time of each tweet. — : December 4, 2019
Links:
Tags: disease
The Foreign Service Act of 1980 requires US presidents to provide Congress a “report on the demonstrated competence” for each ambassadorial nominee. In 2014, the government began disclosing these records, which had long been held secret. But the government’s policy applied only to new nominees. Law professor Ryan Scoville, however, used the Freedom of Information Act to pry loose the rest — more than three decades of concise biographies — and, earlier this year, put them online, accompanied by a dataset that describes the nominees’ qualifications and political contributions. Related: “Unqualified Ambassadors,” Scoville’s Duke Law Journal article on the topic, and his summary in Lawfare. [h/t u/smurfyjenkins] — : December 4, 2019
Links:
Political scientist Frederick Solt’s Standardized World Income Inequality Database provides annual Gini coefficients for 196 countries and territories. For many, the calculations go back to the 1960s. The database draws on hundreds of sources, including international organizations, national statistical offices, and academic studies. In an accompanying paper, Solt argues that the SWIID represents an improvement over other similar efforts, such as the United Nations University’s World Income Inequality Database (DIP 2016.06.01). [h/t Y. Julia Jung] — : December 4, 2019
Links:
Economists Serra Boranbay-Akan and Carmine Guerriero have built a dataset describing the locations and operating years of more than 3,000 Cistercian and Franciscan monasteries in 90 European regions between the years 1000 and 1600. — : November 27, 2019
Links:
Parth Parikh, an engineering student in Mumbai, has created The Indian Movie Database, which aggregates and cross-references data on Bollywood films from IMDB, Wikipedia, and MovieLens. It currently contains more than 4,700 movies released between 1950 and 2019. — : November 27, 2019
Links:
Tags: entertainmentmovies
For an article in AthensLive, Sotiris Sideris collected data on properties up for bidding through eauction.gr, Greece’s official website for auctioning real estate seized from over-indebted borrowers. The dataset includes 45,918 lots listed between mid-November 2017 (when the website launched) and September 1, 2019. For each lot, the dataset specifies the auction date, property characteristics, starting bid, total debt, debtor, “hastener” pursuing the auction, links to additional documentation, and more. — : November 27, 2019
Links:
Tags: real estate
Researchers have constructed a “unified catalog” more than 5,000 earthquakes in Mexico. To do so, they standardized, deduplicated, and fact-checked quake information from more than 100 local, regional, and international sources. The final dataset goes back to 1787 and details each quake’s epicenter, depth, magnitude, and sources. The researchers consider the catalog to be complete for earthquakes with a moment magnitude of 6+ since 1925, and for those with a magnitude of 4+ since 1990. — : November 27, 2019
Links:
Tags: disaster
The United Nations publishes data describing its active peacekeeping missions and police and military personnel, plus all peacekeeping fatalities since 1948. Also: The International Peace Institute’s Peacekeeping Database tracks member-countries’ historical contributions of personnel going back to 1990. Plus: Last month, researchers at Uppsala University’s Department of Peace and Conflict Research introduced a Geocoded Peacekeeping Operations dataset, which provides detailed information about the location and troops for all UN peacekeeping deployments in Africa from 1994 to 2014, with stated plans to extend coverage through 2018 soon. [h/t u/smurfyjenkins] — : November 27, 2019
Links:
Tags: United Nationsconflict
Last October, the Squirrel Census dispatched 300+ volunteers to record every squirrel they saw in Central Park. In June, the organizers published a multimedia report of their findings (at $75 each). Now they’ve made the data — with each squirrel’s location, fur color, behavior, and more — available through New York City’s open data portal. Related: CityLab has more details on the project and interviews its creator. [h/t Jesus M. Castagnetto + Tidy Tuesday] — : November 20, 2019
Links:
Tags: animals
A group of academics has partnered with a soccer-data company to publish what they believe to be “the largest collection of soccer-logs ever released.” The dataset describes every “event” on the field — each pass, shot, foul, tackle, penalty, and more — for the 2017/18 season of five European leagues, the 2018 World Cup, and the 2016 European championship. — : November 20, 2019
Links:
Tags: sports
Since the mid-1980s, the US Census Bureau has periodically conducted its Survey of Income and Program Participation and provided anonymized, respondent-level data. In addition to the titular topics, the extensive questionnaires also ask about “family dynamics, educational attainment, housing expenditures, asset ownership, health insurance, disability, child care, and food security.” Related: My colleagues Scott Pham and Venessa Wong used SIPP’s data on familial support to “finally end the myth of the lazy millennial.” — : November 20, 2019
Links:
OpenEI’s Utility Rate Database contains nearly 50,000 expert-verified rates — current and historical — for residential, commercial, industrial, and street-lighting electricity from thousands of US utility companies. Related: OpenEI’s other data offerings. [h/t Arik Levinson and Emilson Delfino Silva] — : November 20, 2019
Links:
Tags: energy
Since early 2018, the Cook County State's Attorney's Office has been publishing detailed data on every felony case it has handled since 2010. (The office, whose jurisdiction includes Chicago, is the second-largest local prosecutorial agency in the country.) The datasets cover four main stages: intake, initiation, charging, and sentencing. Related: In a recent Pudding article copublished with The Marshall Project and Chicago Reporter, Matt Daniels used the data to examine how prosecutions have changed under State’s Attorney Kim Foxx, who promised reforms. — : November 20, 2019
Links:
AgeGuess.org asks visitors to guess people’s ages, based on photographs. You can download a database of the results, which currently includes more than 220,000 guesses about more than 4,600 photos. For each photograph, the database also includes some metadata, such as the person’s actual age. Related: The researchers describe their project in Scientific Data. — : November 13, 2019
Links:
Tags: miscellaneous
A team of researchers has developed a statistical model to estimate the flow of food commodities between every pair of US counties in 2012. To calculate the estimates, the researchers used data from the Census’s Commodity Flow Survey, ORNL’s Freight Analysis Framework, the USDA’s Census of Agriculture, and several other sources. Related: One of the paper’s authors summarizes the findings. [h/t Jain Family Institute weekly newsletter] — : November 13, 2019
Links:
Tags: agricultureeconomicsfood
In a recent working paper titled “Instruments of Debtstruction,” researchers at the International Monetary Fund share a “comprehensive instrument-level database of sovereign debt for 18 advanced and emerging countries over the period 1913–46.” (The dataset currently published alongside the paper seems to be missing one of the 18 countries, Russia.) The “instruments” include bonds, credit lines, and several other forms of debt; for each instrument issued, the dataset contains the debt’s coupon rate, maturity, and currency. — : November 13, 2019
Links:
The GDELT Project’s Web News Ngram dataset keeps track the frequency individual words and two-word in online news around the world. The dataset incorporates news sources in 142 languages and provides overall word counts for every 15-minute window since January 1, 2019. An additional dataset tracks phrasings used in 10 character-based languages. Previously: GDELT’s similar dataset for television news (DIP 2019.08.21). [h/t Kalev Leetaru] — : November 13, 2019
Links:
Tags: journalismlanguagemedia
The Trace has posted raw data on 4.3 million murders, nonfatal shootings, assaults, robberies, and rapes, obtained from 56 police and sheriff’s departments in the United States. Related: Sarah Ryley's introductory Twitter thread. Also related: The Trace and BuzzFeed News’ investigative reporting on cities’ failure to arrest shooters, for which Sarah, Sean Campbell, and I used many of these datasets. — : November 13, 2019
Links:
Tags: crimestatistics
The Data Visualization Society has published the results of its annual community survey, which received 1,359 responses from data visualization practitioners. The public data contains answers to 50 questions on topics such as compensation, tools, community, and more. [h/t Amy Cesal] — : October 23, 2019
Links:
Tags: statistics
The Pudding’s Amber Thomas used PetFinder’s API to collect detailed data on all adoptable dogs at shelters and rescue organizations on a single day in September. Related: Thomas's story for The Pudding, which uses the data to examine state-to-state relocations. — : October 23, 2019
Links:
Tags: animals
The National Institutes of Health’s new Open Citation Collection brings together 420 million academic citations in biomedical literature. The data — the most comprehensive available for biomedicine — now underpins the NIH’s iCite platform, where you can explore citation statistics online. The citations are also available as a bulk download and via an API. [h/t Travis Hoppe] — : October 23, 2019
Links:
Tags: healthcarescience
European Central Bank has begun publishing a spreadsheet of all executive board members’ speeches since the late 1990s. The dataset contains each speech’s date, speaker(s), title, subtitle, and text; the ECB says it will be updated every two months. [h/t Volker Nitsch + Peter Tillmann] — : October 23, 2019
Links:
The Autocratic Ruling Parties Dataset bills itself as “the first comprehensive data set on the founding origins, modes of gaining and losing power, ruling tenures, and other characteristics of autocratic ruling parties.” The dataset, created by political science professor Michael K. Miller, covers nearly 500 parties in more than 150 countries between 1940 and 2015. [h/t u/smurfyjenkins] — : October 23, 2019
Links:
Tags: governmentpolitics
A team of researchers has “derived measurements of essential functional traits” for more than 2,500 species of palm plants — including but not limited to palm trees. Their PalmTraits database, based on published studies and preserved specimens, includes variables such as maximum height, fruit shape, and whether the fruit color is “conspicuous.” — : October 16, 2019
Links:
Tags: agricultureplants
Stanford’s How Couples Meet and Stay Together study, which receives funding from both the university and the National Science Foundation, has been asking American adults about dating since 2009. A new-and-updated version of the survey includes questions related to dating apps. [h/t u/morningshower] — : October 16, 2019
Links:
Tags: familytechnology
Since 2017, the Alliance for Affordable Internet has been collecting country-level prices for mobile data. The most recent data covers 99 low- and middle-income countries for the second quarter of 2019. The rates are based on “the cheapest plan(s) providing at least 1GB of broadband data over a 30-day period from the largest mobile network operator in each country.” [h/t Teddy Woodhouse] — : October 16, 2019
Links:
Tags: technology
Last week, Pacific Gas and Electric began cutting power to hundreds of thousands of Californians — a precaution to keep the company’s aging infrastructure from sparking wildfires. Simon Willison has been scraping PG&E’s outage website every 10 minutes, and pushing the results into a database you can query and download. [h/t Lam Thuy Vo] — : October 16, 2019
Links:
The Executive Approval Project uses international polling data measure public support for presidents, prime ministers, and other political executives in 50 countries. For most of the countries, the database goes back to the 1990s; for some, it goes even further. Access requires providing a name, affiliation, and email address — plus agreeing to receive updates. [h/t Erik Gahner Larsen] — : October 16, 2019
Links:
Web developer Tim Biles has created an unofficial Breaking Bad API. It provides structured data on every character, episode, and death in the TV series, plus selected quotations. — : October 9, 2019
Links:
Tags: entertainmenttelevision
Programmer Lee Butterman has built a dataset of the SSL encyryption connections associated with 350 million web domains. For each connection, the dataset indicates the SSL certificate’s issuer, cryptographic algorithms used, and other details. [h/t Jason Norwood-Young] — : October 9, 2019
Links:
Tags: technology
Last month, the Smart Power India and the Initiative for Sustainable Energy Policy published Rural Electricity Demand in India, a new survey dataset that “covers 10,000 households and 2,000 rural enterprises across 200 villages in Bihar, Uttar Pradesh, Odisha, and Rajasthan.” Respondents were asked, among other things, how many hours per day they get electricity, whether they have solar panels, and the price they pay for kerosene. [h/t Hisham Zerriffi + Johannes Urpelainen] — : October 9, 2019
Links:
Tags: energy
The Parliamentary Rules Database traces “the formal rules of procedure for various parliaments over time.” Currently, the database covers two parliaments — the UK House of Commons and the Irish Dáil. The House of Commons info includes more than 137,000 “standing orders,” going all the way back to 1811. [h/t Erik Gahner Larsen] — : October 9, 2019
Links:
Tags: governmenthistory
Last week, the International Monetary Fund released the results of its 10th annual Financial Access Survey. It’s a “supply-side” dataset; its country-level metrics include, for instance, the number of automated teller machines (mainland China has the most, with more than 1 million) and active mobile banking accounts (Pakistan and Bangladesh are tops). Many of the metrics are also disaggregated by gender. — : October 9, 2019
Links:
Tags: economicsmoneystatistics
Nearly 10,000 people took the American Institute of Graphic Arts’ “Design Census” earlier this year. The (non-scientific but detailed) results are out, and the raw data are (prominently) available to download. [h/t Robin Sloan] — : October 2, 2019
Links:
Tags: artstatistics
Machine learning scientist Agnieszka Mikołajczyk has been gathering useful resources for identifying birds by sound. The resources include about a dozen datasets of audio recordings, many of which are immediately downloadable. — : October 2, 2019
Links:
Official, reliable data on Cuba is hard to come by. So Cuban journalist Barbara Maseda’s Proyecto Inventario has been collecting and publishing datasets relevant to the island nation — including by documenting the country’s legislators, blackouts, and non-agricultural cooperatives. Related: Júlio Lubianco’s profile of Maseda and the project. — : October 2, 2019
Links:
Tags: statistics
Snapchat has released detailed data about every political ad purchased on its platform in 2018 and 2019. For each ad, the information includes its targeting parameters (age, gender, location, interests, internet service provider, device operating system, and more), the dates the it ran, the amount spent, number of impressions, and a link to the ad itself. Snap says this year’s data will be updated daily and that new ads will appear within 24 hours of first delivery. [h/t Erik Gahner Larsen] — : October 2, 2019
Links:
Tags: politicssocial media
The Federal Reserve’s Enhanced Financial Accounts datasets are supplements to the central bank’s more-aggregated Financial Accounts of the United States statistics. Among them: wealth, asset, and debt distributions by percentile and demographic. Also: College savings plans by state and the banking industry’s balance sheet. [h/t u/loopback2019] — : October 2, 2019
Links:
In 2012, the British government collected noise-level data from across the country and throughout London, including average daytime and nighttime loudness. [h/t Giuseppe Sollazzo] — : September 25, 2019
Links:
Tags: audioenvironment
Microsoft has released a dataset describing the geometric footprints of 12 million buildings in Canada, as detected by a neural network analyzing satellite imagery. (Last year, the company published similar data for the United States.) And the government of New Zealand has published building-outline data for most of the country. It’s based on aerial imagery, “using a combination of automated and manual processes,” and comes with detailed documentation. [h/t Michael McLaughlin + Robin Hawkes] — : September 25, 2019
Links:
Tags: architecturemapping
Political science professor Jack Paine has compiled a dataset of 144 territories colonized by Britain, France, Portugal, Spain, the Netherlands, Belgium, Italy, the United States, Australia, South Africa, and New Zealand during the 16th through 20th centuries. The dataset includes the year of colonization, year of independence, and various metrics related to the colonies’ legislature and suffrage. — : September 25, 2019
Links:
Tags: history
The Socioeconomic High-resolution Rural-Urban Geographic Platform for India is “an open access repository currently comprising dozens of datasets covering India’s 500,000 villages and 8,000 towns.” To make the datasets work well together, the project uses a common set of IDs for each town, village, and constituency. The downloadable files (free registration required) include data from censuses, elections, road construction, and more. Related: Introductory tweets from co-creator Paul Novosad. — : September 25, 2019
Links:
Tags: mappingstatistics
The Carp-Manning U.S. District Court Database provides “data on 110,000+ decisions by federal district court judges handed down from 1927 to 2012.” It includes details of each case (such as the issue area and jurisdiction), each judge (year appointed, gender, race, and political party), and whether the decision was “liberal” or “conservative.” Previously: Federal judges’ data-biographies (DIP 2019.08.07). [h/t Scott Hofer and Jason Casellas] — : September 25, 2019
Links:
Tags:
CIRCL, Luxembourg’s computer security incident response team, has published a dataset of 37,500 .onion website screenshots, a subset of which have been categorized by topic (e.g., “drugs-narcotics”, “extremism”, “finance”) and/or purpose (e.g., “forum”, “file-sharing”, “scam”). [h/t Alexandre Dulaunoy] — : September 18, 2019
Links:
Tags: crimeextremismtechnology
Urban planning professor Geoff Boeing’s US street network data represents America’s roads as a network graph, where each intersection (and dead-end) is a node, and each street segment is an edge between two of those nodes. The project’s data repository contains these networks for each city, county, Census tract, and more. You might remember: Boeing’s urban street orientation charts. [h/t Robin Hawkes] — : September 18, 2019
Links:
Tags: mapping
The State Networks dataset gathers comparative and relationship metrics for every combination of the 50 US states, plus the District of Columbia. Among the metrics: the number of flights between each state-pair, migration in either direction, and total value of goods imported. The comparisons also include state-to-state differences in demographics, ideology, and GDP. [h/t Matt Grossmann] — : September 18, 2019
Links:
Political science professor Jamie Monogan has compiled a dataset of more than 2,700 immigration laws passed by US state legislatures from 2005 to 2016. The dataset summarizes the laws and also categorizes them by subject, scope, and whether they appear to be welcoming or hostile to immigrants.[h/t Jason Anastasopoulos] — : September 18, 2019
Links:
Tags: immigration
Since 1988, Brazil’s PRODES project has been using satellite imagery to track clear-cutting in the country’s Amazon basin. The government’s TerraBrasilis web portal provides an interactive map and downloads of the data. Global Forest Watch also provides a dataset of PRODES-detected deforestation, from 2001 to 2015. [h/t Giuseppe Sollazzo] — : September 18, 2019
Links:
Tags: environmentmappingplants
Researchers at Brazil’s Federal University of Ceará have published a new dataset “composed of more than 70,000 bug-fix reports from 10 years of bug-fixing activity of 55 projects from the Apache Software Foundation.” — : September 11, 2019
Links:
Tags: technology
Since 1992, the US Bureau of Labor Statistics’ has collected data on work-related deaths through its Census of Fatal Occupational Injuries. The results are presented as various cross-tabulations — by industry, demographic, circumstances, and more. Related: The agency also publishes data on non-fatal injuries and illnesses. [h/t Elissa Philip Gentry and W. Kip Viscusi] — : September 11, 2019
Links:
Transport for London has launched its Cycling Infrastructure Database, which “contains the location of more than 240,000 pieces of cycling infrastructure in London, including places to park and the location of cycle lanes.” The new information can be found among the agency’s broader collection of cycling data; look for the “CyclingInfrastructure” folder. [h/t Jolyon Whaymand] — : September 11, 2019
Links:
Uber Movement, from the titular ride-hailing company, “shares anonymized data aggregated from over ten billion trips to help urban planning around the world.” Online, you can explore street speeds and estimated travel times for dozens of cities. To download data from the website, Uber requires you to provide your name, email address, and purpose. But they also provide a command-line tool that lets you download street-speed data without any registration. Michael A. Rice] — : September 11, 2019
Links:
Tags: mappingtransportation
The UN’s World Database on Protected Areas is, it says, “the most up to date and complete source of information on protected areas, updated monthly with submissions from governments, non-governmental organizations, landowners and communities.” It contains structured, geospatial information on more than 245,000 nature reserves, national parks, wildlife sanctuaries, and other kinds of conservation sites. The project provides bulk downloads, an interactive map, country-level statistics, and an API. Previously: The California Protected Areas Database (DIP 2019.07.10). [h/t Giuseppe Sollazzo] — : September 11, 2019
Links:
FiveThirtyEight has built a dataset of 65 college football fight songs, which contains each song’s name, authors, year written, tempo, duration, and whether it includes various tropes, such as spelling out words or mentioning the school’s colors. Related: FiveThirtyEight’s “Guide To The Exuberant Nonsense Of College Fight Songs,” where you can listen to the songs, read the lyrics, and explore an interactive chart of tempo versus duration. — : September 4, 2019
Links:
The Drama Corpora Project has collected and processed more than 800 plays in German, Greek, Spanish, Russian, Latin, and English. For each play, the project provides a structured-data version of the text, a network diagram, speech distribution metrics, plus several other files and features. [h/t Lynn Cherny] — : September 4, 2019
Links:
Tags: entertainmentlanguage
The 3PFL dataset — Patents and Publications with a Public-Funding Linkage — lists more than 13,000 US patents that have acknowledged federal funding. The dataset, accompanied by a detailed methodology, also links the patents to details about the funding, as well as to scientific publications that stemmed from it. Previously: Patent geography (DIP 2019.07.31). [h/t Gaétan de Rassenfosse] — : September 4, 2019
Links:
Tags: technology
The Central African Republic’s ongoing civil war has pressed more than 600,000 people to flee the country. The violence has also internally displaced another 600,000 people, a phenomenon that the UN's Humanitarian Data Exchange has been tracking. In addition to counts of internally displaced people by locality, the UN’s datasets include a listing of refugee sites and the country's road network. Related: A multimedia presentation of one family's 600-kilometer journey in search of safety. [h/t Becky Band Jain] — : September 4, 2019
Links:
The University of Oxford’s Malaria Atlas Project collects, models, and publishes a range of datasets related to the mosquito-borne disease, including localized incidence rates. You can explore and download the data, layer by layer, through the project’s interactive map. [h/t Clara Burgert-Brucker] — : September 4, 2019
Links:
James E. Cutting, a Cornell University psychology professor, has compiled several datasets on the structure of popular films, including one that indicates the length of each shot in 220 movies from 1915 to 2015. [h/t Igor Schwarzmann + Noah Brier] — : August 28, 2019
Links:
Tags: entertainmentmovies
Legal scholar and open-data enthusiast Hanjo Hamann has digitized seventy years of rosters from Germany’s seven federal courts, extracted structured data about the judges, and linked them to their Wikidata IDs. Related: Hamann’s detailed description of the dataest’s historical context and its construction. [h/t Erik Gahner Larsen] — : August 28, 2019
Links:
Tags: law
Government professor C. Lawrence Evans’ dataset of US House "whip counts" describes more than 650 of the informal polls conducted by party leadership — covering 1955–86 for Democrats and 1975–80 for Republicans, on topics as varied as dairy prices, Alaskan statehood, voting rights, and Vietnam. It also indicates how each party member responded. [h/t Neil Malhotra + Janet Box-Steffensmeier] — : August 28, 2019
Links:
Tags: politics
A team led by meta-research pioneer John Ioannidis has developed a dataset of citation metrics for science’s 100,000 most-cited authors. The dataset includes each author’s name, institutional affiliation, number of publications, total citations, “h-index,” and more. For each citation metric, there’s a second version that excludes self-citations. Related: “Hundreds of extreme self-citing scientists revealed in new database” (Nature). — : August 28, 2019
Links:
Tags: science
The OECD’s ADIMA database tracks multinational corporations — Walmart, Toyota, Nestle, etc. — and their subsidiaries. It currently includes economic statistics about each of the world’s 100 largest multinationals, the names and locations of 26,000 subsidiaries, and information about nearly 20,000 of their websites. The OECD says plans to expand the number of companies in the future. Now you know: In 2016, the companies in the dataset “generated nearly $10 trillion in revenues (almost 20% of global GDP), earned $730 billion in profits and paid $185 billion in taxes,” according to the OECD. — : August 28, 2019
Links:
Tags: business
The Confidence Database is aggregating data from behavioral studies that have asked participants’ how confident they were in their own assessments. As of its launch earlier this month, the database contains 145 datasets, 8,700 participants, and 4 million individual observations. [h/t Audrey Mazancieux + Doby Rahnev] — : August 21, 2019
Links:
Tags: statistics
On Monday, the British government published a dataset of voting results, by party and parliamentary constituency, for every UK general election since 1918 — merging modern data with a handful of historical sources. — : August 21, 2019
Links:
The TV-NGRAM project pulls 14 TV stations’ data from the Television News Archive and calculates how often each word (and two-word combination) was said during each 30-minute window. Most of the stations’ counts go back 9 or 10 years, and all are updated daily. — : August 21, 2019
Links:
Tags: journalismlanguagemedia
Joshua Tschantret, a political science Ph.D. candidate at the University of Iowa, has compiled a dataset of 260+ terrorist groups formed between 1860 and 1969. For the purposes of the dataset, “terrorist groups are operationally defined as politically-motivated non-state actors using bombings or assassinations,” Tschantret writes in an introductory article (PDF). About one-third of the groups in the dataset operated in the US, Russia, or China; the rest are spread across dozens of other countries. Related: Additional documentation (PDF). Good to know: On Twitter, Tschantret explains why the Black Panthers are included. [h/t Carla Martinez Machain] — : August 21, 2019
Links:
The Joint Organisations Data Initiative (JODI) coordinates the collection, standardization, and publication of oil and gas data from around the world; the 100+ countries that participate represent the vast majority of global production. The oil data goes back to 2002; the gas data goes back to 2009. Both datasets are updated monthly and track a range of subproducts (e.g., crude oil, diesel, jet fuel) and flows (e.g., imports, exports, production) for each country. Previously: Global and gas infrastructure (DIP 2018.06.06) and state-owned oil companies (DIP 2019.05.01). — : August 21, 2019
Links:
Tags: energy
Katherine M. Kinnaird and John Laudun — professors whose research includes cultural analytics and computational folklore studies — have created a dataset of 2,656 TED talks, with metadata and transcripts, and have published a detailed description of the project. [h/t Lynn Cherny] — : August 14, 2019
Links:
Tags: entertainmentmedia
The Open Power System Data platform has aggregated energy data from across Europe into a series of standardized datasets, including electricity consumption, power plants, and generation capacity. The project has also published an “IT philosophy,” a guide for new users, and a detailed listing of primary sources. — : August 14, 2019
Links:
Tags: energy
ThePLUG, a news site that reports on the black innovation economy, has been collecting data on conferences for black tech professionals. The dataset currently contains 33 events in more than a dozen cities, and lists their costs, year started, contact information, sponsors, and more. [h/t Sherrell Dorsey] — : August 14, 2019
Links:
Tags: racetechnology
OurAirports, a community-assisted project that began in 2007, provides bulk data detailing 55,000+ airports and 41,000+ runways, plus listings of airport radio frequencies and global navigation aids. In addition to standard airports, the records include 23 balloonports, 1,000+ seaplane bases, and 11,000+ heliports. Related: “How we created a map of the global architecture of airport runways, which turned out to be a wind map.” [h/t Robin Hawkes] — : August 14, 2019
Links:
Tags: transportation
The London Stage Database “is the latest in a long line of projects that aim to capture and present the rich array of information available on the theatrical culture of London, from the reopening of the public playhouses following the English civil wars in 1660 to the end of the eighteenth century.” The database contains information on more than 50,000 events, which you can search online and download in bulk, and are often supplemented with detailed notes and cast lists. The site also offers a user guide and a detailed explanation of the data’s provenance. (“We hope that visitors to the site will find this frank acknowledgment and foregrounding of the dataset’s history and limitations refreshing rather than frustrating.”) [h/t Ula Klein] — : August 14, 2019
Links:
Tags: entertainmenthistory
About a third of US states hold a monopoly on the local sale of hard liquor. Some of them — including Virginia, Alabama, Michigan, Utah, and North Carolina — let you download their price lists as spreadsheets. [h/t Christopher Ingraham] — : August 7, 2019
Links:
Tags: alcohol
Brigham Young University’s Antarctic Iceberg Tracking Database provides surveillance on hundreds of floating hunks of ice, past and present. The records cover 1978 plus 1992 through mid-2019; a subset of the database lists 117 icebergs’ daily position, estimated size, and rotation angle. [h/t Robin Hawkes] — : August 7, 2019
Links:
An international team of researchers has compiled a “comprehensive spatial inventory” of nearly 100,000 public health facilities in sub-Saharan Africa. The dataset includes facilities in 50 countries and lists each facility’s name, country, administrative region, type, ownership, and coordinates. [h/t Karen Grepin] — : August 7, 2019
Links:
Tags: healthcare
The government-run Federal Judicial Center publishes a daily-updated “biographical directory” of all judges who’ve served on federal courts — the Supreme Court, appellate courts, district courts, the bygone circuit courts, plus a few others. The directory is presented as structured data, and includes information on the judges’ demographics, educations, professional careers, nominations and more. Related: The University of South Carolina’s Judicial Research Initiative also maintains historical datasets of district and appellate court judges; they contain many of the same variables plus some extras, such as religion and estimated net worth. [h/t Dan Nguyen + Sergio Galletta, Elliott Ash, and Daniel L. Chen] — : August 7, 2019
Links:
Tags: law
The Washington Post and the Charleston Gazette-Mail recently won a year-long legal battle to obtain a large slice of the Drug Enforcement Administration’s data on opioid shipments. (The data had previously been provided to plaintiffs in a federal lawsuit, but a judge had sealed the records from public access.) The Post has begun publishing its findings, as well as a cleaned-up version of the dataset that focuses on “shipments of oxycodone and hydrocodone pills to chain pharmacies, retail pharmacies and practitioners” between 2006 and 2012. The raw, unsealed dataset is also available. Related: A 500-row subset, so you can see what the data looks like before downloading the large files. — : August 7, 2019
Links:
Tags: drugshealthcare
Duncan Geere has compiled a database of the 48 dogs who participated in the USSR’s space program in the 1950s and 1960s. The information, which also includes details about the canines’ 42 flights, is based on Olesa Turkina's book, Soviet Space Dogs. — : July 31, 2019
Links:
The UK Institute for Government has been updating a spreadsheet of ministers who’ve resigned since 1979, the post each one held, the reasons for resignation, and the prime minister in charge at the time. The spreadsheet, which so far contains 151 resignations through last week, includes a few methodological notes embedded as comments in the header row. [h/t Gavin Freeguard] — : July 31, 2019
Links:
Tags: politics
Researchers at two Swiss universities have created a dataset of inventors’ and applicants’ locations listed in 18.8 million patents filed between 1980 and 2014. The locations, which span 46 countries, are specified both by their geographic coordinates as well as their administrative areas (e.g. city, state, country). [h/t Gaétan de Rassenfosse] — : July 31, 2019
Links:
Tags: technology
A team of researchers at the MIT Media Lab has built a corpus of machine-generated transcriptions from 284,000 hours of talk radio. The transcripts capture approximately 2.8 billion words from 50 semi-randomly selected stations, and include metadata, such as the program name, the speaker’s (guessed) gender, and whether the speaker seemed to be in the studio or on the phone. [h/t Lynn Cherny] — : July 31, 2019
Links:
For nearly two decades, the US Department of Defense has released detailed tables on the foreign military units it has trained. For each training, the information describes the units trained, number of trainees, course name, start and end dates, location, cost, and more. Unfortunately, the government publishes these records only as PDFs. To make the data more accessible, Security Force Monitor, a project of the Columbia Law School Human Rights Institute, has converted the PDFs into an open, queryable database. An associated GitHub repository contains an extensive methodology, the extraction code, and the raw data. [h/t Jamon Van Den Hoek] — : July 31, 2019
Links:
Tags: military
“The Merchant Shipping Act 1835 required all British registered ships of 80 tons or more employed in the coastal trade or fisheries to carry crew agreements and accounts, often referred to as crew lists.” The lists include crew members’ ages, places of birth, previous vessels, and more. Thanks to the National Library of Wales Volunteering Programme, thousands of crew lists from the Welsh port of Aberystwyth, from 1856 to 1914, have been transcribed. [h/t u/cavedave] — : July 24, 2019
Links:
The Standardised Precipitation-Evapotranspiration Index is a metric, calculated from climatic data, that “can be used for determining the onset, duration and magnitude of drought conditions with respect to normal conditions.” The project, based at the Spanish National Research Council, provides both a “near real-time” global drought monitor and a historical database. — : July 24, 2019
Links:
ICESat-2, launched by NASA in September 2018, “is measuring the height of a changing Earth one laser pulse at a time, 10,000 laser pulses per second”; the satellite “allow[s] scientists to monitor the elevation of ice sheets, glaciers, sea ice, and more—all in unprecedented detail.” Its datasets are available to download. [h/t Michael McLaughlin] — : July 24, 2019
Links:
As part of Oak Ridge National Laboratory’s efforts to evaluate America’s hydropower resources, researchers there have developed a system (and corresponding dataset) for classifying all 2.6 million streams in the Lower 48 by size, hydrology, gradient, temperature, and “valley confinement.” Elsewhere, other researchers have assessed the “connectivity status of 12 million kilometres of rivers globally” and have identified “those that remain free-flowing in their entire length”; you can download that data and also explore it online. — : July 24, 2019
Links:
The Water Observatory “provides reliable and timely information about surface water levels of water bodies across the globe.” The locations are based on NASA’s Global Reservoir and Dam Database and the World Wildlife Fund’s Global Lakes and Wetlands Database. Concerned about the accuracy of the boundaries in those databases, the researchers instead treated them as a “collection of potentially interesting water bodies” and then “extracted their polygons from the OpenStreetMap.” Of the 40,000 bodies of water they extracted, they’ve published water level data for roughly 7,000 through the project’s interactive dashboard and API. [h/t Emma Vitz] — : July 24, 2019
Links:
The Database of Global Administrative Areas aims “to map the administrative areas of all countries, at all levels of sub-division.” With 386,735 divisions and counting, “this is a never ending project, but we are happy to share what we have.” Note: “commercial use is not allowed without prior permission.” — : July 17, 2019
Links:
Tags: mapping
The United States’ Foreign Agents Registration Act requires lobbyists who represent foreign governments to file paperwork with the Department of Justice. The database has long been available to browse online; last month, the agency added a last month, however, added three new features: full-text search, an API, and bulk downloads. [h/t Lachlan Markay + Jack Corrigan + u/surlyq] — : July 17, 2019
Links:
Tags: politics
The PluriCourts Investment Treaty Arbitration Database (PITAD) provides “a comprehensive, regularly-updated and networked overview of all-known investment arbitration cases.” You can download the 1,400+ cases or explore them online, searching by case, arbitrator, investor, or country. Note: PITAD says its data are “strictly for academic use.” Related: My former colleague Chris Hamby’s “The Court That Rules the World” series — “an exposé of a dispute-settlement process used by multinational corporations to undermine domestic regulations and gut environmental laws at the expense of poorer nations,” as the Pulitzer committee put it. [h/t Joel Dahlquist Cullborg] — : July 17, 2019
Links:
Tags: conflict
A group of researchers have collected, parsed, and added metadata to all UN Security Council debates from 1995 through 2017. The dataset includes more than 65,000 speeches (with information about each speaker), extracted from nearly 5,000 meeting transcripts. Related: The authors describe their methodology. [h/t Ronny Patz] — : July 17, 2019
Links:
Tags: United Nations
The CITES Trade Database, named after the Convention on International Trade in Endangered Species of Wild Fauna and Flora, contains information about more than 20 million shipments of wildlife (e.g., live tapirs, sturgeon eggs, wolf skulls) and wildlife products (e.g., venus flytrap extract) since 1975. The database is maintained by a UN agency and includes the year of the shipment; the scientific name of the plant or animal; the type and quantity of the particular thing being traded; their purpose and source; and the country of origin, export, and export. Related: Citesdb, an R package for analyzing the database. — : July 17, 2019
Links:
James Fee has compiled a dataset of more than 400 baseball stadiums from more than 40 leagues around the world; each stadium’s information includes its name, team(s), league(s), and geographic coordinates. — : July 10, 2019
Links:
Tags: sports
With more than 15,000 “super units,” and an even larger number of subdivisions within them, the California Protected Areas Database is “the authoritative GIS database of parks and open space in California.” It’s one of the two main databases that the California Natural Resources Agency publishes regarding protected lands; the other, the California Conservation Easement Database, tracks restricted-use private land. [h/t @cartonaut] — : July 10, 2019
Links:
Tags: environmentmapping
In order to develop its maps of North American ecoregions, the US Environmental Protection Agency consulted with other federal agencies and state agencies, plus the governments of Canada and Mexico. Each “ecoregion” is an area with “similarity in the mosaic of biotic, abiotic, terrestrial, and aquatic ecosystem components with humans being considered as part of the biota.” The maps are available both as PDFs and as geospatial data files, at four levels of increasing specificity. [h/t Brandyn Friedly] — : July 10, 2019
Links:
Tags: environmentmapping
The Open Observatory of Network Interference, run by the Tor Project, “collects and processes network measurements with the aim of detecting network anomalies, such as censorship, surveillance and traffic manipulation.” You can volunteer to run OONI’s tests from your computer or phone; so far, “millions of network measurements have been collected from more than 200 countries since 2012.” You can explore that data online, download it in bulk, and access it via an API. Related: OONI’s blog, which includes reports on some of its findings. [h/t John Emerson] — : July 10, 2019
Links:
Tags: censorshiptechnology
Last month, the US Federal Emergency Management Agency released two major datasets from its National Flood Insurance Program: more than 47 million insurance policies and more than 2 million insurance claims. The latter includes details on each claim’s property, flood zone, amount paid, and more. Both datasets have been partially redacted to remove personally-identifiable information. [h/t Anna Weber] — : July 10, 2019
Links:
“In this paper, we aim to teach a machine how to make a pizza,” writes a team of computer scientists from MIT and the Qatar Computing Research Institute. One of the key ingredients: 9,213 photos of pizza, with their lists of toppings annotated by Amazon Mechanical Turk workers. [h/t Kristin Houser + Center for Data Innovation] — : July 3, 2019
Links:
Tags: food
Global Mangrove Watch uses satellite data to track the global extent of those coastal intertidal forests; the project’s seven snapshots span 1996 to 2016. Note: To download the data, you’ll need to provide a few details and agree to certain terms and conditions. [h/t Dan Friess] — : July 3, 2019
Links:
Tags: environmentmapping
The Issue Correlates of War project, which started in 1997 with a focus on territorial disputes, gathers “systematic data on contentious issues in world politics.” In addition to its two centuries of territorial claims, the project has also catalogued disputes over rivers, maritime zones, and ethnic groups, and compiled supplementary datasets on colonial history, historical country names, and more. — : July 3, 2019
Links:
Tags: conflict
Dan Salmon, a grad student who specializes in information security, has published data on more than 7 million Venmo transactions, which he downloaded from the mobile payment platform’s public API. “I am releasing this dataset,” he writes, “in order to bring attention to Venmo users that all of this data is publicly available for anyone to grab without even an API key.” Practical: How to make your Venmo transactions private. Related: Salmon explains more, in Wired. Also: In 2018, Hang Do Thi Duc analyzed 200 million public Venmo transactions to show how revealing they could be. [h/t Álex Barredo] — : July 3, 2019
Links:
Tags: businessmoneytechnology
The Administrative Office of the United States Courts posts its annual “wiretap reports”, which provide details on the wiretaps that state and federal judges have authorized. Last week, the agency published its 2018 report; the supplementary data includes each wiretap’s jurisdiction, authorizing judge, date of authorization, type of intercept, number of communications intercepted, total cost, and more. [h/t Chris Zubak-Skees + Steven Rich] — : July 3, 2019
Links:
FiveThirtyEight has collected the text of all 50 state governors’ 2019 annual addresses, and has analyzed the most common words and phrases used by Republican and Democratic governors. — : June 26, 2019
Links:
The United Kingdom’s Department of Education publishes data on its university graduates’ annual earnings 1, 3, 5, and 10 years after graduation, broken down by school attended, subject studied, and demographic characteristics. [h/t Tera Allas] — : June 26, 2019
Links:
Tags: educationmoneystatistics
This is the spreadsheet that “broke the art world’s culture of silence.” In just a few weeks, Michelle Millar Fisher and anonymous colleagues have collected more than 2,600 self-reported salaries from their fellow curators, managers, interns, and other art-world employees. Related: “It took us three minutes to build this spreadsheet,” the organizers have written in The Art Newspaper. “It is not a perfect survey tool, nor was it ever intended to be. While we’ll work with statistics professionals to review and glean meaningful facts [...] Its primary goal is to catalyse us all into action.” [h/t u/cavedave] — : June 26, 2019
Links:
Tags: artmoneystatistics
The Judicial Review of Congress dataset, compiled by Princeton politics professor Keith E. Whittington, “catalogs all the cases in which the U.S. Supreme Court has substantively reviewed the constitutionality of a provision or application of a federal law.” The dataset currently covers 1,308 cases, stretching from the high court’s founding through its 2017 term. For each case, it specifies the statute being reviewed, how long the statute had been in effect, the main constitutional issues at hand, the outcome, and more. [h/t Sheldon Gilbert] — : June 26, 2019
Links:
You can browse NASA’s Image and Video Library online; you can also access it via NASA’s API. Through that interface, you can search by caption, keyword, location, photographer, year created, and other fields; in return, you get structured data on each media file. The library was launched two years ago, bringing together more than 140,000 images, videos, and audio files that had previously been spread across dozens of separate collections. [h/t Seth Donoughe] — : June 26, 2019
Links:
Tags: sciencetechnology
MUStARD is a corpus of 690 text and video clips “for research in automated sarcasm discovery.” The dataset’s 690 examples — half involving sarcasm, half not — come from Friends, The Golden Girls, The Big Bang Theory, and Sarcasmaholics Anonymous. Related: Towards Multimodal Sarcasm Detection (An Obviously Perfect Paper), the researchers’ introduction to the dataset. — : June 19, 2019
Links:
Developer Michael Zemel has built an interactive timeline of 282 European kings, queens, emperors, and other monarchs. For each, the data includes his or her name, religion, period of reign, reason for losing power, wars involved in, relationships, and notable events. Zemel has also published a detailed writeup about his inspiration and process, plus the underlying data and code. [h/t Giuseppe Sollazzo + Sophie Warnes] — : June 19, 2019
Links:
Tags: politics
“Most people can name a mammal or bird that has become extinct in recent centuries, but few can name a recently extinct plant.” That’s from a new academic paper that presents “a comprehensive, global analysis of modern extinction in plants.” The paper itself is paywalled, but the dataset — of 571 extinct seed plants, plus other species that have been rediscovered or reclassified — is available to download. Related: World’s largest plant survey reveals alarming extinction rate, a summary of the findings. [h/t Joseph Stirt] — : June 19, 2019
Links:
Discogs, a user-contributed music database and marketplace, publishes “monthly data dumps” listing the millions of artists, labels, and releases in its system. Additional types of data (e.g., user reviews) are available through Discogs’ API. [h/t Jan Willem Tulp] — : June 19, 2019
Links:
Tags: entertainmentmusic
The Centers for Medicare and Medicaid Services’ National Average Drug Acquisition Cost dataset indicates how much U.S. pharmacies have to pay, on average, to obtain thousands of prescription and over-the-counter drugs. The dataset contains millions of rows — one for each National Drug Code in the survey, for each week since 2013 — but you can also download smaller, weekly slices. The agency also publishes a dataset of changes in these average costs. Previously: Total and average costs for Medicare Part B and Part D prescriptions (DIP 2016.12.14). [h/t data.world] — : June 19, 2019
Links:
Tags: drugshealthcare
The U.S. Department of the Interior publishes data describing the boundaries of all 420 units of the National Park System. In addition to the 61 officially-designated national parks, the boundaries include the country’s national preserves, national seashores, and 30 other types of special places. — : June 12, 2019
Links:
Tags: environmentmapping
For 220 countries between the 1750s and 2018, the Tax Introduction Dataset tracks “the year of the first permanent introduction at the national level of government of six major taxes, as well as on the top statutory tax rate for that year.” The six taxes are those on personal income, corporate income, inheritance, and general sales, plus VATs and compulsory social security contributions. [h/t Philipp Heimberger + Laura Seelkopf] — : June 12, 2019
Links:
Opportunity Insights, a research and policy institute that uses data analysis to examine economic mobility in the United States, publishes dozens of datasets stemming from their studies, often accompanied by code to replicate their findings. Related: “The radical plan to change how Harvard teaches economics,” a recent profile of Raj Chetty, who co-leads the institute. Bonus: The lecture materials for Chetty’s popular new class, “Using Big Data Solve Economic and Social Problems.” Michael A. Rice] — : June 12, 2019
Links:
Tags: economicsmoneystatistics
The International Consortium of Investigative Journalists and partners have obtained records that detail 8,000+ instances, between 2012 and 2017, in which U.S. Immigration and Customs Enforcement detention centers placed detainees in solitary confinement. For each confinement, the records indicate the detainee’s citizenship, detention facility, dates of confinement, and the stated reasons for it. Note: “ICE said it does not keep records of every solitary confinement placement. Instead it tracks only those cases where detainees were held in isolation for more than 14 days, and where immigrants with a ‘special vulnerability’ were placed in isolation.” [h/t Jason Norwood-Young] — : June 12, 2019
Links:
The Humanitarian Data Exchange has been tracking cases and deaths in the North Kivu Ebola outbreak. The numbers come from the Democratic Republic of the Congo’s health ministry and distinguish between suspected, probable, and confirmed cases; they are available at both the national level and disaggregated into the ministry’s 25 currently-affected health zones. Related: “Ebola cases pass 2,000 as crisis escalates” (Nature). Also related: The World Health Organization’s weekly situation reports. Previously: Data from the 2014 Ebola outbreak (DIP 2018.05.23). [h/t Sam Phinizy] — : June 12, 2019
Links:
Tags: diseaseebolahealthcare
“The BLOND dataset was collected at a typical office building in Germany, with the main occupants being academic institutes and their researchers.” BLOND’s several dozen terabytes of data provide “long-term continuous measurements of voltage and current waveforms” for 74 appliances in office over several months, including a bunch of computers, a printer, paper shredder, space heater, and an electric toothbrush. — : June 5, 2019
Links:
Tags: energy
In an study published last year (preprint PDF here), three Boston-area professors analyzed data from more than 600,000 people who took an online English grammar quiz. In addition to the participants’ answers, the dataset includes their native languages, the age they began learning English, the countries they’ve lived in, gender, age, and more. Related: Scott Chacon's analysis of the data, and what it might mean for older learners. [h/t George McIntire] — : June 5, 2019
Links:
Tags: language
The Chicago-focused Lawyers’ Committee for Better Housing has built a database of evictions in the city from 2010 to 2017. It aggregates nearly 300,000 evictions to the ward, community area, and Census tract level, and contains metrics on case types, outcomes, legal representation, and more. There’s a user guide, bulk download, and methodology. Previously: The Eviction Lab, an effort to collect eviction data for the entire country (DIP 2018.04.18). [h/t Maya Dukmasova] — : June 5, 2019
Links:
Tags: justice
The International Institute for Democracy and Electoral Assistance’s Voter Turnout Database tracks the number of registered voters, total voter turnout, voting-age population, and associated metrics for elections in more than 200 countries, some going as far back as 1945. Related: The European Parliament’s election results website provides charts and bulk downloads. Also related: “What’s going on with abstention in Europe?,” a recent article by Lorenzo Ferrari and Jacopo Ottaviani. [h/t Gianna Grün + Giuseppe Sollazzo] — : June 5, 2019
Links:
O Say Can You See, a project partially funded by the National Endowment of the Humanities, “documents the challenge to slavery and the quest for freedom in early Washington, D.C., by collecting, digitizing, making accessible, and analyzing freedom suits filed between 1800 and 1862, as well as tracing the multigenerational family networks they reveal.” The project provides several ways to access the data and documents; it covers more than 500 lawsuits, nearly 5,000 people, and tens of thousands of relationships. You can also explore the cases, people, and families online. [h/t Jan Willem Tulp] — : June 5, 2019
Links:
Duncan Geere’s 00s Indie Band Database quantifies 130+ acts from the early-millennium’s indie music scenes. In addition to basic facts, the database also includes several subjective scales: “Guitars to Synths,” “Artsy to Populist,” “Loudness,” and “Coolness.” — : May 29, 2019
Links:
Tags: entertainmentmusic
The Texas General Land Office’s geospatial data offerings include beach access points, shoreline environmental sensitivity ratings, offshore oil structures, oil and gas leases, and more. Related: “Relinquishing Riches: Auctions vs Informal Negotiations in Texas Oil and Gas Leasing,” and NBER working paper by economists Thomas R. Covert and Richard L. Sweeney; code and data available on GitHub. — : May 29, 2019
Links:
Tags: energyenvironmentmapping
Postupci Protiv Funkcionera “is a unique database made by the Center for Investigative Reporting of Serbia, which gives citizens the opportunity to get information in one place about the processes conducted by the Serbian Anti-Corruption Agency against public officials in the period from 2010 to November 2018.” The database contains information on nearly 2,800 proceedings against more than 1,700 officials, and can be downloaded as an RDS file (and opened in R). Kudos: The project has been shortlisted for the 2019 Data Journalism Awards. (Full shortlist here.) — : May 29, 2019
Links:
Tags: corruption
GeoChicas, an initiative to close the gender gap in the OpenStreetMap community, has built an interactive map and dataset that shows which streets in Latin America and Spain that are named after women (and the much larger number named after men). So far, they’ve mapped 11 cities in 8 countries, including Barcelona, Havana, Mexico City, and Buenos Aires. — : May 29, 2019
Links:
“Every year, the federal government releases large amounts of data on US schools, districts, and colleges. But this information is scattered across multiple datasets, and changes in data structure make it hard to measure change.” The Urban Institute’s Education Data Explorer aims to fix that by pulling together the Department of Education’s Common Core of Data, Civil Rights Data Collection, Integrated Postsecondary Education Data System, and College Scorecard, plus the Census Bureau’s Small Area Income and Poverty Estimates. You download custom queries, access the data via an API, or download bulk files for all elementary and secondary schools, school districts, and colleges. [h/t Daniel Wood] — : May 29, 2019
Links:
Tags: educationstatistics
The Pudding’s Jan Diehm has identified and analyzed decades of hyphenated last names in seven North American sports leagues: the MLB, NBA, NFL, NHL, MLS, WNBA, and NWSL. The code and data are available to download. Now you know: Two ambi-hyphenates — Pierre-Luc Letourneau-Leblond and Jean-Luc Grand-Pierre — have played in the NHL (and none in any of other leagues). — : May 22, 2019
Links:
Tags: languagesportsstatistics
A team of researchers at the Universidad Nacional Autónoma de México have aggregated the observations of 1,216 studies into a database describing 504 primate species. The traits in the database include body mass, habitat, type of diet, conservation status, and more. — : May 22, 2019
Links:
Tags: animals
At Canada’s highest court, “interveners” are the rough equivalent of amicus brief filers in U.S. Supreme Court cases. Sancho McCann, a student at the University of British Columbia’s law school, has created a dataset of the past ten years of interveners and has analyzed it. For each of the 665 cases from 2009 to 2018, the dataset includes the case name, the previous court, a couple of case classifications, and the names of the interveners (if any). — : May 22, 2019
Links:
Tags: law
The Fiscally Standardized Cities database “makes it possible to compare local government finances for 150 of the largest U.S. cities across more than 120 categories of revenues, expenditures, debt, and assets.” The database, developed by Adam Langley at the Lincoln Institute of Land Policy, covers the years 1977 to 2016 and takes into account the ways in which finances and responsibilities overlap between cities, counties, school districts, and other local governments. [h/t Cezary Podkul] — : May 22, 2019
Links:
Tags: statistics
The Measurement Lab describes itself as “the largest open source Internet measurement effort in the world.” Volunteers run the lab’s tests on their own devices, measuring their internet connection’s speed, latency, and other characteristics. The lab then publishes the data it collects, both as raw output and as BigQuery tables. It also offers a tool for charting internet speeds by location and ISP, based on 240+ million tests generated from 87,000+ cities; you can access the data underlying any chart, and also download the same aggregations directly. [h/t Georgia Bullen] — : May 22, 2019
Links:
Tags: technology
The World Cube Association “governs competitions for mechanical puzzles that are operated by twisting groups of pieces,” the most famous of which is the Rubik’s Cube. The association also publishes a database of all competitions, competitors, results, rankings, and more. Related: “Children of the Cube,” by the New York Times’ John Branch. [h/t Michael Höhle + u/cavedave] — : May 8, 2019
Links:
Tags: entertainmentgames
A team of biologists has compiled and standardized data on 790+ animal social networks, covering more than 45 species on six continents. The Animal Social Network Repository features networks of wild and captive mammals, reptiles, fish, birds, and insects; the connective data-tissue includes dominance relationships, group memberships, grooming behaviors, and several other types of interactions. — : May 8, 2019
Links:
Tags: animals
Publishers Weekly’s Translation Database tracks books of fiction and poetry that has been translated into English and published in the United States. The database, which contains more than 7,200 entries since 2008, includes the books’ original languages and countries of publication, the authors’ and translators’ names and genders, the publishers´ names, publication years, prices, and ISBNs. Related: “Will Translated Fiction Ever Really Break Through?” a recent Vulture article by Chad Post, who created the database. — : May 8, 2019
Links:
Team Populism is an initiative that “brings together renowned scholars from Europe and the Americas to study the causes and consequences” of the titular political style. The collaboration has published several datasets, including one that scores the populist rhetoric of 40 countries’ leaders between 2000 and 2018 — a project commissioned by The Guardian, which has visualized the findings and described the methodology. [h/t Erik Gahner Larsen] — : May 8, 2019
Links:
Tags: politics
The Death Penalty Information Center maintains a database of all executions in the United States since 1976. (There have been 1,495 so far.) The database tracks the date, method, county, and state of each execution; the name, age, sex, and race of the person executed; and the race and sex of the victims they were convicted of killing. Related: The Marshall Project’s The Next to Die. Previously: Death sentences (DIP 2018.08.01) and executed prisoners' last words (DIP 2019.03.06). — : May 15, 2019
Links:
Tags: justice
Using data scraped from BoxRec.com and UFCStats.com, Thomas Richardson analyzed “over 13,800 professional boxers and mixed martial artists of varying abilities” and has found “robust evidence that left-handed fighters have greater fighting success.” — : May 8, 2019
Links:
Tags: sports
Last month, Chicago officials launched a public mural registry. So far, the database includes more than 140 pieces, credited to more than 100 artists. About half of the entries specify the mural’s medium (e.g., paint, spray, mosaic) and nearly all indicate the mural’s location and installation year. — : May 8, 2019
Links:
Tags: art
A decade ago, researchers built VizWiz, a smartphone app that allowed blind users take photos and ask questions about them. For instance: “What color is this?” or “When is the expiration date?” Now 20,000 VizWiz images and questions, plus 200,000 answers, are available to download — part of a contest to develop algorithms for visual question-answering. Related: Be My Eyes, an app that lets you volunteer your visual assistance through a video call. — : May 8, 2019
Links:
Tags: languagetechnology
University of Montreal PhD candidate Semra Sevi has compiled data on all Canadian federal candidates from 1867 to 2017. The dataset lists each candidate’s gender, occupation, incumbency status, party affiliations, birth year, and electoral results. The tens of thousands of candidates have represented roughly 140 parties. Among them: Canada’s Work Less Party, which has fielded one lone federal candidate, who in 2008 received 1% of Vancouver East’s votes. [h/t Éric Grenier + Peter Loewen] — : May 8, 2019
Links:
Tags: politics
The United Nations’ FAOSTAT provides dozens of country-by-country datasets on agriculture. The datasets include crop and livestock production, imports and exports, fertilizer usage, emissions, and more. Many go back to 1961. (In that year, Afghanistan harvested about 32,000 metric tons of apricots.) Related: Researchers have previously used this data to trace the “increasing homogeneity in global food supplies” over time. Also related: National Geographic’s visualization of that research. [h/t David Svab] — : May 8, 2019
Links:
Through an unofficial API, you can access to data on the latest items, weapons, challenges, and other aspects of the global video game phenomenon. — : May 1, 2019
Links:
The MAESTRO dataset gathers recordings from nine years of the International Piano-e-Competition, where “virtuoso pianists perform on Yamaha Disklaviers which, in addition to being concert-quality acoustic grand pianos, utilize an integrated high-precision MIDI capture and playback system.” The MIDI data “includes key strike velocities and sustain pedal positions”; additional metadata contains each performance’s year, composer, and title. Related: OpenAI’s music-composing MuseNet neural network, trained in part on the MAESTRO data. — : May 1, 2019
Links:
Tags: music
A team of researchers has compiled the publication histories of 545 Nobel laureates — 92% of the prize-winners in physics, chemistry, and physiology-or-medicine between 1900 and 2016. The researchers say they spent more than 1,000 hours collecting and validating the data, drawing on the Nobel website, laureates’ personal pages, Wikipedia entries, and the Microsoft Academic Graph (featured in DIP earlier this month). — : May 1, 2019
Links:
USA Today has collaborated with more than 100 of its affiliated newsrooms and the Invisible Institute to gather police disciplinary records “from thousands of state agencies, prosecutors and local police departments” around the country, creating “the biggest collection of police misconduct records” ever assembled. They’re starting to make the records public, beginning with a database of 30,000+ officers who’ve had their certifications revoked. The database lists each officer’s name, state, agency, and year decertified. It includes records from 44 states, but you won’t find Massachusetts in it, for instance, because the state doesn’t license police officers. And although there are a handful of records from New York state, none regard NYPD officers; that’s in part because the country’s largest police force keeps its misconduct cases secret. (Last year, colleagues at BuzzFeed News published a database of 1,800 NYPD officers accused of misconduct, based on some of those secret records, obtained from a source who requested anonymity.) — : May 1, 2019
Links:
The browseable and downloadable National Oil Company Database, a project of the Natural Resource Governance Institute, pulls together official data on nearly 100 metrics concerning 71 oil/gas companies owned by 61 countries. For instance: Petróleos de Venezuela, S.A., reported transferring roughly $5.5 billion dollars to its government in 2016, down from nearly $28 million in 2013; Saudi Aramco produces the equivalent of 13 million barrels of oil daily; and in 2017, Russia’s Rosneft generated approximately $283,000 in revenue per employee. [h/t Rachel Ziemba] — : May 1, 2019
Links:
Tags: energy
“In a Cox proportional hazards model, which covariates are associated with the odds (or hazard ratios) being ever in your favor?” To find out, Brett Keller created spreadsheet of all 24 tributes in the 74th Hunger Games, including the districts from which they hailed, their ages, and how many days they survived. — : April 24, 2019
Links:
Tags: booksentertainmentmovies
Derek M. Jones analyzes software-engineering data. Recently, he convinced a small software company to release a dataset documenting its internal time estimates, spanning 10 years, 20 projects, and 10,000+ tasks. For each task, the dataset indicates the number of hours it was predicted to take, how long it actually took, the (anonymized) developers it was assigned to, and more. [h/t Erik Bern] — : April 24, 2019
Links:
Tags: businesstechnology
Chicago has become the first city to publish detailed data from ride-hailing services, such as Uber and Lyft. Last week, officials released three datasets — on (anonymized) drivers, vehicles, and trips. The driver and vehicle datasets cover early 2015 through December 2018. The trip dataset covers only November and December 2018; even so, it includes more than 17 million rides. For each ride, the records contain the rough pickup and dropoff location, duration, the approximate fare and tip, and more. [h/t Sharon Machlis + Dan Nguyen + Karl Sluis + Michael A. Rice] — : April 24, 2019
Links:
Tags: transportation
The Archigos dataset provides historical data the leaders of nearly 200 countries between 1875 and 2015. The dataset — a collaboration between political scientists Hein Goemans, Kristian Skrede Gleditsch, and Giacomo Chiozza — includes basic demographic information, plus categorizations of how each leader came to power, how they lost it, and their post-office fate. Now you know: No UK prime minister has died in office since 1865; José María Velasco Ibarra became president of Ecuador five separate times, and removed by coup four times; Tunisian president Beji Caid Essebsi is 92 years old. [h/t Jeffrey Sachs] — : April 24, 2019
Links:
Tags: politics
Varieties of Democracy bills itself as “a new approach to conceptualizing and measuring democracy” — one that “reflects the complexity of the concept of democracy as a system of rule that goes beyond the simple presence of elections.” The project scores countries annually on five high-level aspects of democracy, which are further broken down (by thousands of country-experts, based on a detailed codebook) into hundreds of more granular “indicators,” such as how often the government publicly attacks the judiciary, the extent to which authorities respect religious freedom, and the proportion of journalists who are women. Version 9 of the dataset, released earlier this month, covers 1789 to 2018 and includes 202 countries. [h/t John Polga-Hecimovich] — : April 24, 2019
Links:
Tags: governmentpolitics
The question: How many bags of Skittles must you open before finding two identical color-distributions? The answer: “82 days, 13 boxes, 468 packs, and 27,740 individual Skittles later [...]”. The data: available on GitHub. [h/t u/cavedave] — : April 17, 2019
Links:
Tags: statistics
The London Lives initiative “makes available, in a fully digitised and searchable form, a wide range of primary sources about eighteenth-century London, with a particular focus on plebeian Londoners.” As part of the project, digital historian Sharon Howard has compiled a dataset of 2,894 Westminster coroners’ inquests from 1760 to 1799. The fields include the date of death, the name of the deceased, the cause of death, the coroner’s verdict, and more. Bonus: A recent Twitter thread from Howard highlighting more datasets. — : April 17, 2019
Links:
Tags: healthcarehistory
The Microsoft Academic Knowledge Graph, published under an Open Data Attributions license, describes 8+ billion relationships between scientific papers, their authors, affiliated institutions, conferences, journals, fields of study, and more. The data can be downloaded and also queried online through a SPARQL interface. [h/t Michael Färber] — : April 17, 2019
Links:
Tags: science
The R Street Institute has converted the last five decades of successful Supreme Court confirmation hearings into a spreadsheet, with one row for each statement, question, and answer. The 15 transcripts begin with William Rehnquist’s 1971 hearing and end with Neil Gorsuch’s in 2017. (Robert Bork’s failed nomination is excluded, and Brett Kavanaugh’s 2018 transcript is not yet available.) [h/t Zachary Agatstein + Alex Spurrier] — : April 17, 2019
Links:
The International Consortium of Investigative Journalists, along with media partners in dozens of countries, has been compiling a cross-border database of medical-device safety alerts. The alerts include recalls as well as less-urgent notifications published by health authorities and manufacturers. You can download the public database, which so far includes 90,000+ notices for devices in 18 countries. The records include the date and type of notice; a device identifier; the reason for the alert; a classification of its severity; and more. Related: The Implant Files, an investigative series by the consortium, based on the data. — : April 17, 2019
Links:
Tags: healthcaretechnology
To study the relationship between artificial light and “flight calling” among nocturnally-migrating species, a team of researchers examined 70,000 instances of birds colliding with buildings in Chicago. [h/t Ben Winger] — : April 10, 2019
Links:
Tags: animalsarchitecture
Boston College’s Center for Retirement Research compiles detailed financial data on state and local public pension plans. The database covers fiscal years 2001–18 and includes 180 public pension plans, which together “account for 95 percent of state/local pension assets and members in the US.” [h/t Cezary Podkul] — : April 10, 2019
Links:
Tags: money
From Nate Silver: “we’ve been publishing forecasts for more than a decade now, and although we’ve sometimes tried to do an after-action report following a big election or sporting event, this is the first time we’ve studied all of our forecast models in a comprehensive way.” You can now explore and download thousands of FiveThirtyEight’s predictions about sports and politics (and their outcomes). [h/t Gavin Freeguard] — : April 10, 2019
Links:
Political science professor Nils B. Weidmann and collaborators have taken tens of thousands of reports — published by the AP, AFP, and BBC Monitoring — of political protests in autocratic countries and have turned them into structured data. The resulting Mass Mobilization in Autocracies Database is available to download (free registration required), and comes with documentation and code examples. The database currently covers 2003–15, with data for 2016–17 in the works. — : April 10, 2019
Links:
“The Rulers, Elections, and Irregular Governance (REIGN) dataset describes political conditions in every country each and every month. These conditions include the tenures and personal characteristics of world leaders, the types of political institutions and political regimes in effect, election outcomes and election announcements, and irregular events like coups, coup attempts and other violent conflicts.” The latest dataset covers 200 countries, from 1950 to the present, and includes dozens of variables for each monthly snapshot. [h/t Erik Gahner] — : April 10, 2019
Links:
“In 1949, an Italian Jesuit priest named Roberto Busa presented a pitch to Thomas J. Watson, of I.B.M.,” according to a New Yorker article principally about the Enron email archive. “Busa was trained in philosophy, and had just published his thesis on St. Thomas Aquinas, the Catholic theologian with a famously unmanageable œuvre.” Watson agreed to help, “and, for the next thirty years, Busa encoded sixty-five thousand pages of Thomist text so that it could be word-searched, cross-referenced, and what we now call hyperlinked.” The Index Thomisticus became “the first corpus to be primed for digital scholarship,” and is available online to search and download. — : April 3, 2019
Links:
The Virginia Institute of Marine Science at The College of William & Mary maintains shoreline inventories for Virginia, Maryland, and parts of Delaware and North Carolina. The datasets include geospatial information about land use, vegetation, different types of structures (e.g., jetties, bulkheads, docks, boathouses), and more. [h/t Susie Cambria] — : April 3, 2019
Links:
To test the “moralizing gods” hypothesis (which posits that “belief in morally concerned supernatural agents culturally evolved to facilitate cooperation among strangers in large-scale societies”), the authors of a recent paper in Nature “coded records from 414 societies that span the past 10,000 years from 30 regions around the world, using 51 measures of social complexity and 4 measures of supernatural enforcement of morality.” The dataset is available to download. Findings: “Our analyses not only confirm the association between moralizing gods and social complexity, but also reveal that moralizing gods follow — rather than precede — large increases in social complexity.” [h/t Juan Moreno-Cruz + Peter Irvine] — : April 3, 2019
Links:
The UNESCO Institute of Statistics collects country-level data on the number of teachers, teacher-to-student ratios, and related figures. You can download the data or explore it in UNESCO’s eAtlas of Teachers or their interactive visualization of teacher supply in Asia. — : April 3, 2019
Links:
Tags: United Nationseducation
To mark the four-year anniversary of the Saudi-led bombing campaign in Yemen, the Yemen Data Project last week released civilian casualty estimates for the entire air war. The project’s researchers collect and cross-reference data from a range of sources, including news reports, social media, video footage, local authorities, and NGOs; their published data contains dates, locations, and casualty estimates for more than 19,000 air raids. As seen in: “Saudi Strikes, American Bombs, Yemeni Suffering: How Saudi Arabia’s war tactics have fueled Yemen’s humanitarian crisis” (New York Times, December 2018). [h/t Andrea Carboni] — : April 3, 2019
Links:
Tags: conflict
From Alexis C. Madrigal, writing at The Atlantic: “Now, a decade since Uber blazed the trail, and half that since the craze faded, we built a spreadsheet of 105 Uber-for-X companies founded in the United States, representing $7.4 billion in venture-capital investment. We culled from lists, dug in Crunchbase, and pulled from old news coverage. It’s not a comprehensive list, but it is a large sample of the hopes and dreams of the entrepreneurs of the time.” — : March 27, 2019
Links:
Tags: businesstechnology
University of Tasmania Ph.D. candidate Shaun T. Brooks has created a geospatial dataset of “all buildings and disturbance detected across Antarctica, manually digitised from Google Earth images.” The dataset includes research stations, lighthouses, weather stations, historic sites, and more. [h/t Jasmine Lee] — : March 27, 2019
Links:
Tags: infrastructuremapping
“The Foundations of Rebel Group Emergence (FORGE) Dataset examines the roots of rebellion by considering the characteristics and activities of the ‘parent’ organizations from which rebel groups emerged,” plus details such as “the organization's ‘birthdate’ and founding location, initial goals, ideology, and ethnic/religious foundations.” The new dataset, developed by the University of Arizona’s Jessica Maves Braithwaite and the University of Maryland’s Kathleen Gallagher Cunningham, contains 430 rebel groups active between 1946 and 2011. [h/t Jori Breslawski + Michael Poznansky] — : March 27, 2019
Links:
Phenology (literally: “the science of appearance”) is the location-and-species-specific study of recurring plant and animal phenomena, such as the annual arrivals and departures of migratory birds. The USA National Phenology Network collects observational data from thousands of citizen scientists, professional researchers, NGOs, and other groups; assesses the data’s quality; and makes it available to explore and download. Previously: The flowering dates of Kyoto’s Prunus jamasakura cherry trees going back to the 9th century (DIP 2017.04.05). [h/t Greta Kaul] — : March 27, 2019
Links:
FiveThirtyEight has compiled a dataset of all U.S. special counsel, independent counsel, and special prosecutor investigations since 1973 — and the people charged in them. Related: FiveThirtyEight’s visual comparison of the Mueller probe to other investigations. Bonus: FiveThirtyEight’s Amelia Thomson-DeVeaux has also been tracking major lawsuits related to President Trump and his administration; that dataset currently contains 45 civil cases and 6 criminal cases. — : March 27, 2019
Links:
New York City requires the owners of buildings with rooftop water tanks to get the vessels inspected annually for things like sediment, bacteria, and dead bugs. The city publishes a dataset of the owner-report results, based on 15,000 inspections, mostly from 2015–17. Unfortunately: “A review of city records indicates that most building owners still do not inspect and clean their tanks” ... and the “city can’t even say with certainty how many there are or where they are located” ... and in “almost every case the [bacteriological] tests are conducted only after the tanks have been disinfected.” [h/t Zack Quaintance] — : March 20, 2019
Links:
Tags: mappingstatisticswater
Security firm Rapid7’s Project Sonar “conducts internet-wide surveys across more than 70 different services and protocols to gain insights into global exposure to common vulnerabilities.” Much of the data (on DNS responses, SSL certificates, and more) can be bulk-downloaded through the company’s open data portal without an account, and historical data and the most-current data are available with a free account. Related: Project Sonar: An Underrated Source of Internet-wide Data (Patrik Hudak). Also: Rapid7’s guide to using their open data API with R. [h/t Sharon Machlis] — : March 20, 2019
Links:
Tags: technology
“[W]hy are so many cities and metropolitan areas still split along racial lines? And what is the role of local government in reinforcing those divides? To answer those questions, Governing conducted a six-month investigation of black-white segregation in the small cities of downstate Illinois.” As part of the investigation, the magazine calculated (and published) school and residential segregation metrics for hundreds of U.S. metropolitan areas, based on the latest Department of Education and Census Bureau data. Related: “The Most Diverse Cities Are Often The Most Segregated” (FiveThirtyEight, 2015). [h/t Mike Maciag] — : March 20, 2019
Links:
The Council of State Governments’ annual Book of the States compiles 50-state reference tables on a range of topics, including elections, finances, courts, and more. It has been published since 1935, and the tables for the past decade-plus are available as spreadsheets. Now you know: The chief justice of the California Supreme Court makes $256,059 per year — the highest compensation for any state judge, and nearly double New Mexico’s top judge, according to 2018’s Table 5.4. [h/t Cezary Podkul] — : March 20, 2019
Links:
Tags: electionsstatistics
“Produced by the OECD Sahel and West Africa Club, Africapolis.org is the only comprehensive and standardised geospatial database on cities and urbanisation dynamics in Africa. Combining demographic sources, satellite and aerial imagery and other cartographic sources, it is designed to enable comparative and long-term analyses of urban dynamics - covering 7,500 agglomerations in 50 countries.” You can download the data — which includes historical populations, urbanization metrics, and geospatial outlines — and also explore it online. [h/t Rafael Prieto Curiel] — : March 20, 2019
Links:
Tags: statistics
Researchers based at Chicago’s Lincoln Park Zoo have published “life expectancy estimates for hundreds of vertebrate species based on carefully vetted studbook data from North American zoos and aquariums.” Their dataset includes “sex-specific median life expectancies as well as sample size and 95% confidence limits for each estimate.” — : March 13, 2019
Links:
The magazine Psychology Today hosts paid listings for therapists, who advertise their services to prospective patients. Andrew Thompson has created a dataset of the 50,000+ U.S. listings (as of October 2018), with each therapist’s name, city, specialties, and subject areas. — : March 13, 2019
Links:
Tags: healthcare
FiveThirtyEight is tracking who’s endorsing whom to be the Democrats’ 2020 presidential nominee. The site has published a methodology describing its approach, plus the underlying data, which includes each endorser’s name, state, relevant position, and other details. (According to the site’s formula, Sens. Cory Booker and Kamala Harris are currently leading, although almost entirely based on home-state endorsements.) — : March 13, 2019
Links:
Stanford University’s Big Local News project has compiled data from 100,000+ daily situation reports (known as “SIT-209”s) filed by federal firefighting authorities, detailing their efforts to suppress large wildfires. The dataset covers 2014 to 2017, and includes 240+ variables from each report, including estimated costs, damaged/destroyed buildings, injuries, fatalities, and more. Related: Eric Sagara’s quick introduction to the dataset. — : March 13, 2019
Links:
“Thousands of people report workplace discrimination to the government each year. Employers are rarely held accountable,” according to an investigation by the Center for Public Integrity. Reporters Maryam Jameel and Joe Yerardi “analyzed eight years of complaint data — through fiscal 2017 — from the [U.S. Equal Employment Opportunity Commission] as well as its state and local counterparts, reviewed hundreds of court cases and interviewed dozens of people who filed complaints.” The data (on more than 3.7 million allegations and their outcomes) and code are available online. Related: A visual exploration of the data. Previously: Two decades of workplace sexual harassment complaints (DIP 2017.12.06). [h/t Reddit user "cavedave" + Giuseppe Sollazzo] — : March 13, 2019
Links:
A few years ago, a team of scientists examined the shapes of 49,000 bird eggs belonging to 1,400 different species. You can download their calculations of each species’ average egg length, asymmetry, and ellipticity, which formed the basis of a graphics-forward article in Science Magazine. [h/t Sophie Warnes] — : March 6, 2019
Links:
For a recent article in The Pudding, Amber Thomas and two data assistants “recorded every rule listed in each dress code” at 481 public high schools in 36 states, plus “the words used in the dress code’s rationale, as well as any listed sanctions for breaking the dress code.” The 15,000+ rules and 1,470 sanctions are available to download. — : March 6, 2019
Links:
Tags: education
The UNESCO Institute of Statistics compiles data on “internationally mobile” university students, including annual numbers of students by country of origin and country of study. Related: UNESCO's interactive map of student flows. [h/t Francisco Marmolejo] — : March 6, 2019
Links:
Tags: United Nationseducation
The U.S. Department of Agriculture’s CropScape website provides interactive access to the agency’s Cropland Data Layer — “a raster, geo-referenced, crop-specific land cover data layer created annually for the continental United States using moderate resolution satellite imagery and extensive agricultural ground truth.” You can use CropScape to filter the data’s acreage estimates (for more than a hundred different crops) by state, county, or custom-drawn geographies — or download the complete data in bulk. [h/t Katie McGaughey] — : March 6, 2019
Links:
Tags: agriculture
The Texas Department of Criminal Justice publishes a list of each death row inmate executed since 1982 — the year the state resumed capital punishment. In addition to providing basic demographic information, the listing also links to transcriptions of the inmates’ final statements. And although state doesn’t provide the statements as structured data, Zi Chong Kao has created a spreadsheet of of them (plus additional details extracted from the state’s website) for his interactive tutorial, Select Star SQL. Related: “‘Love’ Is the Most Common Word in Death Row Last Statements” (Will Young, Oct. 2018). [h/t Noah Veltman] — : March 6, 2019
Links:
The Professional Disc Golf Association publishes a spreadsheet of flying objects officially approved for use in competition. [h/t Ryan Maus] [Note, 2019-02-20: Original item included incorrect link, now fixed.] — : February 20, 2019
Links:
Tags: sports
Thanks to a 2015 state bill, when California law enforcement agencies obtain search warrants for digital communications (or are granted access to such information in an emergency), they must notify the people whose information they targeted. The state’s Department of Justice publishes data about these notifications, including the agency name, the grounds for the warrant, the nature of the investigation, the companies searched (e.g., AT&T, Verizon, Google, Facebook), and more. As seen in: “San Bernardino County Sheriff's electronic surveillance use — already highest in state — continues to surge” (Palm Springs Desert Sun, Jan. 2019). — : February 20, 2019
Links:
MyEU.uk’s interactive map lets you search and explore tens of thousands of European Union–funded projects in the United Kingdom, aggregated from a range of official sources. The initiative, which opposes Brexit, has published its data-collection and data-processing code as well as a spreadsheet of all projects it has identified. [h/t Jovi Juan] — : February 20, 2019
Links:
Tags:
The Academy of Motion Picture Arts and Sciences website hosts two searchable databases related to their annual awards show: one of nominees and winners, and another of acceptance speeches. The Academy doesn’t provide direct downloads, but many folks have created structured datasets from the records. For instance: Statistics professor Adam B. Kashlak has build a dataset that combines speech word-counts, Best Picture winners’ budgets, and total broadcast length. And: Alex Albright’s analysis from a few years ago, “I’d Like to Thank the Academy… for making this data available,” is based on her dataset of all speeches from the 2010–14 broadcasts. [h/t Jay Arthur] — : February 20, 2019
Links:
The Global Power Plant Database, published by the World Resources Institute, “is a comprehensive, open source database of power plants around the world” and contains “information on plant capacity, generation, ownership, and fuel type.” The current edition, released in June 2018, covers 28,600+ power plants in 164 countries — including more than 1,000 each in Brazil, Canada, China, Great Britain, France, and the United States. Previously: U.S. power plants (DIP 2016.02.10). [h/t Kelly Rose + Paul Deane] — : February 20, 2019
Links:
Tags: energy
Drawing upon a fan wiki, Matt Laessig has created a spreadsheet of all 889 obstacles in the first 10 seasons of American Ninja Warrior. (Free registration required to download.) [h/t Ilan Brat] — : February 13, 2019
Links:
Tags: sportstelevision
The CDC’s State Tobacco Activities Tracking and Evaluation system tracks “current and historical state-level legislative data on tobacco [and now also e-cigarette] use prevention and control policies.” The system’s datasets provide quarterly snapshots — going back to 1995 — of rules concerning taxes, youth access, licensing, fire safety, and more. — : February 13, 2019
Links:
Tags:
The Open Knowledge Foundation Deutschland and OpenCorporates have partnered to make Germany’s official business register available to download in bulk. The dataset contains basic information about more than 5 million German companies, and more than 4 million associated officers. Note: Although the dataset’s landing page is written in German, its documentation is available in English. Related: Joachim Gassen’s initial analysis of the companies’ locations, using R. [h/t Sharon Machlis] — : February 13, 2019
Links:
Tags: business
The Resources and Conflict Project’s Rebel Contraband Dataset “measures if and how rebel groups earn income from the exploitation of natural resources or criminal activities.” The dataset spans 1990–2015, covers more than 70 countries, and specifies dozens of types of resources — such as oil, cannabis, gold, tea, and timber. [h/t Eric Gahner] — : February 13, 2019
Links:
Tags: conflictenvironment
The U.S. Forest Service’s Forest Inventory and Analysis program tracks “trends in forest area and location; in the species, size, and health of trees; in total tree growth, mortality, and removals by harvest; in wood production and utilization rates by various products; and in forest land ownership.” It also “serves as perhaps the largest publicly available” dataset of “downed and dead wood.” The inventory is available to download and comes with user guides. — : February 13, 2019
Links:
To support its data-driven feature, “30 Years of American Anxieties,” The Pudding gathered 20,000 questions posed to legendary advice columnist Dear Abby. — : February 6, 2019
Links:
Tags: languagemiscellaneous
The District of Columbia’s taxi trip data covers 2015–17 and includes each trip’s pickup and dropoff location, mileage, total fare, tip amount, and other details. Previously: Chicago and NYC taxi rides (DIP 2016.12.07). Richard Sigman] — : February 6, 2019
Links:
Tags: transportation
In the course of investigating why Oklahoma’s female incarceration rate is so high, The Frontier and the Center for Investigative Reporting obtained “a decade’s worth of state prison data never before analyzed by the state itself.” The data includes information about each prisoner, their prison sentences, and their entries and exits from Department of Corrections supervision. [h/t Dan Nguyen] — : February 6, 2019
Links:
For a recent analysis of Trump administration turnover, FiveThirtyEight compiled a dataset of the last seven presidents’ cabinets — covering the 24 positions included in Donald Trump’s cabinet. (As author Nathaniel Rakich notes, “Not every president designates the same positions to be in the Cabinet.”) The dataset includes each cabinet member’s position, start date, departure date, and total days in office. — : February 6, 2019
Links:
An international team of researchers has created a dataset of 343 cities’ CO2 emissions. The researchers aggregated and standardized the emissions data — largely self-reported — from three sources: the Carbon Disclosure Project, the Bonn Center for Local Climate Action and Reporting, and a new project at Peking University. The dataset includes cities large and small, from Lagos and Shanghai to Kadıovacık, Turkey (pop. 216) and Brisbane, California (pop. ~4,700). In addition to emissions, the dataset also provides contextual information about the cities, such as average household sizes and gasoline prices. — : February 6, 2019
Links:
Tags: environment
“Shane Nackerud needed to know: Does 89.3 the Current play the Replacements every day?” To figure it out, the University of Minnesota librarian extracted track listings from 1.1 million @currentplaylist tweets from 2009 through 2018. He’s also published the total play counts by artist and the raw data. [h/t Kent Gerber + Amy Riegelman] — : January 30, 2019
Links:
. “Funemployed programmer” Colin Morris looked for all the times where commenters on Reddit added “(sp?)”, or a related annotation, to their remarks. E.g., “SF is putting on quite a show, especially Kapernick (sp?).” Morris then compiled a dataset of the words that preceded those annotations, accompanied by examples of their usage. [h/t Rich Posert] — : January 30, 2019
Links:
Tags: languagesocial media
Christina Isabel Zuber and Edina Szöcsik’s Ethnonationalism in Party Competition dataset compiles ratings for more than 200 political parties in 22 European countries. Experts rated the parties twice — first in 2011, and then again in 2017 — on a range of factors, such as the centrality of ethnonationalism to the parties’ platforms, and their positions on territorial autonomy for minorities. (Dataset access requires providing a name and email address.) [h/t Erik Gahner] — : January 30, 2019
Links:
Since 1997, the Precipitation Estimation from Remotely Sensed Information using Artificial Neural Networks (PERSIANN) algorithm has used satellite imagery to estimate rainfall rates around the world. The system’s hourly, daily, monthly, and annual estimates can now be explored online and downloaded. — : January 30, 2019
Links:
Tags: climate
On Thursday, Sarah Ryley, Sean Campbell, and I published a deeply-reported investigation into U.S. cities’ failure to solve shootings — a year-long collaboration between The Trace and BuzzFeed News. To reach our quantitative findings, we analyzed (and standardized) three major FBI datasets, internal data from 22 police departments, and a database of Baltimore victims and suspects. Data, code, and methodologies for the analyses are available on GitHub. Related: Last year, The Washington Post published Murder with Impunity, a series examining unsolved homicides; their data, on 52,000+ homicides in 50 cities, is also available on GitHub. — : January 30, 2019
Links:
The U.S. Department of Agriculture’s Dairy Data Set contains annual tabulations of production, sales, imports, exports, consumption, and other economic aspects of “the U.S. dairy situation.” As seen in: “Nobody Is Moving Our Cheese: American Surplus Reaches Record High” (NPR). — : January 16, 2019
Links:
Tags: agriculturefood
In previous centuries, maritime officers kept “detailed log books of the ships’ activities and management,” including observations of the wind and weather. The Climatological Database for the World's Oceans 1750-1850 has digitized a quarter-million entries from such logbooks, originally written in Dutch, English, French, and Spanish, and published them as detailed, structured data. Helpful: Steven Ottens has converted the project’s fixed-width files into tab-delimited data. [h/t Robi Sen + Roger Davies + Topi Tjukanov] — : January 16, 2019
Links:
Party Facts is a “collaborative data collection” that links various political-party datasets together. The project has two main tables. One contains basic information about 4,100+ political parties in more than 200 countries, including each party’s mother-tongue name and English translation, year founded, and Wikipedia page. The second table cross-references each party with its unique identifier in 26 external datasets, such as ParlGov (DIP 2018.09.19), The Manifesto Project (DIP 2017.06.21), and the Constituency-Level Elections Archive (DIP 2016.09.28). [h/t Matt Grossmann + Erik Gahner] — : January 16, 2019
Links:
The Census Bureau’s My Congressional District tool lets you browse (and download) demographic, socioeconomic, and business data corresponding to each of the country’s 435 congressional districts. Political scientist Ella Foster-Molina has compiled a historical dataset containing similar information for 1972 to 2014; it also contains details about each district’s representatives — such as their personal characteristics, the committees they served on, and the number of bills they sponsored. [h/t Josh McCrain + Derek Willis] — : January 16, 2019
Links:
The Committee to Protect Journalists maintains a database of journalists who’ve been killed for reasons related to their work. The database goes back to 1992 and contains more than 1,300 entries, with details about the journalists, the circumstances of their deaths, and whether perpetrators have been convicted. More recently, the organization has also begun publishing data on journalists who’ve been imprisoned or gone missing. [h/t Giuseppe Sollazzo] — : January 16, 2019
Links:
Norse World is an “online, open access searchable index and mapping of the foreign place names found in medieval East Norse texts.” Through the project���s interactive map, you can search and download the data. — : January 9, 2019
Links:
If you decide to acquire a new citizenship, do you get to keep your previous one? Are you allowed to renounce it? The Maastricht Center for Citizenship, Migration and Development’s Global Expatriate Dual Citizenship Dataset tracks how 200 countries have, each year since 1960, treated this situation. The extensive documentation provides links to the relevant laws, and descriptions of how each country’s rules have changed. [h/t Sam Petulla] — : January 9, 2019
Links:
Tags:
Researchers at the International Monetary Fund have built a historical database of fiscal crises, defined as “periods of extreme fiscal distress, when governments have not been able to contain large fiscal imbalances leading to the adoption of extreme measures (e.g., debt default and monetization of the deficit).” The researchers, building off of previous work, have “expand[ed] the country coverage to 188 countries, over 1970-2015, more than double the size of the sample relative to many other studies,” and identified 436 distinct episodes of fiscal crisis. [h/t David Tercero Lucas] — : January 9, 2019
Links:
Last spring, the BBC published an archive of 16,000+ sound effects, licensed ”for personal, educational or research purposes.” Each audio file is accompanied by a description, categorization, and its length. For instance, the first sound effect on the archive’s page is a 194-second clip described as “two-stroke petrol engine driving small elevator, start, run, stop,” and categorized as “Engines: Petrol.” Not documented, but useful: You can download a CSV of the metadata. Highlight: The one-two punch of “several men snoring, hilariously” and “several men snoring, less hilariously.” [h/t Amy King] — : January 9, 2019
Links:
Monitoring the Future surveys approximately 50,000 eighth-, tenth-, and twelfth-grade students in the U.S. each year. The project, which is funded by the National Institute on Drug Abuse, has been running since 1975. Although best known for its detailed drug-use questions, the surveys also ask questions related to education, labor, sex, race, politics, happiness, and other topics. Public-use versions of the data are available through the National Addiction & HIV Data Archive Program (free registration required). [h/t Dan Kopf] — : January 9, 2019
Links:
The Narrabeen-Collaroy Beach Survey Program has been measuring a major stretch of the Sydney shore every month since April 1976. You can explore the data online and (free registration required) download it. [h/t Robbi Bishop-Taylor + Mitchell Harley] — : January 2, 2019
Links:
The UK Marine Noise Registry tracks “human activities in UK seas that produce loud, low to medium frequency (10Hz – 10kHz) impulsive noise” — including pile-driving, explosives, military sonar, and “acoustic deterrent devices.” For each of the UK’s oil and gas licensing blocks, the registry’s published data counts the number of days that a given type of impulsive noise was generated. Related: Owen Boswarva has built an interactive map of the data. [h/t Giuseppe Sollazzo] — : January 2, 2019
Links:
Tags: audioenergyenvironment
A pair of researchers have used satellite imagery to quantify nighttime lights in five urban areas in Niger and Nigeria — Agadez, Katsina, Maradi, Niamey, and Zinder. Describing their findings in a recent issue of Scientific Data, the researchers write, “Our data showed 1) urban illumination fluctuated seasonally, 2) corresponding population fluctuations were sufficient to drive seasonal measles outbreaks, and 3) overlooking these fluctuations during vaccination activities resulted in below-target coverage levels, incapable of halting transmission of the virus.” — : January 2, 2019
Links:
Tags: diseaseenvironment
Melbourne, Australia, has placed dozens of pedestrian-counting sensors across the city, and publishes a dataset of the hourly observations going back to 2009. Now you know: Among the 2.5 million entries so far, the highest count has been the 12,289 pedestrians at the Bourke Street pedestrian bridge between 6pm and 7pm on Friday, October 26, 2018. Bonus: Melbourne’s interactive map of the data. Related: Pedestrian counts from the Brooklyn Bridge and Somerville, Massachusetts. — : January 2, 2019
Links:
Tags: transportation
The Vera Institute of Justice’s recently-expanded Incarceration Trends project combines data from a range of government reports — such as the Census of Jails and the National Corrections Reporting Program — into a single, longitudinal, well-documented dataset. For each county and year, the dataset tallies the number of people admitted to jails and prisons, the average daily incarcerated jail and prison population, and other related details. Many of the counts are also broken down by race, ethnicity, and sex. Bonus: The institute’s interactive map of the data. [h/t Chris Henrichson + Sam Petulla] — : January 2, 2019
Links:
Tags: crime
A team of evolutionary biologists has compiled a dataset describing the size and shape of eggs laid by more than 6,700 insect species. You can explore and download the underlying data, which is based on measurements from 1,756 published sources. [h/t Cassandra Extavour] — : December 19, 2018
Links:
Tags: animals
Open Units is a dataset detailing the total amount of alcohol in 1,000+ beer and cider offerings, “based on information made public by drinks manufacturers, distributors and retailers.” For instance, a 355-mL bottle of Sierra Nevada Pale Ale contains 20 mL of alcohol, the same as a pint of Bud Light. [h/t Giuseppe Sollazzo] — : December 19, 2018
Links:
Tags: alcohol
Philadelphia’s Department of Records has begun publishing a dataset of all real estate transfers recorded since late 1999. The 3.7 million records include deeds, mortgages, condo declarations, and a few other types of documents. The deed data includes each property’s fair market value, address, grantor and grantee names, various taxes, and more. Bonus: An interactive visualization of the data. Previously: UK property sales (DIP 2016.03.23). [h/t Michael McLaughlin] — : December 19, 2018
Links:
Tags: real estate
The International Monetary Fund’s Global Debt Database brings together “total gross debt” numbers for 190 countries, for the years 1950 to 2017. The database features a detailed methodology and includes indicators of government, household, and corporate debt. — : December 19, 2018
Links:
The CDC’s Small-area Life Expectancy Estimates Project calculates how long someone, born in a given Census tract in 2010–15, might expect to live. The estimates are based on a combination of death records, Census population data, and statistical modeling. Related: “Map: What story does your neighborhood’s life expectancy tell?” (Quartz). Previously: Life expectancy by income, gender, and city (DIP 2016.04.13), and by country (DIP 2017.02.08). [h/t Dan Kopf] — : December 19, 2018
Links:
Tags: deathmappingstatistics
The Pudding’s Internet Boy Band Database is “an audio-visual history of every boy band to chart on the Billboard Hot 100 since 1980.” You can download the underlying data, which is stored in two files: boys.csv and bands.csv. — : December 12, 2018
Links:
Tags: entertainmentmusic
Academic researcher Adrien Barbaresi has compiled a corpus of thousands of speeches from the the German Presidency, Presidency of the Bundestag, Chancellery, and Ministry of Foreign Affairs. The corpus, now in its third version, was first released in 2011. [h/t Adrien Barbaresi] — : December 12, 2018
Links:
The National Renewable Energy Laboratory’s solar datasets measure the average annual and monthly “total solar resource” for the United States, broken down by state, county, ZIP code, and roughly-10-square-kilometer chunks of the country. Bonus: More sun-radiation datasets via this Stack Overflow answer. [h/t Joe Hourclé] — : December 12, 2018
Links:
Common Vulnerabilities and Exposures is a downloadable list of more than 110,000 “publicly known cybersecurity vulnerabilities.” Each vulnerability is assigned a unique identifier (e.g., CVE-2014-0160) and given a description. The National Institute of Standards and Technology’s National Vulnerability Database takes the list and adds more information for each entry, “such as fix information, severity scores, and impact ratings.” That database is available in a variety of bulk downloads and data feeds; you can also search it online. [h/t GitHub user "nanoseconds"] — : December 12, 2018
Links:
Tags: technology
A growing number of cities publish detailed data on bicyclist and pedestrian injuries involving cars, including New York City, Chicago, Boston, Seattle, St. Paul, Minn., Chapel Hill, N.C., Tempe, Ariz., Toronto, and London — many through the cities’ “Vision Zero” street-safety initiatives. (Some of the datasets also include car-on-car collisions.) Related: “The most dangerous intersections in Seattle for bicyclists and pedestrians.” [h/t Rachel Schallom + Jeff Asher] — : December 12, 2018
Links:
Tags: injurytransportation
Last month, I Quant NY’s Ben Wellington analyzed New York City’s raw snow plow data, “which had only been viewed 41 times before apparently.” The 250 million–row dataset is, as Wellington notes, “stored in an odd format” — snapshots that indicate, every 15 minutes, the last time each of the city’s street segments was plowed. Related: ClearStreets provides historical data from the City of Chicago’s Plow Tracker; Iowa Department of Transportation also publishes a live plow tracker; Syracuse and Pittsburgh have published historical snow plow data. — : December 5, 2018
Links:
Tags: climatetransportation
SUBTLEXus is a dataset of word frequencies in American English, derived from the subtitles for 8,388 films. The dataset, which covers more than 74,000 words, includes each word’s total frequency, the number of films in which the word appeared, and several other metrics. Bonus: Similar datasets are also available for Chinese and Dutch. [h/t The Language Goldmine] — : December 5, 2018
Links:
The British nonprofit 360Giving helps grantmakers “to publish their grants data in an open, standardised way and helps people to understand and use the data.” Through its GrantNav platform, you can search across more than 300,000 grants — totalling more than £25 billion — given by scores of funders to nearly 180,000 recipients. You can download the results of each search, as well as the underlying datasets. [h/t Enigma Public] — : December 5, 2018
Links:
The New Orleans Police Department’s “Body Worn Camera Metadata” contains the dates, times, durations, and locations for 2.7 million body camera recordings, going back to 2014. Related: The agency publishes similar data for 1.5 million in-car camera recordings. [h/t Alexandre Léchenet] — : December 5, 2018
Links:
Tags: crimejusticetechnology
In 2014, author C. M. Taylor began writing a new novel, this time with a twist: He would write the entire story on a laptop intentionally infected with spyware. With the help of the British Library, a program recorded every keystroke Taylor typed and took screenshots every few seconds. The novel, Staying On, was published in October; soon after, Taylor and the library made the spyware recordings available to download. [h/t Dan Hett] — : December 5, 2018
Links:
Tags: bookstechnology
New York City’s Department of Health publishes a dataset of 8,000+ reported instances of dogs biting humans, mostly from 2015 through 2017. The agency collects the reports “to determine if the biting dog is healthy ten days after the person was bitten in order to avoid having the person bitten receive unnecessary rabies shots.” [h/t Justin Baker] — : November 28, 2018
Links:
Last month, an international team of researchers published the third major version of their Gridded Livestock of the World dataset, which estimates the global distribution of cattle, buffaloes, horses, sheep, goats, pigs, chickens and ducks. The new dataset is based on 2010 statistics and provides estimates at “a spatial resolution of 0.083333 decimal degrees (approximately 10 km at the equator).” — : November 28, 2018
Links:
Tags: agricultureanimals
Bilateral labor agreements regulate the migration of workers between two countries, and the Bilateral Labor Agreements Dataset aims to catalog as many of these treaties as it can. So far the University of Chicago Law School professors and researchers running the initiative have identified 582 treaties signed between 1945 and 2015. “However, this list is almost certainly underinclusive,” they write. “Many BLAs are not deposited in the major international treaty databases and they often do not receive much, if any, publicity.” [h/t Adam Chilton] — : November 28, 2018
Links:
Tags: economics
The Small World of Words project “is a large-scale scientific study that aims to build a mental dictionary or lexicon in the major languages of the world.” The experiment has asked hundreds of thousands of participants to list their immediate associations with various words (such as “telephone,” “journalist,” and “yoga”). In all, the project has collected more than 15 million responses. You can download the data, examine the project’s analysis pipeline, and explore the responses online. [h/t Lewis Mitchell] — : November 28, 2018
Links:
Tags: language
The German Aerospace Center is publishing global elevation data derived from its TanDEM-X satellite mission. For five years, two satellites orbited Earth together in a formation that allowed their radars to “ 'see' the same land area, but from slightly different perspectives” and to calculate elevations based on those differences. Although the most detailed versions of the data are “subject to restrictions due to the potential for commercial exploitation, and thus requires a scientific proposal,” the least detailed version (which still clocks in at more than 90 gigabytes) can be downloaded for free. [h/t Matt Brealey] — : November 28, 2018
Links:
Tags: mapping
STAPI bills itself as “the first public Star Trek API.” It provides access to structured data not only about the fictional universe (e.g., 6,364 characters, 1,215 spacecraft, and 155 conflicts) but also its intersection with reality (e.g., 5,302 performers, 731 television episodes, 76 soundtracks). [h/t Cezary Kluczyński] — : November 7, 2018
Links:
A recent study revealed the results of “the Moral Machine, an online experimental platform designed to explore the moral dilemmas faced by autonomous vehicles.” The experiment asked participants to decide whether a self-driving car — faced with two deadly options — should stay on course (killing one group of pedestrians) or swerve (killing another). The project “gathered 40 million decisions in ten languages from millions of people in 233 countries and territories,” and a dataset containing every decision is available to download. Read more: “Should a self-driving car kill the baby or the grandma? Depends on where you’re from.” [h/t Walt Hickey] — : November 7, 2018
Links:
Tags: technologytransportation
A team led by University of Kansas professor Ron Francisco has collected and codified data on protests, strikes, and other “coercive acts” in dozens of European countries during the late 20th century. There’s a row for each day of each protest, and each row specifies the issue at stake, the organizers, their target, the type of action, and the location — as well as the number of protesters, arrests, injuries, and deaths. [h/t Alexandre Léchenet] — : November 7, 2018
Links:
The Department of Education requires U.S. universities to report all major gifts from (and contracts with) foreign entities. The agency’s database of these gifts and contracts currently covers 2012 to mid-2018, and includes 18,000+ entries from more than 150 schools. Related: In the wake of Jamal Khashoggi’s murder, the AP’s Collin Binkley and Chad Day used the data to examine colleges’ financial ties to Saudi Arabia. [h/t Meghan Hoyer] — : November 7, 2018
Links:
Tags: education
The Caselaw Access Project aims “to make all published U.S. court decisions freely available to the public online, in a consistent format, digitized from the collection of the Harvard Law Library.” Currently, the project provides an API for fetching data on more than 6 million cases published between 1658 and 2018 — though public access is limited to downloading 500 cases per day. You can also download bulk data for all cases in Illinois and Arkansas, but getting bulk data for other states currently requires a research agreement. [h/t Caitlin Ostroff] — : November 7, 2018
Links:
Tags:
When the Federal Election Commission receives a registration form that contains “questionable information” from a candidate or committee, the agency asks for additional information. If the FEC doesn’t get a proper response, it adds the registration to its dataset of “unverified filers”. Among the 500+ registrations currently on the list: “VoldemortCantStopTheVote.org”, “Department of Treasury,” “Wookie PAC,” and “Al Pacino.” [h/t Chris Zubak-Skees] — : October 31, 2018
Links:
Tags: elections
What happens when coal mines shut down? Money for their cleanup is supposed to be ensured by a system of bonds. But when Climate Home News’ Mark Olalde investigated these remediation funds, he found “a system incapable of dealing with large-scale bankruptcies, amid a declining industry, which severely threatens the environment and future of coal-mining communities across the country.” You can download the data behind Olalde’s findings — including bond databases covering the “23 states that produce 99% of US coal,” obtained via public records requests. [h/t Megan Darby] — : October 31, 2018
Links:
Tags: energyenvironment
The U.S. Energy Information Administration uses Form EIA-861 to collect annual data from thousands of electric utilities about their sales, revenue, peak loads, customer counts, energy efficiency savings, and more. More than 3,400 utilities submitted the form (or its shorter cousin, EIA-861S) for 2017, and the data go back to 1990. [h/t Jordan Wirfs-Brock] — : October 31, 2018
Links:
Tags: energy
Earlier this month, Twitter released data on the public activity of “3,841 accounts affiliated with the [Internet Research Agency], originating in Russia, and 770 other accounts, potentially originating in Iran.” Together, the datasets “include more than 10 million Tweets and more than 2 million images, GIFs, videos, and Periscope broadcasts.” Related: My colleague Peter Aldhous used this data — combined with data on 3 million “Russian troll tweets” released this summer by Clemson University researchers and FiveThirtyEight — to examine the Internet Research Agency’s traction before and after the 2016 election. Bonus: Peter’s code. — : October 31, 2018
Links:
Tags: electionssocial media
The Federal Highway Administration’s National Bridge Inventory contains detailed data on more than 600,000 “highway bridges” in the United States. The inventory goes back to 1992 and contains scores of fields, including the bridge’s age, condition, design, and materials. Now you know: Texas has the most highway bridges in the inventory, with more than 53,800. Bonus: You can also search the bridges via the unofficial BridgeReports.com. Related: The code the Baltimore Sun used to answer the question, “How safe are Maryland's bridges?” [h/t Christine Zhang] — : October 31, 2018
Links:
Tags: infrastructure
New York City provides the latitude, longitude, ID number, and current status — active, inactive, retired, planned, and removed — of more than 14,700 parking meters. [h/t Zack Quaintance] — : October 10, 2018
Links:
Tags: transportation
Connecticut’s Department of Consumer Protection has released a dataset listing all branded medical marijuana products registered with the state. For each of the nearly 4,000 products so far, the dataset describes the producer, brand name, form of dosage, and chemical potencies — plus links to images of each product and label. [h/t Kristin Hussey] — : October 10, 2018
Links:
Tags: drugshealthcare
Nager.Date calculates the timing — past, present, and future — of public holidays for more than 90 countries. The holidays can be browsed online, accessed via an API, or downloaded as CSVs (one per country per year). Now you know: Today is Cuba’s Día de la Independencia and Suriname’s Day of the Maroons. [h/t Tino Hager] — : October 10, 2018
Links:
Tags: miscellaneous
England’s public health department generates quantitative “profiles” of the country’s well-being. The metrics include rates of HPV vaccination, dementia, exercise, diabetes, and much more. The results can be downloaded directly, and also accessed via an API. [h/t Sharon Machlis] — : October 10, 2018
Links:
Tags: healthcarestatistics
LawAtlas.org publishes interactive maps that detail state and federal regulations on dozens of public health–related topics. Among them: e-cigarettes, HIV criminalization, fair housing, syringe distribution, and cell phone use while driving. (You can, for instance, use the e-cigarette map to identify all states where vaping is allowed in hotel rooms but prohibited in public parks.) You can download the underlying data, plus documentation about how the laws were categorized. Bonus: The website, run by Temple University’s Center for Public Health Law Research, will also teach you how to map laws yourself. Previously: The Correlates of State Policy Project (DIP 2016.07.06). — : October 10, 2018
Links:
Tags: HIV and AIDShealthcare
The Florida Department of Transportation publishes its inventory of active permits for billboards and other “outdoor advertising.” For each permit, the dataset provides details about the permit-holder and the structure itself — such as its location, height, whether it’s in a city, and more. [h/t Caitlin Ostroff] — : October 3, 2018
Links:
Tags: mapping
A slew of cities have installed devices to count bicycles that pass through major routes. At least several publish hourly or daily tallies: London, Ottawa, Edinburgh, Seattle, Cambridge, Mass., and the Washington, DC area. New York City provides daily counter-tallies for its East River bridges, but currently only as PDFs. Related: "[Transport for London]’s cycle counter data: initial thoughts" and “What we can learn from Seattle’s bike-counter data.” [h/t Giuseppe Sollazzo] — : October 3, 2018
Links:
Tags: transportation
The UK Department for Transport’s traffic counts calculate the average daily number of vehicles “for every junction-to-junction link on the 'A' road and motorway network in Great Britain.” Likewise, California publishes the average daily traffic, peak hourly traffic, truck traffic, and ramp traffic for each of its state highways. Previously: U.S. interstate highway traffic (DIP 2016.10.05) and public roads (DIP 2018.04.25) [h/t Dave Fisher-Hickey + u/ron_leflore] — : October 3, 2018
Links:
Tags: transportation
PollOfPolls.eu aggregates political polls from 30 European countries. The Vienna-based initiative has, for instance, collected and standardized more than 1,000 individual polls on British parliament since 2014, and 60 on the Bavarian state elections. You can download each set of standardized data as either JSON or CSV. [h/t Jovi Juan] — : October 3, 2018
Links:
Tags: elections
The U.S. Fish & Wildlife Service publishes a database outlining the critical habitats for more than 700 threatened and endangered species. For each habitat, the dataset provides its geographic boundary lines, the species’ name and type, the size of the habitat, the date it was declared critical, and more. Related: Other geospatial datasets from the USFWS, including those on the Coastal Barrier Resources System and migratory bird populations. — : October 3, 2018
Links:
Tags: animalsenvironment
The Hass Avocado Board publishes weekly data on the retail volume and average price of Hass avocados sold in the United States, based on information collected “directly from retailers’ cash registers.” The data is available at the national and city level going back to 2015, distinguishes between conventional and organic avocados of various sizes. Related: Justin Kiggins has aggregated the historical spreadsheets for 2015 through March 2018 into a single file. — : September 26, 2018
Links:
Tags: food
The Social Assistance, Politics and Institutions database, developed at an United Nations University research center, “provides a synthesis of longitudinal and harmonized comparable information on social assistance programmes in developing countries, covering the period 2000-2015.” For each program, such as Brazil’s “Bolsa Familia,” the database describes its basic characteristics, budget and financing, and population coverage. [h/t Erik Gahner Larsen] — : September 26, 2018
Links:
The National Survey of Family Growth, run by the U.S. Centers for Disease Control and Prevention, “gathers information on family life, marriage and divorce, pregnancy, infertility, use of contraception, and men’s and women’s health.” Versions of the survey have been conducted nine times, dating back to 1973. The most recent results come from interviews of more than 10,205 people between September 2013 and September 2015. Related: The Pudding’s Amber Thomas used the data to explore trends in birth control. Bonus: Thomas also published the code and data behind her analysis. [h/t Giuseppe Sollazzo] — : September 26, 2018
Links:
Tags: familyhealthcare
Some cities — including San Francisco, Los Angeles, and Austin — provide downloadable databases of lobbyists who’ve officially registered to influence their administrations. Chicago has gone one step further, publishing data on lobbyists’ compensation, expenditures, gifts, and more. Previously: Lobbying data from the U.S. House, U.S. Senate, and European Union (DIP 2017.05.31 + DIP 2017.08.02). [h/t Alisha Green and Laurenellen McCann] — : September 26, 2018
Links:
Tags:
When Amsterdam began excavating parts of the Amstel River in 2003 to construct a new metro line, the city gave archaeologists access to two large sections of the riverbed. Over time, these archaeologists unearthed “a deluge of finds, some 700,000 in all: a vast array of objects, some broken, some whole, all jumbled together.” To showcase the work, the city has published Below the Surface, a website that lets you explore the 20,000 of the objects online, download detailed data on more than 130,000 of the artifacts, read the backstory, and watch a documentary about it. Among the discoveries: Thousands of tobacco pipes, hundreds of teapots, dozens of gin bottles, and one “miniature wind mill.” [h/t Adam J Calhoun + Manoj Mallela] — : September 26, 2018
Links:
Tags: history
New York State’s Department of State publishes a structured listing of all real estate brokers, salespeople, and offices currently licensed by the agency. Roughly half of the 160,000 licensees are registered to business addresses in New York City. The ZIP code with the largest raw number of active licenses is 10022, a chunk of Midtown East that includes (among other things) the Waldorf Astoria and Trump Tower. — : September 19, 2018
Links:
Tags: real estate
The U.S. Department of State publishes, “to the maximum extent practicable,” a database of “each United States citizen who dies in a foreign country from a non-natural cause.” The database currently contains 13,045 deaths, starting in October 2002, and is updated every six months. For each incident, the database provides the date, city, and cause of death. [h/t Jacquelyn Elias] — : September 19, 2018
Links:
Tags: death
Unpaywall has collected data on millions of open-access scholarly articles, plus many more paywalled articles. You can download the full dataset, or submit specific Digital Object Identifiers to the website’s API or online form. For each article, you can learn whether it’s openly accessible, whether the journal that published it is open-access, and additional details about the article itself. [h/t @authcontroller] — : September 19, 2018
Links:
Tags: education
For many decades, the Department of Energy’s Residential Energy Consumption Survey has been asking people about their homes’ energy-related characteristics (e.g., number of bedrooms and roofing materials) and energy-consuming appliances (e.g., television size and dishwasher use). Then, the agency cross-references those answers with billing data collected “directly from energy suppliers under a mandatory authority granted by Congress.” The survey has been conducted 14 times since 1978; survey microdata is available for the eight most recent iterations. — : September 19, 2018
Links:
Tags: energy
ParlGov, “a data infrastructure for political science,” has collected detailed information on 1,500+ political parties, the results of 900+ elections, and the formation of 1,400+ parliamentary cabinets. The 37 countries it covers include every member of the European Union plus certain non-EU members of the OECD (such as Israel, Turkey, and Canada — but not the United States). The datasets are available in several formats, can be explored online, and come with extensive documentation. [h/t Jovi Juan] — : September 19, 2018
Links:
Open Brewery DB is a searchable database of more than 8,000 breweries in the United States (although “future plans are to import world-wide data”). The site provides an API, which lets you query by name, location, and type — microbrewery, regional brewery, brewpub, and so on. Previously: Official brewery statistics (DIP 2017.05.24). [h/t Chris Mears] — : September 12, 2018
Links:
Tags: alcohol
Zillow has created a dataset outlining the boundaries of more than 17,000 neighborhoods in the United States’ largest cities, spanning 49 states (all but Wyoming) plus D.C. and Puerto Rico. Related: OpenStreetMap, which is API-queryable, has a “neighbourhood” tag type. [h/t Volodymyr Kupriyanov] — : September 12, 2018
Links:
Tags: mappingreal estate
“For the first time, the city’s database, which tracks more than 28 million parking and vehicle compliance tickets, is easily available to the public,” according to ProPublica Illinois, which has published the two-gigabyte dataset in collaboration with WBEZ. The dataset, which covers January 2007 to mid-May 2018, “includes information on when, where, and by whom tickets were issued; de-identified license plates; vehicle make; registration zip code; the violation for which the vehicle was cited; the payment status and more.” — : September 12, 2018
Links:
Last month, Puerto Rico’s government began publishing a dataset of all deaths registered in the U.S. territory from January 2017, updated weekly. For each death, the information includes the year and month of the death; the type and causes of death; the deceased’s age, sex, marital status, occupation, place of birth and residence, and more. Related: “More Than 2,000 Puerto Ricans Applied For Funeral Assistance After Hurricane Maria. FEMA Approved Just 75.” [h/t Giancarlo Gonzalez] — : September 12, 2018
Links:
Utility companies are required to report major power outages and other “electric disturbance events” to the Department of Energy within a business day (or, depending on the type of event, sooner) of the incident. The federal agency then aggregates the reports annual summary datasets. For each event, the data includes the time it began and was resolved, the geographic areas it affected, the type of incident, and the estimated number of customers affected. [h/t Jordan Wirfs-Brock] — : September 12, 2018
Links:
Tags: energy
Jan Diehm and Amber Thomas measured the pockets of 80 pairs of jeans — four pairs each from 20 brands, half marketed to men and the other half to women. Their findings “confirmed what every woman already knows to be true: women’s pockets are ridiculous.” In fact, “on average, the pockets in women’s jeans are 48% shorter and 6.5% narrower than men’s pockets.” For each pair of jeans, the duo’s underlying dataset contains the front and back pocket dimensions, material composition, retail price, and more. — : August 22, 2018
Links:
Tags: fashionmiscellaneous
The University of North Carolina’s Louis Harris Data Center serves as “the national depository for publicly available survey data collected by Louis Harris and Associates, Inc.” The online depository contains more than 1,000 Harris polls, some from as early 1958. In total, they include “160,000 questions asked of more than 1,200,000 respondents.” [h/t Xan Gregg] — : August 22, 2018
Links:
Tags: statistics
DigitalGlobe’s open data program publishes georeferenced satellite imagery from before and after major natural disasters. The archive currently includes a couple dozen events, including recent flooding in Kerala and California’s Carr Wildfire and Mendocino Complex Fire. Previously: NOAA's emergency response aerial imagery (DIP 2017.09.20). [h/t Laura Noren and Brad Stenger] — : August 22, 2018
Links:
The African Economic Research Consortium, African Development Bank, and the World Bank have partnered to create the Service Delivery Indicators program — ”a new Africa-wide initiative” that dispatches teams of surveyors “to gauge the quality of service delivery in basic health services” across the continent. The initiative’s de-identified data contains results for nine countries so far, including assessments of facility infrastructure, worker absenteeism, and patient case simulations. [h/t Matthew Collin] — : August 22, 2018
Links:
Tags: healthcare
Google recently launched a database of political ads “that have appeared on Google and partner properties.” The searchable and downloadable dataset indicates the organization that paid for each advertisement, approximately how much they spent, how long the ad ran, what demographics were used for targeting, and roughly how many people it reached. A few months ago, Facebook launched a similar initiative, but you need to logged in to view it and you can’t download the data. You can, however, get Facebook political-advertising data from at least two sources: A repository of 267,000 ads scraped from Facebook’s official archive by NYU researchers, and ProPublica’s ongoing, detailed database of ads and targeting parameters gathered through their Political Ad Collector. [h/t Sheera Frenkel] — : August 22, 2018
Links:
The SEC’s Office of Structured Disclosure publishes data extracted from corporations’ public financial statements. That dataset contains the numbers listed in each company’s primary financial statements — balance sheets, cash flows, et cetera. An even more detailed version of the dataset includes plain-text notes from the filings, plus numbers from a broader array of forms. Both datasets are updated quarterly and go back to 2009. — : August 15, 2018
Links:
Tags: business
Best Buy’s API and Walmart’s API both let you search their products and stores. Both also require (free) registration to obtain an API key. In 2016, Best Buy also published bulk data describing its products and stores. [h/t Dan Nguyen + Dave Machado] — : August 15, 2018
Links:
Tags: businesstechnology
Nike, Inc.’s manufacturing map displays 618 factories and material suppliers that the company uses to manufacture its products (as of May 2018). You can export the entire dataset, or browse and filter the data online. For each of the factories, the information includes the factory’s name, address, product type, number of workers, percentage of workers who are female, and more. [h/t Marc DaCosta] — : August 15, 2018
Links:
SpaceX’s API provides data on the company’s rockets, launchpads, launches, and more. It also will tell you the current orbital position of the car SpaceX launched into space. [h/t Mike Allred] — : August 15, 2018
Links:
Tags: businesstechnology
The Lending Club, which matches borrowers with investors, publishes a dataset of all loans issued through its platform since 2007. The dataset’s many fields include each loan’s amount, term, interest rate, grade, status, and purpose (as a category, and often also a fuller description), as well as the borrower’s employer, home ownership status, and annual income. You can also download all declined loans, i.e., those “that did not meet Lending Club's credit underwriting policy.” Charlie Stanton] — : August 15, 2018
Links:
Tags: money
London, Belfast, Vancouver, Washington (D.C.), Philadelphia, Boston, Cambridge (Mass.), Madison, Providence, San Francisco, Oakland, and Berkeley are among the many cities that publish data cataloguing the trees that line their streets. Previously: NYC’s street trees (DIP 2016.11.16). [h/t Jens von Bergmann + Sunlight Open Cities + u/willwardo] — : August 8, 2018
Links:
The nonprofit organization Reclaim The Records recently obtained New Jersey’s death index, and has made it available to search and download. The records include structured data for 1,275,833 deaths in the state between 2001 and 2017, plus digitized images of the death index for 1901-1903, 1920-1929, and 1949-2000. The structured data contains each person’s name, date of birth, date of death, and death certificate number — plus, for the most recent records, the locations of birth and death. Also: NJ Advance Media has published data on 17 years of drug overdose deaths from the state’s Office of the State Medical Examiner, and property tax rolls for “all 2.3 million taxable parcels of land” in 2017. (Free registration required to download the files.) [h/t Benjamin Cooley + Martin Burch] — : August 8, 2018
Links:
Tags: deathstatistics
Researchers at Primer, a machine learning and natural language processing startup, have released a dataset describing more than 36,000 notable computer scientists, “only 15%” of which have Wikipedia biographies. The researchers trained their algorithms on a corpus of existing Wikipedia articles, Wikidata entries, news articles, and the Semantic Scholar Open Research Corpus. (The latter contains data on more than 39 million research papers in computer science, neuroscience, and biomedical science.) The results include each computer scientist’s name, basic metadata, academic papers, and snippets of news articles mentioning them. Related: “Using Artificial Intelligence to Fix Wikipedia's Gender Problem” (Wired). [h/t Sara Blask] — : August 8, 2018
Links:
Tags:
ProPublica is tracking the money that political campaigns and government agencies have reported spending at Donald Trump’s hotels, golf clubs, and restaurants. You can download the data, which includes the spender, property, date, amount, and listed purpose for each payment. From ProPublica’s notes: “Federal government spending is incomplete because many government agencies have actively fought requests to disclose spending at Trump properties. The data we have so far was released, in part, after lawsuits.” — : August 8, 2018
Links:
Tags: Trump
The Militarized Interstate Dispute datasets provide details about more than 2,200 instances between 1816 and 2010 where a government “threatened, displayed, or used force against another” — including each dispute’s timing, participants, death count, result, and more. A supplementary database tracks the disputes’ locations. The datasets are part of the Correlates of War project, which was founded in 1963 and which strives for “the systematic accumulation of scientific knowledge about war.” [h/t Erik Beuck] — : August 8, 2018
Links:
There are many official datasets of public toilets, including those in New York City parks, Vancouver, Seattle parks, many UK cities, Australia, and New Zealand. [h/t Jens von Bergmann] — : August 1, 2018
Links:
Tags: mapping
Earlier this year, Johns Hopkins professor Dan Honig released the Project Performance Database, which tracks the outcome ratings of international development projects (typically conducted by auditors on a four- or six-point scale). “The PPD is, at present, the world's largest” such database and “contains over 14,000 unique projects from eight agencies,” including the World Bank, the Asian Development Bank, and others. [h/t Paddy Carter] — : August 1, 2018
Links:
Tags:
Researchers at the Environmental Protection Agency have created a new dataset of “reported and predicted information on more than 75,000 chemicals and more than 15,000 consumer products.” The Chemicals and Products Database, as they’ve named it, is an “aggregation of publicly available data on chemical-use categorization, consumer product composition [...], and functional use of chemicals”, and uses “a consistent scheme for categorizing products and chemicals.” You can download the data via the EPA’s Chemistry Dashboard. — : August 1, 2018
Links:
The Organisation for Economic Co-operation and Development (OECD) has launched a database “providing detailed and comparable tax revenue information for 80 countries around the world.” The Global Revenue Statistics Database, “which will expand to cover more than 90 countries by the end of 2018,” breaks tax revenues into dozens of categories and subcategories — such as sales taxes, taxes on capital gains, and taxes on exports. Related: The OECD’s interactive charts of the data. — : August 1, 2018
Links:
Tags: taxes
Law professor Brandon L. Garrett has led an effort to compile data on every death sentence in the U.S. since the early 1990s. Garrett’s “End of its Rope” database currently includes more than 4,900 sentencings, and specifies each defendant’s name, race, and gender; the state, county, and year of the sentence; whether it was a resentencing; and whether the defendant has been executed. You can download the data, browse it online, and explore it via an interactive map. — : August 1, 2018
Links:
“When somebody's obituary appears in the New York Times, FOIA The Dead sends an automated request to the FBI for their (newly-available) records.” So far, the project has obtained and published FBI’s files on 54 people. The site’s data includes each person’s name, a short description, a link to the relevant obituary, a link to the received records, and the number of pages obtained. [h/t Noah Veltman] — : July 18, 2018
Links:
In a recent essay at The Pudding, Jason Li, Amber Thomas, and Divya Manian explored the shades of foundation offered by best-selling makeup brands in the U.S., Nigeria, India, and Japan. They also published the underlying data — color values for more than 600 shades from 36 different brands. — : July 18, 2018
Links:
Microsoft’s Bing Maps team has published an open dataset describing the outlines of nearly 125 million buildings in the United States. To build the dataset, the team trained neural networks to detect buildings’ footprints in satellite images. — : July 18, 2018
Links:
The Amsterdam-based activist group UNITED for Intercultural Action has, since the early 1990s, been collecting information about the deaths of Europe’s refugee-seekers. The organization's volunteers “update the data annually, spending six months at a time verifying reports, categorising deaths and entering them into the database,” according to The Guardian's story about the endeavor and its findings. “When the project began, they received physical clippings from a network of groups around Europe. Now, the data is collected from email submissions and Google Alerts in a number of languages.” The story features a PDF-listing of the deaths, including the date the migrants were found dead, names and countries of origin (where known), and the causes of death. The Italian civic-data organization OnData has converted the PDF to a spreadsheet. [h/t Giuseppe Sollazzo] — : July 18, 2018
Links:
Tags: deathimmigration
Since July 2014, ScotusMap.com has been tracking the U.S. Supreme Court justices’ public events — “whether the Supreme Court is in session or on summer recess, the justices keep busy with writers’ conferences, state bar luncheons, award ceremonies, and more.” The map’s database now contains more than 700 entries, and even includes events attended by the retired justices. Bonus: The creators of ScotusMap recently launched ScotusWat.ch, a website (with downloadable data) that “tracks the public statements made by United States senators about how they plan to vote on the Supreme Court nominee, Brett Kavanaugh, and tallies them into a likely vote count.” [h/t Jay Pinho + Victoria Kwan] — : July 18, 2018
Links:
Tags: law
Kansas City publishes a dataset of cars for sale at its monthly auction. As of yesterday, the dataset contained 482 cars. For each car, the variables include the make, model, year, VIN, reason for being auctioned — e.g., “abandoned,” “stolen,” “illegally parked” — and other details. — : July 4, 2018
Links:
Tags: transportation
Christine Zhang has compiled a CSV of 400+ locations featured in Anthony Bourdain’s No Reservations, The Layover, and Parts Unknown shows. The spreadsheet-as-remembrance includes each location’s name, country, latitude/longitude, plus the relevant episode’s show, season, number, and title. — : July 4, 2018
Links:
The U.S. Centers for Medicare & Medicaid Services publishes a series of “geographic variation” spreadsheets, which cover hundreds of metrics — such as kidney dialysis usage, the total cost of medical tests, and hospital readmission rates — related to Medicare beneficiaries’ healthcare in each state, county, and “hospital referral region.” [h/t Drew Ivan] — : July 4, 2018
Links:
Tags: healthcarestatistics
The Cooperative Open Online Landslide Repository (COOLR) is a recently-launched NASA project that “seeks to cultivate an open platform where scientists and citizen scientists around the world can share landslide reports to guide awareness of landslide hazards for improving scientific modeling and emergency response.” The repository has been seeded with the agency’s Global Landslide Catalog, which it says is already “the largest openly available global database of rainfall-triggered mass movements known to date.” You can explore the COOLR data on an interactive map or download the data in several formats. — : July 4, 2018
Links:
Every year, hundreds of U.S. transit systems — from the Pomona Valley Transportation Authority’s Claremont Dial-a-Ride to the Metropolitan Transportation Authority’s New York City Transit — submit detailed metrics to the congressionally-established National Transit Database. The NTD's datasets cover a broad set of topics, including “agency funding sources, inventories of vehicles and maintenance facilities, safety event reports, measures of transit service provided and consumed, and data on transit employees.” The NTD also provides a glossary, data collection manuals, and the underlying forms. Michael A. Rice, a teacher at Ingraham High School in Seattle] — : July 4, 2018
Links:
Tags: transportation
Ever since 2010, the National Institute for Computer-Assisted Reporting (NICAR) annual conference has featured a session of five-minute “lightning talks,” selected by popular vote. NICARian Christine Zhang has compiled a spreadsheet of all 309 lightning talk proposals, the proposed presenters, their professional affiliations, how many votes each proposal received, and more. Related: “Nine Years of NICAR Lightning Talks (and Cats),” Zhang’s analysis of the data. Also related: The code behind Zhang’s analysis. — : June 6, 2018
Links:
Tags: journalism
The Mexican state of Yucatán publishes a dataset listing the names and locations of cenotes, the region’s famous water-filled sinkholes. Related: Other datasets from the Programa de Ordenamiento Ecológico Territorial del Estado de Yucatán. [h/t Forest Gregg] — : June 6, 2018
Links:
Tags: environmentmappingwater
PubMed, the National Library of Medicine’s search engine for biomedical and life-sciences literature, lets you search for retracted publications; just add "retracted publication"[PTYP] to your query. For instance, here are retracted articles that were originally published in 2016. Using the “Send to” link at the top-right of the query pages, you can download all the results. Data scientist Neil Saunders has gathered this data and condensed it into an interactive, graphical report. (Clicking on the axis labels takes you the relevant PubMed search.) Related: The code behind Saunders’ report. [h/t u/cavedave] — : June 6, 2018
Links:
Tags: healthcarescience
The Smithsonian Institution’s Global Volcanism Program maintains a database of more than 12,000 volcanoes and 11,000 eruptions — dating from 10450 BCE to the present year. You can search the data online, and then download the results as a spreadsheet. Related: “Here's every volcano that has erupted since Krakatoa.” [h/t Duncan Geere + Rachel Schallom + Lazaro Gamio] — : June 6, 2018
Links:
The Department of Energy’s National Energy Technology Laboratory has published what it says is the “first-ever database inventory of oil and natural gas infrastructure information from the top hydrocarbon-producing and consuming countries in the world.” The database contains tons of geospatial information and “identifies more than 4.8 million individual features like wells, pipelines, and ports from more than 380 datasets in 194 countries. It includes information about the type, age, status, and owner/operator of infrastructure features.” Helpful: The authors’ (detailed) methodology paper. [h/t Michael McLaughlin] — : June 6, 2018
Links:
Tags: energy
The Wikimedia Foundation has published a dataset listing each clearly-cited source (e.g., a book with an ISBN, a scholarly article with a DOI, etc.) on each page of each of Wikipedia’s 298 languages editions — 15,693,732 source-page combinations in all. Related: “The Most-Cited Authors on Wikipedia Had No Idea,” by Louise Matsakis. [h/t Ted Lawless] — : May 23, 2018
Links:
Tags: languagemediatechnology
Earlier this month, the Department of Energy’s National Renewable Energy Laboratory made a big new slice of its Wind Integration National Dataset available online. The latest version provides API access to 50 terabytes of wind-related measurements — about 10% of the full database. It includes “barometric pressure, wind speed and direction, relative humidity, temperature, and air density data” between 2007 and 2013, from nearly 5 million locations in/near the continental United States. The NREL has also published an animated map of the data. Note: Free registration is required to access the API. Previously: Wind turbines (DIP 2018.04.25). [h/t Michael McLaughlin] — : May 23, 2018
Links:
Tags: energy
The Nonviolent and Violent Campaigns and Outcomes (NAVCO) Data Project, based at the University of Denver, “catalogues major nonviolent and violent resistance campaigns around the globe from 1900-2013.” The project’s initial dataset explored the general characteristics of hundreds of campaigns; follow-up datasets have examined the annual activity and tactics of smaller subsets. Each dataset comes with a detailed codebook. Note: Free registration is required to download the most recent datasets. [h/t Peace Science Digest] — : May 23, 2018
Links:
Tags: conflict
As a way to “lower the barrier“ for analyzing public transportation data, researchers at Finland’s Aalto University have published “a curated collection of [now more than] 25 cities' public transport networks in multiple easy-to-use formats including network edge lists, temporal network event lists, SQLite databases, GeoJSON files, and the GTFS data format.” On the project’s website, you can browse, visualize, and download each city’s data. (The cities are mostly in Europe and Australia, but also include Detroit, Winnipeg, and Antofagasta, Chile.) Previously: TransitLand and TransitFeeds (DIP 2016.07.27). [h/t NYU Data Science Community Newsletter] — : May 23, 2018
Links:
Tags: transportation
Caitlin Rivers, a computational epidemiologist at the Johns Hopkins Center for Health Security, has started compiling data tracking the current Ebola outbreak in the Democratic Republic of Congo. So far, the datasets are based on case counts and other information from the DRC’s Ministry of Health and the World Health Organization. A series of “data interpretation notes” accompanies each dataset. (Rivers administered a similar data repository during the 2014 Ebola outbreak.) Related: “Most Maps of the New Ebola Outbreak Are Wrong,” by Ed Yong. — : May 23, 2018
Links:
You know those metallic grates embedded into city sidewalks? D.C.’s Office of the Chief Technology Officer has identified 10,000+ of them in the District. Also: 89,727 curb segments. [h/t Sunlight Open Cities] — : May 9, 2018
Links:
Tags: infrastructuremapping
The European Space Agency’s Gaia spacecraft “has produced the richest star catalogue to date, including high-precision measurements of nearly 1.7 billion stars and revealing previously unseen details of our home Galaxy.” Those measurements, released last month, are available to download. They’ve also been used to create a high-resolution image of all observed stars and to expand the ESA’s interactive space map. Related: This Vox article provides some more context. [h/t u/Kopachris] — : May 9, 2018
Links:
Tags: science
The Himalayan Database tracks “all expeditions that have climbed in the Nepalese Himalaya.” The hyper-detailed database “is based on the expedition archives of Elizabeth Hawley, a longtime journalist based in Kathmandu, and it is supplemented by information gathered from books, alpine journals and correspondence with Himalayan climbers.” The database — long accessible only on CD, for a fee — is now available to download for free. (The main download is provided as a Microsoft Visual FoxPro database, but the .DBF files within it can be opened using other software, including LibreOffice.) Related: Yuichiro Miura, the oldest person to reach the summit of Mount Everest. [h/t Jacob Bradburn] — : May 9, 2018
Links:
Cook County, Illinois, publishes data on all deaths reported to its medical examiner — 20,000+ deaths since August 2014, and updated daily. (FYI: “Not all deaths that occur in Cook County are reported to the Medical Examiner or fall under the jurisdiction of the Medical Examiner.”) Connecticut’s Office of the Chief Medical Examiner has published data on all accidental drug deaths reported between 2012 and 2017. The Dallas Morning News’ Dana Amihere obtained autopsy data from the Dallas County medical examiner's office, and NJ Advance Media’s Stephen Stirling obtained data on “all cases referred to the NJ Medical Examiner system from 1996 to 2016.” [Correction, 2018-05-09: The original version of this item misspelled Stephen Stirling's name. Data Is Plural regrets the error.] — : May 9, 2018
Links:
Tags: death
OpenTrials, a collaboration between Open Knowledge International and Oxford University’s Ben Goldacre, “aims to locate, match, and share all publicly accessible data and documents, on all trials conducted, on all medicines and other treatments, globally.” The project’s “public beta” brings together data from several of the world’s largest clinical trial registries — including the United States’ ClinicalTrials.gov, the European Union Clinical Trials Register, and the WHO’s International Clinical Trials Registry Platform — and other related sources. You can explore the data through an online search tool, monthly bulk exports, and an API. — : May 9, 2018
Links:
Tags: healthcarescience
Computer-vision researchers convinced 32 participants (of 10 nationalities, living in 4 cities) to record everything they did in their kitchens for three days using a head-mounted camera. Later, the participants narrated what they had been doing. Taken together, the EPIC-Kitchens dataset includes 55 hours of video, nearly 40,000 narration segments, and more. [h/t Duncan Geere] — : April 25, 2018
Links:
Tags: food
For her 2013 book, Making the News: Politics, the Media, and Agenda Setting, UC Davis professor Amber E. Boydstun oversaw the compilation of a dataset of every front-page article in the New York Times from 1996 to 2006. Each of the 31,034 articles have been categorized by topic, according a detailed codebook, and given a short summary. Related: The Comparative Agendas Project's list of datasets that use its topic-classification system, including Boydstun’s data. Also related: The NYT’s APIs. [h/t Cornelius Puschmann] — : April 25, 2018
Links:
Tags: journalism
Lawrence Berkeley National Laboratory, the U.S. Geological Survey, and the American Wind Energy Association have partnered to publish the U.S. Wind Turbine Database. The dataset, which the government says will be “continuously updated,” currently contains 57,636 turbines and includes each turbine’s location, development project, manufacturer, model, height, rotor diameter, and other characteristics. You can download the data in several formats, and also explore it on an interactive map. [h/t Ed Vine] — : April 25, 2018
Links:
Tags: energy
The federal Highway Performance Monitoring System “includes inventory information for all of the Nation's public roads as certified by the States’ Governors annually.” And it’s not just highways: “All roads open to public travel are reported in HPMS regardless of ownership, including Federal, State, county, city, and privately owned roads such as toll facilities.” Shapefiles representing the HPMS data are available for 2011–2015. For each segment of road, the dataset indicates the average daily traffic, number of turn lanes, surface type, and dozens of other variables. Related: America’s Quietest Routes, which uses the data. — : April 25, 2018
Links:
Over the past year, reporters at the Washington Post ”attempted to identify every act of gunfire at a primary or secondary school during school hours since the Columbine High massacre on April 20, 1999.” Using a range of sources, the reporters ”reviewed more than 1,000 alleged incidents, but counted only those that happened on campuses immediately before, during or just after classes.” The resulting database, published last week, currently contains more than 200 incidents and can be downloaded as a CSV. For each shooting, the database includes details about the location, timing, circumstances, shooter, casualties, and the school’s students. [h/t The INN Nerds] — : April 25, 2018
Links:
The University of Florida’s Larry Winner has collected hundreds of “miscellaneous” datasets, many from niche academic studies. A few highlights: “Antiseptic as Treatment for Amputation – Upper Limb” (from an 1870 study), “Sex, Lies, and Religiosity” (1971), and “Reading Times by E-Reader Device and Lighting Conditions” (2013). [h/t Charles Minshew] — : April 18, 2018
Links:
Tags: miscellaneousstatistics
Researchers at the University of Colorado at Boulder and the Santa Fe Institute have compiled a dataset of 200+ universities’ parental leave policies. For each institution, the dataset indicates the amount of paid leave granted to//taken by both women and men, and what type of leave it is (e.g., relief from teaching, from all duties, et cetera). [h/t Sam Way] — : April 18, 2018
Links:
The nonpartisan Campaign Finance Institute has launched a database of current and historical state campaign finance laws. The information goes back to 1996 and describes each state’s contribution limits, various kinds of prohibitions, disclosure rules, and more. You can download the full dataset or explore it online. [h/t Rachel Shorey] — : April 18, 2018
Links:
The U.S. Office of the Federal Register publishes structured data on every presidential executive order since 1994. For each of the 886 entries, the dataset provides the order’s title, the date it was signed, the president who signed it, and where to find it in the Federal Register. [h/t u/cavedave] — : April 18, 2018
Links:
Tags: politics
A team led by Princeton sociologist and Evicted author Matthew Desmond has compiled the United States’ first-ever national-scale, publicly-available database of eviction metrics. Desmond’s Eviction Lab has collected more than 80 million records from cities, counties, and states across the country, and used them to calculate the number of evictions and eviction filings in each place. (Short methodology here; longer methodology here.) You can download the aggregate data in bulk (after supplying your email address) and explore it through an interactive map. Related: “In 83 Million Eviction Records, a Sweeping and Intimate New Look at Housing in America” (The New York Times), which includes additional background and graphics. — : April 18, 2018
Links:
Last year, Data Is Plural pointed readers to dog registration data for NYC, Tacoma, and Edmonton. It turns out that government of Zurich also publishes local dog registrations, including each canine’s name, gender, and birth year. And the Sunshine Coast Council, in Australia, publishes a spreadsheet of both dogs and cats, their primary breeds and colors, and whether they’ve been spayed/neutered. [h/t Open Data Institute] — : April 4, 2018
Links:
Tags: animals
Software engineer Michael Penkov has scraped the official, polling station–level results for Russia’s recent presidential election, and made the data available as a single JSON file. He’s also published an introductory Python notebook, which explains the data structure and provides English translations for the Russian field names. — : April 4, 2018
Links:
The Center for Strategic and International Studies’ Beyond Parallel project publishes several databases related to North Korean international relations — including 200+ negotiations between the U.S. and DPRK since 1990, and several hundred military provocations since 1958. Related: Los Angeles Times correspondent Matt Stiles’ visual explorations of the provocations data. Previously: The James Martin Center for Nonproliferation Studies’ North Korea Missile Test Database (DIP 2017.05.17). [h/t Matt Stiles] — : April 4, 2018
Links:
Tags: conflict
The UK government has begun requiring all companies with at least 250 employees in Great Britain (i.e., England, Scotland, and Wales) to report the pay differences between their male and female workers. Today is the official deadline to submit the reports; as of last night, more than 8,800 employers had done so. The reports include the percentage gaps in hourly earnings, differences in bonus pay, and the proportions of male and female employees in each pay quartile. You can search the data online and also download it as a CSV. Related: The Guardian’s series of reports on the data. [h/t Peter Yeung] — : April 4, 2018
Links:
Tags: gendermoneystatistics
The GOES-16 satellite was launched into orbit in November 2016, and it’s been collecting near-realtime images and data ever since. (GOES stands for “Geostationary Operational Environmental Satellite.”) It collects data on 16 different spectral bands, and it can capture a full image of the Western Hemisphere every 15 minutes, plus “an image of the Continental U.S. every five minutes, and two smaller, more detailed images of areas where storm activity is present, every 60 seconds.” You can browse the images and data online, and also download them as NetCDF files. Related: Washington Post graphics reporter John Muyskens’ list of GOES-16 resources and usage examples. [h/t John Muyskens] — : April 4, 2018
Links:
Tags: mapping
OpenPowerlifting.org “aims to create a permanent, accurate, convenient, accessible, open archive of the world's powerlifting data. In support of this mission, all of the OpenPowerlifting data and code is available for download in useful formats.” So far, that includes 400,000+ performances at 9,000+ competitions in dozens of countries. [h/t u/cavedave] — : March 7, 2018
Links:
Tags: sports
The University of Miami Libraries has digitized 53,000+ pages of La Gaceta de La Habana, “the paper of record during the Spanish colonial occupation of Cuba in the nineteenth century.” The digitized editions span 33 of the years between 1849 and 1897. Previously: Historical U.S. newspapers (DIP 2017.08.16). [h/t Mike Stucka + Heather Froehlich] — : March 7, 2018
Links:
Tags: journalism
Last year, the Stanford Center for Reproducible Neuroscience launched OpenNeuro, “a free and open platform for analyzing and sharing neuroimaging data.” (It’s the successor to the center’s earlier initiative, OpenfMRI.) You can, for instance, download scans of brains that were watching a particular episode of The Twilight Zone. Related: The Brain Imaging Data Structure, “a simple and intuitive way to organize and describe your neuroimaging and behavioral data.” Previously: The Open Access Series of Imaging Studies (DIP 2017.08.16). [h/t Laura Noren and Brad Stenger] — : March 7, 2018
Links:
Tags: healthcarescience
Afrobarometer “is a pan-African, non-partisan research network that conducts public attitude surveys on democracy, governance, economic conditions, and related issues in more than 35 countries in Africa.” You can download data from the first six rounds of surveys, conducted between 1999 and 2015. You can also read the detailed questionnaires and explore the results online. Note: To download the data, you’ll need to create a (free) account on the website. [h/t Jeffrey Arnold] — : March 7, 2018
Links:
Tags: statistics
The U.S. Energy Information Administration publishes near-real-time data on the Lower 48’s electrical grid. The datasets include net electricity generation, flows in and out of the country’s various “balancing authorities,” regional demand, and forecasts of demand. You can explore the data online, access it through the EIA’s API, or download it in bulk. Helpful: The EIA’s guide to the data and “known issues”. — : March 7, 2018
Links:
Tags: energy
The (unofficial) Rick and Morty API provides data on 390+ characters, 60+ locations, and all 31 episodes of the science-fictional animated series. — : February 28, 2018
Links:
Tags: entertainmenttelevision
The American Society of Mammalogists’ Mammal Diversity Database “is home base for tracking the latest taxonomic changes to species and higher groups of mammals.” Currently, it contains more than 1,300 genera and 6,000 total species. Fun facts: The impala is the only member of the genus Aepyceros, and the name “Schmidly's deer mouse” can refer to either of two species in two entirely different genera. [h/t Himanshu Goenka] — : February 28, 2018
Links:
Tags: animals
The Environmental Protection Agency’s RadNet system “monitors the nation's air, precipitation and drinking water for radiation.” The radiation measurements, collected from 130+ stations in all 50 states plus the District of Columbia and Puerto Rico, are available on a “near-real-time” basis. Related: Randall Munroe’s radiation dose chart. Previously: SafeCast (DIP 2016.02.03). [h/t Stanislav Kralin] — : February 28, 2018
Links:
Tags: environmentnuclear
Political scientist Jeffrey Arnold has converted the U.S. Army Concepts Analysis Agency (CAA) Database of Battles from a series of Lotus 1-2-3 worksheets into tidier, easier-to-use CSV files. The dataset includes details of 660 battles — associated with several dozen wars — between 1600 and the mid/late-1900s. The fields indicate each battle’s “name, date, and location; the strengths and losses on each side; identification of the victor; temporal duration of the battle,” and more. — : February 28, 2018
Links:
The PA-X Peace Agreements Database contains structured information about 1,500+ “formal, publicly-available documents” that address “conflict with a view to ending it.” The database covers more than 140 peace processes between 1990 and 2015, and each agreement has been coded for more than 200 variables — for instance, whether the agreement contains provisions about religious groups. [h/t Melissa Terras] — : February 28, 2018
Links:
Tags: conflict
The Open Plaques project is dedicated to “documenting the historical links between people and places as recorded by commemorative plaques.” The latest data dump contains nearly 40,000 plaques — the vast majority in the U.S., U.K., and Germany. OpenBenches, meanwhile, has collected similar data for 4,300+ memorial benches. [h/t Jason Norwood-Young] — : February 21, 2018
Links:
Tags: mapping
The GitHub Archive is an effort to record the popular code-sharing website’s public timeline, “archive it, and make it easily accessible for further analysis.” The dataset, which includes more than 20 types of events and often contains more than 1 million events per day, goes back to February 2011. Related: Structured data representing the “commit histories” of two dozen popular open-source projects, including Rust, Pandas, Redis, and Bitcoin. — : February 21, 2018
Links:
Tags: technology
HappyDB is “a corpus of 100,000 crowd-sourced happy moments.” An example: “My son gave me a big hug in the morning when I woke him up.” The researchers, who recently described their efforts in an academic paper, collected the sentiments from Mechanical Turk workers, who also supplied basic demographic information, such as age, gender, and whether they have children. [h/t Marcel Weiher] — : February 21, 2018
Links:
Tags: entertainmentstatistics
Through its Office of Foreign Assets Control, the Treasury publishes several datasets that describe the people and companies subject to U.S. economic sanctions. The two main listings are the Specially Designated Nationals and Blocked Persons (“SDN”) and the Consolidated Sanctions List. Those contain only currently-sanctioned entities, but the Treasury also publishes (semi-structured) documents describing historical additions and removals. Related: Enigma Public’s Sanctions Tracker. [h/t Jennifer Roscoe] — : February 21, 2018
Links:
The Humanitarian Data Exchange has collated dozens of datasets related to the Rohingya refugee crisis. Among them: the geographic boundaries of Rohingya refugee settlements in Bangladesh, the numbers of refugees living in those settlements, and the infrastructure available there. — : February 21, 2018
Links:
Tags: refugees
Via a Freedom of Information Act request to the Fish and Wildlife Service, Newsweek reporter Kristin Hugo obtained a spreadsheet listing all imports of bats — vampire, fruit, yellow-shouldered, leaf-nosed, and more — to the United States between January 2016 and October 2017. — : February 14, 2018
Links:
Tags: animals
The United Kingdom’s Home Office publishes dozens of fire-safety related datasets, including aggregate statistics on response times, smoke alarms, and fire department staffing; incident-level data on appliance fires, vehicle fires, and fatalities; and much more. Of the 100,000+ domestic appliance fires reported over a six-year span, 52% were believed to have been caused by a “cooker incl. oven,” 11% by a “grill/toaster,” 2% by dishwashers, and just over 1% by deep-fat fryers. Semi-related: Jamie Oliver’s Bad Cheese Idea Is Still Starting Toaster Fires. [h/t Owen Boswarva] — : February 14, 2018
Links:
Tags: disaster
Common Voice is a Mozilla-led project that aims “to make voice recognition technology easily accessible to everyone.” To that end, the project asks visitors to record themselves speaking specific sentences, and to validate the recordings of other users. The whole dataset is available to download and currently clocks in at 12 gigabytes, compressed. (Bonus: That download page also links to other freely available voice datasets.) Related: The project’s FAQ. — : February 14, 2018
Links:
Tags: language
Through the Constituency-Level Elections Archive (DIP 2016.09.28) and other sources, you can get historical election results for the U.S. Congress. And through the work of Jeffrey B. Lewis et al., you can get data describing the historical boundaries of each congressional district. In a Scientific Data article published last year, quantitative geographer Levi John Wolf presented a dataset that brings the two types of information together, so that all congressional election results from 1896 to 2014 are “explicitly linked to the geospatial data about the districts themselves.” — : February 14, 2018
Links:
In April 2015, the Ghorkha Earthquake killed more than 8,000 people in Nepal, and destroyed hundreds of thousands of homes. In early 2016, a team led by the not-for-profit Kathmandu Living Labs, in collaboration with Nepal’s government, undertook “a massive household survey using mobile technology to assess building damage in the earthquake-affected districts.” The responses to that survey are now available at the 2015 Nepal Earthquake Open Data Portal; you can explore the data online or download it in bulk. In all, the datasets include details on millions of individuals, plus information about each surveyed household and building. [h/t Reddit user “phishfart”] — : February 14, 2018
Links:
Tags: disaster
KongTrackr hosts detailed stats about specific games played on the beloved arcade fixture, with a focus on record-setting scores. The website’s database, which can be downloaded as a single JSON file, currently includes 1,715 games by 450 players. Related: KongTrackr played a role in some recent high-score commotion. Also related: KongTrackr says its site is “heavily influenced” by this database of StarCraft 2 results. — : February 7, 2018
Links:
Tags: entertainment
Before each meeting of the Federal Open Market Committee, the Federal Reserve’s research staff prepares a set of economic projections known as the Greenbook. Those forecasts are kept secret for five years, and then released to the public. The Philadelphia Fed’s archive of public Greenbook data dates back to 1966, and contains both PDFs and structured data files. — : February 7, 2018
Links:
Tags: economics
The Global Register of Introduced and Invasive Species combines data and observations from thousands of sources to create a standardized database of such species in more than 200 countries. can be explored by kingdom (plants, animals, fungi, etc.), ecosystem, and country. Each slice of data can be downloaded as a CSV. Related: In a Scientific Data paper published last month, the researchers behind the effort described their methodology in detail. — : February 7, 2018
Links:
Tags: animalsenvironmentplants
The Global Terrorism Database, run by a University of Maryland–based consortium, is an “open-source database” of more than 170,000 terrorist events. The database, which currently covers 1970 through 2016, is well-documented and includes information about about the attackers, locations, weapons, victims, and more. Note: To download the data, you first need to accept an end-user license agreement. Previously: Profiles of Individual Radicalization in the United States, from the same consortium (DIP 2017.05.24). [h/t Brian C. Keegan] — : February 7, 2018
Links:
Tags: terrorism
Lynn Fisher’s Hollywood Age Gap collects data on silver screen love interests — more than 880 so far, from more than 630 movies — and then calculates the difference in those actors’ ages. The largest gap so far is the 52-year age difference in Harold and Maude. The movie with the most pairings is Love Actually, with seven. You can download the data as JSON and CSV files from the project’s GitHub page. [h/t Julia Smith] — : February 7, 2018
Links:
Tags: entertainmentgender
Hans Lienesch calls himself The Ramen Rater, and (as his website’s banner declares) he’s been “Celebrating the Instant Noodle for 15 Years.” Over that time, he’s amassed a spreadsheet of more than 2,600 ratings. [h/t dreyco] — : January 24, 2018
Links:
Tags: food
Using a range of public sources, The Duke Chronicle collected data on all 1,739 students listed in the Class of 2018’s “Freshman Picture Book” — including their hometowns, details about their high schools, whether they won a merit scholarship, and whether they play on a sports team — in order to analyze “trends between those who do and don't join Greek life at Duke.” Related: “Is Greek life at Duke as homogenous as you think?,” the first story in the Chronicle’s multipart series based on the data. [h/t Gautam Hathi] — : January 24, 2018
Links:
Tags: education
A team led by researchers at the University of Oxford’s Malaria Atlas Project have estimated the time it would take (as of 2015) to get from any square kilometer in the world to the nearest city of 50,000+ people. The analysis, which improves upon a similar effort from 15 years earlier, benefits from “the first-ever, global-scale synthesis of two leading roads datasets – Open Street Map (OSM) data and distance-to-roads data derived from the Google roads database.” You can download the data as a GeoTIFF, or explore the map online. [h/t Data & Eggs] — : January 24, 2018
Links:
Tags: mappingtransportation
The Consumer Financial Protection Bureau’s National Financial Well-Being Survey collected more than 6,000 responses to the agency’s 10-question Financial Well-Being Scale, plus additional demographic and financial information. The survey results, which were collected in late 2016, come with a detailed methodology and data dictionary. Plus: You can take the questionnaire yourself, anonymously. [h/t Amy Cesal] — : January 24, 2018
Links:
Tags: economicsmoneystatistics
The Atlas of Economic Complexity has collected decades of import/export data from the United Nations Comtrade database, and then applied “a unique method to clean the data to account for inconsistent reporting practices.” You can download the raw data, learn more about the cleaning process in the FAQ, explore current and historical trade flows, and browse the Atlas’s rankings of countries by “economic complexity.” Related: The researchers have also created regionally-detailed economic atlases of Mexico and Columbia. [h/t Annie White] — : January 24, 2018
Links:
The unofficial Studio Ghibli API contains structured information about the famed Japanese animation studio’s films (e.g., Princess Mononoke and Spirited Away), plus the characters, locations, and vehicles featured in them. You can also download a single file containing all the data. — : January 17, 2018
Links:
Tags: entertainmentmovies
The Open Source Psychometrics Project “provides a collection of interactive personality tests with detailed results that can be taken for personal entertainment or to learn more about personality assessment.” You can download results from more than 30 such tests, including the Big Five Personality Test, the Kentucky Inventory of Mindfulness Skills, and Bob Altemeyer's Right-wing Authoritarianism Scale. Related: “Most Personality Quizzes Are Junk Science. I Found One That Isn’t” (FiveThirtyEight). [h/t Chris Zioutas] — : January 17, 2018
Links:
Tags: healthcare
The London Air Quality Network, run by researchers at King's College London, gathers data on levels of nitrogen dioxide, ozone, fine particulate matter, and other pollutants from more than 100 monitoring sites. You can download the data as CSV files (for up to six metric and site combinations at a time) or fetch JSON and XML data from the site’s API. Related: “London air pollution live data – where will be first to break legal limits in 2018?” (The Guardian). Previously: Air quality data from the EPA (DIP 2017.10.04), OpenAQ (DIP 2017.03.29), Berkeley Earth (DIP 2017.03.22), and the World Health Organization (DIP 2016.06.15). [h/t Gavin Freeguard] — : January 17, 2018
Links:
Tags: environment
The Immigration Policies in Comparison (IMPIC) project has quantified the immigration regulations of 33 OECD countries between 1980 and 2010. The project, led by political sociologist Marc Helbling, dives deeply into the regulations related to four policy areas: labor migration, family reunification, asylum/refugees, and “co-ethnics.” You can find the dataset’s detailed codebook and methodology in this PDF. Related: Helbling's summary of the project’s goals, approach, and initial findings (Migration Data Portal). [h/t David Brady] — : January 17, 2018
Links:
Reclaim The Records launched in 2015 and became a 501(c)(3) non-profit last year. Its mission: To “identify important genealogical records sets that ought to be in the public domain but which are being wrongly restricted by government archives, libraries, and agencies.” The organization files freedom-of-information requests and lawsuits to get the data, and “then we digitize everything we win and put it all online for free, without any paywalls or usage restrictions, so that it can never be locked up again.” Most of the records they’ve received so far have arrived as PDFs or microfilm. But a 2016 court settlement with the NYC City Clerk’s Office netted the group — and the public — a dataset of 3 million NYC marriage licenses from 1950 to 1995. — : January 17, 2018
Links:
Scott Cole is a neuroscience PhD student at UC San Diego who, in his spare time, is leading a project to rate the region’s burritos on a 10-dimensional scale. — : January 10, 2018
Links:
Tags: food
The National Water and Climate Center maintains a series of interactive snow maps. Their snow depth map is based on data from nearly one thousand monitoring stations around the country — mostly in western states, but also a handful in the Southwest, Northeast, and Midwest. To download data from a map, click on “Selected Stations” in the top-left corner, and then click “Export Data as CSV.” [h/t Charlie Loyd's collection of "near-realtime Earth observation resources" + Noah Veltman] — : January 10, 2018
Links:
With the help of research assistants, legal historian Jed Shugerman has compiled a “tentative database” of prosecutor politicians — presidents, Supreme Court justices, circuit court justices, governors, state attorneys general, and senators who served as prosecutors earlier in their careers. Shugerman’s spreadsheet goes back to 1880 and lists the dates served in office, political party, other offices held, and “relevant prosecutorial background” for each politician. [h/t Geoff Hing] — : January 10, 2018
Links:
The CDC’s 500 Cities Project provides “city and census tract-level data, obtained using small area estimation methods, for 27 chronic disease measures for the 500 largest American cities.” The metrics range from cancer prevalence to binge drinking to dental health to undersleeping. The latest data release was published in December and covers more than 28,000 Census tracts. [h/t Kate Rabinowitz] — : January 10, 2018
Links:
Tags: diseasehealthcare
The Bureau of Ocean Energy Management and the Bureau of Safety and Environmental Enforcement — two of the agencies that replaced the troubled U.S. Minerals Management Service in the wake of the Deepwater Horizon spill — publish a few dozen bulk datasets related to their oversight of offshore drilling operations. Among them: lease owners, production metrics, company details, pipeline permits and locations, incident investigations, and platform structures. Related: “American Idle: Decommissioning costs sink offshore drillers into latest crisis,” a 2017 Debtwire investigation that used the platform data. [h/t Alex Plough] — : January 10, 2018
Links:
“The Khipu Database Project began in the fall of 2002, with the goal of collecting all known information about khipu” — the knotted string textiles used for recordkeeping in the Inca Empire — “into one centralized repository.” The project’s datasets include detailed structural data about hundreds of khipu, as well as an inventory of all known specimens. Related: The College Student Who Decoded the Data Hidden in Inca Knots. — : January 3, 2018
Links:
Tags: miscellaneous
The Census Bureau’s Building Permits Survey collects data from thousands of municipalities every month. For each municipality, metro area, and state, the datasets provide the number of permits issued for new residential housing, number housing units authorized, and total estimated value of the new construction. Previously: The Census Bureau’s Annual Characteristics of New Housing survey (DIP 2016.06.22). [h/t Susie Cambria + Issi Romem] — : January 3, 2018
Links:
Tags: architecturereal estate
Movebank is a “a free, online database of animal tracking data hosted by the Max Planck Institute for Ornithology.” On the site’s data map, you can display the animal tracks from particular studies — for instance, the migrations of more than a dozen turkey vultures. Contributing researchers can decide whether to share the underlying data; not all do. (Here’s the data for those vultures, plus six buffalo in Kruger National Park, and seven Venezuelan oilbirds.) [h/t Hari Karthic] — : January 3, 2018
Links:
The Open University Learning Analytics dataset features demographic information about 28,000+ students who, in 2013 and 2014, enrolled in any of seven particular distance learning courses at the UK’s Open University; their final results (distinction, pass, fail, or withdrawn); 173,000+ graded assignments; and 10+ million rows describing each student’s interactions with the courses’ “virtual learning environments.” Useful: The researchers’ academic article describing the dataset. — : January 3, 2018
Links:
Tags: educationtechnology
The IRS publishes a ton of tax statistics. One of the most interesting portions: data aggregated from individual income tax returns (i.e., Form 1040s), which the IRS provides at the state, county, and ZIP code level. Those datasets’ 100+ fields include details that range from the basic (e.g., the number of tax filings and total income reported) to the more obscure (e.g., the number of returns that included “educator expenses” and the total amount of overpayments refunded). [h/t Cecilia Reyes] — : January 3, 2018
Links:
Tags: taxes
Through a series of surveys, L'Atlante della Lingua Italiana QUOTidiana has been asking Italian speakers what words they use to describe various everyday things. The results for each question can be browsed as maps, or downloaded as XML files. When shown a picture of a watermelon, most respondents wrote “anguria,” but others responded with “cocomero,” “melone,” “citrone,” or “zipangulu.” [h/t Giuseppe Sollazzo] — : December 27, 2017
Links:
As part of a recent investigation, reporters at Reason Magazine used public records law to obtain geospatial data on each of Tennessee's 8,544 drug-free zones. In addition the geographic boundaries, the shapefile also includes each zone’s name and type (school, childcare, park, or library). [h/t CJ Ciaramella] — : December 27, 2017
Links:
The Digital Database for Screening Mammography was first released two decades ago, in 1997. It contains data and images from 2,620 mammographies — a mix of normal, benign, and malignant cases. In a Scientific Data article published last week, a team of Stanford University researchers describe a series of improvements they’ve made to the original database; their Curated Breast Imaging Subset of DDSM has modernized the database’s image formatting, added detailed “region-of-interest” annotations, and converted the metadata into CSV files. — : December 27, 2017
Links:
Tags: healthcarewomen
Ships use the internationally-standardized automatic identification systems (AIS) to broadcast their name, speed, direction, and other details. With a bit of radio hardware and software, anyone can collect the signals emitted by nearby vessels. AISHub aggregates AIS data from hundreds of volunteer signal-collectors around the world, and makes that data available via an API and online maps. The Finnish Transport Agency also provides an API of data collected by its AIS stations on the Baltic Sea and other local waters; Denmark’s government publishes free historical data of maritime traffic on Danish waters; and the Coast Guard publishes historical AIS data for U.S. coastal waters (currently only for 2009–2014). [h/t Topi Tjukanov + Miska Knapek] — : December 27, 2017
Links:
The SEC requires Moody’s, Standard & Poor’s, and other “nationally recognized statistical rating organizations” to report their rating assignments and changes (e.g., upgrades, downgrades, withdrawals) going back to 2010. The agencies publish the reports as XBRL-formatted files, and update them monthly. But “because most researchers are unfamiliar with XBRL and cannot easily locate the history files, this valuable resource has seen limited use,” according to the Center for Municipal Finance’s RatingsHistory.info, which now provides the reports as easier-to-use CSVs. [h/t data.world] — : December 27, 2017
Links:
In far northern Norway, the Svalbard Global Seed Vault safekeeps hundreds of millions of seeds, helping to back up the world’s biodiversity. Data on the vault’s deposits, which often contain hundreds of seeds apiece, are available to search and to download. [h/t Enigma Public] — : December 13, 2017
Links:
Tags: plants
For a recent investigation into state legislators’ financial interests, the Center for Public Integrity “analyzed disclosure reports from 6,933 lawmakers holding office in 2015 from the 47 states that required them.” You can search through the disclosures and download the data. For each of the 11,000+ disclosed interests, the dataset includes the lawmaker’s state, legislative body, and district; the name and industry of the financial interest; and a link to the lawmaker’s personal disclosure form. [h/t The Nerds at INN Labs] — : December 13, 2017
Links:
Tags: politics
For several years now, the folks at FiveThirtyEight have been quantifying professional sports teams’ current and historical strength, mostly using Elo rating systems. Their global club soccer ratings go back to 2016, their basketball ratings go back to 1946, their American football ratings go back to 1920, and their baseball ratings go back to 1871. For each of those, the entire histories of match-by-match ratings are available as CSV files. [h/t Jay Boice] — : December 13, 2017
Links:
Tags: historysportsstatistics
For an investigation published Monday, Vice News spent “nine months collecting data on both fatal and nonfatal police shootings from the 50 largest local police departments in the United States.” They’ve published raw and standardized data on every shooting, plus the code they used to analyze it. [h/t Allison McCann] — : December 13, 2017
Links:
Last month, the Council on Foreign Relations launched the Cyber Operations Tracker, a database of “publicly known state-sponsored cyber incidents that have occurred since 2005.” The 191 attacks in the database so far have been sponsored by 16 different countries, with China, Russia, and Iran being the most cited. For each incident, the dataset also includes the type of attack (e.g. espionage, data destruction), its name (e.g., “Stuxnet”), a description, the date it occurred, its victims, and the type of response, if any. — : December 13, 2017
Links:
Tags: conflicttechnology
An anonymous married couple has decided “to be completely open about [their] finances so that people can see what an actual family’s budget looks like.” In addition to blogging about their financial habits, they’ve also published a spreadsheet of “(almost) every dollar” they spent between December 2015 and November 2017. For each transaction, the dataset provides the date, dollar amount, category (e.g., “Groceries”), and meta-category (e.g., “Food”). — : December 6, 2017
Links:
Atlas da Notícia is a Brazilian project that aims to collect data on all local and regional news outlets in the country. Last month, the project released its first batch of data, which identified 5,354 newspapers and online publications in a total of 1,125 municipalities. The raw dataset is currently only available in Portuguese, but the aggregate tables have been translated into English. [h/t Sérgio Spagnuolo] — : December 6, 2017
Links:
Tags: journalismmedia
Back in 2013, four dozen Dartmouth College students agreed to let a custom smartphone app surveil them for the StudentLife Study. During the 10 weeks of the spring academic term, the app collected data on the students’ physical activity, GPS coordinates, eating schedule, sleep habits, phone usage, and more. The study combined all that information with a slew of other data, including the students’ class deadlines, academic performance, and their responses to surveys about stress, depression, personality, and sleep quality. The study’s public (and anonymized) dataset clocks in at 53 gigabytes. Related: “Towards Deep Learning Models for Psychological State Prediction using Smartphone Data: Challenges and Opportunities,” a recently-released academic paper that uses the StudentLife dataset. [h/t Konrad Kording] — : December 6, 2017
Links:
Tags: education
The Consumer Financial Protection Bureau’s consumer complaint database can be searched online, accessed via an API, and downloaded in bulk. The 915,000+ complaints the Bureau has received have been categorized into 18 financial product groups (e.g., mortgages, debt collection, student loans, cryptocurrency) and more than 160 kinds of issues (e.g., billing disputes, communication tactics, privacy). The agency says they “don’t verify all the facts alleged in these complaints,” but that they “take steps to confirm a commercial relationship between the consumer and the company.” [h/t Dan Brady] — : December 6, 2017
Links:
My colleague Lam Thuy Vo obtained an anonymized dataset listing all 170,000+ sexual harassment claims submitted to the U.S. Equal Employment Opportunity Commission between October 1995 and September 2016. For each claim, the dataset indicates the date the complaint was filed, the complainant’s gender, and the general category of employer. Additional fields — available for most claims, but not all — indicate the complainant’s birthdate, race, and national origin, as well as the employer’s industry and approximate number of workers. Related: Lam’s story and interactive graphics, which place the data in context. — : December 6, 2017
Links:
The Aarne-Thompson-Uther Classification of Folk Tales organizes (mostly Indo-European) folktales into groups and hierarchies. As Atlas Obscura’s Cara Giaimo puts it, the ATU is “like the Dewey Decimal System, but with more ogres.” The ATU doesn’t publish any downloadable versions of its data, but researchers studying the “ancient roots” of such stories have built a data-matrix that denotes the presence/absence of the 275 ATU “tales of magic” across 50 Indo-European-speaking populations. [h/t Andrew McCartney] — : November 29, 2017
Links:
Tags: entertainmentlanguage
Since 2014, the California Civic Data Coalition has been working to improve access to CAL-ACCESS, “the jumbled, dirty and difficult government database that tracks campaign finance and lobbying activity in California politics.” Their cleaned-up datasets are updated often and include formats suitable for beginners, “database junkies,” and masochists. Last month, the organization released data files cataloging every state ballot measure and candidate for public office since 2000. [h/t Zack Quaintance] — : November 29, 2017
Links:
ProPublica has published a searchable and downloadable dataset of visitor logs and meeting calendars from five White House agencies: the Office of Management and Budget, the Office of the U.S. Trade Representative, the Office of National Drug Control Policy, the Office of Science and Technology Policy, and the Council on Environmental Quality. ProPublica received the underlying documents from Property of the People, a transparency group that sued the Trump administration to release the records under the Freedom of Information Act. (The administration has not released the White House’s main visitor logs.) Related: Politico has manually compiled a searchable database it calls “The Unauthorized White House Visitor Logs”, based on thousands of known visits, meetings, phone calls, and other presidential interactions. Also related: The Obama administration’s White House visitor logs. — : November 29, 2017
Links:
Missing Pieces is “a yearlong investigation by The Trace and more than a dozen NBC TV stations [that has] identified more than 23,000 stolen firearms recovered by police between 2010 and 2016 — the vast majority connected with crimes.” To support the investigation, the reporters obtained more than 800,000 records of stolen and recovered guns, which they’ve standardized into a single CSV file and supplemented with a data dictionary. The dataset “contains nearly complete stolen-gun records for the states of California and Florida, both of which have centralized collections of gun-theft data,” as well as records from nearly 300 other agencies across the country. Previously: The ATF’s gun trace statistics (DIP 2017.11.08) and firearm background checks (DIP 2015.12.09). [h/t Sarah Ryley] — : November 29, 2017
Links:
The Armed Conflict Location & Event Data Project (ACLED), records the locations, dates, actors, and outcomes of “all reported political violence and protest events in over 60 developing countries in Africa and Asia.” The Africa datasets currently go back to 1997 and cover more than 50 countries. The Asia datasets currently only go back to 2015, but ACLED’s website says it’s planning to add data soon going back to 2010. Both of the datasets are extensively documented, as is the methodology . [h/t Lari McEdward] — : November 29, 2017
Links:
A few years ago, economist Alex Albright and a friend transcribed the plotline-sharing dynamics of Friends’ six friends, across all 236 episodes. In the very first episode (“The One Where Monica Gets a Roommate”), Monica and Rachel each have their own plotline; Rachel and Ross share a plotline; and Chandler, Joey, and Ross share another plotline. Related: Albright’s analysis of the data. — : November 8, 2017
Links:
Tags: entertainmenttelevision
Over at BuzzFeed India, Harsha Devulapalli and Janak Jain have crowned Hyderabad the best city in India for going to the movies, based on their analysis of nearly 600 theaters in eight major cities. The underlying dataset lists each theater’s location, name, average ticket price (where available), number of screens, and number of seats. — : November 8, 2017
Links:
Tags: entertainmentmovies
As part of NerdWallet’s recent investigation into Rent-A-Center, “the nation’s largest rent-to-own company,” reporters compiled pricing data for 39 consumer products on rentacenter.com. For each product, the dataset lists the various Rent-A-Center costs (e.g., installment fees for weekly/monthly payment plans, cash prices, et cetera) in each of 48 states and D.C. — plus prices for the same product at standard online retailers. Related: NerdWallet’s analysis of the data. — : November 8, 2017
Links:
Reporters at the Center for Investigative Reporting asked 200+ of the largest Silicon Valley tech companies for their official diversity data. Specifically, the reporters requested each company’s latest EEO-1, the detailed demographic report that every large U.S. employer must submit to the federal government. Only 23 companies shared their data. For those that did, their numbers are now available as a tidy spreadsheet. [h/t Sophie Chou] — : November 8, 2017
Links:
The Bureau of Alcohol, Tobacco, Firearms and Explosives (ATF) helps trace guns — such as those recovered at crime scenes by law enforcement agencies — back to their original manufacturers, wholesale distributors, dealers, and purchasers. Each year, ATF publishes a range of datasets based on these gun traces. The datasets for 2016 provide state-by-state tallies of gun caliber, state of original purchase, possessors’ age, associated crime, and more. Related: “Gun Laws Stop At State Lines, But Guns Don’t,” from FiveThirtyEight, using the data. Also related: “How a Gun Trace Works,” from The Trace. Previously: Firearm background checks (DIP 2015.12.09), which my colleague Peter Aldhous analyzed last week, finding that gun sales did not spike after the Las Vegas shooting. — : November 8, 2017
Links:
“Jane Goodall drew the attention of a global audience with vivid depictions of the personalities of eastern chimpanzees (Pan troglodytes schweinfurthii) at Gombe National Park, yet only one attempt [in 1973] has been made to quantify these personality traits systematically,” writes a team of researchers in the latest issue of Scientific Data. To remedy the situation, the researchers paid field observers to score 128 Gombe chimpanzees on 24 personality traits — “dominant,” “excitable,” “helpful,” “sensitive,” and more — on a seven-point scale. — : November 1, 2017
Links:
Tags: animals
Researchers at the University of Washington’s Institute for Health Metrics and Evaluation to estimated cardiovascular mortality rates for each U.S. county, for every year between 1980 and 2014. The findings, based on 32 million de-identified death records, population data from the Census, and other sources, are also broken down by particular disease (e.g., aortic aneurysm, ischemic stroke, etc.) and gender. Related: The researchers’ JAMA article describing their methodology and findings. Previously: The Global Burden of Disease dataset, published by the same institute (DIP 2016.07.27). Michael A. Rice, a teacher at Ingraham High School in Seattle] — : November 1, 2017
Links:
For years, the National Oceanic & Atmospheric Administration has been working to assess the damage done to natural resources by the April 2010 Deepwater Horizon explosion and oil spill. As part of that effort, they’ve collected and compiled several dozen related datasets, including toxicity studies, plankton samples, necropsies of stranded turtles, dolphin health assessments, and a “backyard boater” survey. [h/t Sebastian Kraus] — : November 1, 2017
Links:
Tags: disasterenvironmentwater
Rebecca Zisser and Lazaro Gamio at Axios have compiled a timeline of alleged sexual assaults by Harvey Weinstein, Bill O'Reilly, Roger Ailes, Donald Trump, and Bill Cosby. For each of the 140+ cases recorded as of Oct. 20, the timeline indicates the year of the assault, the year the victim came forward (if they did), and the year of any legal settlement (if there was one). The underlying data is available as a spreadsheet. [h/t Mike Allen] — : November 1, 2017
Links:
The U.S. Federal Judicial Center’s “Integrated Data Base” contains a longitudinal record of all federal criminal, civil, and appellate court cases going back to the 1970s, as well as bankruptcy cases going back to late 2007. Each dataset contains dozens of detailed fields — including each case’s jurisdiction, name, docket number, relevant legal statutes, and more — accompanied by explanatory codebooks. You can download single-year snapshots and cumulative files, or interactively select specific slices of data to export. Related: “How the Bankruptcy System Is Failing Black Americans,” an investigation by ProPublica that used the IDB’s data on bankruptcy cases for its analysis. — : November 1, 2017
Links:
Tags: law
ConceptNet “is a freely-available semantic network, designed to help computers understand the meanings of words that people use.” It defines approximately 28 million “statements,” i.e., relationships between various things. For instance, ConceptNet indicates that a newsletter is a type of “report”, and that a computer can be used to “send email”. You can download the entire dataset, or access it via an API. — : October 18, 2017
Links:
Tags: languagetechnology
In the wake of the Second Vatican Council in the 1960s, Sister Marie Augusta Neal conducted an enormous opinion survey of Catholic “women religious.” More than 130,000 sisters responded to the 649 multiple-choice-question survey — the results of which the University of Notre Dame recently cleaned up and made available online. [h/t Kevin Schlottmann] — : October 18, 2017
Links:
The U.S. Patent and Trademark Office publishes a huge amount of bulk data, including detailed XML files that contain information about millions of patent/trademark applications, assignments, trials, and appeals. The agency also publishes a collection of “research datasets”, which distill those bulk XML files into easier-to-use tabular data. [h/t Rachael Tatman] — : October 18, 2017
Links:
Tags: businesstechnology
University of Michigan–based researchers have created “a repository of micro-level, subnational event data on armed conflict and political violence around the world.” The project, dubbed xSub, standardizes information from 21 data sources, and includes conflicts in 139 countries between 1942 and 2016. For each administrative boundary (e.g., country, province, district) and data source, xSub’s data counts the number of violent incidents by year, month, week, or day. The numbers are also broken down by the sides involved, who initiated the conflict, and what types of force were used. [h/t Andy Halterman] — : October 18, 2017
Links:
Tags: conflicts
Since shortly after Hurricane Maria hit Puerto Rico, the territory’s government has been publishing a dashboard of recovery statistics. The website tracks a couple dozen metrics, including the percent of homes with electricity, number of people in shelters, and the number of open hospitals. For several of the main metrics, researcher Michael A. Johansson has been scraping daily figures from the dashboard and publishing them as a CSV file. Related: The Washington Post has been charting the recovery, and published a deep dive into the island’s ongoing power outages. — : October 18, 2017
Links:
Tags: disasterstatistics
Carnegie Mellon’s Motion Capture Database provides data files and videos representing humans performing various activities: shaking hands, drinking soda, exchanging “angry hand gestures,” doing cartwheels, mopping floors, laughing, chicken-dancing, and oh-so-much more. [h/t John Emerson] — : October 11, 2017
Links:
Tags: science
The U.S. Geological Survey has been measuring water quality in the San Francisco Bay for nearly 50 years. The agency recently published 210,826 of these measurements, collected from dozens of monitoring stations between April 1969 and December 2015. (It’s “one of the longest records of water-quality measurements in a North American estuary,” according to a recent academic article describing the data.) Each row specifies the measurement’s date, station, depth, temperature, and salinity; many rows include levels of chlorophyll, oxygen, nitrate, ammonium, and other matter. — : October 11, 2017
Links:
Tags: environmentwater
The Federal Motor Carrier Safety Administration helps to regulate the United States’ large trucks and passenger buses. The datasets available through its Safety Measurement System include a census of all regulated carriers, the results of safety inspections, and reported crashes. The crash files list the number of injuries and fatalities; the weather, light, and road conditions; the involved vehicle’s VIN and license plate number; and more. [h/t Dan Brady] — : October 11, 2017
Links:
The Crowd Counting Consortium, launched earlier this year, is a volunteer effort to “[collect] publicly available data on political crowds reported in the United States, including marches, protests, strikes, demonstrations, riots, and other actions.” The team publishes monthly spreadsheets that list each crowd’s date, location, type, and cause (e.g., “Oppose removal of confederate statue”); high and low size estimates; the number of reported arrests and injuries; links to sources; and additional details. Related: The project’s main coordinators have been summarizing their findings on the Washington Post’s Monkey Cage blog. [h/t Amanda L. James] — : October 11, 2017
Links:
“Monitoring Trends in Burn Severity (MTBS) is an interagency program whose goal is to consistently map the burn severity and extent of large fires across all lands of the United States”; the most recent release contains more than 20,000 fires from 1984 to 2015. You can explore the data online, or download it in bulk. For more recent data, see GeoMAC, which aims to map all current wildfires; NOAA’s Hazard Mapping System, which uses satellites to detect fire locations and smoke plumes; and NASA’s MODIS and VIIRS datasets, which provide satellite-based detections for the entire globe. Previously: National Fire Incident Reporting System, which also includes structure fires and vehicle fires (DIP 2016.07.20). [h/t Max Joseph] — : October 11, 2017
Links:
Tags: disasterenvironment
For a new interactive essay at The Pudding, Ash Ngu analyzed the gender composition of This American Life episodes. To support the findings, Ngu has published the underlying data, extracted from the show’s transcripts. Among the data extracted: the number of words spoken by each person in each act of each episode. — : October 4, 2017
Links:
Tags: genderjournalism
In certain cities, private developers can earn zoning concessions by converting sections of their properties into plazas, atriums, mini-parks, and other open-to-the-public spaces. You can download datasets of these “privately owned public spaces” in San Francisco, Seattle, New York City, and — thanks to a recent collaboration between Guardian Cities and local community group — London. Related: A guide to NYC’s POPS. [h/t Reddit user seeriktus + Ed Vine] — : October 4, 2017
Links:
Tags: mapping
Media Cloud, a collaboration between MIT and Harvard–based researchers, describes itself as “an open-source platform for studying media ecosystems.” The project lets you track topics and keywords across thousands of sources — including mainstream news publications in the U.S. and many other countries — at both a story and sentence level. You can access Media Cloud’s data via its dashboard or its API. Both require (free) registration. Related: “The Media Really Has Neglected Puerto Rico,” by Dhrumil Mehta at FiveThirtyEight; the analysis uses data from Media Cloud, the TV News Archive, and Google Trends. Also related: The geometry of hurricane coverage, as told through the front pages of The New York Times and Washington Post. — : October 4, 2017
Links:
Tags: journalism
Last week, the National Institutes of Health released a dataset containing more than 100,000 anonymized chest x-rays, from 30,000 patients, “including many with advanced lung disease.” For each image, the associated metadata includes the patient’s age, gender, and diagnosis labels. (The dataset’s authors used natural language processing to extract those labels from radiological reports; they estimate that fewer than 10% of the labels are incorrect.) Related: Andrew L. Beam’s list of medical datasets for machine learning. [h/t Chris Hamby] — : October 4, 2017
Links:
Tags: healthcare
The Environmental Protection Agency collects air quality samples from thousands of monitoring stations across the country. The resulting datasets, which go back to the 1980s, are available as daily files, annual files, and via an API. The monitored pollutants include ozone, carbon monoxide, sulfur dioxide, nitrogen dioxide, particulate matter, volatile organic compounds, and more. You can also download daily Air Quality Index ratings and information about each monitoring station. Previously: Global air pollution datasets from Berkeley Earth (DIP 2017.03.22) and from the World Health Organization (DIP 2016.06.15). [h/t Swier Heeres] — : October 4, 2017
Links:
Tags: environment
For each of 966 occupations, the Department of Labor’s O*NET database quantifies the types knowledge, skills, abilities, education, and training required, tasks involved, tools used, and more job-related parameters. Related: The Upshot uses the data to ask (and answer), “What Is Your Opposite Job?” — : September 27, 2017
Links:
Tags: statistics
New York City’s Department of Transportation publishes a bunch of data, including its own assessments of each street segment’s quality on a 1-to-10 scale. It also publishes spreadsheets of all construction-related street closures, by intersection and by block, updated daily. [h/t Christian Moscardi] — : September 27, 2017
Links:
Tags: mappingtransportation
The UK’s Ordnance Survey makes detailed digital maps of Great Britain. Their free offerings include all of the island’s roads, rivers, green spaces, and place names. The Survey’s “open map” includes buildings, railways, electricity transmission lines, and other features. Related: Want only the buildings? The University of Sheffield’s Alasdair Rae has you covered. [h/t Robyn Inglis] — : September 27, 2017
Links:
Tags: architecturemapping
The TV News Archive’s new “Third Eye” project is extracting chyrons — those placards of text at the bottom of news broadcasts, also known as “lower thirds” — from four major cable networks: BBC News, CNN, Fox News, and MSNBC. The resulting database contains every chyron that Third Eye’s optical character recognition (OCR) software has extracted since late August. Related: This Washington Post piece analyzing cable news’ chyrons during James Comey’s congressional testimony, and this explanation of how they did it. [h/t Nancy Watzman] — : September 27, 2017
Links:
Tags: journalismlanguagemedia
Earlier this month, the FBI and 18F released the first iteration of their Crime Data Explorer, a website that simplifies access to the FBI’s Uniform Crime Reporting program. You can download bulk data on individual incidents, state and national trends, hate crimes, arrests, assaults on officers, police employees, human trafficking, and cargo theft. You can also access the data via an API. Caution: The FBI’s data collection program is voluntary; not all law enforcement agencies participate. (In fact, more than 3,000 agencies don’t submit hate crime data.) [h/t Nick Wright] — : September 27, 2017
Links:
Tags: crime
The popular “webcomic of romance, sarcasm, math, and language” provides an interface for grabbing data about each comic strip, including the title, image file, date of publication, easter-egg-y “alt” text, and transcript. [h/t Karl L. Hughes] — : September 20, 2017
Links:
Earlier this year, Politico reporters scoured the internet’s WHOIS records for domains registered to the Trump Organization. They found thousands, including TrumpRussia.com, No2Trump.com, Trumpublican.net, and ImBeingSuedByTheDonald.com. (Most, including those, just send readers to a generic “domain parking” landing page.) Politico has open-sourced the article’s components, including a JSON file containing 1,267 of the domains, which includes each domain’s owner, creation date, last-updated date, and expiration date. [h/t Tyler Fisher] — : September 20, 2017
Links:
Tags: Trumptechnology
The Democracy Fund Voter Study Group, “a research collaboration comprised of nearly two dozen analysts and scholars from across the political spectrum,” has published the participant-level data from its 2016 VOTER survey. It’s a “unique longitudinal data set” that represents the “political attitudes, values, and affinities” of 8,000 American adults who were interviewed first in December 2011, then again before and after the 2012 election, and again in December 2016. [h/t Jenny Listman] — : September 20, 2017
Links:
Tags: elections
After major natural disasters, NOAA’s National Geodetic Survey routinely collects detailed aerial photos of the affected areas. For each disaster — including Hurricane Harvey, Hurricane Irma, and a couple dozen others — you can download the full set of (georeferenced) images, by date and survey flight. [h/t David Yanofsky] — : September 20, 2017
Links:
Tags: disaster
The U.S. Federal Communications Commission publishes a ton of data on the “wireline” telecommunications industry, including several datasets about broadband internet access. Among them: the places where providers offer service, subscriptions per 1,000 households in each Census tract, and a survey of plans available in urban areas. You can also find a spreadsheet of payphones-by-state at the bottom of that landing page. (As of last March, there were only 113 payphones left in North Dakota, down from 705 in 2008.) Related: “Signs of Digital Distress,” a new Brookings Institution report, with findings and maps based on the broadband subscription data. — : September 20, 2017
Links:
Tags: technology
The “robust and curated” Global Wood Density Database contains more than 16,000 entries, culled from scientific literature, websites, and unpublished scholarship. The densest so far is a Caesalpinia sclerocarpa from Mexico, weighing in at 1.39 grams per cubic centimeter. Related: The TRY database of “curated plant traits” (free registration required). [h/t Amy Zanne] — : September 13, 2017
Links:
Tags: plants
When companies file reports to the U.S. Securities and Exchange Commission, they do so through the SEC’s EDGAR system. The SEC makes those filings available online, and it uses EDGAR’s server logs to analyze web traffic to the site. The SEC’s EDGAR Log File Data Set contains a set CSVs — one for each day between February 14, 2003 and December 31, 2016 — extracted from those server logs. For each document visited, the data includes the visitor’s unique-but-obfuscated IP address, the date and time of the visit, the IDs of the document and associated company, and some information about the visitor’s browser. [h/t Brian C. Keegan] — : September 13, 2017
Links:
The Internet Archive has pumped footage from CNN, Fox News, MSNBC, and the BBC through software trained to recognize the faces of Donald Trump and majority/minority leaders of the U.S. House and Senate. The result: Face-O-Matic, a dataset released to the public last week. For each face the software found, the dataset includes the network, program, date, time, duration, and a link to the footage on the TV News Archive. Since mid-July, Face-O-Matic has logged more than 50,000 sightings. [h/t Nancy Watzman] — : September 13, 2017
Links:
Tags: Trumpjournalismpolitics
Two weeks ago, DIP featured Case-Shiller’s home price index data. There are, in fact, several other prominent (and downloadable) house price indices, including the Federal Housing Finance Agency’s House Price Index, the National Association of Realtors’ indices, and Zillow’s Home Value Index. Helpful: This guide to various home price indices and how they’re constructed, by Jed Kolko, formerly Trulia’s chief economist. Related: This critique of Case-Shiller’s approach, also by Kolko. — : September 13, 2017
Links:
Tags: real estate
The Dartmouth Flood Observatory’s Global Archive of Large Flood Events contains data about 4,500+ floods, dating back to 1985. It’s updated often, and is available in Excel, XML, HTML, and geospatial formats. The variables include each flood’s location, timespan, severity, main cause, and estimated impact. The organization also publishes detailed maps of the “maximum observed flooding” for specific disasters, such as for Hurricane Harvey and for Hurricane Irma. Related: A Science Magazine mini-profile of the DFO and its founder. Previously: U.S. tide gauges and flood observations (DIP 2016.03.23), UK coastal flooding (DIP 2017.08.09), and FEMA flood risk maps (DIP 2017.08.30). — : September 13, 2017
Links:
Earlier this month, The New York Times asked readers to rate 50 of the show’s most recognizable characters along two dimensions: good ↔ evil, and ugly ↔ beautiful. They’ve received 190,000+ submissions. The results are accessible as two JSON files: one for the averages and another for the distributions. — : August 30, 2017
Links:
Favicons are the little square icons in your browser’s tabs, placed there by the websites you’ve loaded. Two recent projects attempted to collect these markers from the web’s million most-trafficked domains. One, by programmer Colin Morris, collected 360,000 favicons in July 2016. The second, by researchers at ETH Zurich, collected 548,00 favicons in April 2017. Semi-related: Morris’s “Finding bad flamingo drawings with recurrent neural networks”; the analysis uses Google’s 50-million-doodles data, featured in DIP 2017.05.04. — : August 30, 2017
Links:
Tags: arttechnology
The Federal Reserve Bank of St. Louis publishes S&P/Case-Shiller Home Price Index data, which measures changes in average home prices over time. The monthly-updated datasets — copyrighted, but free to download — are available at a national and metro-area level, and go back several decades. — : August 30, 2017
Links:
Tags: real estate
The Mapping Inequality project has digitized more than 150 of the “security maps” produced by the Home Owners' Loan Corporation between 1935 and 1940. Together, the maps “offer a view of Depression-era America as developers, realtors, tax assessors, and surveyors saw it — a set of interlocking color-lines, racial groups, and environmental risks.” To download the data for a given map, click on the cloud icon in the top-right corner. Related: A new research paper, by economists at the Federal Reserve Bank of Chicago, uses the data to quantify redlining’s lasting effects. Also related: The New York Times’ summary of the data and research. [h/t Kendall Taggart] — : August 30, 2017
Links:
FEMA’s Flood Map Service Center publishes geospatial files that detail the agency’s flood risk assessments — both current and historical. The maps include flood zones, levee locations, “base flood elevations,” and more. Helpful: FEMA’s technical documentation. Related: “Why Houston Isn’t Ready for Harvey,” published last week by ProPublica and The Texas Tribune; and “Hell and High Water,” the reporting team’s deep dive on Houston last year. Previously: The most comprehensive global dataset of cyclone paths (DIP 2017.04.19). — : August 30, 2017
Links:
The 1970s, a team of linguistic investigators canvassed the globe, armed with boxes of color chips. They sought out a couple dozen native speakers of 110 unwritten languages, and asked: What do you call these colors? The results are available online. Related: This Vox video provides context. — : August 23, 2017
Links:
The congressionally-established National Endowment for the Humanities publishes a dataset of all of the grants it has awarded since the late 1960s. On the same page, you can download a file describing the organization’s 25,000+ “evaluators” — “knowledgeable persons outside NEH who are asked for their judgments about the quality and significance” of proposed projects. [h/t Brett Bobley + Max Kemman] — : August 23, 2017
Links:
Tags: aidartentertainment
The Database of State Incentives for Renewables & Efficiency, “is the most comprehensive source of information on incentives and policies that support renewables and energy efficiency in the United States.” The database, which was founded in 1995 and is funded by the Department of Energy, includes tax rebates, solar energy buybacks, building standards, and more. You can download the data in several formats, or browse and search it online. [h/t Carol Brotman White] — : August 23, 2017
Links:
Last year, the British government began requiring companies to identify all the people who exert power over them. The resulting “People with Significant Control” database contains each person’s name, country of residence, nationality, and “nature of control” — e.g., ownership of large numbers of shares, voting rights, or the ability to appoint/remove directors. [h/t Enigma Public] — : August 23, 2017
Links:
Tags: business
A team of researchers has compiled “the largest ever geo-coded database of anophelines in Africa.” (Anophelines are the only kind of mosquito that transmits malaria.) The database covers 1898 to 2016 and includes more than 13,400 observations of mosquitoes in specific locations. For each observation, the dataset lists the country, administrative region(s), and latitude/longitude, as well as the time period, the species identified, the sampling method, and the source of the information. [h/t Michael Chew] — : August 23, 2017
Links:
Robin Sloan, author of Mr. Penumbra’s 24‑Hour Bookstore, has a new book coming out next month — one that he believes “is the first novel in English to feature, as a main supporting character, a possibly-sentient sourdough starter.” To dole out advance copies of the book, Sloan conducted the following contest: Try to choose the smallest prime number that nobody else will pick. Now he’s posted the results — a CSV listing the number of contestants who chose each prime number. (Seventeen was the most popular number among the contest’s 1,354 entries; the smallest unique prime was 409.) — : August 16, 2017
Links:
Tags: miscellaneous
The Energy Information Administration’s Petroleum Supply Monthly contains detailed data about how the United States obtains crude oil and petroleum products, and where that supply goes. In May, for instance, the U.S. refined nearly 314 million barrels of “finished motor gasoline” and exported 18.6 million barrels of it. — : August 16, 2017
Links:
The Open Access Series of Imaging Studies (OASIS) project is “aimed at making MRI data sets of the brain freely available to the scientific community,” with the goal of “[facilitating] future discoveries in basic and clinical neuroscience.” So far, the project has published two collections: a cross-sectional dataset of scans from 416 people, ages 18 to 96; and a longitudinal dataset, based on 150 people aged 60 to 96, each of whom were scanned at least two different times. [h/t Andrew Beam] — : August 16, 2017
Links:
Tags: healthcare
“The U.S. government has prosecuted 808 people for terrorism since the 9/11 attacks. Most of them never even got close to committing an act of violence.” Those are the findings of The Intercept’s Trial and Terror database, first published in April and most recently updated last week. The underlying data — available on GitHub — contains each defendant’s name and demographic details, as well as each case’s description, status, charges, charge date, conviction date (if convicted), jurisdiction, and more. — : August 16, 2017
Links:
UPDATE: VOA News used this data for Terror on Trial: The Imam’s Choice.
Chronicling America — a project run by the Library of Congress and the National Endowment for the Humanities — provides information about more than 150,000 historic newspapers and access to digitized pages from many of them. Its API lets you search the database and doesn’t require registration; its bulk data includes text from more than 12 million pages. For instance, here’s the Omaha Daily Bee’s front page on April 7, 1917, the day after the U.S. entered World War I. [h/t Ed Summers] — : August 16, 2017
Links:
Tags: historyjournalism
California’s Department of Industrial Relations publishes a dataset of all licensed talent agencies, with each agency’s name, address, license number, workers’ comp insurer, and bond issuer. Florida publishes something similar. Previously: Texas’s licensed professionals (DIP 2015.12.09). — : August 9, 2017
Links:
Tags: businessentertainment
Earlier this year, the researchers behind SurgeWatch.org published an updated version of their their database of UK coastal floods. They combined tidal gauge data with reports from scientific journals, newspapers, and social media to identify 329 “coastal flooding events” that occurred between 1915 to 2016. For each event, the dataset includes the date, region, and severity level, which ranges from 1 (“nuisance”) to 6 (“disaster,” applied to only one event — the North Sea flood of 1953). — : August 9, 2017
Links:
The Atlas of Pidgin and Creole Language Structures contains data on 76 languages, such as Trinidad English Creole, Afrikaans, Guadeloupean Creole, and Singapore Bazaar Malay. For each language, the dataset includes information about 130 “structural features,” example sentences, and more. Previously: The World Atlas of Language Structures (DIP 2016.01.06) and a database of the Trans-New Guinea language family (DIP 2015.11.04). [h/t Rachael Tatman] — : August 9, 2017
Links:
Tags: language
The federally funded Freight Analysis Framework “integrates data from a variety of sources to create a comprehensive picture of freight movement among states and major metropolitan areas by all modes of transportation.” For each year between 2012 and 2015, the database “provides estimates for tonnage (in thousand tons) and value (in million dollars) by regions of origin and destination, commodity type, and mode.” Last week, Axios published an interactive map of the state-to-state flows for each commodity group, as well as some helpful caveats and “head-scratchers.” [h/t Chris Canipe] — : August 9, 2017
Links:
The USDA National Nutrient Database for Standard Reference is the primary source for most of the food nutrition facts you see in America. The database assesses more than 8,000 foods, from abiyuch to zwieback, and provides the average nutrient levels per 100 grams — e.g., protein, carbohydrates, vitamin D, caffeine, lycopene, and water. North of the border, you can find the (bilingual) Canadian Nutrient File. It’s based on the USDA data, but excludes stateside foods “known not to be on the Canadian market”, adds some foods (such as poutine and ptarmigan), and makes adjustments based on “Canadian levels of fortification and regulatory standards.” The United Kingdom has its own nutrient file, as do many other countries. [h/t Reddit user Alacritous] — : August 9, 2017
Links:
Tags: food
The New York Philharmonic has published three spreadsheets listing its subscribers — including where they sat, how much they paid, and where they had their tickets sent — for a slew of orchestral seasons between 1883 and the late 1990s. The earliest data includes names, too. (“Miss A. Brown” of 715 Fifth Avenue seems to have been a big fan, having subscribed to 26 seats for the 1890-91 season.) Previously: The Philharmonic’s performance history (DIP 2016.10.12). [h/t Rachel Shorey] — : August 2, 2017
Links:
Tags: entertainmentmusic
The Seattle Public Library publishes a dataset of every checkout of every physical item (e.g., paperback books and DVDs, but not e-books) since April 2005. It currently contains more than 90 million rows. Previously: The library’s monthly checkout counts, by title (DIP 2017.03.01). [h/t David Christensen] — : August 2, 2017
Links:
Tags: books
The California Department of Education publishes aggregate scores on these high-school tests for each county, district, and school going back to the late 1990s. One hitch: For more than two months, the 2016 AP data “contained 350,000 more tests than had actually been taken,” according to inewsource.org’s Megan Wood, who spotted the discrepancies (and others) and got the department to fix them. Similar datasets are available from other states, including Texas, Florida, and Pennsylvania. Bonus: inewsource.org’s has also published easy-to-search tables of the California AP, SAT, and ACT scores. — : August 2, 2017
Links:
Tags: education
The EU publishes a searchable database of people and organizations registered to lobby the European Parliament and the European Commission. The website LobbyFacts.eu takes that data and makes it available via an API. LobbyFacts also scrapes the European Commission’s disclosed lobbying meetings, which you can download here (warning: 10-megabyte direct download). Related: You can also explore the lobbyists and meetings via InegrityWatch.eu, which uses LobbyFacts’ data. Previously: U.S. government lobbyists (DIP 2017.05.31). [h/t Enigma Public + Xavier Dutoit] — : August 2, 2017
Links:
Tags: politics
After Malaysia Airlines flight MH370 disappeared in March 2014, the Australian government undertook an enormous seafloor-mapping operation in search of the lost Boeing 777. Last month, it released data from the first phase of the project, which collected 278,000 square kilometers of bathymetry (i.e., seafloor topography) measurements. “In general, the world's deep oceans have had little investigation,” the government explains in an interactive map. “Only 10 to 15 percent of the ocean has been mapped with the sonar technology similar to that used in the search for MH370.” As a result, the MH370 search area “is now among the most thoroughly mapped regions of the deep ocean on the planet.” [h/t Soh Kam Yung] — : August 2, 2017
Links:
Data Stories is a podcast about data visualization, hosted by Enrico Bertini and Moritz Stefaner. To celebrate their recently-published 100th episode, the hosts released a spreadsheet detailing the date, title, number and genders of guests, length, and timestamped subchapters of each episode so far. Related: Christian Laesser’s visualization of the data. [h/t Benjamin Cooley] — : July 26, 2017
Links:
Tags: audiomediastatistics
During the course of its Enron investigation, the Federal Energy Regulatory Commission obtained the emails of approximately 150 (mostly high-ranking) Enron staff. You can find versions of the dataset — cleaned, deduplicated, and restructured in various ways — hosted by Carnegie Mellon, UC Berkeley, and Duke Law. Related: “What the Enron Emails Say About Us,” published by The New Yorker last week. Nathan Heller writes: The Enron archive “remains one of the country’s largest private e-mail corpora turned public. Its lasting value is less as an account of Enron’s daywork than as a social and linguistic data pool, a record of the way we write online when we’re not preening for the public eye.” — : July 26, 2017
Links:
As the basis for his recent study, “Is Running Enough? Reconsidering the Conventional Wisdom about Women Candidates” (paywalled, but a draft is freely available), PhD candidate Peter Bucchianeri compiled a dataset of female candidates in House primary elections from 1972 to 2010. The spreadsheet covers 1,242 candidacies, and includes each candidate’s party, votes garnered in the primary and general elections, the seat’s incumbency status, the district’s demographics, and more. — : July 26, 2017
Links:
NBC News has been tracking the president’s visits to his own luxury properties. For each day since Trump took office, the data — available to download at the bottom of the page — tells you which properties he visited and whether any were golf courses. Since February, Trump has visited his properties roughly 10 days a month, including 25 trips to Mar-a-Lago and 42 trips to his golf courses. Related: A similar tracker from The New York Times. [h/t Rachel Schallom] — : July 26, 2017
Links:
The Densho Digital Repository is an archive of oral histories, photographs, newspaper clippings, and other primary sources relating to the internment of Japanese Americans during World War II. Among the materials: several datasets listing people sent to the internment camps, based on official government records. The largest dataset contains more than 100,000 entries and includes details such as each internee’s “relocation” site, arrival date, hometown, birth year, time spent in Japan, marital status, religion, educational degrees, occupation, and military service. The National Archives hosts the raw data, as well as its documentation. — : July 26, 2017
Links:
The National Park Service and Geyser Observation and Study Association have been using water-temperature sensors to track the eruption times of dozens of geysers in Yellowstone — Old Faithful, of course, but also Beehive, Little Squirt, and Narcissus. GeyserTimes.org combines this data with historical logbooks and observations from “geyser gazers” to form what it describes as “the most comprehensive database of geyser eruption and observation data on the internet.” — : July 19, 2017
Links:
The International Monetary Fund’s World Economic Outlook Database contains the fund’s projections for future “national accounts, inflation, unemployment rates, balance of payments, fiscal indicators, trade for countries and country groups” and commodity prices. (They predict that farm-bred Norwegian salmon will cost $6.79/kg in 2022.) The database also contains historical observations for many of the economic indicators back to 1980. [h/t David Mihalyi] — : July 19, 2017
Links:
Each September, the United Nations gathers for its annual General Assembly. Among the activities: the General Debate, a series of speeches delivered by the UN’s nearly 200 member states. The statements provide “an invaluable and, largely untapped, source of information on governments’ policy preferences across a wide range of issues over time,” write a trio of researchers who, earlier this year, published the UN General Debate Corpus — a dataset containing the transcripts of 7,701 speeches from 1970 to 2016. The researchers have also published an online tool for exploring and visualizing the dataset. Previously: UN General Assembly votes since 1946 (DIP 2016.07.13). [h/t Ronny Patz] — : July 19, 2017
Links:
Tags: United Nations
Last week, a team at NYU announced “the world’s densest urban aerial laser scanning (LiDAR) dataset” — a 1.4-billion-point description of Dublin’s city center. They write: ”At over 300 points per square meter, this is more than 30 times denser than typical LiDAR data and is an order of magnitude denser than any other aerial LiDAR dataset.” The researchers collected the topographical data during a series of criss-crossing flyovers on March 26, 2015. They’ve also published a short, illustrative video. Previously: LiDAR datasets (DIP 2016.05.25) and 3D models (DIP 2017.04.05) of cities and countries around the world. [h/t Darrell Etherington] — : July 19, 2017
Links:
Tags: mapping
You’ve probably seen The Washington Post’s solar eclipse graphics from last Monday. The stellar maps are largely based on an online tool that uses data from NASA's Five Millennium Canon of Solar Eclipses. The tool can (among other things) generate maps and KMZ files describing the paths of the 11,898 solar eclipses Earth will have experienced between and 2000 BCE and 3000 CE. Helpful: NASA’s key to understanding the data terminology. — : July 19, 2017
Links:
Tags: science
The recently-launched Tweets Of Congress is collecting and publishing daily archives of tweets by congressional representatives, caucuses, and committees. Meanwhile, the Trump Twitter Archive has collected more than 30,000 of @realDonaldTrump’s tweets, which you can search and download. — : June 28, 2017
Links:
The Internal Revenue Service publishes a file listing all “organizations eligible to receive tax-deductible charitable contributions” — currently more than 1 million charities, private foundations, and other groups. (Not all nonprofits apply for, or receive, tax-exempt status from the IRS; but all tax-exempt organizations are nonprofits.) Previously: Annual IRS 990 filings, in bulk (DIP 2016.06.22). [h/t Norbert Krupa + Derek Willis] — : June 28, 2017
Links:
Tags: taxes
The National Association of Realtors publishes monthly real estate inventory data “at the national level, the 500 largest metropolitan areas, the 1,000 largest counties, and over 15,000 zip codes.” The data, based on the realtors’ multiple listing services, goes back five years and “tracks key market metrics including list prices, days on market, and total active inventory.” As of early June, six counties — Manhattan, plus five in California — had median listing prices above $1 million. Previously: The Census Bureau’s Annual Characteristics of New Housing (DIP 2016.06.22), international house prices (DIP 2017.02.08), millions of mortgages (DIP 2015.12.30), and millions more mortgages (DIP 2017.03.15). [h/t Reddit user bbekks] — : June 28, 2017
Links:
Tags: real estate
OpenSNP is a website that lets people publish the results of their genetic tests (such as those sold by 23andMe, deCODEme, FamilyTreeDNA), “find others with similar genetic variations, [get] the latest primary literature on their variations, and help scientists find new associations.” Since 2012, users have uploaded more than 3,000 sets of genetic variants, which you can download individually or in bulk or access via OpenSNP’s API. Users can also list various personal traits, such as eye color, height, coffee consumption, and lactose intolerance. Useful primer: SNP stands for “single nucleotide polymorphism,” the NIH explains. They’re “the most common type of genetic variation”; each one “represents a difference in a single DNA building block, called a nucleotide.” — : June 28, 2017
Links:
The European Centre for Disease Prevention and Control’s Surveillance Atlas of Infectious Diseases lets you browse, map, and download data on the historical incidence of several dozen diseases — from anthrax to Zika — in each of the European Economic Area’s countries. Related: Keila Guimarães’s recent investigation into penicillin shortages, which uses the Centre’s data on syphilis cases. — : June 28, 2017
Links:
Tags: disease
The Florida Department of Corrections’ public database contains a table describing current and released inmates’ tattoos. That data includes each tattoo’s location (e.g., “right arm,” “stomach,” “face”) and description (“cross,” “tribal,” and “skull” being the most common). Helpful: Dan Nguyen’s guide to converting the database into SQLite and CSV files. Related: Recent analyses by The Economist and by The Palm Beach Post. — : June 21, 2017
Links:
The Manifesto Project has collected and coded more than 4,000 electoral manifestoes from more than 1,000 political parties in more than 50 countries between 1945 and 2015. For each manifesto, the project’s dataset indicates whether the document expresses support for/against dozens of policies and attitudes, including “market regulation,” a “national way of life”, ��environmental protection,” and “anti-imperialism.” You can also browse the manifestoes online. Caveat: The dataset is subject to a somewhat restrictive usage policy. [h/t The Quartz Directory of Essential Data] — : June 21, 2017
Links:
Libraries.io monitors “over 2.4m unique open source projects, 25m repositories and 85m interdependencies between them.” Last week, the site released its first bulk dataset, which describes each project’s metadata, published versions, and dependencies on other software libraries. [h/t Nadia Eghbal] — : June 21, 2017
Links:
Tags: technology
“Created by USAID in 1985 to help decision-makers plan for humanitarian crises,” the Famine Early Warning Systems Network (FEWS NET) “provides evidence-based analysis on some 34 countries.” As part of its work, FEWS NET publishes geospatial shapefiles that score each country’s “most likely food security outcome” on standardized scale: Minimal, Stressed, Crisis, Emergency, and Famine. Previously: Global food prices (DIP 2017.05.17). [h/t Melissa Segura] — : June 21, 2017
Links:
Tags: disaster
“Police pull over more than 50,000 drivers on a typical day, more than 20 million motorists every year. Yet the most common police interaction — the traffic stop — has not been tracked, at least not in any systematic way,” according to the Stanford Open Policing Project. To that end, the group has been collecting and standardizing traffic-stop data from state police agencies across America. Its first data release, published Monday, contains 130 million records from 31 states. The records vary by agency, but the most-complete states include the date, time, location, reason, and outcome of each stop; the driver’s race, gender, and age; whether a search was conducted; and whether the search found contraband. Related: The project’s findings so far. Previously: Raw traffic stop data from a smaller number of states (DIP 2015.10.28). — : June 21, 2017
Links:
The Los Angeles City Controller has released a map of the city’s openly-operating medical marijuana businesses. You can access a spreadsheet of the 191 dispensaries that comply with Proposition D, which the city passed in 2013. Additionally, you can find hundreds of (active and inactive) dispensaries by filtering the city’s business registrations to those whose primary NAICS category is listed as “medical marijuana collective.” [h/t Zack Quaintance] — : June 14, 2017
Links:
Tags: drugshealthcare
. ResistoMap is an interactive visualization of antibiotic drug resistance, based on more than 1,500 bacteria genome samples from people’s intestinal tracts. The data behind the visualization is available to download. It’s partly based on two prior datasets: McMaster University’s Comprehensive Antibiotic Resistance Database (“a bioinformatic database of resistance genes, their products and associated phenotypes”) and the University of Gothenburg’s BacMet (“an easy-to-use bioinformatics resource of antibacterial biocide- and metal-resistance genes”). [h/t Carlos Somohano] — : June 14, 2017
Links:
Tags: diseasehealthcare
The Census Bureau’s Survey of Business Owners and Self-Employed Persons “provides the only comprehensive, regularly collected source of information on selected economic and demographic characteristics for businesses and business owners by gender, ethnicity, race, and veteran status.” The most recent data comes from 2012. The survey has been conducted every five years since 1972, but data from before 1992 is “available only in printed form.” Related: “30% Of The Black-Owned Businesses In New York Disappeared In 5 Years,” by my colleague Cora Lewis. — : June 14, 2017
Links:
Tags: business
Last week, the University of Virginia School of Law launched an expanded version of its Corporate Prosecution Registry. The revamped database includes “detailed information about every federal organizational prosecution since 2001, as well as deferred and non-prosecution agreements with organizations since 1990” — more than 3,000 cases so far. Previously: Good Jobs First’s Violation Tracker (DIP 2015.11.11). [h/t Tom Jackman] — : June 14, 2017
Links:
Oyez.org bills itself as, among other things, “a complete and authoritative source for all of the [Supreme] Court’s audio since the installation of a recording system in October 1955.” The site has an API and releases all its material — including timestamped transcripts of oral arguments — under a Creative Commons license. A least two GitHub repositories have aggregated the transcripts and make them easy to bulk-download. For each segment of audio, the transcripts list the start/end time, the speaker, and the text. Related: PuppyJusticeAutomated, a YouTube channel that (a) must be seen to be understood and (b) uses the Oyez API. Previously: CourtListener (DIP 2016.04.13) and The Supreme Court Database (DIP 2016.02.23). [h/t Walker Boyle + Reddit user 21cannons] — : June 14, 2017
Links:
The San Francisco Public Utilities Commission’s Beach Water Quality Monitoring Program measures bacteria levels at fifteen locations on the city’s shoreline. You can download the measurements by clicking the “raw data” link below this map. The data powers the (unsurprisingly) unofficial @BeachPooBot account on Twitter. [h/t Reddit user cavedave] — : June 7, 2017
Links:
Researchers at Google took a semi-random sample of 9,473 Reddit threads, containing 116,347 comments in total. Then, they paid people to categorize each comment by its “discourse act” — e.g., whether it was a question, answer, announcement, agreement, humor, et cetera. The result is Coarse Discourse, “a dataset for understanding online discussions.” [h/t Roberto Bayardo] — : June 7, 2017
Links:
Beginning in January 2015, the Occupational Safety and Health Administration began requiring U.S. employers to report “all severe work-related injuries, defined as an amputation, in-patient hospitalization, or loss of an eye.” You can download a spreadsheet of these injuries — some 20,000 in 2015 and 2016 combined. It contains the injury dates, descriptions, and outcomes, as well as the employers’ names and locations. Previously: OSHA’s more detailed (but slightly more cumbersome) inspection data and API (DIP 2016.07.13). [Clarification, 2017-06-07/2017-06-14: The dataset dataset reflects "federal OSHA states only.” It excludes “injuries in state plans," which cover private sector employees in 21 states.] — : June 7, 2017
Links:
Tags: injury
Before Donald Trump began flying on Air Force One, he rode a fleet of private aircraft. Reporters at Bloomberg used the Freedom of Information Act to obtain flight records for three major components of that fleet — a ”Boeing 757 with gold-plated seatbelt buckles, known as Trump Force One during the campaign; a Cessna 750 Citation X jet; and a Sikorsky helicopter”. For each of the more than 1,500 flights taken between August 2010 and November 2016, the dataset contains the date, time, and airport of both the departure and arrival. Trump wasn’t necessarily aboard each of those flights; the dataset does not contain passengers information. Related: Bloomberg’s analysis/maps of the data. Also related: The Washington Post used the data to estimate the flights’ CO2 emissions. — : June 7, 2017
Links:
Tags: Trumptransportation
ORCID is a nonprofit organization that provides unique identifiers for researchers — mostly scientists so far — to make it easier to distinguish between them. It has issued more than 3 million IDs so far, and provides annual bulk downloads of all researchers’ public profiles. In many cases, the researchers have supplied their education and employment histories. That enabled Science magazine to analyze the migrations of more than 110,000 researchers who’ve listed multiple countries in these public CVs. (The data and code underlying the analysis are also available to download.) [h/t Shaun Coffey] — : June 7, 2017
Links:
Tags: science
You might have seen New York City’s bubble map of dog names. It turns out that the underlying dataset — which includes the name, gender, age as of 2015, breed, and borough of more than 110,000 dogs — is available on GitHub. You can also download slightly older, but more detailed data from WNYC’s Dogs of NYC project. That data includes each dog’s coat colors, whether it had been spayed/neutered, and its ZIP code. Related: Similar pet license data from Tacoma, Wash., and Edmonton, Canada. [h/t Alex P. Miller + Dan Nguyen] — : May 31, 2017
Links:
Tags: animals
Aswath Damodaran — a professor of finance at the NYU’s business school — maintains a trove of data on per-sector financials, including effective tax rates, return on equity, and working capital ratios by industry. For most datasets, Damodaran publishes both current and historical versions. [h/t Tim McGovern] — : May 31, 2017
Links:
A team of researchers at the Boston University School of Public Health has collected data on the presence/absence of 133 different types of firearm laws in each U.S. state, for each year between 1991 and 2016. The legal provisions are grouped into 14 categories, such as background checks, “Stand Your Ground” laws, and child access prevention. You can download a spreadsheet of the data, and also browse state-by-state summaries. Previously: The Correlates of State Policy Project (DIP 2016.07.06). — : May 31, 2017
Links:
Tags: guns
U.S. lobbyists must notify Congress within 45 days of being retained by new clients. Every quarter after that, they’re required to file activity reports that detail the agencies they lobbied, the topics they covered, and the income they earned. Bulk downloads of both types of reports are available as XML files from the House (going back to 2004) and from the Senate (since 1999). Although they receive the same filings, each chamber “follows different data-cleaning, processing, and editing procedures before storing the data,” according to this recent GAO report. — : May 31, 2017
Links:
Tags: government
Last week at BuzzFeed News, we shared a vast trove federal payroll data. Those records — provided by Office of Personnel Management through the Freedom of Information Act — cover more than 40 years and millions of employees. The dataset includes salaries, titles, job types, and demographic variables. In many-but-not-all cases (per OPM’s data release policies), it also includes names. Previously, federal payroll data had been searchable online, but very little was available in downloadable, analysis-friendly formats. Also: Many states – including New York, California, Florida, New Jersey, Minnesota, Arkansas, South Carolina, and Washington – proactively make payroll data available for download. (Some cities, such as Chicago, do, too.) — : May 31, 2017
Links:
Tags: governmentmoney
The website CraftCans.com publishes a database of 2,000+ canned beers. For each beer, the database lists its name, style, brewery, size, alcohol level, and bitterness. The website doesn’t provide a direct download, but — as Jean-Nicholas Hould points out — you can basically just copy-paste the website’s data into your favorite spreadsheet program. Or, if you want something slightly cleaner, you can use this script. Related: This data-profiling tutorial by Hould, which uses the data. Also related: RateBeer.com’s API, but you’ll need to request a developer key to use it. Plus: This interactive graphic, which uses the RateBeer data to explore America’s microbrew epicenters. And also: Official brewery production stats from the U.S. Alcohol and Tobacco Tax and Trade Bureau. [h/t Daniel Brady] — : May 24, 2017
Links:
Tags: alcohol
Google is clever: It created a drawing game, got 15 million people to play it, and then turned those doodles into into a public dataset of people drawing. You can download the raw data, or just browse the doodles online. — : May 24, 2017
Links:
Tags: arttechnology
When the malware program known as “WannaCry” hit hundreds of thousands of computers earlier this month, it demanded that the computers’ owners pay $300 in Bitcoin — or lose all of their data. Keith Collins at Quartz has been using Blockchain’s API to track Bitcoin payments to the three digital wallets that the hackers designated to receive the ransoms. He’s published the data and is also using it to power a Twitter bot. Related: “Victims of the WannaCry ransomware attacks have stopped paying up” and “Inside the digital heist that terrorized the world—and only made $100k,” both by Collins. Previously: Historical Bitcoin prices (DIP 2017.03.08). — : May 24, 2017
Links:
Tags: crimetechnology
The Profiles of Individual Radicalization in the United States (PIRUS) database “contains deidentified individual-level information on the backgrounds, attributes, and radicalization processes of nearly 1,500 violent and non-violent extremists who adhere to far right, far left, Islamist, or single issue ideologies in the United States” — including the Klu Klux Klan, the Taliban, and the Animal Liberation Front, among others. The dataset covers 1948 through 2013 and was released earlier this year by a team at the University of Maryland. [h/t Lorand Bodo] — : May 24, 2017
Links:
Last week, the Library of Congress released its largest dataset ever: nearly 25 million records for books, maps, manuscripts and other items in its online catalog. For each item, the data includes standardized bibliographic information, such as the title, author, publication date, and genre. (The dataset represents the online catalog as it was in 2013; more recent data will cost you.) Related: A bit of background about the library’s MARC (Machine Readable Cataloging Records) data format. — : May 24, 2017
Links:
Tags: bookstechnology
“The WikiPlots corpus is a collection of 112,936 story plots extracted from English language Wikipedia.” The plots describe movies, books, plays, TV series, TV episodes, video games, and other stories — essentially, any *thing that has a Wikipedia article with the word “plot” in one of its subheadings. Related: “Examining the arc of 100,000 stories: a tidy analysis” and “Gender and verbs across 100,000 stories: a tidy analysis,” two blog posts by David Robinson that use the data. — : May 17, 2017
Links:
The Chicago Sun-Times has obtained and published an August 2016 copy of the Chicago Police Department’s “Strategic Subject List,” a database that scores nearly 400,000 (unnamed) people on a scale from 10 to 500, based on an algorithm that attempts to estimate their risk of being involved in gun violence (either as a shooter or a victim). The database includes demographic, geographic, criminal history, and other information about the people it ranks. “But the database doesn’t indicate — and the police won’t say — how much weight is given to each factor in computing the scores, which are produced using an algorithm developed at the Illinois Institute of Technology,” according to the Sun-Times. — : May 17, 2017
Links:
How might rising sea levels affect coastal flooding? A new-ish NOAA Technical Report, published in January, combines historical data on global sea levels with “regional factors contributing to sea level change for the entire U.S. coastline.” The result: Localized projections under six sea-level rise scenarios, ranging from “low” to “extreme.” You can download the data (at the bottom of this page) or explore it on a map. Related: Climate Central describes what NOAA’s “extreme” scenario could mean for America (including more maps and calculations). Previously: Tide gauge data (DIP 2016.03.23) and sea ice measurements (DIP 2016.09.14). [h/t Susie Cambria] — : May 17, 2017
Links:
The UN World Food Programme’s vulnerability analysis group collects and publishes food price data for more than 1,000 towns and cities in more than 70 countries. The dataset, which goes back more than a decade, covers basic staples, such as wheat, rice, milk, oil, and more. It’s updated monthly and feeds into (among other things) the UNWFP’s price-spike indicators. Related: The Humanitarian Data Exchange, which hosts the dataset for the UN. Also: The Economist’s Big Mac Index. Andrew McCartney] — : May 17, 2017
Links:
The James Martin Center for Nonproliferation Studies publishes what it calls “the first database to record flight tests of all missiles launched by North Korea capable of delivering a payload of at least 500 kilograms a distance of at least 300 kilometers.” The database currently contains 107 missile tests — starting with North Korea’s first, launched in April 1984, to its latest, launched Sunday morning. For each test, the data includes the missile’s launch site, highest altitude, distance travelled, landing location, success/failure, and other details. [h/t Ian Greenleigh] — : May 17, 2017
Links:
Grad students in Princeton’s computer science department have published a dataset they call Self-Annotated Reddit Corpus, or “SARC” for short. “The corpus has 1.3 million sarcastic statements — 10 times more than any previous dataset,” the authors write, and takes advantage of Reddit users’ habit of tagging sarcastic comments with an “/s”. Related: A dataset of sarcastic Amazon reviews. [h/t Carlos Somohano + Reddit user cavedave] — : May 10, 2017
Links:
Tags: languagesocial media
The National Science Foundation’s Survey of Doctorate Recipients “is a longitudinal biennial survey conducted since 1973 that provides demographic and career history information about individuals with a research doctoral degree in a science, engineering, or health (SEH) field from a U.S. academic institution.” You can download aggregated data and detailed survey responses going back to 1993. The next release is scheduled for this month. Related: The NSF has published an interactive graphic of the data. [h/t Peter Aldhous] — : May 10, 2017
Links:
Groceries-on-demand startup Instacart has released a dataset containing 3 million orders from 200,000 (anonymized) users. “For each user, we provide between 4 and 100 of their orders, with the sequence of products purchased in each order,” the company’s head of data science writes. “We also provide the week and hour of day the order was placed, and a relative measure of time between orders.” Here’s the data dictionary. — : May 10, 2017
Links:
Tags: food
Last month, ProPublica and Consumer Reports published an analysis of car insurance costs in four states, finding that “some major insurers charge minority neighborhoods as much as 30 percent more than other areas with similar accident costs.” The reporters also published a detailed methodology and dataset supporting their findings. The dataset contains company-by-company insurance premiums for a (hypothetical) college-educated, excellent-credit, accident-free 30-year-old woman in each of 6,261 ZIP codes in the four states — California, Texas, Missouri, and Illinois. The dataset also includes several years of average (per-car) insurance payouts for each ZIP code, which the reporters obtained from state insurance commissioners. Related: The insurance industry's rebuttal and ProPublica's counter-rebuttal. — : May 10, 2017
Links:
There’s about 700 miles of official fencing between the U.S. and Mexico, covering about one-third of the full border. The Department of Homeland Security doesn’t provide structured spatial data about the fence’s path. But, thanks to a Texas law professor’s FOIA and some serious elbow grease, reporters at Reveal have created “the most detailed border fence map publicly available.” For each segment of fence, Reveal’s dataset includes the fence type (i.e., pedestrian, vehicle, or unknown), the government’s name for the segment, and the project through which the segment was built. — : May 10, 2017
Links:
Tags: immigrationmapping
For April Fools, Reddit launched a million-pixel canvas called “r/place.” Users could place a single-pixel tile, in one of 16 colors, anywhere on the canvas — but only every five minutes. By the end of r/place’s 72-hour lifetime, Redditors had placed 16.5 million tiles on the canvas, likely making it “the largest collaborative art project in history.” Last week, Reddit published the entire history of the canvas as structured data. [h/t Felipe Hoffa] — : April 26, 2017
Links:
The CDC has been running its National Survey of Family Growth since 1973. For the first three decades, it surveyed only women ages 15-44. Starting in 2002, it began also surveying men. The latest survey was conducted in 2013-15, when it collected data from 10,205 residents about sexual activity and contraception, pregnancy and infertility, marriage and divorce, adoption, parenting, and more. [h/t Allen B. Downey] — : April 26, 2017
Links:
Tags: familyhealthcarewomen
For each of India’s 36 states and Union Territories, the country’s latest National Family Health Survey includes 114 metrics, such as the percentages of “households using iodized salt” and “men who have comprehensive knowledge of HIV/AIDS.” Unfortunately, the government publishes the reports only as PDFs. But the Hindustan Times has extracted the data for the survey’s eight “women’s empowerment and gender based violence” metrics, including the percentages of “ever-married women who have ever experienced spousal violence” and “women having a bank or savings account that they themselves use.” They’ve published that data as a spreadsheet and used it to construct an interactive Women Empowerment Index. [h/t Gurman Bhatia] — : April 26, 2017
Links:
You’re probably familiar with the Google Books Ngram Viewer, which lets you chart word and phrase frequencies over time. Google publishes the underlying data but those files can (depending on your tools and goals) be cumbersomely large. Here’s an alternative: DIP reader (and former colleague) Chris Wilson has condensed the overall frequencies for 87,000 words — those found in the CMU Pronouncing Dictionary — into a svelte, four-megabyte file. Related: BYU’s advanced interface to the Google Books data. Also related: “The Pitfalls of Using Google Ngram to Study Language” (Wired, 2015). And also: “Somewhere at Google there is a database containing 25 million books and nobody is allowed to read them” (The Atlantic, 2017). — : April 26, 2017
Links:
Tags: language
The U.S. National Park Service publishes a ton of data about visitors to its parks, historic sites, memorials, preserves, and more. Among them: Visitors per park (annually since 1904, and monthly since 1979), overnight stays by type of lodging (tents, RVs, backcountry, etc.), and traffic. Related: “The National Parks Have Never Been More Popular” (FiveThirtyEight, 2016). [h/t Jack King] — : April 26, 2017
Links:
Tags: environmentstatistics
For a 2012 academic paper, researchers captured the keystrokes of paid volunteers as they typed descriptions of images. Whenever a participant used the backspace key to correct a word, the researchers added it to a dataset of self-corrections. Each of the 44,000 lines in the English-language version of the dataset contains the original mistake and the correction. The most common change was in → on. Other common fixes included waling → walking and pople → people. [h/t Seth Stephens-Davidowitz] — : April 19, 2017
Links:
Tags: language
The USDA’s Plant Hardiness Zone Map “is the standard by which gardeners and growers can determine which plants are most likely to thrive at a location.” The USDA and Oregon State, which have jointly developed the map, previously sold access to the underlying data through a vendor. But after the vendor shut down earlier this year, OSU began publishing the data free of charge (though with some licensing restrictions). The dataset is available as detailed shapefiles and as ZIP code–based spreadsheets. [h/t Waldo Jaquith + Lynn Cherny] — : April 19, 2017
Links:
Tags: agricultureplants
Through its International Best Track Archive for Climate Stewardship project, the National Oceanic and Atmospheric Administration publishes what it calls “the most complete global set of historical tropical cyclones available.” For each tropical cyclone — a category that includes typhoons, hurricanes, tropical depressions, and more — the dataset includes its position, wind speed, central pressure, and classification at six-hour intervals. The dataset is updated annually and includes some historical cyclones from as early as 1842. [h/t Daniel Miller] — : April 19, 2017
Links:
The CDC’s National Center for Immunization and Respiratory Diseases collects and publishes state-by-state vaccination rates for infants, kindergartners, teens, and adults — plus, flu vaccination rates for several age groups. Each dataset includes several years’ worth of data, with many going back to 2008 or 2009. Related: “California Shows The Rest Of The Country How To Boost Kindergarten Vaccination Rates,” by my colleague Peter Aldhous, with additional county-level data from the Golden State. Previously: International vaccination rates and policies (DIP 2016.08.03). — : April 19, 2017
Links:
Tags: healthcare
The UK’s National Health Service publishes monthly data on drugs prescribed in England through the country’s single-payer health care system. (Drugs prescribed in Scotland, Wales, or Northern Ireland aren’t included.) For each prescriber-and-drug combination, the dataset includes the quantity and cost of prescriptions for each month since August 2010. The US publishes similar data about prescriptions issued through Medicare, but only on an annual basis and currently only covering 2013 and 2014. Related: ProPublica’s Prescriber Checkup, which uses the Medicare data to examine doctors’ prescribing patterns. Previously: A decade-plus of Australian prescription data (DIP 2016.08.24). [h/t Adam Crahen] — : April 19, 2017
Links:
Tags: drugshealthcare
Comic books make use of white space — or gutters — to propel the story forward, relying on readers’ intuitive ability to fill in the gaps between panels. To see whether computers could learn to make the same inferences, a group of computer scientists built a giant corpus of public-domain comics and tried training a series of neural networks on it. (Spoiler: Humans are much better at this.) The underlying dataset contains 1.2 million panels from nearly 200,000 scanned pages of nearly 4,000 books in the Digital Comic Museum, all published during the 1938–1954 “Golden Age” of American comics. It also contains 2.5 million chunks of text extracted from the comics’ speech balloons, thought bubbles, and narration boxes. [h/t Robin Sloan] — : April 12, 2017
Links:
Researchers at the World Health Organization have assembled a dataset of international aid — both from official government assistance and private grants — devoted to reproductive, maternal, newborn, and child health from 2003 to 2013. The dataset, which the researchers described in a recent academic article, draws on 2.1 million records, and is based largely on the OECD’s Creditor Reporting System. Related: Earlier this month, the U.S. State Department cut all its funding for the UN's family planning agency; it was the agency’s third-largest donor. — : April 12, 2017
Links:
Sci-Hub, which describes itself as “the first pirate website in the world to provide mass and public access to tens of millions of research papers,” recently released a list of the 62,835,101 academic papers it has collected. That dataset identifies each paper only by its DOI — a short, unique ID. Helpfully, graduate student Bastian Greshake has extracted the journal name, publisher, and publication ear from those DOIs. Greshake has also combined that data with six months of Sci-Hub download data (previously featured in DIP 2016.05.04), and analyzed the datasets together. Among his findings: Both are “largely made up of recently published articles, with users disproportionately favoring newer articles and 35% of downloaded articles being published after 2013.” — : April 12, 2017
Links:
Tags: education
The Environmental Protection Agency publishes fuel efficiency data on all the car models it has tested, going back to the 1980s… minus all the Volkswagen, Audi, and Porsche diesels caught cheating. The data typically includes three estimates: for city driving, highway driving, and a city-highway combination. — : April 12, 2017
Links:
Every four years, Congress publishes United States Government Policy and Supporting Positions, better known as the Plum Book. The 2016 version, which is available as both PDF and Excel files, identifies more than 8,000 executive and legislative branch jobs subject to “noncompetitive appointment.” Those positions include 1,710 presidential appointments, which are as wide-ranging as the ambassadorship to Afghanistan and the directorship of the Occupational Safety and Health Administration’s Whistleblower Protection Program. Related: For positions requiring its confirmation, the Senate publishes XML files of pending, confirmed, and withdrawn nominees. — : April 12, 2017
Links:
Tags: governmentpolitics
In peer-reviewed paper published last week, a trio of University College London researchers describe their Global Avian Invasions Atlas. The dataset includes information on “971 species, introduced to 230 countries and administrative areas across all eight biogeographical realms, spanning the period 6000 BCE – AD 2014.” — : April 5, 2017
Links:
Tags: animals
The National Science Foundation publishes data on all of the grants the agency has awarded since the 1970s (and some earlier ones, too). Each grant is represented as an XML file, which contains information about the project, the awardee, and the NSF division that awarded the grant. [h/t France A. Córdova] — : April 5, 2017
Links:
Tags: science
Bruegel, “a European think tank that specialises in economics,” publishes a quarterly-updated dataset quantifying sovereign bond holdings for 12 countries: Belgium, Finland, France, Germany, Greece, Ireland, Italy, Netherlands, Portugal, Spain, the U.K., and the United States. For each country, the dataset tells you what proportion of the federal government’s bonds are held by each of five types of owners: the country’s central bank, other public institutions, domestic banks, other domestic investors, and foreign investors. [h/t @CoolDatasets] — : April 5, 2017
Links:
Tags: economicsgovernmentmoney
Yasuyuki Aono, an associate professor at Osaka Prefecture University, has collected the historical flowering dates of Kyoto’s Prunus jamasakura cherry trees going all the way back to the 9th century. The dataset is based on “many diaries and chronicles written by Emperors, aristocrats, [governors] and monks,” Aono writes. The dates are those “on which cherry blossom viewing parties had been held or full flowerings had been observed.” Over the past century, Kyoto’s cherry trees have been blooming earlier and earlier. Related: @bbgblossoms, a Twitter bot that tracks the status of the Brooklyn Botanic Garden’s 152 cherry trees. [h/t Eric Steig] — : April 5, 2017
Links:
Tags: plants
In 2014, the NYC Department of Information Technology & Telecommunications conducted a massive aerial survey of the city. Then, they converted the images and data they collected into a three-dimensional model of every building in all five boroughs. Related: In December, The New York Times used the data to map the city’s shadows. Also related: Berlin, the Hague, and Lyon offer digital 3D models of their cities, too. Previously: LiDAR-powered elevation data from around the world (May 25, 2016). [h/t Dan Nguyen] — : April 5, 2017
Links:
Tags: mappingtechnology
The anonymously-published DNS Census 2013 “is an attempt to provide a public dataset of registered domains and DNS records” — essentially the Internet’s phone book. The dataset, which has also been uploaded to the Internet Archive, includes 2.7 billion Domain Name System records and 106,928,034 distinct domains, organized by extension (e.g., .com, .info, .edu). RIP, certificationcommissionforhealthcareinformationtechnology.biz. [h/t Andrew Ferlitsch] — : March 29, 2017
Links:
Tags: technology
After last week’s item on Berkeley Earth’s real-time air quality data, reader Olaf Veerman pointed me to OpenAQ. The open-source project currently gathers pollution data from nearly 5,500 locations in 47 countries, aggregated “from real-time government and research grade sources.” You can download the data via OpenAQ’s API. [h/t Olaf Veerman] — : March 29, 2017
Links:
Tags: environmentstatistics
The Federal Deposit Insurance Corporation publishes a spreadsheet of failed banks for which the agency has been appointed as a receiver — some 550 banks since October 2000. It also provides short descriptions of each bank failure. The most recent: Proficio Bank of Cottonwood Heights, Utah, which closed on March 3. More on the FDIC’s receivership program here. — : March 29, 2017
Links:
Late last year, the FDA began publishing a dataset of ”adverse events” that have been reported to its Center for Food Safety and Applied Nutrition. The database currently covers January 2004 through December 2016, and includes reports of (suspected) bad reactions to foods, dietary supplements, and cosmetics. For instance, the first row names a particular brand of chocolate chips as the potential culprit in the hospitalization of a two-year-old girl, whose symptoms included a rash, swelling face, cough, and difficulty breathing. Previously: FDA adverse event data for pharmaceutical drugs (May 18, 2016). [h/t Sheila Hagar + Drew Ivan] — : March 29, 2017
Links:
Tags: food
The Stockholm International Peace Research Institute’s Military Expenditure Database is based on official reports, International Monetary Fund yearbooks, newspaper articles, and other sources. It covers most major countries since the 1950s and more than 100 countries since 1988. The dataset also quantifies military spending on a per-capita basis, as share of the country’s GDP, and as a proportion of total government spending. Also: The Defense Manpower Data Center publishes spreadsheets detailing the number of active and reserve U.S. personnel stationed in each state, territory, and foreign country. Previously: SIPRI’s database of international arms transfers (Nov. 18, 2015). [h/t K.K. Rebecca Lai, Troy Griggs, Max Fisher and Audrey Carlsen] — : March 29, 2017
Links:
NOAA Fisheries’ Greater Atlantic Region publishes spreadsheets of the federal permits it awards to fishing vessels, operators, and dealers. For each vessel, the data includes the boat’s name, owner, principal port city, length, horsepower, and categories of fish permitted. The agency’s Southeast Regional Office also publishes lists of its permits — for shark dealers, domestic swordfish dealers, spiny lobster tailing, and more — but as HTML tables with no CSV-export option. [h/t J. Albert Bowden II] — : March 22, 2017
Links:
The Census’ Value of Construction Put in Place Survey “provides monthly estimates of the total dollar value of construction work done in the U.S.” For instance, construction spending in 2016 totaled approximately $1.1 trillion, $89 billion of which went to education-related construction. The survey has been collected monthly since 1964; historical data files are available going back to 1993. [h/t Kevin Gilmore] — : March 22, 2017
Links:
Tags: architecture
Five states in India, representing nearly 250 million residents — Punjab, Uttar Pradesh, Uttarakhand, Goa, and Manipur — have already held legislative assembly elections this year. India’s Election Commission publishes these results, but only as webpages. A couple of Hyderabad-based developers have scraped the website, and published CSVs of the data on GitHub. Previously: Data Is Plural’s election edition (Sept. 28, 2016). — : March 22, 2017
Links:
To accompany its 2016 and 2017 budget proposals, the Obama administration published machine-readable copies on GitHub. Each proposal’s data are divided into three CSV files: for budget authority, outlays, and receipts. The accompanying user guide explains the data sources and structure. Sample tidbit: The White House expected the Department of Homeland Security to pull in $712 million in excise taxes from the Oil Spill Liability Trust Fund in 2017. [h/t Dan Nguyen] — : March 22, 2017
Links:
Tags: government
The team at Berkeley Earth has released the data files behind their real-time global air quality map. The map and data track measurements of pollution particles smaller than 2.5 microns in diameter. “Under typical conditions,” the Berkeley Earth team writes, this particulate matter “is the most damaging form of air pollution likely to be present, contributing to heart disease, stroke, lung cancer, respiratory infections, and other diseases.” Previously: The World Health Organization’s Global Urban Ambient Air Pollution Database (June 15, 2016). — : March 22, 2017
Links:
To prepare for an exhibition last year, the National Archives and Records Administration created a dataset of more than 11,000 constitutional amendment proposals introduced in Congress between 1787 and 2014. [h/t Justin Lewis] — : March 15, 2017
Links:
The Windy City publishes two datasets on traffic violations. One tallies the daily number of speeding violations in each Children’s Safety Zone; the other, red-light violations at each camera-surveilled intersection. Both go back to July 2014. The city also publishes a spreadsheet of city-towed vehicles. Related: The Chicago Tribune’s long-running investigation into the city’s traffic camera troubles. [h/t Jacob Sheff] — : March 15, 2017
Links:
Tags: crimetransportation
Freddie Mac — the government-sponsored, publicly traded company also known as the Federal Home Loan Mortgage Corporation — publishes data on 23 million single-family home mortgages it has originated or guaranteed since 1999. The dataset includes the loan amount and interest rate, the borrower’s credit score, the property type (e.g., condo, co-op, manufactured housing), metro area, first payment month, whether the borrower is a first-time homebuyer, and lots more. Freddie Mac requests that you register before downloading the data, but you can also access the files directly. Don’t miss the terms and conditions, which prohibit republishing the files. Previously: Data on millions more loans from the Home Mortgage Disclosure Act (Dec. 30, 2015). — : March 15, 2017
Links:
Tags: real estate
Last week, a research team at Google published AudioSet, a dataset of “2,084,320 human-labeled 10-second sound clips drawn from YouTube videos.” The clips have been classified into hundreds of categories, including “plucked string instrument,” “computer keyboard,” “chuckle, chortle,” “snoring,” and “fowl.” [h/t Suman Deb Roy] — : March 15, 2017
Links:
Donald Trump’s new travel ban is scheduled to take effect at 12:01am Eastern tonight. The State Department doesn’t publish realtime visa data, but it does publish historical data, including the number of non-immigrant visas issued each fiscal year between 1997 and 2016, by nationality and visa type. (For example, the government issued 226 “fiancé(e)” K-1 visas to Syrian nationals in fiscal year 2016.) The agency also reports how many visas of each type it refused each year, as well as refusal rates by nationality. [h/t Thomas Kasang] [Update, 2017-12-12: The State Department link appears no longer to be working; here's a copy from the Wayback Machine: http://web.archive.org/web/20171201161048/https://travel.state.gov/content/visas/en/law-and-policy/statistics/non-immigrant-visas.html ] — : March 15, 2017
Links:
Tags: Trumpimmigration
A trio of European researchers has published a dataset containing 101,000 photos of food — 1,000 images each from 101 food categories, all downloaded from foodspotting.com. The categories include apple pie, escargots, onion rings, paella, bibimbap, prime rib, and more. [h/t Reddit user cavedave] — : March 8, 2017
Links:
Tags: food
Researcher Amber Thomas has parsed the transcripts of last year’s 10 highest grossing films. The resulting data files indicate each character’s number of turns speaking, number of words spoken, and gender. Previously: Dialogue from 2,000 movies, by gender (April 13, 2016). — : March 8, 2017
Links:
Tags: entertainmentmovies
The FDA’s “Orange Book” lists approved drugs, their associated patents, and government-granted exclusivity rights. The Orange Book is available as a 1,400-page PDF, but you can also download the key data as structured text files. The files are updated monthly. Related: “Drugs For Rare Diseases Have Become Uncommonly Rich Monopolies,” published by Kaiser Health News and NPR in January. Question for readers: The Orange Book data comes as tilde-delimited files, the first I’ve ever seen. Do you have ~any other examples~? [h/t Sydney Lupkin] — : March 8, 2017
Links:
Tags: businessdrugshealthcare
The Bitcoin exchange rate hit an all time high last week, at more than $1,290 to the dollar. That’s according to CoinDesk’s Bitcoin Price Index, an average rate derived from several major exchanges. You can download daily and hourly data for the index and its components. [h/t Jan Doggen] — : March 8, 2017
Links:
Tags: moneytechnology
From Treasury.io: “Every day at 4pm, the United States Treasury publishes data tables summarizing the cash spending, deposits, and borrowing of the federal government.” Those data tables “catalog all the money taken in that day from taxes, the programs, and how much debt the government took out.” On Monday, for instance, the government spent $481 million on the Postal Service. One hitch: The Treasury’s data tables are (subjectively) ugly and (objectively) spreadsheet-unfriendly. So Treasury.io — an open-source civic project — continuously converts the files into good ol’ tabular data. You can download individual tables as CSVs, get the whole dataset as a big SQLite database, or query the API. There’s also a data dictionary and a Twitter bot. — : March 8, 2017
Links:
Tags: governmenttaxes
Florida’s Fish and Wildlife Conservation Commission publishes data from its statewide recreational alligator hunt. For each alligator harvested between 2000 and 2015, the dataset includes the date, the hunting area, and the length of the carcass. (Legal hunting tools include crossbows, harpoons, spearguns, fishing poles, snatch hooks, and bang sticks — but not rifles, pistols, or other guns.) [h/t Christopher Groskopf + Neil Bedi + Eric Sagara] — : March 1, 2017
Links:
Last month, the Seattle Public Library released a dataset tracking the total number of checkouts for each title by year and month from April 2005 to December 2016 (so far). The dataset isn’t limited to physical books; it also includes e-books, magazines, CDs, DVDs, and more. Last year, the three most popular physical books were Paula Hawkins’s The Girl on the Train (2,355 checkouts), Lauren Groff’s Fates and Furies (2,151 checkouts), and Ta-Nehisi Coates’s Between the World and Me (2,134 checkouts). — : March 1, 2017
Links:
Tags: books
The National Highway Traffic Safety Administration provides an impressively rich API detailing every manufacturer, make, and model in its database. The API can translate cars’ Vehicle Identification Numbers into the nitty-gritty details that those VINs encode, including the plant where the vehicle was manufactured, number of doors, engine measurements, fuel type, and more. [h/t Justin Myers] — : March 1, 2017
Links:
Tags: transportation
In an early executive order, Donald Trump instructed the Department of Homeland Security to expand its use of Section 287(g) of the Immigration and Nationality Act, which allows the federal government to deputize local law enforcement agencies in its search for undocumented immigrants. In response to FOIA requests, DHS has previously released data on the local agencies that participate in the 287(g) program. The Marshall Project has collated the DHS data, which includes the number of immigrants deported, for 2006 to 2013 (the most recent year available). During that timespan, “more than 175,000 people nationwide were deported under the program,” Anna Flagg writes. “More than 30,000 of them came from Maricopa County, Ariz., the most from any single jurisdiction.” [h/t Tom Meagher] — : March 1, 2017
Links:
Tags: Trumpimmigrationlaw
Wordbank is an “open database of children's vocabulary development.” So far, the Stanford-hosted project has gathered data from more than 71,000 standardized and anonymized vocabulary questionnaires across 23 languages. You could spend hours exploring the data online, charting how quickly children learn individual words, how quickly the same word (e.g., “grandma,” “abuela,” “ба́бушка”) is learned in different languages, and connections between words. You can download the data for each word or for each child’s vocabulary. Bonus: Wordbank has an R package and a GitHub repository. [h/t Hacker News user "Jasamba"] — : March 1, 2017
Links:
When researchers asked 1,354 people to name or visualize a playing card, 1 in 6 of them first chose the Ace of Spades. Here’s the data, which includes each participant’s three card choices, age, and gender. — : February 22, 2017
Links:
Since March 2015, the National Basketball Association has issued post-game reports reviewing referees’ calls during the final two minutes of neck-and-neck games. The NBA publishes those reports as PDFs; journalist Russell Goldenberg has been converting them to spreadsheet-friendly CSVs. Goldenberg is also analyzing and visualizing the data — updated daily — to show, for example, which players are benefitting most from incorrect and missed calls. (Answer so far: the Wizards’ Marcin Gortat and the Nets’ Brook Lopez.) — : February 22, 2017
Links:
Tags: sports
The National Cancer Institute has estimated ultraviolet radiation exposure estimates for every county in the continental United States. The estimates, based on a peer-reviewed methodology and 30 years of data from the National Solar Radiation Data Base, can also be explored using the institute’s mapping tool. Luna County, New Mexico had the highest estimated UV exposure at 5,723 Watt-hours per square meter; Clallam County, Washington, was exposed to the least estimated UV radiation, at 3,012 Wh/m². [h/t J. Albert Bowden II] — : February 22, 2017
Links:
Tags: healthcare
Last week, a team of researchers released a dataset containing “60,949 Doppler velocity measurements covering 1,624 stars taken over 20 years” from the Keck Observatory in Hawaii. The authors have already used the dataset to identify more than 100 exoplanets — i.e., planets outside our solar system. Now, they’re hoping that the public and other researchers will use their data to help discover even more. Previously: The NASA Exoplanet Archive (May 11, 2016). [h/t Arthur Bashlykov] — : February 22, 2017
Links:
Tags: science
Earlier this month, the Department of Housing and Urban Development released its “Picture of Subsidized Households” report for 2016. The dataset describes the living conditions, demographics, and finances of families receiving subsidies via the agency’s various programs — including public housing, Section 8 vouchers, and several others. The figures are provided for the entire U.S., by state, metro area, housing agency, city, county, Census tract, and even by housing development. HUD provides a data dictionary explaining each field, as well as a tool to query the data without downloading the entire dataset. [h/t Pat Smith] — : February 22, 2017
Links:
Tags: aidreal estate
The NCAA publishes data on its student athletes’ academic progress and graduation rates. The numbers are aggregated by school and sport — from baseball, to women’s bowling, to mixed rifle. [h/t Albert Bowden] — : February 15, 2017
Links:
From the Journal of Open Psychology Data: “We present a dataset of a single (N=1) participant diagnosed with major depressive disorder, who completed 1,478 measurements over the course of 239 consecutive days in 2012 and 2013.” The “participant” happens to be one of the study’s authors — Peter C. Groot, a researcher at Maastricht University Medical Centre. Each day, he recorded the degree to which “I feel relaxed,” “I feel lonely,” “I worry,” and responses to dozens of other prompts. [h/t Sacha Epskamp] — : February 15, 2017
Links:
Tags: healthcare
The Clinical Trials Transformation Initiative — a public-private partnership of more than 80 organizations — upgraded its clinical trials database late last month. The relational database, called the Aggregate Analysis of ClinicalTrials.gov (AACT), contains “all information (protocol and result data elements) about every study registered” through that titular government website. The AACT data is well-documented and accessible both via download and remote database connection. ClinicalTrials.gov also publishes the underlying data itself, but as one big XML file. — : February 15, 2017
Links:
Tags: healthcare
Last week, the Metropolitan Museum of Art made 375,000 images free to use, remix, and share under a Creative Commons Zero license. The museum also publishes bulk metadata on more than 420,000 pieces of art; that file indicates whether a given artwork is in the public domain, and hence whether the images fall under the new license. You can also search the images here. Other museums providing open-access imagery include the National Gallery of Art, the Getty, and Amsterdam’s Rijksmuseum. Previously: Mo’ museum metadata (Nov. 4, 2015). [h/t Joshua Barone + Sarah Bond] — : February 15, 2017
Links:
Tags: art
The National Weather Service’s Cooperative Observer Program (COOP) is a 127-year-old network of volunteer weather observers. “More than 8,700 volunteers take observations on farms, in urban and suburban areas, National Parks, seashores, and mountaintops,” according to the NWS. Want to become a volunteer? Because the program is so old, “many areas already have the necessary stations operating,” but “about 200 observers resign each year, about 4 per state.” While you’re waiting, you can download the COOP data from Iowa State University. [h/t Bill Frischling] — : February 15, 2017
Links:
Tags: climate
The World Health Organization publishes life expectancy estimates for 194 countries, for each year between 2000 and 2015. Related: “One Dataset, Visualized 25 Ways.” Previously: American life expectancies by city (April 13, 2016). — : February 8, 2017
Links:
For their 2011 paper, “Flavor network and the principles of food pairing,” four scientists analyzed 56,498 recipes downloaded from three websites — allrecipes.com, epicurious.com, and menupan.com. To support their findings, the authors published two datasets. One names the cuisine and ingredients for each recipe. The other dataset counts how often any two ingredients appeared in the same recipe. (Parmesan cheese and beef appeared together 93 times; starfruit and Algerian geranium oil just once.) Related: “food2vec – Augmented cooking with machine intelligence,” published last month. [h/t Rob Barry] — : February 8, 2017
Links:
Tags: food
The prestigious Scandinavian awards have an API. The official documentation explains it succinctly: “The data is free to use and contains information about who has been awarded the Nobel Prize, when, in what prize category and the motivation, as well as basic information about the Nobel Laureates such as birth data and the affiliation at the time of the award. The data is regularly updated as the information on Nobelprize.org is updated, including at the time of announcements of new Laureates.” Related: “These Nobel Prize Winners Show Why Immigration Is So Important For American Science,” by my colleague Peter Aldhous. Plus: The R code supporting Peter's analysis. — : February 8, 2017
Links:
Tags: historymiscellaneous
The International House Price Database combines and standardizes house price indices from 23 countries — mostly in Europe and North America, but also including South Africa, Australia, New Zealand, Japan, South Korea, and Israel. The dataset, published by the Federal Reserve Bank of Dallas, is deeply documented and updated quarterly. Previously: Historical San Francisco rents (May 25, 2016) and the U.S. Census Bureau’s Annual Characteristics of New Housing (June 22, 2016). — : February 8, 2017
Links:
Tags: real estate
Two weeks ago, Bloomberg News reporters requested entrance and exit data from Washington, DC’s Metrorail system for three days: Jan. 20, 2009 (Obama's first inauguration), Jan. 20, 2017 (Trump's inauguration), and Jan. 21, 2017 (the Women's March). A week later, they received the data — but as PDFs, which they turned into structured data and published this week. Related: NYC’s MTA publishes detailed turnstile-by-turnstile data, and Chicago publishes daily “L” ridership data for each station going back to 2001. Plus: “Second Avenue Subway Relieves Crowding on Neighboring Lines,” which uses the NYC data. — : February 8, 2017
Links:
The National Institute of Standards and Technology publishes Special Database 18 “for use in development and testing of automated mugshot identification systems.” The dataset contains 3,248 mugshot photos portraying 1,573 different people (mostly men), and includes each arrestee’s age and gender. [h/t Noah Veltman] — : January 25, 2017
Links:
Tags: crimetechnology
EU-Forest is a new dataset that, according to its authors, “extends by almost one order of magnitude the publicly available information on European tree species distribution.” The new project merges and harmonizes data from 21 national forest surveys and two related databases. In all, EU-Forest includes more than 580,000 observations of more than 200 species in 1km-by-1km square plots of land, and is available in both tabular and geospatial file formats. Previously: American tree maps (Dec. 23, 2015) and NYC street trees (Nov. 16, 2016). — : January 25, 2017
Links:
Tags: mappingplantsstatistics
The GDELT Project and the Internet Archive have collaborated to make the latter's Television News Archive more powerfully searchable. Their new tool, announced in December, lets you search across “more than 5.7 billion words from over 150 distinct stations spanning July 2009 to present” at a sentence-by-sentence level. The results are downloadable as CSV or JSON files. Previously: The Political TV Ad Archive (Feb. 2, 2016). — : January 25, 2017
Links:
The Bank of England publishes a spreadsheet of historical economic data going back, in some cases, to the late 1600s. The country’s GDP in 1700 was £11.7 billion in 2013 prices. That’s about 1/157th the size of the UK’s GDP in 2015. And in November 1694, monthly short-term interest rates were roughly 6%. [h/t Ian Greenleigh] — : January 25, 2017
Links:
A team of economists studying “the equality of opportunity” has published new research identifying which colleges “help the most children climb the income ladder.” For their analysis, the researchers combined federal tax records and data from the Department of Education. California State University–Los Angeles was one of the greatest engines of mobility; nearly 1 in 10 students enrolled there began in the bottom 20% of income but reached the top 20% by their early thirties. You can download the findings, which include similar statistics for more than 2,000 schools, as a series of spreadsheets. Related: “Some Colleges Have More Students From the Top 1 Percent Than the Bottom 60. Find Yours,” from the New York Times. — : January 25, 2017
Links:
The General Services Administration recently updated its list of known .gov domains. It currently includes more than 1,300 federal domains �� from aapi.gov to youthrules.gov — and more than 4,300 domains registered by state, local, and native sovereign agencies. — : January 18, 2017
Links:
Tags: governmenttechnology
State-owned Deutsche Bahn AG is Europe’s largest railway company by revenue, serving 12 million train and bus passengers each day. It also happens to publish a bunch of open data, including datasets on its routes, stations, platforms, and cargo facilities. [h/t Martin Bergmann] — : January 18, 2017
Links:
Tags: transportation
Between December 2014 and March 2016, Alberto Cavallo — co-founder of MIT’s Billion Prices Project — sent 323 crowdsourced workers to collect product prices from 56 large retailers in 10 countries. Then, he found the prices for the same products on the retailers’ websites. The results, which contain tens of thousands of observations, are available as several Excel spreadsheets. (Caveat: The dataset’s “Terms of Use” rules stipulate that the information is “EXCLUSIVELY FOR USE IN ACADEMIC RESEARCH AND PUBLICATIONS”.) Related: Cavallo summarized his findings in a paper published recently by the American Economic Review. — : January 18, 2017
Links:
Tags: businesstechnology
Late last year, the USDA published a study that used “point-of-sale transaction data from a leading grocery retailer to examine the food choices” of households receiving Supplemental Nutrition Assistance Program (SNAP) benefits. In an appendix, the report ranks the total spending on major commodities by SNAP households and non-SNAP households. Soft drinks, “fluid milk products,” and ground beef were the top three commodities purchased by SNAP households. Milk, soft drinks, and cheese were the top three for non-SNAP households. That information is presented as a PDF table, but I’ve converted it to a spreadsheet-friendly text file for you. [h//t Reddit user "junglejuicy"] — : January 18, 2017
Links:
At BuzzFeed News, a few colleagues and I spent the past two months compiling a big database of organizations and people connected to President-elect Trump, his family, advisers, and Cabinet picks. On Sunday, we published what we’ve found so far — connections between more than 1,500 organizations and people altogether. Still, there are certainly things we’ve missed. So you can download and search the data, but you can also help us expand it. See something we’ve overlooked? Let us know! — : January 18, 2017
Links:
In 2015, computer scientist Randy Olson tried computing “the optimal search strategy for finding Waldo” in the seven original Where’s Waldo? books. In doing so, he transcribed a 2013 Slate chart of Waldo’s locations (itself transcribed from those seven original books). The resulting dataset contains 68 rows — one for each Waldo — and four columns: book, page, x coordinate, and y coordinate. — : January 11, 2017
Links:
CelesTrak’s T.S. Kelso has been obsessively transcribing NORAD’s “resident space object” data for decades. Among his offerings: the SATCAT satellite catalog, which provides data on all known satellites launched since 1957 — more than 41,900 of ‘em. Kelso also provides a SATCAT Boxscore, which is like a baseball box score ... but for satellites. The U.S., it turns out, is responsible for almost exactly one-third of the 1,590 satellites classified as “active.” Previously: The Union of Concerned Scientists’ satellite database, featured Dec. 30, 2015. [h/t Noah Veltman] — : January 11, 2017
Links:
Tags: historytechnology
Years ago, Lt. Col. Jenns Robertson began entering information into “a simple Excel spreadsheet that eventually matured into the largest compilation of releasable U.S. air operations data in existence.” Last month, the Department of Defense published a “beta” version of this data, known as Theater History of Operations Reports (THOR). Currently, THOR’s data covers bombing operations from World War I, World War II, the Korean War, and the Vietnam War. For each bombing, the reports include data about the aircraft, munitions, targets, results, and more. — : January 11, 2017
Links:
Scientists expect that, when the final numbers come in, 2016 will have been Earth’s hottest year on record. The National Oceanic and Atmospheric Administration publishes monthly data on “temperature anomalies” — how much hotter or cooler a month was than the 20th century average. (November 2016, the most recent month available, was 0.73° Celsius warmer than the average November.) You can grab the data for the entire globe, by hemisphere, or by continent; for the land and ocean combined, or separately; and going all the way back to 1880. Related: My colleague Peter Aldhous demonstrates how he charted this data using R. Also: NOAA released its 2016 U.S. “State of the Climate” report on Monday. — : January 11, 2017
Links:
Tags: climate
In September, ProPublica published a Chrome extension that showed readers what Facebook said it knew about them — and then asked readers to share that data. In the following months, readers unearthed more than 52,000 of the “unique interest categories” that Facebook uses for advertising, such as “yoga,” “beer,” and “Scent of a Woman (1992 film).” But ProPublica’s reporters also found that Facebook doesn’t tell users about the “far more sensitive” data it buys about their offline lives, which can include “their income, the types of restaurants they frequent and even how many credit cards are in their wallets.” To support these findings, ProPublica published two key datasets: the crowdsourced “interest categories” and the list of categories that Facebook allows advertisers to target. — : January 11, 2017
Links:
Tags: social mediatechnology
Last week, Quartz published an addictive tool that lets you map word usage on Twitter, by U.S. county. It’s based on an academic analysis of 890 million geocoded tweets uttered between October 2013 and November 2014. Data and details available here. — : December 21, 2016
Links:
The FAA’s Near Midair Collision System keeps track of incidents where two planes flew uncomfortably close to each other. The system, which is based on reports from pilots and flight crew members, contains more than 7,500 incidents dating back to 1987. The FAA received 305 of these reports for the first 10 months of 2016, including 35 classified as “critical.” — : December 21, 2016
Links:
Tags: transportation
The U.S. Geological Survey’s BISON service brings together “species occurrence” data from hundreds of sources. The service, whose name stands for ”Biodiversity Information Serving our Nation,” currently contains 262 million records, each of which refers to the observation of “an organism at a particular time in a particular place.” Most of the observations are based on direct sightings; others use fossils, written records, or other sources. The data aren’t available for bulk download, but can be accessed via BISON’s free API. [h/t Clare Malone] — : December 21, 2016
Links:
Tags: animalsplantsstatistics
Since the 1940s, oilfield services corporation Baker Hughes and its predecessor companies have been publishing “rig counts” — the number of rigs actively drilling for oil and/or gas in various parts of the world. These days, the company updates its North America numbers every week and its international counts every month. As of December 16, they counted 637 rigs in — and offshore of — the United States, nearly half of them in Texas. [h/t Jordan Wirfs-Brock] — : December 21, 2016
Links:
Last week, the U.S. Department of Health and Human Services released a dataset of state-level Obamacare metrics. The dataset is divided into five main categories: coverage gains, employer coverage, individual market coverage, Medicaid, and Medicare. Between 2010 and 2015, the proportion of Nevadans without health insurance dropped from 22.6% to 12.3% — the largest percentage-point decrease of any state. (In 2015, an estimated 17.1% of Texans still didn’t have health insurance, the highest rate of any state that year.) The metrics come from various sources, including the Census, academic studies, and the department’s own estimates. [h/t Nadja Popovich] — : December 21, 2016
Links:
Tags: governmenthealthcare
UK-based CarbonCulture helps organizations measure and publish their buildings’ energy and water use in near-realtime. Among the first users: 10 Downing Street, the Tate Modern, and University College London. For each building, you can download yearly datasets, which are broken down into 30-minute intervals. [h/t Max Roser] — : December 14, 2016
Links:
Tags: energy
A Lithuania-based web-scraping company has been collecting data on Kickstarter projects and Indiegogo campaigns every month. The datasets include (among other things) each project’s number of backers, amount pledged, and category. You can also explore the data online. [h/t Vincent Granville] — : December 14, 2016
Links:
Tags: moneytechnology
The European Commission and Google engineers have mapped surface water – including lakes, rivers, reservoirs, oceans, and more – on every 30-meter-by-30-meter square on Earth between 1984 and 2015. During that time, “permanent surface water has disappeared from an area of almost 90,000 square kilometres, roughly equivalent to that of Lake Superior, though new permanent bodies of surface water covering 184,000 square kilometres have formed elsewhere.” The data, based on the U.S. government’s Landsat satellite images, are available to download and explore online. Related: “Mapping Three Decades of Global Water Change,” published by The New York Times, based on this dataset. — : December 14, 2016
Links:
Troy Hunt runs HaveIBeenPwned.com, a service that lets you see whether your email address has been included in any major data breaches. Last week, Hunt published an anonymized dataset based on the breaches he’s collected. (That post provides a torrent file for the dataset; you can also download the data here.) Unlike the HaveIBeenPwned website, the dataset doesn’t include information about specific accounts; instead it counts the number of email addresses that have been compromised on particular combinations of websites. For example, 14.6 million email addresses appeared in both the LinkedIn and Dropbox breaches. (You can read more about each breach here.) — : December 14, 2016
Links:
Tags: crimetechnology
The federal government has released data on Medicare’s prescription drug spending from 2011 to 2015. Previously, Medicare had only published data on the most expensive drugs; the new release includes data on all drugs used by at least 11 Medicare patients in a given year. Caveat: Medicare “is prohibited from publicly disclosing drug-specific information on manufacturer rebates,” so the “spending metrics do not reflect any manufacturers’ rebates or other price concessions.” [h/t Charles Ornstein] — : December 14, 2016
Links:
Tags: drugshealthcare
“MusicNet is a collection of 330 freely-licensed classical music recordings, together with over 1 million annotated labels indicating the precise time of each note every recording, the instrument that plays each note, and the note's position in the metrical structure of the composition.” [h/t Lon Riesberg] — : December 7, 2016
Links:
Tags: audioentertainmentmusic
The IPUMS Higher Ed portal provides data from three “leading surveys for studying the science and engineering (STEM) workforce in the United States.” The surveys currently cover 1993 through 2013 and include questions about educational choices, demographics, employment outcomes, and more. Requires a free account. Michael A. Rice, a teacher at Ingraham High School in Seattle] — : December 7, 2016
Links:
Last month, Chicago’s city government published data on more than 100 million local taxi rides taken in the city since 2013. (The city gathers the data through “periodic reporting by two major payment processors believed to cover most taxis in Chicago.”) The dataset contains each ride’s start/end times, pickup/dropoff location (based on Chicago’s “community areas”), distance, cost, payment type, and taxi company. Related: “Analyzing 1.1 Billion NYC Taxi and Uber Trips, with a Vengeance,” which contains pointers to similar data for New York City. [h/t Dan Nguyen] — : December 7, 2016
Links:
Tags: transportation
The Open PV Project is a “community driven, comprehensive database” of solar panel installations in the U.S., ranging from home installations to utility-scale projects. The database, run by the Department of Energy, contains more than 1 million installations — with a total capacity of 16,000+ megawatts — and tracks their locations, sizes, costs, installers, and other variables. [h/t Dad] — : December 7, 2016
Links:
Tags: energy
The U.S. Energy Information Administration publishes a bunch of geographic data, including shapefiles mapping the country’s crude oil, petroleum product, hydrocarbon gas liquid, and natural gas pipelines. (They were last updated five months ago.) Additionally, the Pipeline and Hazardous Materials Safety Administration keeps track of “significant incidents” — for example, those that caused a serious injury or $50,000 in damage. Related: “Six maps that show the anatomy of America’s vast infrastructure.” Also related: ProPublica’s Pipeline Safety Tracker, covering 1986–2012. — : December 7, 2016
Links:
A few years ago, Reddit user trexmatt uploaded 216,930 Jeopardy! trivia-tidbits, scraped from j-archive.com, “the nearly comprehensive online Jeopardy! archive maintained by obsessive fans.” Each entry lists the question, answer, category, value, round, show number, and show air-date. — : November 30, 2016
Links:
Data analyst Patrick Martinchek has published a dataset of all Facebook posts from “15 of the top mainstream media sources” — a group that includes The New York Times, The Wall Street Journal, NPR, Fox News, and other familiar sources — from January 2012 through Nov. 8, 2016. Related: “What I Discovered About Trump and Clinton From Analyzing 4 Million Facebook Posts.” — : November 30, 2016
Links:
This year, I decided to grade a bunch of prominent election forecasts for BuzzFeed News. Now that Michigan has finally been called, I’ve published the results. I’ve also published the underlying data and code on GitHub, including state-level predictions from all nine forecasters in the analysis. — : November 30, 2016
Links:
Earlier this month, Forbes published an examination of ShotSpotter, a company that uses networks of outdoor microphones to detect and locate gunshot-like sounds. Forbes found that ShotSpotter has produced “few tangible results.” “In some cities, ShotSpotter hasn’t had the effect city officials and residents had hoped for. While officers are responding to more illegal gunfire, they rarely catch the shooter.” To support its findings, Forbes has published the ShotSpotter data they received from police departments in seven cities: Brockton, Mass.; East Palo Alto, Calif.; Kansas City, Mo.; Milwaukee, Wis.; Omaha, Neb.; San Francisco, Calif.; and Wilmington, N.C. The data varies somewhat for each city, but typically includes the date, time, location, and outcome of the each gunshot alert. [h/t Matt Drange] — : November 30, 2016
Links:
The CDC’s Underlying Cause of Death database provides county-level mortality statistics based on death certificates of U.S. residents for each year from 1999 to 2014. The tool lets you group the data by geography, demographics, place of death (e.g., inpatient hospital, hospice, home, etc.), and other variables. In 2014, for example, about 40,000 residents died of pancreatic cancer — with the highest rates coming in America’s most-rural counties (~15.6 deaths per 100,000 residents) and the lowest rates in country’s most-urban counties (~11.3 per 100,000). The CDC’s “compressed mortality” datasets contain slightly less detail, but go all the way back to 1968. [h/t Drew Ivan] — : November 30, 2016
Links:
Earlier this month, New York City published the results of its decennial tree count. You can explore a map of every street tree in NYC — nearly 700,000 of ‘em — or download the corresponding dataset, which contains info on each tree’s species, circumference, health status, and other observations. (Note: That dataset appears to contain about one-third fewer trees than the map’s count, for reasons I can’t quite figure out.) Results of the 1995 and 2005 tree censuses are also available. — : November 16, 2016
Links:
Germany-based researcher Andreas Thalhammer has applied PageRank — the algorithm at the heart of Google’s origin story — to the world of Wikipedia. The result: the DBpedia PageRank dataset, which estimates the importance of each page based on the other pages that link to it. You can download the data directly, or query it online. (According to the metric, Aristotle, Plato, and Karl Marx are history’s three most Wiki-central philosophers.) — : November 16, 2016
Links:
Tags: technology
Jason Baumgartner — a.k.a. Stuck_In_the_Matrix — has collected and published every submission and comment posted to Reddit from November 8th through November 10th. For each of the nearly 8 million comments, the dataset includes the message, the author, the subreddit it was posted to, the comment thread’s ID, and more. Previously: 1.7 billion Reddit comments, featured Nov. 25, 2015. — : November 16, 2016
Links:
Last month, colleagues at BuzzFeed News and I analyzed and fact-checked 1,000+ posts from hyperpartisan Facebook pages, and found a disturbingly high rate of fake news. Here’s the data. Facebook CEO Mark Zuckerberg has dismissed the possibility that fake news influenced the election, calling it a “pretty crazy idea”. Meanwhile, renegade Facebook employees have now formed an unofficial task force to battle fake news on the platform. — : November 16, 2016
Links:
Since the 1990s, the FBI has collected data on hate crimes from local law enforcement agencies. On Monday, the bureau released data for 2015, reporting “5,850 criminal incidents and 6,885 related offenses, as being motivated by bias toward race, ethnicity, ancestry, religion, sexual orientation, disability, gender, and gender identity.” Those numbers are based on reports from 14,997 participating agencies. On the FBI’s website, you can view and download summary tables of the most recent data. You can also download incident-specific data for 1992 through 2014 from the National Archive of Criminal Justice Data. Unfortunately, as ProPublica noted yesterday, the FBI dataset is “deeply flawed”; more than 3,000 law enforcement agencies don’t participate in the program. [h/t John Templon] — : November 16, 2016
Links:
The city publishes a spreadsheet — last updated in May — of local dogs who’ve officially been “declared dangerous.” (“They have attacked in the past. The owner is required to provide $100,000 in financial responsibility. If they attack again the court could order them put to sleep.”) The file currently contains 63 entries, from a Labrador named Charlie to a Blue Lacy named Flint. [h/t Sharon Machlis] — : November 2, 2016
Links:
Julian McAuley, an assistant professor at UC San Diego, has collected a massive amount of user-generated data from Amazon.com, including 142.8 million reviews and 1.4 million answered Q&As. (As of mid-2014, Sophie la Girafe was the most-reviewed item in the baby category. Backstory here.) Much of the data can be downloaded directly, but the largest files require contacting McAuley for access. [h/t Reddit user samofny] — : November 2, 2016
Links:
Earlier this autumn, New York City began publishing a dataset of official citizen complaints against the city’s police, for every case closed since 2006. For each of the 200,000+ allegations, the main dataset includes various details about the incident — e.g., where it took place, and whether there’s video evidence — but no information about the officer involved. Related: Similar data from Indianapolis, which includes demographic information about the complained-against officers but not their names. Also related: “The local projects that are making police complaint data open and accessible.” Previously: Complaints against Chicago police, featured Nov. 11, 2015. [h/t Eve Ahearn] — : November 2, 2016
Links:
The European Commission’s Global Human Settlement Layer combines satellite imagery and census data to measure three things: population, building density, and urban/rural classification. The resulting datasets are fairly detailed — they provide population estimates for every 250-meter square in the world, for example — and are available for 1975, 1990, 2000, and 2015. [h/t Alaistair Rae] — : November 2, 2016
Links:
The U.S. government’s Medicare Health Outcomes Survey tracks the “physical and mental health and well-being” of Americans covered by Medicare. Each survey, currently available for 1998–2000 to 2012–2014, follows a sample of Medicare beneficiaries for two years, and asks them questions along the lines of, “In the past 12 months, have you had a problem with balance or walking?” The 2012–2014 data includes (at least partial) responses from 296,320 people. [h/t Ricardo Pietrobon] [Update, 2016-11-02: The original link in this item points to an ICPSR page, which provides access only to people at "member institutions." Here's a better link to the data: http://www.hosonline.org/en/data-dissemination/research-data-files/] — : November 2, 2016
Links:
Tags: healthcare
Between October 2014 and September 2015, the U.S. Transportation Security Administration confiscated 22,196 “dangerous” items at airports, including 156 times at New York’s JFK. (Twice there, someone had placed fireworks in checked baggage.) That’s according to data obtained from the government by FOIA enthusiast Max Galka, who has also built an interactive map of the confiscations. — : October 26, 2016
Links:
Tags: transportation
OpenFlights.org has collected data on more than 60,000 flight routes, including 915 itineraries departing Atlanta’s Hartsfield–Jackson International Airport. (That airport was recently named the world’s busiest, for the 18th year in a row.) For each route, the dataset indicates the airline, the departing airport, the arriving airport, the number of stops, and what type of plane is typically used. The website also provides datasets on thousands of airports and airlines. Important caveat: “This data is not suitable for navigation.” — : October 26, 2016
Links:
Tags: transportation
The World Cities Culture Forum, a convening of 32 major cities on six continents, has assembled a series of mini-datasets on 70+ “cultural indicators”. Those indicators range from the number of art galleries in rach city (Paris had 1,151 in 2012) to the number of international tourists each city sees per year (Istanbul had 11.8 million in 2014) to the value of cinema ticket sales (Shanghai sold $563 million in 2014). Note: The data points draw on various sources — at least one just says “Google” — and aren’t necessarily directly comparable. [h/t Camilo Moreno] — : October 26, 2016
Links:
The Department of Education’s EDFacts data tracks public grade schools’ participation and proficiency rates on standardized math and reading/language exams. The files provide data on all students who took the tests, broken down by race/ethnicity, sex, disability status, homelessness, and more. A related set of data files, available on the same page, tracks high-school graduation rates. — : October 26, 2016
Links:
Tags: education
The website EveryCRSReport.com provides unprecedented public access to reports from the Congressional Research Service — essentially the national legislature’s think-tank. The website, which launched last week by Demand Progress and Congressional Data Coalition, also lets you download metadata and text for each report. [h/t Daniel Schuman] — : October 26, 2016
Links:
Tags: government
Today’s newsletter marks the 50th edition of Data Is Plural, as well as its one-year anniversary. To celebrate, I’ve started publishing a spreadsheet that details each edition’s basic stats — total subscribers, the “open rate,” the number of people who chose to unsubscribe, and more. — : October 19, 2016
Links:
The Department of Agriculture publishes a spreadsheet of farmers markets in the United States. For each market, the dataset notes its location, hours, and the types of goods available (e.g., vegetables, seafood, flowers, et cetera). [h/t Susie Lu] — : October 19, 2016
Links:
Tags: agriculture
The Jordà-Schularick-Taylor Macrohistory Database claims to be “the most extensive long-run macro-financial dataset to date.” It contains dozens of variables — GDP per capita, long-term interest rates, and the timing of systemic financial crises, for example — for 17 “advanced economies”. The dataset uses a Creative Commons license and has been extensively documented. — : October 19, 2016
Links:
Tags: economics
Each year, the Department of Health and Human Services updates its Area Health Resources Files, a vast suite of local health care data collated from more than 50 sources. Among the topics covered: the number health care professionals by specialty, various rates of hospital usage, air quality, and demographic profiles. You can download the data, or explore and map it online. [h/t Ricardo Pietrobon] — : October 19, 2016
Links:
Tags: healthcare
The Census Bureau’s Annual Survey of Manufacturers provides state-by-state and industry-by-industry statistics for America’s manufacturing sector. Metrics include the number of employees, annual payroll, “value added,” beginning-of-year inventory, and many more. In 2014, dog and cat food manufacturers employed about 18,000 people nationwide. Related: “Why Are Politicians So Obsessed With Manufacturing?” [h/t Scott Stern + RJ Andrews] — : October 19, 2016
Links:
You don’t have to like Flight of the Conchords to enjoy New Zealand’s national statistics website, though it couldn’t hurt. The country publishes data on a broad range of topics, including abortion, work stoppages, the Māori census, and, of course, exports. In '08 and '09, the country exported NZD $3.5 billion and NZD $2.4 billion, respectively, of ”mineral fuels, mineral oils and products of their distillation; bituminous substances; mineral waxes.” [h/t Drew Ivan] — : October 12, 2016
Links:
Tags: statistics
John C. McCallum has collected the advertised prices of computer memory over time. In 1957, one byte of memory cost $392, or the equivalent of $411 million per megabyte; today, one metabyte costs about a third of a cent. [h/t Jorge Luis] — : October 12, 2016
Links:
Tags: economicstechnology
The New York Philharmonic’s performance history dataset contains “all known concerts” — more than 20,000 of ‘em — played by the Philharmonic and the groups with which it has merged (e.g., the New York Symphony). Last month, the Museum of Modern Art published a dataset containing “all of the known exhibitions held at the museum from 1929 through 1989” — 1,788 in total. The first featured Cézanne, Gauguin, Seurat, and van Gogh. [h/t Stacy-Marie Ishmael + Miriam Posner + Chad Weinard] — : October 12, 2016
Links:
The World Bank keeps statistics on total forest coverage per country and worldwide. (Between 1990 and 2015, that worldwide total declined from 41.3 million to 40.0 million square kilometers.) More than 98% of all land area in Suriname was forest in 2015, according to a related dataset — the highest proportion of any country. [h/t Tariq Khokhar + Max Galka] — : October 12, 2016
Links:
Tags: environmentplants
Noah Veltman has collected all presidential endorsements (and non-endorsements) of 100+ major newspapers from 1980 (Reagan vs. Carter) to 2016. You can view the data as a spreadsheet, or as a formatted table. — : October 12, 2016
Links:
Tags: journalismmediapolitics
FOIA enthusiast Max Galka received a month of highway traffic data from the U.S. Department of Transportation. The dataset “includes hourly traffic counts for each hour of each day of [November 2015] at approximately 4,000 continuous traffic counting locations nationwide.” In all, the dataset “amounts to a total of 14 million traffic count readings and a total of 6 billion vehicles counted.” — : October 5, 2016
Links:
Tags: transportation
The UNESCO Institute for Statistics’ data on national research and development budgets contains estimates of personnel and total spending by field, funding source, and more. You can also explore the data online through a series of interactive graphics. [h/t Rebecca Galloway] — : October 5, 2016
Links:
Tags:
The federal government publishes default rates for federal student loans, aggregated by school, state, and school type. Last week, it published data covering students whose loans were due for repayment beginning in FY2013.The national default rate for those students as of this August was 11.3%. At certain schools, however, more than a third of students defaulted. More: Some background on the 10 colleges with highest default rates, by my colleague Molly Hensley-Clancy. — : October 5, 2016
Links:
Researchers at the Vienna-based Wittgenstein Centre for Demography and Global Human Capital have developed a dataset of historical and projected education levels for 171 countries. For five-year age groups in each country, the project estimates the percentage of people in each of several categories of educational attainment — no education, primary education, secondary education, post-secondary education, and a few gradations in between. The dataset is available to browse and download via the Wittgenstein Centre Data Explorer – look for “Educational Attainment Distribution” in the “indicators” dropdown. — : October 5, 2016
Links:
Tags: education
AidData, an organization based at the College of William & Mary, has compiled a dataset of more than 1.5 million foreign aid projects between 1947 and 2013. Together, the dataset accounts for more than $7 trillion in commitments from 96 donors such as the U.S. government, UNICEF, the Nordic Development Fund, and the World Bank. AidData also publishes geospatial datasets and a data user guide. Previously: ForeignAssistance.gov, featured Jan. 13. [h/t Kedar Pavgi] — : October 5, 2016
Links:
Tags: United Nationsaidhistory
After 2000’s contentious election, the National Opinion Research Center — funded by a consortium of news organizations — rigorously reviewed 175,010 Florida ballots that weren’t recognized as “valid” votes for president. In November 2001 the researchers concluded that, even with a full recount of disputed ballots, George W. Bush still would have won the state by 493 votes. The underlying data is available in several formats. — : September 28, 2016
Links:
The Constituency-Level Elections Archive, based at the University of Michigan, collects and standardizes results from lower-house legislative elections around the world. (In the U.S., the lower house is the House of Representatives; in the U.K., it’s the House of Commons; in Albania, it’s the Kuvendi i Shqipërisë.) The latest release covers 1,591 elections from 136 countries. [h/t Jeremy Darrington] — : September 28, 2016
Links:
The U.S. Election Assistance Commission’s Election Administration and Voting Survey “includes data on the ability of civilian, military and overseas citizens to register to vote and successfully cast a ballot,” as well as an overview of each state’s voting laws and procedures. [h/t Derek Willis] — : September 28, 2016
Links:
OpenElections, a Knight Foundation–funded project, aims “to create the first free, comprehensive, standardized, linked set of election data for the United States.” They’ve made progress, but are looking for additional volunteers. In the meantime, you can download county-level presidential results from the National Atlas of the United States for 2004, 2008, and 2012 — or all combined. And you can download precinct-level results from 2002 to 2012 from the Harvard Election Data Archive (codebook here). — : September 28, 2016
Links:
Perhaps better known for its campaign-finance data, the Federal Election Commission also publishes official state-level results for presidential, House, and Senate elections going back to 1982. The results include all official candidates, and sometimes even write-ins (depending on the state). In the 2008 presidential election, eight Rhode Island voters wrote-in “Stephen Colbert,” five scribbled “Joe the Plumber,” and seven chose “Jesus.” — : September 28, 2016
Links:
A group of computer scientists and the New Yorker’s cartoon editor walk into a room… and write an academic article titled, “Humor in Collective Discourse: Unsupervised Funniness Detection in the New Yorker Cartoon Caption Contest.” The corresponding dataset — available via the “cartoons” link on this page — includes 50 cartoons and nearly 300,000 reader-submitted captions. — : September 14, 2016
Links:
Reporters at the New York Times have assembled a dataset counting the number of inmates each U.S. county sent to state prison in 2006, 2013, and 2014. The reporters derived the numbers from the Bureau of Justice Statistics’ National Corrections Reporting Program, which only certain researchers can access. Related: “This small Indiana county sends more people to prison than San Francisco and Durham, N.C., combined. Why?” — : September 14, 2016
Links:
Tags: crime
The National Snow and Ice Data Center, based at the University of Colorado, publishes the Sea Ice Index. The data files, which track ice coverage in the Arctic and Antarctic oceans, include daily and monthly measurements from November 1978 to the present. Lately, the extent of sea ice on the Arctic Ocean has been two or more standard deviations below its long-term average, according to the center, while Antarctic sea ice remained at average levels. [h/t Dan Vergano] — : September 14, 2016
Links:
The CDC calls its Behavioral Risk Factor Surveillance System “the largest continuously conducted health survey system in the world.” Every year, the survey asks more than 400,000 American adults about a range of health-related topics, from tobacco to seatbelt use, from alcohol consumption to arthritis, from HIV testing to immunizations. Annual datasets from 1984–2015 are currently available. [h/t Ricardo Pietrobon] — : September 14, 2016
Links:
Researchers at the Washington Center for Equitable Growth have compiled a dataset of current and historical minimum wages in America. The federal and state minimum-wage data stretches back to May 1974 — when the federal minimum was $2.00 per hour, or roughly equivalent $9.76 per hour in today’s dollars — while the data for cities and counties starts in January 2004. [h/t Ben Casselman] — : September 14, 2016
Links:
The U.S. General Services Administration publishes an annual dataset about vehicles owned and leased by the federal government. The spreadsheets — which contain details on total inventories, cost, usage, and fuel consumption — go back to fiscal year 2011. In FY 2015, federal vehicles drove 4.8 billion miles, down about 9% from FY 2011. [h/t John Templon] — : September 7, 2016
Links:
Tags: governmenttransportation
Last week, a team of researchers published HistPat, a database containing county-of-residence data for 2.8 million U.S. patents granted between 1836 and 1975. The database covers approximately 83% of all patents granted to U.S. residents during that time, according to the authors. The most frequent home counties for innovation were New York County (422,234 patents); Cook County, Ill. (215,021), and Los Angeles County (90,171). Related: The National Bureau of Economic Research’s dataset of patent citations, 1975-1999. And: “Cancer moonshot” patents, 1976–2016. [h/t Drew Ivan] — : September 7, 2016
Links:
Tags: technology
Earlier this summer, a group researchers published a “new world atlas of artificial night sky brightness,” also known as light pollution. You can download a KMZ version of their atlas and view it in Google Earth. The researchers haven’t made their most detailed, “floating point” dataset available for public download; instead, they ask that you first submit a data-request form. [h/t Matthew Petroff] — : September 7, 2016
Links:
Tags: environmentmapping
The Federal Communications Commission decides who can use the nation’s airwaves and how. To date, they’ve issued millions of licenses, including nearly 200,000 last year for broadcast, personal use, law enforcement, and more. Almost exactly six years ago, the FCC launched a consolidated portal that pulls data from its various licensing systems into a single dataset. You can download all 17 million licenses in bulk, search for specific licenses online, or query the dataset’s API. [h/t Marc DaCosta] — : September 7, 2016
Links:
Tags: media
Since 1996, the Medical Expenditure Panel Survey has collected data on “the specific health services that Americans use,” and the “health insurance held by and available to U.S. workers.” In a typical year, the survey collects data from more than 30,000 people from more than 10,000 families. In addition to the raw data files, the Agency for Healthcare Research and Quality, which runs the survey, also provides summary data tables. They show that, for example, in 2013 an estimated 61% of Americans faced expenses for prescription drugs, which cost the median patient about $278 before insurance. [h/t Ricardo Pietrobon] — : September 7, 2016
Links:
Tags: healthcare
The 2016 U.S. Open began on Monday. It’s as good an occasion as any to highlight the work of. TennisAbstract.com’s Jeff Sackmann, who has published decades of match results and historical rankings from the men’s ATP and women’s WTA tours. Related: How FiveThirtyEight is using the data to forecast this year’s U.S. Open. Also: Prize money for the four Grand Slam tournaments, by gender and over time. And: The Tennis Racket. [h/t Nadja Popovich + John Templon] — : August 31, 2016
Links:
Tags: sports
California Senate Bill 272, enacted last year, required every local government agencies to publish a “catalog of enterprise systems” — essentially a guide to all the big databases they keep — by July 1 of this year. To find out who complied, a group of data-transparency organizations hosted the California Database Hunt last weekend. Volunteers searched 680 agencies, and published two spreadsheets of their findings: 430 (63%) of local agencies had posted their database catalogs, while 250 had not. [h/t Stephanie M. Lee] — : August 31, 2016
Links:
Tags: technology
EarthStat provides geographic data on harvest regions, yields, and fertilizer use for more than 100 crops. The website also publishes data on pasture land, water depletion, and climatological effects on crop yields. — : August 31, 2016
Links:
Tags: agricultureclimatewater
On Monday, the Department of Transportation released 2015 data from its Fatality Analysis Reporting System. The dataset contains detailed information about every fatal motor-vehicle crash in the U.S., aggregated from a variety of state databases, including police reports, death certificates, and licensing files. In 2015, such crashes led to 35,092 deaths, 7.2% more than in 2014. [h/t Tanya Snyder] — : August 31, 2016
Links:
The Panel Study of Income Dynamics is “the longest running longitudinal household survey in the world,” according to its University of Michigan overseers. The study, which began in 1968, has interviewed more than 70,000 people, including four generations of some families. You can access the data for free, but you first need to register for an account and agree to a set of guidelines. An example insight: In 2013 — the most recent year for which data is available — approximately 11% of families said they owned a business in the previous year. [h/t Don Fullerton + Nirupama S. Rao] — : August 31, 2016
Links:
The German Traffic Sign Recognition Benchmark dataset contains 50,000+ images of 43 kinds of German traffic signs — from the classic “STOP,” to various speed limits, to roundabout indicators. The dataset, published by researchers at Ruhr-Universität Bochum’s Institut für Neuroinformatik, formed the basis of a 2011 machine-learning competition. Viktor Schepik] — : August 24, 2016
Links:
Tags: languagetechnology
New York State tracks every time a horse has been injured or died at a state race track since March 2009. The dataset, which is updated often, also includes a few other types of incidents, such as when a rider falls or horse loses badly. Related: “Horses’ Deaths at Aqueduct Prompt New Rules.” Mark Secada] — : August 24, 2016
Links:
The OpenOil project aims to collect and standardizes data oil and gas development contracts around the world. So far, they’ve gathered at least some data from more than 60 countries. They’ve also published a map of oil concessions in the Middle East and Africa. Michael Gardiner] — : August 24, 2016
Links:
Tags: energy
Australia’s Department of Health has recently released an enormous dataset of Medicare and subsidized-prescription claims. It includes all claims from a random 10% sample of patients, and “contains approximately 1 billion lines of data relating to approximately 3 million Australians.” The Medicare claims go back to 1984, and the prescription claims go back to 2003. Drew Ivan] — : August 24, 2016
Links:
Tags: healthcare
The Marshall Project has collected and analyzed four decades of FBI data “on the most serious violent crimes in 68 police jurisdictions.” The FBI data covers 1975 through 2014; the reporters “also obtained data directly from 61 local agencies for 2015 — a period for which the FBI has not yet released its numbers.” Between 2010 and 2015, violent crime increased most in Milwaukee (+11%) and declined most in Prince George’s County, Md. (-22%). — : August 24, 2016
Links:
Tags: crime
In the 1990s, ethnoarchaeologist Lewis Binford digitized more than 200 variables describing 339 groups of hunter-gatherers, a project his collaborator and widow Amber Johnson continues to maintain. The data come from historical ethnographies of societies, ranging from the Chichimec of the 1570s (in what is now Mexico), to the Dorobo of the 1920s (in what is now Kenya), to the Shompen of the 1980s (in the Nicobar Islands). — : August 17, 2016
Links:
Tags:
The Penn World Table contains GDP estimates, normalized for purchasing power, for 182 countries. These “real GDP” estimates — based on a combination of price surveys and national accounts data — stretch back at least to 1960, and many to 1950. In the most recent year available, 2014, Qatar’s real GDP per capita ranked highest: roughly $144,340 in 2011 U.S. dollars. The Central African Republic’s ranked lowest (~$594), and the United States’ ranked 11th (~$52,292). [h/t Willem Kerstholt] — : August 17, 2016
Links:
Tags: economics
The American Society of Composers, Authors and Publishers (ASCAP) boasts a membership of “more than 585,000 US composers, songwriters, lyricists and music publishers of every kind of music.“ The organization also maintains a downloadable catalog of the writers and publishers behind nearly 9 million songs. (But the downloaded files lack key details, such as the date the song was published.) — : August 17, 2016
Links:
Tags: entertainmentmusic
Through its Form 477 program, the Federal Communications Commission collects detailed data on broadband internet access in the United States. One of the easiest ways to access county-level data is through the agency's Mapping Broadband Health in America project, which overlays internet access data and physical health indicators. The latest tabulations come from 2014. In more than a quarter of counties with at least 1,000 residents that year, broadband reached less than 50% of the population. — : August 17, 2016
Links:
Tags: technology
The U.S. Environmental Protection Agency’s BEACON system contains data on more than 5,000 public beaches. For each state’s most “significant” beaches, BEACON’s downloadable reports include data on water quality, pollution advisories, closures, and more. Of these highly-visited beaches, the longest — at nearly 24 miles — is the Oregon Dunes National Recreation Area’s South Jetty, also home to “the largest expanse of coastal sand dunes in North America.” — : August 17, 2016
Links:
Tags: environmentoceanswater
PhysioNet has published sound and data files for more than 3,000 heart recordings (a.k.a. phonocardiograms). The files support PhysioNet’s 2016 contest, which seeks algorithms that can detect abnormal heart sounds. [h/t Joe Isaacson] — : August 10, 2016
Links:
Macrostrat.org provides data and maps on thousands of geologic formations around the world. The database currently includes 1,474 “regional columns,” 33,903 “rock units,” and 1,750,044 “geologic map polygons.” You can also explore the data through the University of Minnesota’s “Flyover Country” iOS and Android apps. [h/t Grant J. Smith] — : August 10, 2016
Links:
Tags: science
The Centers for Medicare & Medicaid Services evaluates hospitals on dozens of measures — relating to safety, timeliness of care, patient satisfaction, and more — and publishes the results online as the “Hospital Compare” dataset. The dataset also includes an overall score, which distills each hospital’s results into a single five-star rating. If you don’t want to download the data, you can explore the results online. [h/t Drew Ivan] — : August 10, 2016
Links:
Tags: healthcare
For more than a century, the U.S. Census collected slave population figures. An assistant professor at George Mason University has aggregated that data, and mapped it. He cautions: “Treat the Census numbers skeptically: even in the best of circumstances the Census undercounts the population.” Previously: New Orleans slave sales in the December 30 edition; slave ship voyages in the January 20 edition. — : August 10, 2016
Links:
Connecticut has begun publishing a daily census of every inmate held in jail while awaiting trial. Starting July 1, the database contains one row per inmate per day; each row includes basic demographic data (age, gender, race), as well as the inmate’s bond amount, main offense, and jail location. Read more at: The New Haven Independent and TrendCT. Question: This release seems unprecedented; does any other state or country publish such detailed data on pretrial inmates? [h/t Camille Seaberry] — : August 10, 2016
Links:
The 20 Newsgroups dataset contains 20,000 messages (including some duplicates) sent to 20 Usenet bulletin boards in 1993. Among the groups: alt.atheism, misc.forsale, sci.electronics, talk.politics.guns, and talk.politics.mideast. — : August 3, 2016
Links:
A group of public health researchers have estimated the average height of adults in 200 countries over the course of a century. Their calculations are based on a re-analysis of 1,472 previous studies, which collectively measured nearly 19 million participants. The resulting dataset contains annual height estimates for both men and women born each year between 1896 and 1996. During that time, South Korean women’s average height increased by approximately 8 inches, the largest gain of any group. These days, the Netherlands boasts the tallest men, and Latvia the tallest women. — : August 3, 2016
Links:
Tags: statistics
In May 2016, U.S. residential consumers paid an average of roughly 12.8 cents per kilowatt hour of electricity. The price was lowest in Louisiana (9.28 cents) and Washington state (9.54 cents), and highest in Hawaii (26.87 cents) and Connecticut (21.63 cents). These data-points, and more, are available through the Energy Information Administration’s electric power reports, which are updated monthly. [h/t Jordan Wirfs-Brock] — : August 3, 2016
Links:
Tags: energy
At least 6,913 people died while in the custody of Texas police, jails, and prisons between 2005 and 2015, according to the newly-launched Texas Justice Initiative. The data, gathered through freedom-of-information requests, contains the age, sex, and race/ethnicity of each person who died, as well as the general cause of death and a more detailed summary. Read more at: The Atlantic. Related: California’s Department of Justice publishes similar statistics and raw data. [h/t Melissa Segura + Reade Levinson] — : August 3, 2016
Links:
The World Health Organization publishes a slew of datasets on national vaccination rates and policies. Some facts gleaned from the data: Asked whether they provided routine vaccinations to children at school, just 55% of 191 countries that responded said they did. And: In 2015, Equatorial Guinea reported that only 26% of infants had received a first dose of measles vaccine, a lower rate than any other country’s. [h/t Philip Shemella] — : August 3, 2016
Links:
Tags: diseasehealthcare
The Bigfoot Field Researchers Organization dubs itself “the only scientific research organization exploring the bigfoot/sasquatch mystery.” The BFRO collects and vets sighting reports, and publishes them online. (Direct link to KMZ file.) Related: “'Squatch Watch: 92 Years of Bigfoot Sightings in the US and Canada.” [h/t Joshua Stevens + Lynn Cherny] — : July 27, 2016
Links:
The Pacific walrus (Odobenus rosmarus divergens) accounts for the vast majority of walruses on the planet. When they’re not swimming, Pacific walruses like to rest at places called “haulouts.” A new dataset and study include details on 150 current and historic haulouts, the largest of which has been reported to attract more than 100,000 walruses. Miscellany: Three of the study’s authors work for the U.S. Department of the Interior; the fourth works for Russia’s Institute of Biological Problems of the North. [h/t Keith Collins] — : July 27, 2016
Links:
Tags: animalsenvironment
The U.S. Institute of Museum and Library Services annually collects responses from 9,000 public library systems. The results, currently available through 2013, include information about the libraries’ collection size, physical footprint, population served, hours, and more. Previously: Every known museum in the United States, featured Nov. 11, 2015. — : July 27, 2016
Links:
Tags: books
Transitland and TransitFeeds both aggregate data on routes, stops, and timetables from hundreds of public transit systems — from the Bay Area’s BART, to New York’s MTA, to Milan’s ATM, to Budapest’s BKK. — : July 27, 2016
Links:
Tags: transportation
The Global Burden of Disease dataset represents “the largest and most comprehensive effort to date to measure epidemiological levels and trends worldwide,” according to the Institute for Health Metrics and Evaluation, which runs the project. For each disease and each country, the dataset contains estimates of the total deaths, years of life lost, and years lived with disability. The estimates are currently available for 1990, 1995, 2000, 2005, 2010, and 2013. Related: “Where We Live and How We Die: What a year of death looks like around the world.” [h/t Mimi Onuoha + Data & Society] — : July 27, 2016
Links:
Thanks to the Paperwork Reduction Act, federal agencies must get approval from the Office of Information and Regulatory Affairs for any “information collection” (e.g., a form) that seeks 10 or more responses. You can search all information collections — under review, approved, or rejected — online, or download an XML file of all active collections. — : July 20, 2016
Links:
Tags: government
Today, UNESCO’s World Heritage Committee will wrap up its 40th session, during which it has “inscribed” more than 20 new awe-inspiring places around the world. Online, the organization publishes spreadsheets and map files of 1,031 heritage sites it has previously inducted. For each site, the spreadsheet tracks its location, size, date inducted, category (“cultural,” “natural,” or “mixed”), and which selection criteria it met, and more. Through 2015, the countries with the largest number of heritage sites were Italy (51), China (48), and Spain (44). — : July 20, 2016
Links:
Tags: United Nationsmapping
The National Fire Incident Reporting System (NFIRS) is “the world’s largest, national, annual database of fire incident information,” containing about 1 million fires per year, including wildfires, structure fires, vehicle fires, and more. NFIRS data from 2013 (and prior years) are available online from FEMA. Looking for 2014’s data? The government asks you to request it via postal mail; or you could trust the copy a public safety analyst uploaded in March. (See the links at the bottom of that page.) The U.S. Fire Administration, which maintains NFIRS, publishes additional datasets, including a spreadsheet of 27,000+ fire departments and a database of on-duty firefighter fatalities. Also, the U.S. Geological Survey publishes data on current and historical wildfire perimeters. [h/t Nick Penzenstadler + Nadja Popovich] — : July 20, 2016
Links:
Tags: disaster
StackOverflow is a Q&A site for programmers, and part of the larger StackExchange network of Q&A communities. StackExchange publishes periodic data dumps of the networks’ users, questions, answers, votes, and comments. On Monday, the company released “StackLite,” a smaller, easier-to-use slice of the data. (Even so, it contains metadata on more than 15 million questions.) If you don’t want to download anything, you can also explore and analyze the data online. [h/t David Robinson] — : July 20, 2016
Links:
Tags: technology
Two political science professors at the University of Kentucky are compiling a dataset of coup attempts. So far, the dataset covers both successful and unsuccessful attempts from 1950 to late 2015. During those 65+ years, coup plotters have been foiled about half the time, with 236 victories and 238 failures. According to the dataset, Bolivia’s top leaders have faced 23 coup attempts, including 11 successful overthrows — more than any other country by either metric. [h/t Arthur Charpentier] — : July 20, 2016
Links:
Pokéapi is an API “detailing everything about the Pokémon main game series,” including every character, evolution, battle skill, and more. The data is also available as a series of CSVs. Currently, however, the dataset doesn’t include details from the so-hot-right-now Pokémon Go game. — : July 13, 2016
Links:
Tags: entertainmentgame
The U.S. Occupational Safety and Health Administration (OSHA) conducted 86,000 workplace inspections last year. The agency makes its inspection results — including investigations of fatal accidents and severe injuries — available in bulk and via an API. — : July 13, 2016
Links:
Tags: deathhealthcareinjury
OpenAddresses.io is an effort to collect the official geocoordinates of the all the world’s physical addresses. (These data come from “authoritative” sources, such as city governments. When Google Maps tells you the location of an address, it’s often just a very-educated guess, extrapolated from coarser data.) As of Monday evening, the project had processed 265,078,567 addresses, mostly in North America, Europe, Japan, and Australia. Related: “Open-source geo is really something right now.” — : July 13, 2016
Links:
Tags: mapping
Late last month, the Centers for Medicare and Medicaid Services added data from 2015 to its Open Payments database, which tracks medical companies’ payments to doctors and teaching hospitals. The payments — which include consulting fees, gifts, honoraria, meals, drinks, grants, and more — totaled more than $7.5 billion last year. Related: ProPublica’s Dollars for Docs project, which began tracking medical industry payments in 2010, long before CMS released the OpenPayments database. [h/t Cat Ferguson + Chris Hamby] — : July 13, 2016
Links:
Tags: businesshealthcare
This repository contains voting data from each of the UN General Assembly’s the first 69 sessions. One spreadsheet summarizes the topic and results of each voted-upon resolution. (The dataset also indicates whether the U.S. State Department identified the vote as “important” — such those condemning human rights violations in Syria and North Korea — in its annual Voting Practices in the United Nations report.) Another file contains each country’s individual voting decisions. [h/t David Robinson] — : July 13, 2016
Links:
And yet, people do... by the thousands. In 2005, the Federal Aviation Administration created a system for pilots to report “laser events,” which it says can temporarily blind crewmembers. The administration has published five years of data from the reporting system. In 2014, the most recent year available, pilots reported 3,894 laser beamings. The vast majority involved a green beam, and none were reported to have caused an injury. — : July 6, 2016
Links:
Property tax data in New York City is technically available to the public, but the city makes it difficult to access. So a pair of civic hackers liberated the data. Now you can download 1.1 million rows of bulk data, which details each property’s type, assessed value, taxes due, owner’s name, and more. You can also download 750,000 rows of tax exemptions and abatements. Related: “A Look at NYC’s $650 Million Property Tax Breaks Related to Religion” — : July 6, 2016
Links:
Tags: real estatetaxes
In a recently-updated paper, three academics say they’ve found “convincing evidence of election fraud” in federal Russian elections since 2004. To support their analyses, the researchers have published the underlying data, which includes polling station data from seven Russian elections (as well as one Polish and one Spanish election, which showed no such signs of fraud). Related: WSJ analysis of Russian parliamentary election “points to widespread fraud” (2012). [h/t Arthur Bashlykov] — : July 6, 2016
Links:
Last month, German investigative nonprofit Correctiv published a searchable database of 13,000 nursing homes in the country. The data are based on government inspections, and the reporters have published the raw and processed data on GitHub. Related: ProPublica’s searchable database of nursing homes in the United States and the Medicare’s nursing home data. [h/t Sandhya Kambhampati] — : July 6, 2016
Links:
Tags: healthcare
The Correlates of State Policy Project aims to become a “one-stop shop” for data related to public policy in America’s 50 states. So far, the project is tracking 700+ aspects of each state’s laws, budgets, demographics, and more. Among the policy variables: Can pharmacies dispense emergency contraception without a prescription? Does the state ban corporal punishment in schools? and Does the state have an endangered species act? Don’t miss the codebook, which describes the data and sources in greater detail. Related: State and Local Public Policies in the United States, a similar project, for which an update to include 2014 data is “underway.” [h/t Rob Gillezeau] — : July 6, 2016
Links:
Tags: government
“Most people find this website because they are searching for the source of an unusual low frequency sound.” The World Hum Database currently includes more than 10,000 reader-submitted reports, including a recent submissions that describe the noise as sounding “like a fridge,” “like a train in the distance,” and “like a cicada that never shuts up.” [h/t Susie Cambria] — : June 22, 2016
Links:
Tags: audio
The U.S. Census Bureau’s Annual Characteristics of New Housing culls data on features such as square footage, wall material, number of bedrooms, and number of fireplaces. (Air conditioning was present in 93% of new single-family homes built in 2015, up from 49% in 1973.) Related: “Houses Keep Getting Bigger, Even as Families Get Smaller.” [h/t Lindsey Cook] — : June 22, 2016
Links:
Tags: real estate
Next month, thousands of adrenaline junkies will gather in Pamplona for the city’s annual Running of the Bulls. The San Fermin festival, which organizes the spectacle, publishes injury data on its website. (Here’s a shortcut to display every year of data, instead of one year at a time.) Last year, the bulls gored 10 runners and injured another 27. Related: “Your Chances Of Being Gored By A Bull In Pamplona Are Getting Higher.” — : June 22, 2016
Links:
Earlier this month, researchers published “the first spatially explicit dataset of urban settlements from 3700 BC to AD 2000,” along with a detailed methodology. The dataset digitizes and geocodes population numbers originally tabulated by historian Tertius Chandler (Four Thousand Years of Urban Growth) and political scientist George Modelski (World Cities: -3,000 to 2,000). Though “far from comprehensive,” the authors say that the dataset a “first step towards understanding the geographic distribution of urban populations throughout history.” Related: “Watch 6,000 years of urbanization taking over the world.” — : June 22, 2016
Links:
Last week, the Internal Revenue Service released a huge dataset of nonprofits’ annual Form 990 filings, which provide details on program expenses, salaries, and more. More than 60% of Form 990s are filed digitally, according to the IRS. Previously, those forms were only available as images; now the IRS is publishing them as analysis-friendly XML files. (You can also download the data in bulk from the Internet Archive, thanks to Carl Malamud, the public domain advocate who led the fight for 990s-as-XML.) One early observer noted that the some of the data was misformatted, and has provided instructions for fixing it. [h/t Andrew Sullivan + Kendall Taggart] — : June 22, 2016
Links:
Tags: taxes
The National Oceanic and Atmospheric Administration’s Fisheries Statistics Division provides data on seafood caught by U.S. commercial fisheries, sliceable by month, species, and fishing gear. You can learn, for example, that these fisheries caught 88,893,305 pounds of Dungeness crab in 2006 — the highest recorded total since at least 1950. [h/t Gwynn Guilford] — : June 15, 2016
Links:
PhishTank is a clearinghouse that tracks thieves’ attempts to steal personal information and online credentials. The website also publishes bulk data on all verified phishing attempts — 44,000 and counting. With more than 1,000 phishing attempts recorded against it, PayPal is the single most-targeted website in the database. [h/t Herman Slatman] — : June 15, 2016
Links:
Tags: crimetechnology
Last month, the World Health Organization released its latest update to the Global Urban Ambient Air Pollution Database, which now covers nearly 3,000 cities in 103 countries. For each city, the dataset includes annual average density of two key categories of particulates (PM2.5 and PM10), as well as details regarding the data collection. According to the organization’s own analysis, “98% of cities in low and middle income countries with more than 100,000 inhabitants do not meet WHO air quality guidelines.” Related: ”A New Air Pollution Database Is Good, but Imperfect.” — : June 15, 2016
Links:
Tags: climateenvironment
The Chronicle of Higher Education has been tracking federal investigations into sexual assault on college campuses. Recently, The Chronicle added an API, so that developers and data analysts can access the data more easily. Currently, the dataset includes 292 investigations conducted since April 2011 — 49 of which have been resolved. [h/t Jon Davenport] — : June 15, 2016
Links:
On everypolitician.org, you can search and download data on 70,000+ legislators (past and present) from 233 countries. (Among those missing: Cuba, Ethiopia, and Qatar.) The dataset includes each lawmaker’s party affiliation, years served, gender, social media profiles, and more. Related: Every member of the United States Congress since 1789. — : June 15, 2016
Links:
Statistics grad student Kaylin Walker scraped 50 years of Billboard’s “Year-End Hot 100” rankings and those songs’ lyrics. Related: Walker’s analysis and methodology. [h/t Melissa Bierly] — : June 8, 2016
Links:
In 2006, Netflix launched a $1 million challenge to beat the company’s movie-recommendation algorithm. In 2009, Netflix awarded the prize to a group of AT&T scientists (though ultimately didn’t use the winning algorithm). The challenge, which was open to the public, was based on a dataset of 100 million ratings from 480,000 (anonymized) users, corresponding to more than 17,000 movies between Oct. 1998 and Dec. 2005. The dataset, once hosted at UC Irvine, is currently available through the Internet Archive. Previously: MovieLens, featured Jan. 27. [h/t Brandon Loudermilk] — : June 8, 2016
Links:
The Sunlight Foundation’s Hall of Justice brings together “nearly 10,000” criminal justice datasets and research documents from across the United States. You can search for topics and filter by geography, publisher, and accessibility (open, open-but-not-machine-readable, restricted access, et cetera.). Related: Sunlight’s “lessons learned from a year of opening police data.” [h/t Susie Cambria + Noah Veltman] — : June 8, 2016
Links:
The U.S. government maintains a “judgment fund,” which it uses to pay plaintiffs when federal agencies lose in court (or settle “actual or imminent lawsuits”). The Department of the Treasury, which administers the fund, publishes data on these payouts for each fiscal year going back to FY2006. [h/t CJ Ciaramella] — : June 8, 2016
Links:
Researchers in Europe have published a database of 216 nuclear energy accidents — a compendium they say is “twice the size of the previous best data set.” For each accident, the database contains the date, location, description, and four measurements of severity: its ratings on the International Nuclear Event Scale and on the Nuclear Accident Magnitude Scale, the number of fatalities, and total monetary cost. (The three most expensive: Chernobyl, Fukushima, and a 1995 accident at Japan’s Monju Nuclear Power Plant, estimated to have caused $15.5 billion in damages.) [h/t Dad] — : June 8, 2016
Links:
BrickLink is a website for buying and selling LEGOs. It also happens to publish a (nearly?) complete inventory of every LEGO set and piece produced since 1949. Related: LEGO sets have become increasingly violent, according to a recent study. [h/t Lindsey Cook] — : June 1, 2016
Links:
Tags: entertainment
The Scripps National Spelling Bee publishes the competition’s results online, but not in any analysis-friendly format. Thankfully, statistician Christopher Long has scraped and spreadsheet-ified the Scripps results going back to 1996 – including last week’s finals. Related: FiveThirtyEight uses the data to ask, “Where Do Spelling Bee Words Come From?” — : June 1, 2016
Links:
Tags: languagestatistics
The United Nations University’s World Income Inequality Database contains historical Gini coefficients for more than 170 countries — in some instances stretching back to the 1930s or ‘40s. The latest version of the database was released in October 2015 and includes key details about each estimate, such as the name of the primary source and the quality of data collection. — : June 1, 2016
Links:
Tags: United Nationsstatistics
Between 2002 and 2004, researchers surveyed more than 9,500 farming households in 11 African countries to better understand how climate change might affect agricultural practices. Last month, they published the detailed results and documentation in Scientific Data. The dataset includes responses to questions about plantings, harvests, yields, water sources, animal purchases, taxes paid, and much more. — : June 1, 2016
Links:
In 2014, approximately 22 million U.S. military veterans were still alive, including 1 million who served in World War II, 7.2 million who served during the Vietnam War era, and 3.9 million who have served in post-9/11 wars. Those numbers come from the VA’s National Center for Veterans Analysis and Statistics, which publishes estimates and future-projections of the country’s veteran population. You can explore the data by age, race, ethnicity, gender, military branch, state, county, era of service, and more. (To see the files, click on the “Population Tables” header.) [h/t Charles Worthington] — : June 1, 2016
Links:
The Photographers’ Identities Catalog aggregates data on more than 110,000 photographers and photo studios throughout history. The information “has been culled from trusted biographical dictionaries, catalogs and databases, and from extensive original research” by the New York Public Library’s photography experts. The catalog — which includes data on gender, geography, range of years active, and more — is available as raw CSVs on GitHub. — : May 25, 2016
Links:
The Major League Soccer Players Union publishes salary data going back to 2007, and released 2016’s figures last week. (At $7.17 million in total compensation, Orlando City’s Kaká ranks as the league’s highest-paid player.) The MLSPU publishes the data as PDFs; I’ve converted those PDFs into CSVs for you. [h/t Rose Eveleth + John Templon] — : May 25, 2016
Links:
To help understand San Francisco’s soaring real estate prices, Eric Fischer transcribed decades of apartment and house listings in the San Francisco Chronicle. For each year from 1948 through 1979, Fischer jotted down every monthly rent advertised in the paper on the first Sunday in April. (Similar data for 1979 through 2001 is available from San Francisco’s Housing Study DataBook.) The transcriptions are available on GitHub. [h/t Kendall Taggart + Michael Andersen] — : May 25, 2016
Links:
“There’s software used across the country to predict future criminals. And it’s biased against blacks,” a ProPublica analysis has found. The investigation focused on risk assessments and recidivism in Broward County, Florida, and found that black defendants were more likely than white defendants to be mislabeled as “high risk.” The reporters have published their methodology, code, and the underlying data — including two years of Broward County risk assessments — on GitHub. — : May 25, 2016
Links:
Tags: crimejusticetechnology
Governments around the world have used “LiDAR” — a laser-powered surveying technology — to build impressively precise elevation maps. In many cases, they’ve also released these topographic datasets to the public. The U.S., for instance, publishes gobs of LiDAR data through the Interagency Elevation Inventory. And you can also find LiDAR datasets for the United Kingdom, Spain, Finland, Slovenia, Denmark, Switzerland, the Netherlands, and New York City. Related: Using LiDAR data to print a 3D map of London. — : May 25, 2016
Links:
Tags: mapping
The MusicBrainz database contains metadata on more than one million artists, 16 million recordings, 900,000 pieces of cover art. You can download the data in bulk or query it via an API. Previously: The smaller-but-more-detailed Million Song Dataset, featured Feb. 10. [h/t Geoff Boeing] — : May 18, 2016
Links:
Tags: audioentertainmentmusic
I Quant NY author Ben Wellington recently discovered that New York City had been “ticketing legally parked cars for millions of dollars a year.” To reach that finding, Wellington analyzed three years of parking tickets, amounting to more than 30 million summonses. NYC isn’t alone in providing parking ticket data; Philadelphia, Toronto, Baltimore, Seattle, and others publish similar datasets. — : May 18, 2016
Links:
Tags: statistics
The United Nations publishes estimates of the number of foreign-born residents living in every country. The figures cover 1990 to 2015, at five-year intervals. The Vatican (100% foreign-born) and the United Arab Emirates (88%) had the highest proportion of immigrant residents in 2015; the U.S. (46.6 million) boasted the largest total immigrant population. The dataset also includes estimates by age, sex, and country of origin. Previously: Refugees in America, featured Nov. 25, 2015. [h/t Manu Balachandran] — : May 18, 2016
Links:
The Paleobiology Database, run by a non-profit group of researchers, has aggregated data on more than a million fossils from all around the world. You can access the dataset — organized by species, era, and location — via an interactive map, download form, or API. — : May 18, 2016
Links:
To help monitor drug safety, the FDA collects “adverse event” reports submitted by patients, doctors, and manufacturers. You can download the (anonymized) reports from the FDA directly, but that dataset includes duplicate cases, and sometimes calls the same drug by different names. A group of researchers recently announced that they’ve cleaned up the data — removing duplicates and standardizing nomenclature — so that you don’t have to. The resulting dataset covers 4,245 drugs, more than 17,000 types of reactions, and nearly 5 million case reports. Previously: The SIDER database of pharmaceutical side effects, featured Nov. 11, 2015. — : May 18, 2016
Links:
Tags: drugshealthcare
In response to a freedom-of-information request, the NYC Department of Buildings provided WNYC with a spreadsheet of 76,088 “registered elevator devices” in the city. Elevators and escalators dominate the list, but you’ll also find dumbwaiters, handicap lifts, and a few other vertical transporters. The spreadsheet includes data on location, speed, maximum capacity, floors served, and more. Related: FiveThirtyEight analyzed the data last week. [h/t Michael A. Rice, a teacher at Ingraham High School in Seattle + John Templon] — : May 11, 2016
Links:
Tags: mappingstatistics
An international network of researchers who study noncommunicable diseases estimates the annual prevalence of obesity and diabetes for approximately 200 countries and territories around the world. The data currently covers 1975–2014 and is based, on 2,000+ surveys, according to the group. Related: Bloomberg’s chart and maps of the data. — : May 11, 2016
Links:
Tags: healthcare
The Institute for Cannabis (established in 1985 as The Institute for Hemp) has obtained, via FOIA, the U.S. Drug Enforcement Administration’s list of organizations licensed to handle marijuana ��� or, as the license application form calls it, “marihuana.” Many of the nearly 3,000 licensees are law enforcement organizations, but universities, pharmacies, and hospitals also pepper the list. Michael Ravnitzky] — : May 11, 2016
Links:
Tags: drugs
Since 2009, NASA’s Kepler spacecraft has been looking for Earth-like exoplanets — i.e., planets outside our solar system. Through the NASA Exoplanet Archive, you can explore, filter, and download databases of “candidate” and “confirmed” exoplanets, including Kepler’s discoveries. [h/t David Kipping] — : May 11, 2016
Links:
Tags: science
On Monday, the International Consortium of Investigative Journalists released data on 210,000 companies, trusts, and funds named in the massive Panama Papers leak. The database is searchable online and downloadable as several CSV files. The dataset includes companies’ officers, registered addresses, and middlemen. It supplements a pre-existing cache of of 105,000 companies named in ICIJ’s 2013 "Offshore Leaks" investigation. — : May 11, 2016
Links:
Climate scientists have compiled a dataset of grape-harvest-dates from 380 European vineyards, across 27 regions, and stretching back 650 years. The earliest data-point refers to a Burgundy harvest in 1354. Related: The original academic paper. [h/t Martín González] — : May 4, 2016
Links:
OpenFootball collects and publishes results and rosters from national and international soccer/football matches, including the Premier League and the World Cup. Related: English soccer/football results, 1871–2014. [h/t Wendy Mak] — : May 4, 2016
Links:
Tags: sports
The National Practitioner Data Bank tracks medical malpractice payments, license suspensions, Medicare expulsions, and other lists of penalized physicians. The public use data file includes dozens of details per entry but excludes the part that is almost certainly most important to patients: the doctors' names. Related: “Doctors perform thousands of unnecessary surgeries,” according to a 2013 USA Today investigation that relied partly on the NPDB. — : May 4, 2016
Links:
Tags: healthcare
Scholars at Virginia Commonwealth University have identified and mapped the locations of 2,000 KKK branches active in the early 20th century. The dataset contains the city, state, earliest-known-date, and sources for each “klavern.” Related: “Active Hate Groups in the United States in 2015,” a report by the Southern Poverty Law Center. [h/t K Reed] — : May 4, 2016
Links:
UPDATE: Check out Southern Poverty Law Center's Hate Map which includes a link for downloading data from 2000 - 2018.
Sci-Hub bills itself as “the first pirate website in the world to provide mass and public access to tens of millions of research papers.” Who’s downloading papers from the site? “Everyone,” Science magazine concluded after analyzing data culled from six months of Sci-Hub server logs. For every download, the dataset identifies the paper downloaded, the date and time, an anonymized version of the downloader’s IP address, and a rough location. [h/t Melissa Bierly + Tom Grahame] — : May 4, 2016
Links:
Tags: science
The Star Wars API provides programmatic access to data about every character, species, spaceship, planet, and film in George Lucas’ cinematic universe. You can also download JSON files containing all the data. [h/t Robin Sloan] — : April 27, 2016
Links:
Tags: entertainmentmovies
The U.S. House of Representatives requires all staff to reveal all “gift travel” — i.e., “free” trips that the government didn’t pay for. The Office of the Clerk compiles those filings into a database containing each trip’s dates and sponsors. (The Consumer Electronics Show paid for 49 staffers and one congressman to visit the Las Vegas convention in January.) The Senate publishes similar data, except it doesn’t include the sponsor name ... which kind of undermines the entire point. [h/t John Stanton] — : April 27, 2016
Links:
Tags: politics
The Bureau of Transportation Statistics requires the nation’s largest airlines to report scheduled and actual timing data for every domestic flight. The corresponding database includes information about delays, cancellations, and diversions, among other fields — and goes back to 1987. In January 2016, departing flights taxied for an average of 16 minutes, a minimum of 1 minute, and a maximum of 2 hours, 38 minutes. Related: “Which Flight Will Get You There Fastest?” [h/t Tom Augspurger] — : April 27, 2016
Links:
Tags: transportation
Last week, the researchers at CERN’s Compact Muon Solenoid Experiment released more than 300 terabytes of data. The datasets include raw particle-detection data from the Large Hadron Collider, as well as pre-processed datasets the researchers say “can be readily analysed by university or high-school students.” [h/t Dad] — : April 27, 2016
Links:
Tags: science
On Saturday, BuzzFeed hosted a FOIA data hackathon. Participants used datasets — from MuckRock, FOIA Machine, FOIA Mapper, and FOIA.gov — to analyze federal, state, and local responsiveness to public records requests. The first three datasets contain details about individual FOIA requests and responses; FOIA.gov provides aggregate internal data from federal agencies. — : April 27, 2016
Links:
Tags: journalism
The U.S. Alcohol and Tobacco Tax and Trade Bureau publishes a few permit datasets, including this table of 1,900+ businesses licensed to produce and/or bottle liquor. [h/t Maggie Lee] — : April 20, 2016
Links:
Tags: alcohol
Researchers have analyzed 15 years of satellite imagery to create a nearly-global dataset of seasonal cloud coverage. The data — available at a kilometer-square resolution — could help scientists monitor and predict changes in ecosystems. [h/t Grant Smith + Joanna Klein] — : April 20, 2016
Links:
Tags: climate
Baseball season is in full-swing, basketball and hockey playoffs have begun, and the NFL draft is nigh. No better time to highlight some cricket data! Cricsheet.org has gathered ball-by-ball data on more than 2,700 matches played since the mid-2000s. Looking for historical data? A new GitHub repository contains stats for more than 40,000 matches going back to 1773 (but mostly since the 1970s), scraped from ESPN Cricinfo. Related: How, statistically, the coin toss affects who’ll win. [h/t Derek Willis] — : April 20, 2016
Links:
Tags: sportsstatistics
Last week, the Bureau of Labor Statistics published its midyear update to the Consumer Expenditure Survey. The survey collects data on spending, income, and a handful of characteristics about U.S. consumers. One tidbit: On average, Americans are spending approximately 33% of their income on housing, and a tad less than 1% on alcohol. [h/t Nathan Yau] — : April 20, 2016
Links:
An under-scrutinized quirk in a little-known, widely-used database “turned a random Kansas farm into a digital hell.” How? The database contains best-guess geographic coordinates for every IP address on the internet. But for millions of IP addresses, the best guess is just somewhere in the United States. And, until recently, the database translated that vague location into the latitude and longitude of a farm in Potwin, Kansas. (Now it points to a lake.) — : April 20, 2016
Links:
Tags: mapping
The Federal Aviation Administration maintains a database of all non-military aircraft registrations, which includes extensive details about each plane/helicopter/glider/blimp and their owners. Related: “Spies In The Skies.” [h/t Peter Aldhous] — : April 13, 2016
Links:
Tags: transportation
Over the weekend, Hannah Anderson and Matt Daniels published an interactive analysis of male and female speaking roles in 2,000 movie scripts. Among their findings: 308 scripts gave 90%+ of the film’s dialogue to men, while just 8 scripts did so for women. The duo has also released “as much data as we can share (without getting sued)” on GitHub. — : April 13, 2016
Links:
The Health Inequality Project calculates American life expectancies by income, gender, and geography. You can download the data at the national, state, county, and “commuting zone” levels. Where do poor Americans live the longest? New York City, Santa Barbara, and San Jose. [h/t Margot Sanger-Katz] — : April 13, 2016
Links:
CourtListener gathers and publishes bulk data the Supreme Court, all federal appeals courts, and hundreds of other jurisdictions. The files include opinions, audio from oral arguments, dockets, and citations. It also has an API. (If you register, you can also create and explore networks of citation-linked cases.) [h/t Jeff Grove] — : April 13, 2016
Links:
Tags: law
To create the most detailed measurements of global rainfall ever, researchers at UC Santa Barbara’s Climate Hazards Group harmonize data from satellites and on-the-ground weather stations. The dataset, known as CHIRPS, stretches back more than 30 years and is freely available. Related: Eric Holthaus provides more details and explains why the dataset is so important. [h/t Dave Riordan] — : April 13, 2016
Links:
Tags: climate
An API Of Ice And Fire lets you fetch data about every book, character, and house in Game of Thrones — including allegiances, family trees, and dates of death. You can also download the data in bulk. Related: Macalester researchers recently published a network analysis (and underlying data) of all characters in A Storm of Swords, the third book in the series. Jon Snow, according to the analysis, was the second-most important character. [h/t Melissa Bierly] — : April 6, 2016
Links:
When physician John Snow constructed his now-famous dot-map of London’s Broad Street cholera outbreak in the 1850s, the leading geospatial technologies were ink and paper. Academic Robin Wilson has adapted the data for the computer age, converting Snow’s map into several modern GIS formats. Related: Infographics in the Time of Cholera. — : April 6, 2016
Links:
Hacker News’ official API provides data describing every submission, comment, and user on the community-driven website. You can also analyze the full dataset via Google’s recently-relaunched BigQuery Public Datasets program. [h/t Michael Gardiner] — : April 6, 2016
Links:
Tags: technology
. Every January, at the behest of the U.S. Department of Housing and Urban Development, volunteers across the country attempt to count the homeless in their communities. The result: HUD’s “point in time” estimates, which are currently available for 2007–2015. The most recent estimates found 564,708 homeless people nationwide, with 75,323 of that count (more than 13%) living in New York City. Related: “Why counting America’s homeless is both imperative and imperfect.” Also related: “How Many Street Homeless? NYC’s Tallies Leave the Question Open.” [h/t Tim Henderson + Jonathan Stray] — : April 6, 2016
Links:
Tags: aid
The citybik.es API provides access to live data on every bike-sharing station in more than 400 cities around the world. It’s free, and the underlying software is open-source. What data you get per station depends on the city, but typically includes the number of empty slots, number of available bikes, and location information. Looking for bulk data on bike-sharing rides? Many cities — including New York, Chicago, and D.C. — make it available. Related: “A Tale of Twenty-Two Million Citi Bike Rides.” Also related: Three maps illustrating the gender gap in bike-share usage. — : April 6, 2016
Links:
Tags: transportation
In 1999, the USDA Economic Research Service published a “natural amenities scale,” which rated every county in the contiguous United States based on factors such as landscape variation and January sunniness. Last year, based on the dataset, a Washington Post reporter called Minnesota’s Red Lake County “the absolute worst place to live in America.” Now, he’s moving there. [h/t Jody Avirgan] — : March 30, 2016
Links:
Tags: environment
Open Food Facts is a crowdsourced database of food products’ nutrition data and ingredient lists. (E.g., this kilogram jar of Nutella contains 316 grams of fat.) The entire database can be downloaded in several formats. — : March 30, 2016
Links:
Tags: food
Based in large part on Encyclopedia Titanica, researchers have compiled a structured dataset of 1,309 passengers on the RMS Titanic’s maiden voyage. (To get the data, download titanic3.csv on this page.) The dataset includes passengers’ names, ages, ticket fare, cabin number, and whether they survived. — : March 30, 2016
Links:
Tags: history
Researcher Gwern Branwen has assembled an archive of listings posted to “dark net markets". Silk Road is the best-known among the group, but the collection covers scores of other markets, including Amazon Dark and FreeBay. The materials gathered from each site are slightly different; many include product advertisements and seller profiles. Warning: Some of the archives contain pictures, which may include offensive or disturbing imagery. And it’s probably wise to heed Gwern’s caveats: The scrapes “are large, complicated, redundant, and highly error-prone. They cannot be taken at face-value.” [h/t Mike Sconzo] — : March 30, 2016
Links:
Tags: crimeeconomicstechnology
Want to fly a drone in the United States for non-recreational purposes? You’ll need a “Section 333” exemption from the Federal Aviation Administration, which governs drone activity. The FAA publishes a list of approved exemptions, which Bard College’s Center for the Study of the Drone has converted into a PDF-formatted database. The Verge, in turn, has converted that PDF into an easy-to-use CSV. Related: Last week, the FAA updated its dataset of unmanned aircraft sightings. [h/t Dan Vergano] — : March 30, 2016
Links:
Tags: technology
NYC’s 311 dataset contains a special category for rat sightings. This slice of data, which is updated daily and stretches back to 2010, contains more than 73,000 rows. One-third of sightings have occurred in Brooklyn. Related: An academic study of NYC rat sightings. Also related: Reply All #56 — ”Zardulu”. — : March 23, 2016
Links:
Tags: animals
The USDA Economic Research Service’s County Typology Codes categorize each U.S. county based on (a) its dependence on certain industries and on (b) various socio-economic factors. For example, the data classifies 219 counties as “mining-dependent.” [h/t Steven Romalewski] — : March 23, 2016
Links:
Tags: economics
The U.S. National Water Level Observation Network tracks water levels at hundreds of tide gauges around the country. The data is available via an API. Related: Water’s Edge, a 2014 Reuters investigation based on the gauge data. Also related: The Advanced Hydrologic Prediction Service’s flood observations and warnings, as structured data. [h/t Ryan McNeill] — : March 23, 2016
Links:
The UK’s Price Paid Data contains virtually all of the country’s residential property sales, with only a few exceptions. (Sales forced under court order are excluded, for example.) Each row includes the sale price, address, property type, and more. The full, multi-gigabyte dataset covers all sales since 1995, but you can also download files for individual years or the most recent month, or just search the dataset online. Related: Where can you afford to buy a house? [h/t Helena Bengtsson] — : March 23, 2016
Links:
Tags: real estate
The Oklahoma Geological Survey Observatory’s “Catalog of Nuclear Explosions” contains a “nearly complete” list of such detonations — more than 2,000 of them between 1945 and 2006. The dataset roughly (but not precisely) overlaps with the explosions listed in the Stockholm International Peace Research Institute’s “Nuclear Explosions, 1945–1998” (PDF) report. Both datasets list the date and location of each explosion, the country responsible, the detonation site, and (where known) its explosive yield, among other variables. And both reports use unconventional formatting, so I’ve extracted a couple of CSVs for you. — : March 23, 2016
Links:
If you’re looking for historical data on baseball teams, players, salaries, or managers, Sean Lahman’s Baseball Archive likely has it. The archive was updated with data from the 2015 season last week. Related: Retrosheet’s game logs — a record of every major league game since 1871. [h/t Joe Murphy] — : March 9, 2016
Links:
Tags: sports
The cruciverb industry is facing its first major plagiarism scandal, unearthed thanks to a newly-published database of crosswords that are at least 25% similar to previous-published puzzles. — : March 9, 2016
Links:
With the help of volunteers, the New York Public Library is transcribing 6,000+ mortgage and bond ledgers from Emigrant Savings Bank, founded in 1850 and the oldest such bank in the city. You can search the transcribed records, or download the (very) raw data. — : March 9, 2016
Links:
Tags: historymoneyreal estate
The Sunlight Foundation’s Capitol Words project lets you explore the frequency of words and phrases in the Congressional Record since 1996. For example: "weapons of mass destruction", “war” vs. “peace”, or “Obamacare”. The underlying data is available via an API. — : March 9, 2016
Links:
Researchers have compiled a multi-decade database of the super-rich. Building off the Forbes World’s Billionaires lists from 1996–2014, scholars at Peterson Institute for International Economics have added a couple dozen more variables about each billionaire — including whether they were self-made or inherited their wealth. (Roughly half of European billionaires and one-third of U.S. billionaires got a significant financial boost from family, the authors estimate.) — : March 9, 2016
Links:
Tags: money
Through a freedom of information request, WNYC obtained four years of New York City film and television permits. The 40,000+ records date from October 2011 to September 2015 cover several types of permits, including those for scouting, shooting, and red carpet premieres. More: Popular TV shows’ shooting locations, mapped. [h/t John Templon] — : March 2, 2016
Links:
National population data is easy to find. But it’s much harder to find reliable, standardized population figures for finer-grained geographies. To that end, the World Bank has launched a pilot of its Subnational Population Database, which calculates estimates for 75 countries’ major provinces/states/regions. — : March 2, 2016
Links:
Tags: statistics
Congress has finally begun publishing official bulk data on the status of its bills — something open-government advocates had been requesting for more than a decade. The bulk downloads include an XML file for each piece of legislation, with indicators tracking (among other things) committee referrals and actions. Nostalgia: I’m Just A Bill. [h/t Derek Willis] — : March 2, 2016
Links:
Tags: money
The UK government has published data on 27 years of food consumption. The National Food Survey datasets are based on “food diaries” recorded by a sample of British families from 1974 to 2000. In addition to tracking food consumption, the data contains details about each household, including whether they kept vegetarian, had a pregnancy, and/or owned a microwave. [h/t Hannah Brooks + Sebastian Gutierrez] — : March 2, 2016
Links:
Tags: food
Last week, the Department of Homeland Security published more than 250 infrastructure-related datasets, which had previously been marked as "For Official Use Only." The release covers a wide range of topics, including datasets on educational facilities, hurricane evacuation routes, poultry slaughterhouses, and sports venues. (According to that dataset, the Indianapolis Motor Speedway holds more people than any other major sports venue, with a listed capacity of 257,325.) [h/t Michael Keller] — : March 2, 2016
Links:
Tags: infrastructure
Since 1999, Jester has been telling jokes. The website, built by UC Berkeley’s Laboratory for Automation Science and Engineering, asks you to rate its sometimes-humorous offerings, and then uses those answers to guess which of the remaining 100+ jokes you’ll like best. The UC Berkeley team behind the project has released millions of joke ratings from more than 100,000 anonymous users. [h/t Alex Gude] — : February 24, 2016
Links:
Tags: entertainmentlanguage
The CDC publishes a searchable database of its cruise ship sanitation inspections — but doesn’t provide an option to download the data. Last week, an open-data enthusiast scraped the database and posted CSVs of specific deficiencies and overall inspection scores since 1990. The lowest score: The Nippon Maru’s 38 points (out of 100) in 1998. Related: ProPublica’s “Cruise Control,” a searchable database of health and safety reports. [h/t Mike Stucka + Lena Groeger] — : February 24, 2016
Links:
Tags: transportation
The Nuclear Latency Dataset contains “all known uranium enrichment and plutonium reprocessing facilities” built between 1939 and 2012. That amounts to 253 plants around the world, each with information on its construction timeframe, civilian-vs-military purpose, international oversight, and more. [h/t Abraham Epton] — : February 24, 2016
Links:
The Uppsala Conflict Data Program maintains several large, interconnected datasets describing decades of war, genocide, and other armed hostilities. Looking for a slightly less depressing experience? Try the UCDP’s dataset of 216 peace agreements signed between 1975 and 2011. [h/t Tony Gray] — : February 24, 2016
Links:
Tags: conflict
The Supreme Court Database is exactly what it sounds like — and definitively so. The most recent release covers all SCOTUS cases from 1946 through 2014. For each case, the database contains 247 “pieces of information,” including the source of the case, why the court agreed to hear the case, the legal provisions at play, and how each justice voted. — : February 24, 2016
Links:
Tags: law
For 18 years, a trap on the roof of the University of Copenhagen’s Zoological Museum lured moths, butterflies, and beetles to their early deaths. Researchers at the university counted and identified more than 250,000 specimens from 1,500+ species. The most common: Yponomeuta evonymella, a moth species also known as the bird-cherry ermine, which got trapped nearly 40,000 times. — : February 17, 2016
Links:
Tags: animalsstatistics
Portable Game Notation, a file format used to describe chess matches, was invented in 1993. Since then, enthusiasts have created PGN files for virtually all top players’ games and every high-level tournament at sites such as PGN Mentor and Chess DB. [h/t Seth Kadish] — : February 17, 2016
Links:
Tags: entertainmentgames
In 2011, agriculture occupied about 22% of all land in the contiguous U.S., according to the National Land Cover Database. The NLCD classifies every 30-meter-by-30-meter chunk of land into one of 16 categories, including “woody wetlands,” “cultivated crops,” and “developed” land, at different intensities. (Alaska’s unique landscape has earned it a few additional categories, such as “dwarf scrub.”) The database is presented as raster files, so you’ll need some geospatial software to dig in. [h/t Ryan McNeill] — : February 17, 2016
Links:
Computational linguists at Canada’s National Research Council used Mechanical Turk to crowdsource the emotional associations of 14,182 words. For each word, participants were asked whether it was “positive” and/or “negative”, and whether it was associated with any of eight emotions: anger, anticipation, disgust, fear, joy, sadness, surprise, and trust. The resulting Word-Emotion Association Lexicon was first published in 2010. Of the full lexicon, only two words — “treat” and “feeling” — were associated with all eight emotions. [h/t Bipul Mohanto] — : February 17, 2016
Links:
Tags: language
Every two years since 1991, the CDC has conducted the Youth Risk Behavior Survey, which asks high school students questions about drug use, sex, eating habits, and more. The results are available at the national, state, and district level. Results from the 2015 survey will be published in June, the CDC says. Related: Today’s teens _______ less than you did. — : February 17, 2016
Links:
The Million Song Database contains metadata and “feature analysis” (e.g., loudness, tempo, and “danceability”) for, you guessed it, one thousand-thousand songs. The full dataset occupies hundreds of gigabytes, but you can also download a 1% sample. [h/t Neal Lathia] — : February 10, 2016
Links:
Tags: audiomusicstatistics
The Internet Archive’s Political TV Ad Archive uses audio fingerprinting to identify the campaign ads playing in key primary states. You can search the database, watch the ads, and download the data. The data file contains information about each ad’s sponsor, pro/con-ness, TV network, and time of airing. Previously: Political Ad Sleuth, featured Jan. 20. — : February 10, 2016
Links:
Tags: politics
The Organ Procurement and Transplantation Network, a public-private partnership, keeps records of organ donations, transplants, and waiting lists in the United States. The website’s “advanced” data tool lets you generate fairly detailed custom reports. One hitch: The site doesn’t provide an option to download the data. Data Is Plural wrote a small bit of software to fix that. — : February 10, 2016
Links:
Tags: healthcare
iNaturalist is a sort of social network for nature enthusiasts. Users can post photos and descriptions of birds, fish, bugs, and even mold, which experts can then help to identify. In November, the site recorded its two-millionth observation. You can explore the data via API or, with a free account, use the site’s export tool. [h/t Dan Brady] — : February 10, 2016
Links:
Every year, the U.S. Energy Information Administration requires thousands of power plants to report detailed data on fuel consumption and electricity generation. The datasets stretch back more than three decades, to 1989. In 2014, the most recent year available, Arizona’s Palo Verde Nuclear Generating Station generated more electricity — 32 million megawatt hours — than any other power plant in the country. [h/t Marc DaCosta] — : February 10, 2016
Links:
Tags: energy
The Cornell Movie-Dialogs Corpus contains 220,579 “conversational exchanges” between 9,035 characters in 617 movies. Included: “Hello. My name is Inigo Montoya. You killed my father. Prepare to die.” — : February 3, 2016
Links:
Next month marks the five-year anniversary of the Fukushima Daiichi disaster, the worst nuclear accident since Chernobyl. Since shortly after the meltdown, volunteers for Safecast have been collecting radiation measurements in Japan and beyond. The results are available to download or to access via API. — : February 3, 2016
Links:
Fears about the Zika virus — and a possible, but not proven, connection to microcephaly — are growing. Little data on the latest outbreak has been published, but here’s an open guide to what’s available so far, including reported cases of microcephaly in Brazil and the number of suspected Zika samples sent to Colombia’s national institute of health. — : February 3, 2016
Links:
Tags: diseasehealthcare
Last month, a group of researchers introduced Pantheon 1.0, “a manually verified dataset of globally famous biographies.” It starts with 11,341 Wikipedia biography pages in 25 languages, and adds birthplace, birthdate, gender, occupations, and page views. You can download the data or explore it online. Baffling factoid: As of May 2013, High School Musical star Corbin Bleu had biographies in more language editions than anyone other than Jesus Christ and Barack Obama. Related: A broader-but-shallower dataset of more than 400,000 influential people on the English-language Wikipedia. [h/t Ben Dilday] — : February 3, 2016
Links:
Tags: social mediatechnology
The Transportation Security Administration publishes spreadsheets of legal claims against the agency, including the location, circumstances, and outcome of each claim. The most expensive settlement on record appears to involve a vehicle-related personal injury in July 2004, for which the TSA paid $125,000. On the other end of the spectrum: In 2014, a traveler recouped $1.25 for lost food or drink at Hilton Head Island Airport. [h/t Seth Kadish + Lindsey Cook] — : February 3, 2016
Links:
Tags: justicetransportation
The University of Edinburgh hosts an incredibly detailed, and deeply documented database of more than 3,000 accused witches in Scotland. The mania reached its quantitative peak in 1662, when, according to the database, 402 people were accused of witchcraft. [h/t Felix Haass] — : January 27, 2016
Links:
Tags: historystatistics
Last year, more than 400,000 federal employees took the Office of Personnel Management’s annual survey, which includes questions about satisfaction, leadership, and work schedules. You can download aggregate and raw results. Important note: The survey is voluntary and non-random. — : January 27, 2016
Links:
Tags: statistics
MovieLens.org is a free, noncommercial movie recommender — sort of like Netflix, minus the ability to watch movies. The service is run by a research lab at the University of Minnesota. The lab publishes several datasets of user ratings and movie info. The largest contains 22 million ratings. Among movies with at least 1,000 ratings, The Shawshank Redemption has received the highest average score (4.44 of 5), while 2007’s Epic Movie has netted the lowest (1.48 of 5). — : January 27, 2016
Links:
Tags: entertainmentmovies
Earlier this month, the American Cancer Society launched a new data dashboard. Metrics include estimated new cases, historical survival rates, and more. To download the corresponding spreadsheets, use the “tools” button on each page. [h/t Virginia Hughes] — : January 27, 2016
Links:
NASA collects aviation safety reports from pilots, technicians, flight attendants, and other personnel. The (anonymized) published data contains text narratives, as well as details about flight conditions and other safety factors. (“Ok, I did it; the dumbest thing I have ever done in my entire life,” one confessional begins.) You can search the database but can only download so many records at a time. And you can request the full database from NASA, but you’ll have to wait. An alternative option: There’s a copy from November on the Internet Archive. [h/t Dave Riordan + Julian Simioni] — : January 27, 2016
Links:
Tags: transportation
Last month, Nature Communications published a study of the “long-term neural and physiological phenotyping of a single human.” That human? Study co-author Russell A. Poldrack, “a right-handed Caucasian male, aged 45 years at the onset of the study.” The 18 months of results — tracking brain connections, food consumption, stress levels, and much more — are available to download and explore. [h/t Sune Lehmann] — : January 20, 2016
Links:
In 2013, Stanford University researchers published a paper examining how people’s tastes “change and evolve over time.” They drew, in part, on a dataset containing 13 years of Amazon reviews of gourmet foods. (Note: Not all foods were intended for humans.) The dataset comes in a slightly unconventional format; here’s a Python script to convert it to a TSV file. [h/t Kaggle] — : January 20, 2016
Links:
Tags: food
The FCC requires broadcasters to keep records of “all requests for broadcast time made by or on behalf of a candidate for public office.” With the help of volunteers, Political Ad Sleuth gathers those records and enters them into a searchable, downloadable database. Note: Due, in part, to the difficulty of transcribing the (non-standardized) records, the information in the database is incomplete. — : January 20, 2016
Links:
Tags: politics
Slate Magazine’s “The Atlantic Slave Trade in Two Minutes” — recently named a multimedia finalist for the American Society of Magazine Editors’ annual awards — tracks 20,528 transatlantic voyages over 315 years. The information comes via SlaveVoyages.org, which provides searchable, downloadable records of ships’ and captains’ names, regions where slaves were purchased and sent, and more. — : January 20, 2016
Links:
Researchers from Virginia Tech have joined forces with Flint, Mich., residents to sample the city’s lead-tainted water supply. In December, the researchers posted the results of 271 samples, which indicated high levels of lead contamination. The most extreme sample found a lead concentration of 158 parts per billion — 10 times higher than the EPA’s “action level.” Related: The New York Times + The Washington Post have used the data. — : January 20, 2016
Links:
Tags: environmentwater
When State Department employees travel on official business abroad, they can get reimbursed — to a point — for lodging, meals, and things such as laundry. The department publishes monthly spreadsheets of the maximum per diems, which vary by location. The highest right now? The Cayman Islands ($735 per day). The lowest? Antarctica ($0/day) and Iraq ($11/day). — : January 13, 2016
Links:
Tags: statistics
Last year, more than 2 million people applied for new Social Security retirement and survivor benefits. When they did, they indicated their preferred language. More than 93% said English, and about 5% of applicants said Spanish — the second most popular choice. Among the 88 other options: 1,616 applicants chose American Sign Language, 32 chose Japanese, nine chose Yiddish, and one chose Swedish. — : January 13, 2016
Links:
Tags: languagestatistics
USAID, the Peace Corps, the U.S. African Development Foundation, and other agencies report data on foreign assistance spending to ForeignAssistance.gov. The full dataset includes detailed information for each grant and contract — and comes with data dictionary. The website also provides a chart of participating agencies, and an interactive map of the data. — : January 13, 2016
Links:
Tags: aidmoneystatistics
UPDATE: The site is no longer in beta and can be found at https://foreignassistance.gov/.
What did the world’s political boundaries look like in 1945? The lines between Swedish counties in 1968? The U.S. states in 1865? Thenmap, an open-source API and mapping tool, answers these questions and more. [h/t Carlos Matallín] — : January 13, 2016
Links:
The 2010 Religious Congregations and Membership Study counts, for more than 200 religious groups, the number of congregations and adherents in each U.S. state and county. In total, the study reported more than 344,000 congregations and more than 150 million adherents — nearly half of the 2010 U.S. population. New counts are published every 10 years. [h/t Julia Silge] — : January 13, 2016
Links:
Tags: religionstatistics
Crowdsourced from his 1983 “Motown 25” performance. [h/t Nadja Popovich] — : January 6, 2016
Links:
Tags: entertainmentmusic
The UN’s refugee agency is keeping track of daily refugee movements through Greece, Macedonia, Serbia, and farther along into Europe. The downloadable data and interactive map cover migrations since October 2015. — : January 6, 2016
Links:
Tags: refugees
The historically opaque New York Police Department has finally started publishing incident-level felony data — something that cities such as Chicago and Boston have done for years. The dataset includes the date, time, and approximate location of each offense. It currently covers the first nine months of 2015 and will (apparently) be updated quarterly. Don’t miss the footnotes in this PDF. Related: Some initial insights. Also related: “Which Cities Share The Most Crime Data?” [h/t Dan Nguyen + Mark Silverberg] — : January 6, 2016
Links:
Tags: crime
This database compares the phonological, grammatical, and lexical properties of hundreds of languages. One dataset looks at languages’ counting systems. (Many use the decimal system, but Yoruba uses the vigesimal system and Danish uses a hybrid.) Others examine the use of tone, how you say “tea”, and whether there are different words for “finger” and “hand”. [h/t Jacqui Maher] — : January 6, 2016
Links:
Tags: language
After it became clear that the federal government was doing an awful job of keeping track of how often police kill civilians, two newspapers started counting last year. According to The Guardian’s tally, U.S. police killed 1,136 people in 2015. The Washington Post’s count — which focused on shootings only and didn’t include off-duty officers — counted 984 deaths. Both organizations provide methodologies and downloadable datasets (including demographic and geographic details): Guardian / WaPo. — : January 6, 2016
Links:
Among them: 37,622 cellphones; 3,604 hats; 1,903 scarves; 1,017 birth certificates; 483 diaries; 115 VHS tapes; 82 violins; 41 GPS navigation systems; and 9 answering machines. At least one of the 2,756 umbrellas is mine. [h/t Mona Chalabi + Allison McCann + Noah Veltman] — : December 30, 2015
Links:
Tags: statistics
The Union of Concerned Scientists’s Satellite Database currently contains 1,305 entries and is updated “roughly quarterly.” The longest-orbiting: AMSAT-OSCAR 7, an amateur radio satellite launched in November 1974. Related: The satellites, visualized. [h/t David Yanofsky] — : December 30, 2015
Links:
Tags: technology
Over the weekend, the Seattle Times and BuzzFeed News published an investigation into Clayton Homes, a company that is owned by Warren Buffett's Berkshire Hathaway and that “has grown to dominate virtually every aspect of America’s mobile-home industry.” The investigation draws on data released through the Home Mortgage Disclosure Act. The law requires large lenders to publish details about each of their loans. You can download the raw data from the FFIEC, or slightly user-friendlier versions from the CFPB. [h/t Mike Baker + Dan Wagner] — : December 30, 2015
Links:
Tags: economicsreal estate
Last week, the Centers for Medicare & Medicaid Services published a new drug-spending dataset. It focuses on medications that (a) cost the most, overall; (b) cost the most per patient; or (c) saw the largest price-hike between 2013 and 2014. Vimovo, an arthritis pain reliever, tops the price-hike rankings: Between 2013 and 2014, the average cost per unit increased more than sixfold, from $1.94 to $12.46. [h/t Virginia Hughes] — : December 30, 2015
Links:
Tags: businessdrugshealthcare
A new study in the American Economic Review suggests that slaveholders in the South underestimated the odds of “emancipation without compensation.” To reach its conclusions, researchers compiled a dataset of 15,377 slave sales, culled from remarkably detailed official records. Data for each sale includes demographic information about the slaves, seller, and buyer; the price paid; payment method; and researcher notes. — : December 30, 2015
Links:
The Unicode Consortium publishes a big ol’ HTML table of every emoji, how they look in various contexts, and when they entered the canon. The “Christmas tree” emoji occupies code point U+1F384, and was introduced in 2010. (“Menorah with nine branches” arrived in 2015.) [h/t Ben Collins] — : December 23, 2015
Links:
Tags: technology
The Forest Service has digitized many of the tree species distribution maps from Elbert Little's “Atlas of United States Trees,” first published in the 1970s. Shapefiles and PDFs are available for for more than 600 species — including Ilex opaca (American holly) and Pseudotsuga menziesii (Douglas fir). — : December 23, 2015
Links:
Tags: environmentmappingplants
The Wikimedia Foundation publishes hourly pageview counts for each of its articles. It’s a tremendous amount of data — about 90 megabytes, compressed, per hour. Luckily, there’s also a tool for browsing individual pages’ daily traffic stats. Last Wednesday, the English-language page for "Christmas tree" received 7,822 visits, its highest mark so far this year. — : December 23, 2015
Links:
The USDA’s 2012 Census of Agriculture — the most recent vintage available — tallies agricultural activity at the national, state, and county levels. You can download detailed data from the agency’s Quick Stats tool. In 2012, Oregon harvested more Christmas trees than any other state: 6.8 million of them, or 39% of the census total. [Correction, 2015-12-23: The Oregon numbers incorrectly referenced 2007 data. In 2012, Oregon harvested 6.4 million trees, or 37% of the census total. Thanks to @JoeMurph for flagging this mistake.] — : December 23, 2015
Links:
Tags: agriculture
Every year, the U.S. Consumer Product Safety Commission tracks emergency rooms visits to approximately 100 hospitals. The commission uses the resulting National Electronic Injury Surveillance System data to estimate national injury statistics, but it also publishes anonymized information for each consumer product–related visit, including the associated product code (e.g., 1701: “Artificial Christmas trees”) and a short narrative (“71 YO WM FRACTURED HIP WHEN GOT DIZZY AND FELL TAKING DOWN CHRISTMAS TREE AT HOME”). — : December 23, 2015
Links:
Tags: healthcareinjury
This dataset is fucking amazing. — : December 16, 2015
Links:
You’ve probably heard of PolitiFact, the Tampa Bay Times project that fact-checks what politician say. What you might not know: PolitiFact has an API. You can use it to fetch detailed data the project’s national and state-level editions. Related: “All Politicians Lie. Some Lie More Than Others,” PolitiFact’s top editor writes in the New York Times. — : December 16, 2015
Links:
Tags: politics
Last week, USA Today released its annual accounting of assistant — yes, assistant — college football coaches’ salaries. At $1.6 million per annum, Auburn’s Will Muschamp leads the pack. More than 371 assistants have salaries of $250,000+. The release complements the publication’s database of head-coaching salaries. Related: Each state’s highest paid public employee, as of 2013-ish. [h/t Steve Berkowitz] — : December 16, 2015
Links:
The recently-updated Randolph Glacier Inventory contains spreadsheets and outlines of every known glacier in the world. Of the 212,000+ glaciers inventoried, more than 27,000 are in Alaska. Someone please adopt Deserted Glacier. [h/t Robin Wilson’s stunningly extensive directory of free GIS data] — : December 16, 2015
Links:
The Department of Justice is authorized to investigate police departments that display a “pattern or practice” of civil rights violations. In April, the Marshall Project began publishing a spreadsheet of the DOJ investigations into local law enforcement. The dataset, which is updated regularly, indicates when each case began, when it ended, and what type of agreement (if any) was reached. The latest entry: An investigation into the Chicago Police Department, announced last week. Related: PBS Frontline's interactive map of DOJ investigations. [h/t Tom Meagher] — : December 16, 2015
Links:
The Texas Department of Licensing and Regulation maintains a webpage of well-formatted data on state-licensed workers, including tow truck operators, boxing judges, journeyman electricians, elevator inspectors, manicurists, and, yes, barbers. [h/t Ryan Murphy] — : December 9, 2015
Links:
Tags: statistics
The CDC’s Foodborne Outbreak Online Database (FOOD) contains 18,000+ outbreaks, which resulted in 358,000+ illnesses and 13,000+ hospitalizations, from 1998 through last year. In 2008, a multi-state Salmonella Saintpaul outbreak hospitalized 308 people — the highest count in the database. — : December 9, 2015
Links:
Tags: diseasefoodhealthcare
Gun dealers use the FBI’s National Instant Criminal Background Check System to determine whether someone is allowed to buy a firearm. There isn’t a one-to-one correlation between these background checks and gun sales, but they’re said to be the best available proxy. The FBI publishes a PDF tallying the monthly number of firearm checks for each state and type. At BuzzFeed News, we’ve parsed that PDF into a CSV/spreadsheet for easier use. — : December 9, 2015
Links:
Tags: gunsstatistics
Last week, Data Is Plural highlighted ShootingTracker.com, a source for data on shootings that wounded at least four people. Other resources include the Gun Violence Archive and Mother Jones’ detailed database of mass shootings since 1982. The Mother Jones database takes narrower approach, focusing on shootings that killed at least four people in a public setting. In a New York Times op-ed, published shortly after last week’s San Bernardino shooting, the editor behind that database argues that broader methodologies don’t distinguish between a “a 1 a.m. gang fight” and “the madness that just played out in Southern California.” A Washington Post article weighs the pros and cons of broader and narrower approaches. [h/t Robin Shields + Mark Follman + Christopher Ingraham] — : December 9, 2015
Links:
Open Knowledge International has just published its latest survey of openly available government data. This year’s audit includes 112 countries and territories, up from 97 last year. The survey scores each based on the availability of datasets in 13 key categories (e.g., “election results,” “government spending,” and “pollutant emissions”) and links out to the available datasets. In this year’s survey, Taiwan ranks first, the U.K. second, and Denmark third. The U.S. ranks eighth. — : December 9, 2015
Links:
Tags: electionsstatistics
The CelebA dataset, published in September, contains 200,000+ images of 10,000+ celebrities, each annotated with 40 yes/no variables. Some favorites: “5_o_Clock_Shadow,” “Bags_Under_Eyes,” and “Goatee.” — : December 2, 2015
Links:
Tags: entertainment
The Huffington Post and Chronicle of Higher Education teamed up to investigate how colleges bankroll their athletics. (Georgia State, for example, spent more than $100 million subsidizing sports between 2010 and 2014, mostly via student fees.) The report, published last week, draws on five years of revenue/expense reports from 234 Division I public universities. You can download the raw data or explore it online. Related: The Washington Post also tackled this topic — from a slightly different angle — last week, examining the profitability (or lack thereof) of athletic programs at 48 schools. [h/t Shane Shifflett] — : December 2, 2015
Links:
Socrata’s software powers open-data portals around the world. But downloading large datasets — e.g., this 2.8-gigabyte dataset of NYC parking tickets — from Socrata-powered portals can feel, well, sluggish. One solution: OpenDataCache.com, a free website that provides faster-to-download versions of virtually every dataset from 50+ Socrata portals. Related: Thomas Levine’s detailed analyses of Socrata-powered portals, published in 2013 and 2014. [h/t John Krauss and Steven Romalewski] — : December 2, 2015
Links:
Tags: statistics
ShootingTracker.com provides datasets listing all U.S. mass shootings — defined as “when four or more people are shot in an event, or related series of events” — since 2013. So far in 2015, mass shootings have killed 447 people and wounded an additional 1,292. — : December 2, 2015
Links:
The National Centers for Environmental Information maintains more than 20 petabytes of data, it says. Among the most useful slices is the Global Historical Climatology Network’s data, which aggregates reports on temperature, precipitation, wind, and more from tens of thousands of climate-monitoring stations around the world. One tidbit: January 1995 was Death Valley’s wettest month since at least the 1960s, with a whopping 2.59 inches of precipitation. — : December 2, 2015
Links:
In 2011, the New York Public Library launched a crowdsourcing project to transcribe its massive collection of restaurant menus, dating back to the 1850s. So far, volunteers have transcribed more than 1.3 million dishes, their prices, and where on the menu each dish appeared. The library publishes a spreadsheet of all the data, and updates it twice a month. Happy Thanksgiving! — : November 25, 2015
Links:
Tags: food
The U.S. government has one very large Google Analytics account, and has begun sharing traffic data with the public. Not every federal website is accounted for, but more than 4,000 are. Over the past 90 days, they’ve racked up approximately 1.5 billion visits. The most popular page at the time of this writing? Weather.gov. Bonus: How they built it. [h/t Rebecca Williams] — : November 25, 2015
Links:
Tags: statisticstechnology
You can download every comment posted to Reddit since October 2007 … but you’ll need some patience and a terabyte of storage. If you’re more of the instant-gratification, don’t-have-an-external-hard-drive-lying-around type, you might enjoy FiveThirtyEight’s “How The Internet* Talks,” a sort of Google Ngrams for the Reddit data. [h/t Randall Olson and Ritchie King] — : November 25, 2015
Links:
The Department of State publishes demographic reports on refugee arrivals since 2002. The data includes country of origin, resettlement city and state, religion, age, gender, and more. Related: At BuzzFeed, I used the data to chart the past decade of refugee arrivals. Also related: The UN’s refugee data portal. — : November 25, 2015
Links:
Tags: refugeesstatistics
The newly-launched Citizens Police Data Project has collected more than 56,000 allegations of police misconduct. The data, covering 2002-2008 and 2011-2015, includes demographic information about the complainant and the officer, as well as the type and location of the incident. Click here to download the raw data. Related: The City of Chicago’s wide-ranging data portal includes a spreadsheet of every reported crime in the city since 2001; you can explore neighborhood trends via the Chicago Tribune. [h/t Melissa Segura and Abraham Epton] — : November 25, 2015
Links:
What contains 34,052 bottles and is worth an estimated £3 million? The United Kingdom’s official wine cellar, which provides libations for the government’s guests and hosts — and a dram of data for the public. Between April 2014 and March 2015, the cellar’s clients consumed more than 5,500 bottles of wine and liquor. Among them: 205 bottles of Champagne, 51-and-a-half bottles of gin, and one bottle Château Pichon-Longueville Comtesse de Lalande 1986. [h/t Nadja Popovich] — : November 18, 2015
Links:
Tags: alcohol
Under the HITECH Act of 2009, companies must notify the government of any data breach involving the HIPAA-protected health data of 500 or more people. Summaries of those reports are available at the Department of Health and Human Services’s Breach Portal, which currently contains more than 1,300 incidents. Related: In April, JAMA published an analysis of the breaches. Also related: Forty years of legislative acronyms. [h/t Virginia Hughes] — : November 18, 2015
Links:
Tags: healthcare
The National Registry of Exonerations contains “every known exoneration in the United States since 1989—cases in which a person was wrongly convicted of a crime and later cleared of all the charges based on new evidence of innocence.” For each of the 1,702 cases, the registry includes details about the exoneree, the crime, and the factors — such as new DNA evidence — that contributed to the exoneration. [h/t agate] — : November 18, 2015
Links:
The 2016 presidential hopefuls have been tweeting, ‘gramming, and ‘booking like a pack of millennials. Fusion collected nearly 70,000 images from the candidates’ social media accounts, then pumped the pictures through an automated tagging system. Now you can search for guns, money, beer and more — or download the raw data for your own analysis. — : November 18, 2015
Links:
Tags: politicssocial media
The Arms Transfer Database tracks the international flow of major weapons — artillery, missiles, military aircraft, tanks, and the like. Maintained by the Stockholm International Peace Research Institute (SIPRI), the database contains documented sales since 1950 and is updated annually. SIPRI provides a download tool, which outputs rich-text files, but it’s also possible to download the data as CSV. [h/t Martín González] — : November 18, 2015
Links:
For his 1898 book, The Law of Small Numbers, statistician Ladislaus Bortkiewicz tabulated the number of Prussian cavalrymen killed by horse kicks each year between 1875 and 1894. (In total, 196 suffered that tragic fate.) The dataset is tiny, but boasts an outsized legacy: Bortkiewicz’s lethal horse kicks allegedly helped to popularize the then-obscure Poisson distribution. [h/t Noah Veltman] — : November 11, 2015
Links:
Tags: historystatistics
Earlier this year, the HathiTrust Research Center released a massive dataset extracted from 4.8 million digitized volumes. For each of its 1.8 billion pages, the dataset includes word frequencies, languages used, and sentence counts, among other features. — : November 11, 2015
Links:
The New Mexico city publishes dozens of regularly-updated, well-documented datasets. Among them: government employee earnings, the number of daily visitors to the city’s swimming pools, real-time bus locations, the geography of police beats, and the city’s complete vendor checkbook. [h/t Tom Johnson, who emailed Data Is Plural to praise how Albuquerque is sharing its data: “I have not found any other city in the world doing so in such detail.”] — : November 11, 2015
Links:
Tags: statistics
The Side Effect Resource, a.k.a. SIDER, takes all the fine print from drug labels, and aggregates the information about side effects into a searchable, downloadable database. SIDER got a major upgrade last month, and now contains 40% more drug-effect pairs than before. The website incorporates both generic and brand names, so that searches for “Prozac” and “fluoxetine” bring you to the same page. — : November 11, 2015
Links:
Tags: drugshealthcare
Good Jobs First’s Violation Tracker calls itself “the first national search engine on corporate misconduct.” The new database currently contains nearly 100,000 penalties for environmental, health, and safety violations — sourced from 13 U.S. regulatory agencies — since 2010. Search results can be downloaded as CSV files, which contain a few additional fields. (Tip: Search for “*” to get all cases.) The largest single fine? The Department of Justice’s $20.8 billion penalty this year against BP. [h/t Samuel Rubenfeld] — : November 11, 2015
Links:
Tags: businesscrimeenvironment
Last May, a Gulfstream G150 taking off from Houston’s Ellington Airport struck an armadillo. The animal’s remains were collected, but were not sent to the Smithsonian Institution for identification. This anecdote comes from a single row in the Federal Aviation Administration’s Wildlife Strike Database, and draws on just seven of the 94 available fields. The database contains more than 168,000 strikes reported since 1990, almost all involving birds. Roughly 10% of the time, the animal's remains are sent to the Smithsonian's Feather Identification Lab. [h/t Dan Vergano] — : November 4, 2015
Links:
Tags: transportation
Trans-New Guinea is the world’s third-largest language family. But it’s also among the poorest-studied. TransNewGuinea.org, an online database launched in 2013, is trying to change that. It now contains more than 1,000 New Guinea languages and lists 145,000 word translations — including 1,065 entries for “dog.” It even has an API. A recent PLOS ONE journal article provides additional background and statistics. [h/t Simon J. Greenhill] — : November 4, 2015
Links:
Tags: language
The Bureau of Alcohol, Tobacco, Firearms, and Explosives publishes a searchable and downloadable licensing database. License-holders fall into eleven categories. Among them: run-of-the-mill dealers, ammunition manufacturers, collectors of “curios and relics,” pawnbrokers, and importers of “destructive devices.” The ATF’s website contains monthly and state-by-state archives. [h/t Marc DaCosta] [Correction, 2015-11-04: There are only nine categories of license-holders. The published ATF data includes only eight of them; it does not include "Collector of Curios and Relics." Thanks to @MikeStucka for flagging this mistake.] — : November 4, 2015
Links:
Tags: guns
This July, the Museum of Modern Art published a dataset containing 120,000 artworks from its catalog, joining the UK’s Tate, the Smithsonian’s Cooper Hewitt, and other forward-thinking museums. The MoMA data contains the names of the artwork and artist, the dates created and acquired, and the medium — but no images. Related: Artist Jer Thorp encourages you to “perform” the data. Also related: Every museum in the United States. [h/t Nadja Popovich] — : November 4, 2015
Links:
Tags: art
The 600+ entries in this searchable, sortable database range from 3M to Amazon to Zynga, and list both paid and unpaid leave. The database, run by the women-in-the-workplace website FairyGodBoss.com, culls from published policies and employee tips. An introductory blog post provides more information. — : November 4, 2015
Links:
Sexualitics.org is on a mission: “to contribute to human sexuality understanding through a Big Data approach.” Last year, the site posted detailed metadata on 800,000 adult videos, including titles, descriptions, view counts, and tags. It powers Porngram, an only-kinda-safe-for-work charting tool. — : October 28, 2015
Links:
Tags: entertainment
Prior to October 15th, the Census Bureau’s USA Trade Online tool cost $300/year. No longer. The newly-free dataset covers more than 17,000 commodities, including a category for “magic tricks, practical joke articles; parts and accessories.” [h/t Noah Veltman] — : October 28, 2015
Links:
Most population numbers tell you where people live. But legions of Americans commute for work across city, county, and state lines. The Census Bureau’s Commuter-Adjusted Daytime Population Data accounts for these daily migrations. Manhattan’s population (non-tourist) population doubles from 1.5 million to 3 million, by far the largest influx by raw numbers. But Lake Buena Vista, Fla., takes the percentage-growth prize. The city’s entire resident population could fit in two sedans, but its “daytime population” includes 33,000 workers — including a not-insubstantial number dressed as Mickey Mouse. [h/t Steven Romalewski] — : October 28, 2015
Links:
Tags: transportation
This weekend, the New York Times published a front-page article on “the disproportionate risk of driving while black.” Among other findings: “officers were more likely to conduct [searches] when the driver was black, even though they consistently found drugs, guns or other contraband more often if the driver was white.” The investigation drew on several statewide traffic-stop datasets that track the race and gender of stopped drivers. The “seven states with the most sweeping reporting requirements,” in order of how easy it seems (to me) to get detailed data: Connecticut, North Carolina, Missouri, Nebraska, Maryland, Illinois, and Rhode Island. — : October 28, 2015
Links:
If you can’t beat ‘em, post spreadsheets about ‘em. Earlier this month, the Federal Communications Commission started publishing a dataset of complaints against telemarketers and robocalls. The FCC says the file will be updated weekly. It’s already being put to use: A clever programmer has crammed all the offending numbers into a single phone “contact” so that you can block them all at once. [h/t Shale Craig] — : October 28, 2015
Links:
Tags: statistics
WNYC, through a freedom-of-information request to the New York DMV, obtained a list of vanity plate approvals and denials from late 2010 to late 2014. Among the denials: “RUBMYDUB,” “S5SS5S5S,” “RFLMAO,” and “CBSNEWS.” (Strangely, “NBC4” was approved. Go figure.) The files and related story were published in August, but the data are timeless. [h/t @veltman] — : October 21, 2015
Links:
Tags: statistics
The Wikimedia Foundation has published a dataset enumerating monthly revision counts for every editor, across all of its wikis. The foundation is asking for help investigating a few perplexing trends. For example: Why have the number “very active editors” — those with 100+ edits per month — increased while the number of merely “active” editors have plateaued? — : October 21, 2015
Links:
The Police Open Data Census, created by Code for America fellows in Indianapolis, is tracking “currently available open datasets about police interactions with citizens in the US," including officer-involved shootings, use of force, and citizen complaints. The census currently covers 36 police departments. Related: The NYPD says it will start tracking all officer use-of-force incidents — not just gunfire — next year, the New York Times reports. — : October 21, 2015
Links:
The Hechinger Report casts doubt on the Pell grant graduation numbers contained in the Department of Education’s recently-released College Scorecard. Why the discrepancy? “[W]hile schools are required by law to provide the graduation rates of Pell recipients to any applicants who ask, a loophole protects them from having to report the same figures to the government.” Oof. — : October 21, 2015
Links:
Sometimes, bureaucracy creates poetry. Since 1890, the U.S. Board on Geographic Names has been cataloguing, standardizing, and promulgating official names for the places we hike, swim, work, and call home. Along the way, it began publishing Geographic Names Information System (GNIS), a searchable and downloadable database containing all of its domestic nomenclature. In Alaska alone, the database lists names for 167 dams, 303 post offices, 666 glaciers, 2,704 capes, and 9,575 streams. My favorite: Confusion Creek. [h/t @emilymbadger] — : October 21, 2015
Links:
Tags: mapping