Through multiple sampling methods and multiple ways of estimating race and gender, we show that tech journalism is almost certainly less racially diverse than tech itself.

We’ve open sourced our analysis and code below. It provides instructions on rebuilding the main dataset with estimated diversity data on 982 tech journalists and a secondary curated dataset that we use to check our results. All data used in this analysis was public on Twitter.

We’ve worked hard to avoid errors, but if you find any please report them using this Google Form. With that said, there’s only so much we can do from the outside. The ideal would be for every media corporation reporting on tech to follow the leads of Amazon, Apple, Facebook, Google, and Microsoft and publish its own annual diversity report.

TLDR

As of 2020, the percentage of white employees at Facebook is 41%. At Google, 51.7%. At Microsoft, 53.2%. At Amazon, 34.7%. At Apple, 50%. The workforce of the big five tech companies is thus less than 50% white, if we weight each company equally.

Note that a population-weighted average across companies would bring the percentage white number down further given the size of Amazon. Moreover, while it’s true that Amazon’s workforce includes many blue collar employees,even if we restrict to white collar employees only (or just look at the other four big tech companies) the numbers cluster in the same range.

For most tech media corporations, we lack diversity reports, so we did two different kinds of analyses (subjective and algorithmic) on two different kinds of datasets (search-based and curated).

The tightness of these different estimates for tech journalism should inspire some confidence in their accuracy. The overall estimate is that tech journalism is 77-84% white.

Thus, tech journalism is considerably less racially diverse than tech. And when we repeat these analyses for gender, we find that tech and tech journalism are very similar.

Background and overview of the methodology

Oo Nwoye (@OoTheNigerian) wrote an initial blog post on this topic in early July 2020. The initial dataset focused on the diversity of editorial staff and attendees at tech-journalism conferences, but this approach had two limitations. First, conferences are subject to a variety of selection biases. Second, there are only so many people on editorial staff or in the tech-journalism conference circuit.

Here we pursue a different approach, using multiple methods to check the robustness of our results.

Two different datasets: search-based and curated

We use two different sampling methods to make sure our results are not sensitive to how the set of “tech journalists” is defined. This gave us two different datasets.

In the search-based method, we collected lists of individuals on Twitter whose bios indicate current or past work for any one of 13 major tech-journalism outlets (namely Verge, Gigaom, Cnet, Wired, Engadget, The Information, Recode, The Next Web, Venture Beat, TechRadar, TechCrunch, Gizmodo, and Motherboard). We added tech journalists from the New York Times with a slightly different method (detailed below) as most of the NYT isn’t devoted full time to just tech coverage. If you want to extend our analysis, this method can be generalized to the Wall Street Journal, Washington Post, the Economist, and other general interest publications that include a significant amount of tech coverage.

In the curated method, we gathered a dataset of tech journalists by combining two publicly available, third-party-curated lists (1, 2) called “Tech Journalists.” We had no role in the curation of these lists, so if the results of our diversity analysis are similar when conducted on this sample as well, we should be more confident that our analyses accurately reflect the true state of tech journalism on Twitter.

Two ways of estimating race and gender

Estimating race and gender can be tricky. An algorithmic method may fail to capture important nuances in how race and gender are socially constructed. A method based on subjective human judgment might err by being inconsistent about how race and gender are assigned. To deal with the limitations of each method, we estimate race and gender through both methods. If the results are similar, we should be relatively confident in their accuracy.

Subjective method. Estimating race and gender through subjective human judgment was our first approach. We looked at each Twitter account and inferred the individual’s race and gender holistically, using their profile photo, name, and bio. In cases where either the gender or race was unclear using a subjective approach, we assigned an “NA” for the given variable. Given some controversy over the race of Arab and Persian people (the US census says “white” but many people have contested this), we sidestep the matter by assigning people of Arab and Persian descent “NA” for race. This affected only about 15 people, so neither classification would have a large effect on our key findings.

Algorithmic method. As a second method, we leveraged data from the Census and the Social Security Administration to infer journalist’s race and gender from surnames and first names, respectively.

Note on using Twitter data

Let’s discuss a few limitations of the approach of using Twitter-based samples to analyze tech journalism. First, it’s possible some tech journalists are not on Twitter. A second limitation is that a tech journalist may be on Twitter but not pulled up by the search algorithm. Third, we restricted our search to 13 tech publications and one general interest publication (the New York Times) that has significant tech coverage.

The main impact of these issues is that our sample will not capture every tech journalist on Twitter. Nonetheless, it should capture most of the writers at tech-focused publications, which means it should generalize.

Note on inferring race and gender from names

In estimating race from surnames, the most serious problem is due to the history of anti-black racism and slavery in the United States. Because of slavery in the United States, black people in America today frequently have European-origin last names (such as Williams or Washington). Therefore, inferring race from surnames will generally under-count black people. As you’ll see below, we take some extra steps to mitigate this problem (see comments inside the code snippets), but the results of any automated measurement method will always have some amount of error. A similar issue is present for gender estimation. We believe this problem is offset by using subjective assignment as an independent check.

Note on code

This code in this study is research-grade. It works and we’ve double checked the numbers, but it has a fair bit of copy/paste when it comes to the graphs. To be clear, you should be able to reproduce everything from the code below, but if there’s interest in a cleaned up version we may provide it. Fill out this if you’re interested.

Dataset 1: Twitter search for journalists with major tech outlets in their bio

We used the R package rtweet to search for anyone on Twitter who mentions one of 12 journalism-related keywords (reporter, journalist, producer, designer, editor, writer, analyst, current, former, bylines, columnist, freelance) and one of 13 dedicated tech-journalism outlets in their bio (Verge, Gigaom, Cnet, Wired, Engadget, The Information, Recode, The Next Web, Venture Beat, TechRadar, TechCrunch, Gizmodo, and Motherboard). The Twitter API allows for up to 1,000 results per search.

Because the New York Times is not a dedicated tech publication, we used a different method. We first gathered names by searching for people on Twitter with “tech” and “nytimes” (the Twitter username of the NYT), and then we manually gathered the Twitter usernames of every author listed in the Technology section in the month of July.

tech.journalists <- function() {
    ## Search for accounts mentioning tech-journalism outlets 
    ##
    ## Note: ReadWrite, ForbesTech, and TechDirt were queried but excluded
    ## because each one had fewer than ten results.

    NN <- 1000
    professions <- c("reporter","journalist","producer", "designer", "editor",
                     "writer", "analyst","current","former","bylines","columnist",
                     "freelance")

    outlets <- c("Verge", "Gigaom", "cnet", "Wired", "Engadget", "The Information",
                 "Recode", "The Next Web", "Venture Beat", "TechRadar", "TechCrunch",
                 "Gizmodo", "Motherboard")

    all.results <- c()
    for(oo in outlets) {
        for(pp in professions) {
            query = paste(pp, oo)
            print(paste("Searching", query))
            results <- search_users(query, n=NN)
            results$outlet = oo
            all.results <- rbind(all.results, results)
        }
    }
    return(all.results)
}


nyt.tech.journalists <- function() {
    nyt1 <- search_users("tech nytimes", n = 1000)
    print("Searching tech nytimes")
    
    ## All the twitter usernames of authors who appeared in the Technology section
    ## of the NYT in the month of July
    nyt2 <- lookup_users(c("sheeraf", "daveyalba", "ceciliakang", "jacknicas", "dmccabe",
                           "ShiraOvide", "brooksbarnesNYT", "nicsperling", "noamscheiber",
                           "TaylorLorenz", "hudidi1", "ellenrosen", "antontroian", "jtes",
                           "VVFriedman", "LizziePaton", "kevinroose", "JordanSalama19",
                           "maureendowd", "daiwaka", "smbahr14", "JoannPlockova", "m_delamerced",
                           "eringriffith", "edmundlee", "wendyluwrites", "LoosLips", "SteveLohr",
                           "jdbiersdorfer", "katiehafner", "nealboudette", "choesanghun", "portereduardo",
                           "Aaron_Krolik", "zhonggg", "AnaSwanson", "nathanielpopper", "ericmargolis",
                           "heathertal", "MikeIsaac", "charlie_savage", "katie_thomas", "mega2e",
                           "pnstenquist", "byJenAMiller", "SangerNYT", "jakesNYT", "veronica_penney",
                           "Lollardfish", "satariano", "_StephenCastle", "kchangnyt", "jwherrman",
                           "NYTnickc", "NellieBowles", "teddytinson", "tiffkhsu", "ezra_marc",
                           "Jonesieman", "jmorrisseynyc", "Lattif", "pranshuverma_", "ewong",
                           "bencareynyt", "jonah_kessel", "nicoleperlroth"))
    nyt <- rbind(nyt1, nyt2)
    nyt$outlet <- "New York Times"

    ## Remove duplicates within the NYT data
    nyt <- distinct(nyt, screen_name, outlet, .keep_all=T)
    return(nyt)
}

aggregated.tech.journalists <- function() {

    ## Combine all journalists into one dataframe
    tj <- tech.journalists()
    nytj <- nyt.tech.journalists()
    df <- rbind(tj, nytj)

    ## Restrict to English language accounts
    df <- df %>%
        filter(lang=="en") %>%
        select(screen_name, name, location, description, followers_count, friends_count,
               listed_count, statuses_count, favourites_count, verified,
               profile_expanded_url, outlet) %>%
        distinct()
    return(df)
}

df <- aggregated.tech.journalists()

## Serialize and export, then add race/gender columns based on subjective method.
## We will then complement this with algorithmic methods based on the `predictrace`
## and `gender` R packages.
## write.csv(df, "intermediate-search-based-list-of-tech-journalists-df.csv")

The total number of accounts identified in this fashion was 1865, including some false positives such as generic brand entities. So we manually went through the results and narrowed the dataset down. We then conducted the subjective assignment of race and gender. We also standardized names and removed emojis in preparation for the next stage of analysis. This process left 982 individuals. There are a few duplicates (people who have written for multiple tech journalism outlets), which we retain for our diversity analysis at the outlet-level. We remove duplicates otherwise.

# Load the manually cleaned spreadsheet
# df<-read.csv("intermediate-search-based-list-of-tech-journalists-df.csv")

# Calculate percent white within each tech journalism outlet
totals.subjective <- df %>%
  drop_na(subjective_race) %>%
  group_by(outlet, subjective_race) %>%
  summarise(n = n(), .groups="drop_last") %>%
  mutate(Percent = n / sum(n)) %>%
  filter(subjective_race=="White")

# Prepare a dataframe to make the main graph
journo.df <- totals.subjective %>%
  select(outlet, Percent) %>%
  mutate(Percent = Percent*100)

# Add tech company workforce stats
outlet <- c("Facebook", "Google", "Microsoft", "Amazon", "Apple")
Percent <- c(41, 51.7, 53.2, 34.7, 50)
tech.df <- as.data.frame(cbind(outlet, Percent), stringsAsFactors=FALSE)
tech.df$Percent <- as.numeric(tech.df[,2])

# Wrangle the two together
main.graph.df <- bind_rows(journo.df, tech.df)
main.graph.df$Type <- NA
main.graph.df$Type[1:14] <- "Tech Journalism"
main.graph.df$Type[15:19] <- "Tech Companies"

# Create main graph
main.graph.df %>%
  ggplot(aes(x=fct_reorder(outlet, Percent), y=Percent, colour=Type)) +
  geom_point(size=4) +
  coord_flip() +
  theme_bw() +
  labs(x="", y="Percent White",
       title="Tech Journalism Is Less Diverse Than Tech",
       subtitle = "Percent white, tech journalists vs. tech companies",
       caption= "TechJournalismIsLessDiverseThanTech.com") +
       theme(legend.position="bottom") +
       theme(legend.title=element_blank()) +
  scale_y_continuous(breaks = seq(0, 100, by = 10), limits=(c(0,100))) +
  scale_colour_manual(values = c("#00AFBB","#FC4E07"))

Next, let’s consider gender diversity in tech journalism relative to gender diversity in the big five tech companies. Facebook reports a workforce that’s [63% male](https://diversity.fb.com 68% male. Microsoft is 72% male. Amazon is 57% male. Apple is 67% male.

gender.graph.df<-df %>%
  drop_na(subjective_gender) %>%
  group_by(outlet, subjective_gender) %>%
  summarise(n = n(), .groups="drop_last") %>%
  mutate(Percent = n / sum(n)) %>%
  filter(subjective_gender=="Male") %>%
  select(outlet, Percent) %>%
  mutate(Percent = Percent*100)

# Add tech company workforces stats
outlet <- c("Facebook", "Google", "Microsoft", "Amazon", "Apple")
Percent <- c(63, 68, 72, 57, 67)
tech.df <- as.data.frame(cbind(outlet, Percent), stringsAsFactors=FALSE)
tech.df$Percent <- as.numeric(tech.df[,2])

# Wrangle the two together
gender.graph.df<-bind_rows(gender.graph.df, tech.df)
gender.graph.df$Type <- NA
gender.graph.df$Type[1:14] <- "Tech Journalism"
gender.graph.df$Type[15:19] <- "Tech Companies"

# Create gender comparison graph
gender.graph.df %>%
  ggplot(aes(x=fct_reorder(outlet, Percent), y=Percent, colour=Type)) +
  geom_point(size=4) +
  coord_flip() +
  labs(x="", y="Percent Male",
       title="Tech Journalism and Tech Companies Are Similar on Gender",
       subtitle = "Percent male, tech journalists vs. tech companies",
        caption= "TechJournalismIsLessDiverseThanTech.com") +
       theme_bw() +
       theme(legend.position="bottom") +
       theme(legend.title=element_blank()) +
  scale_y_continuous(breaks = seq(0, 100, by = 10), limits=(c(0,100))) +
  scale_colour_manual(values = c("#00AFBB","#FC4E07"))

According to our subjective judgment, 654 are likely White, 127 Asian, 23 Black, and 11 Hispanic.

Also according to our subjective judgment, 548 are Male and 372 are Female.

Next we leverage the predictrace package to infer race from surnames, based on the co-occurrence of surnames and self-reported races in the Census datasets.

# Use names and publicly available datasets to infer race and gender algorithmically

# Extract surname from the full name
df$lastnames <- word(df$name,-1)
# Predict race from surname
race.profiles.df <- predict_race(df$lastnames)

# To partially mitigate the problem of under-identifying black people,
# any surname with a probability of being black greater than .4 will
# will be classified as black. This will increase the error rate of
# classifying white people as black, but decrease the error rate of
# classifying black people as white.
race.profiles.df$likely_race<-ifelse(race.profiles.df$probability_black>.4 &
                                     race.profiles.df$likely_race=="white",
                                     "black",
                                     race.profiles.df$likely_race)

# We think there's a bug in the predictrace package which fails
# to infer Hispanic race even when it identifies a very high probability
# of being Hispanic. We correct this manually by inferring
# Hispanic race for any probability greater than .5.
race.profiles.df$likely_race<-ifelse(race.profiles.df$probability_hispanic>.5,
                                     "hispanic",
                                     race.profiles.df$likely_race)

# Bind predicted race variables to Twitter profiles
df <- cbind(df, race.profiles.df$likely_race)
# Rename the inferred race variable
df$race <- df$`race.profiles.df$likely_race`

# Extract first name from the full name
df$firstnames <- word(df$name,1)
# Predict gender from first name
gender.profiles.df <- gender(df$firstnames)
# Rename the inferred gender variable
gender.profiles.df$firstnames <- gender.profiles.df$name
# Merge gender variables with Twitter profiles
df <- merge(df, select(gender.profiles.df, gender, firstnames), by="firstnames", all.x=T)
# Remove duplicates created in merge
df <- distinct(df)
# Remove redundant columns
df <- select(df, -c(firstnames, lastnames, `race.profiles.df$likely_race`))

df$race<-recode(df$race, asian = "Asian",
                        black = "Black",
                        white = "White",
                        hispanic = "Hispanic",
                        american_indian = "American Indian")
totals.algorithmic<-df %>%
  drop_na(race) %>%
  group_by(outlet, race) %>%
  summarise(n = n(), .groups="drop_last") %>%
  mutate(Percent = n / sum(n)) %>%
  filter(race=="White")

df %>%
  drop_na(race) %>%
  group_by(outlet, race) %>%
  summarise(n = n(), .groups="drop_last") %>%
  mutate(Percent = n / sum(n)) %>%
  filter(race=="White") %>%
  ggplot(aes(x=fct_reorder(outlet, Percent), y=Percent)) +
  geom_point(color = "#9124b5") +
  coord_flip() +
  theme_bw() +
  labs(x="", y="Percent White",
       title="Racial Diversity Across 13 Major Tech-Journalism Outlets",
       subtitle = "Race inferred from surnames using Census Data.",
       caption= "TechJournalismIsLessDiverseThanTech.com") +
  scale_y_continuous(breaks = seq(0, 1, by = 0.10), limits=(c(0,1)))

According to this algorithmic method, 471 are likely White, 67 Asian, 25 Black, 30 Hispanic, and 1 American Indian. (We did not code for American Indian when we did the subjective assessment.) The rest could not be algorithmically classified. The White percentage is still very close to our estimates based on subjective assessments, namely about 77%.

df %>%
drop_na(race) %>%
ggplot(aes(x = race)) +  
  geom_bar(aes(y = (..count..)/sum(..count..)), fill = "#9124b5") +
  theme_bw() +
  labs(title="Racial Diversity Among Tech Journalists on Twitter",
       subtitle="Race inferred from surname using Census data.",
      x="",
       y="Proportion",
      caption="TechJournalismIsLessDiverseThanTech.com") +
  coord_flip() +
  scale_y_continuous(breaks = seq(0, 1, by = 0.10), limits=(c(0,1)))

Below is the graph for the distribution of race according to subjective assessments of race, aggregated over outlets.

df %>%
drop_na(subjective_race) %>%
ggplot(aes(x = subjective_race)) +  
  geom_bar(aes(y = (..count..)/sum(..count..)), fill = "#9124b5") +
  theme_bw() +
  labs(title="Racial Diversity Among Tech Journalists on Twitter",
       subtitle="Race inferred by subjective method.",
      x="",
       y="Proportion",
      caption="TechJournalismIsLessDiverseThanTech.com") +
  coord_flip() +
  scale_y_continuous(breaks = seq(0, 1, by = 0.10), limits=(c(0,1)))

And then we leverage the gender package to infer gender from first names, based on the co-occurrence of baby names and gender in the Social Security Administration datasets.

# Graph of gender diversity in Sample 1
df %>%
drop_na(gender) %>%
ggplot(aes(x = gender)) +  
  geom_bar(aes(y = (..count..)/sum(..count..)), fill = "#9124b5") +
  theme_bw() +
  labs(title="Gender Diversity Among Tech Journalists on Twitter",
       subtitle="Gender inferred from first name using SSA data.",
      x="",
       y="Proportion",
      caption="TechJournalismIsLessDiverseThanTech.com") +
  coord_flip() +
  scale_y_continuous(breaks = seq(0, 1, by = 0.10), limits=(c(0,1)))

df %>%
drop_na(gender, race) %>%
ggplot(aes(x = race)) +  
  geom_bar(aes(y = (..count..)/sum(..count..)), fill = "#9124b5") +
  theme_bw() +
  labs(title="Racial and Gender Diversity Among Tech Journalists on Twitter",
       subtitle="Race and gender inferred from names using government data.",
      x="",
       y="Proportion",
      caption="TechJournalismIsLessDiverseThanTech.com") +
  coord_flip() + facet_wrap(.~gender) +
  scale_y_continuous(breaks = seq(0, 1, by = 0.10), limits=(c(0,1)))

df %>%
drop_na(subjective_gender) %>%
ggplot(aes(x = subjective_gender)) +  
  geom_bar(aes(y = (..count..)/sum(..count..)), fill = "#9124b5") +
  theme_bw() +
  labs(title="Gender Diversity Among Tech Journalists on Twitter",
       subtitle="Gender inferred by subjective method.",
      x="",
       y="Proportion",
      caption="TechJournalismIsLessDiverseThanTech.com") +
  coord_flip() + 
  scale_y_continuous(breaks = seq(0, 1, by = 0.10), limits=(c(0,1)))

df %>%
drop_na(subjective_gender, subjective_race) %>%
ggplot(aes(x = subjective_race)) +  
  geom_bar(aes(y = (..count..)/sum(..count..)), fill = "#9124b5") +
  theme_bw() +
  labs(title="Racial and Gender Diversity Among Tech Journalists on Twitter",
       subtitle="Race and gender inferred by subjective method..",
      x="",
       y="Proportion",
      caption="TechJournalismIsLessDiverseThanTech.com") +
  coord_flip() + facet_wrap(.~subjective_gender) +
  scale_y_continuous(breaks = seq(0, 1, by = 0.10), limits=(c(0,1)))

df %>%
  drop_na(gender) %>%
  group_by(outlet, gender) %>%
  summarise(n = n(), .groups="drop_last") %>%
  mutate(Percent = n / sum(n)) %>%
  filter(gender=="Male") %>%
  ggplot(aes(x=fct_reorder(outlet, Percent), y=Percent)) +
  geom_point(color = "#9124b5") +
  coord_flip() +
  theme_bw() +
  labs(x="", y="Percent Male",
       title="Gender Diversity Across 13 Major Tech-Journalism Outlets",
       subtitle = "Gender inferred by subjective method.",
       caption= "TechJournalismIsLessDiverseThanTech.com") + 
  scale_y_continuous(breaks = seq(0, 1, by = 0.10), limits=(c(0,1)))

Dataset 2: Third-party curated lists of tech journalists

We found two Twitter lists entitled “Tech Journalists” (1, 2). We take these at face value and combine all the listed Twitter accounts into a new sample of tech journalists. Then we re-run the algorithmic method above on this new sample. The initial number of journalists in Sample 2 is 424; after removing duplicates the number is 360.

The basic counts are: 179 likely to be White, 17 likely to be Asian, 9 likely to be Black, and 8 likely to be Hispanic. The rest could not be classified.

The result is very close to our previous results, again, namely 84% white.

# Import the accounts identified by each list
journo.list1<-lists_members(96162061)
journo.list2<-lists_members(53301954)

# Bind them into one dataframe
df2<-rbind(journo.list1, journo.list2)

# Extract surname and predict race
df2$lastnames<-word(df2$name,-1)
race.lists.df<-predict_race(df2$lastnames)

# To partially mitigate the problem of under-identifying black people,
# any surname with a probability of being black greater than .4 will
# will be classified as black. This will increase the error rate of
# classifying white people as black, but decrease the error rate of
# classifying black people as white.
race.lists.df$likely_race<-ifelse(race.lists.df$probability_black>.4 &
         race.lists.df$likely_race=="white",
       "black",
       race.lists.df$likely_race)

# We think there's a bug in the predictrace package which fails
# to infer Hispanic race even when it identifies a very high probability
# of being Hispanic. We mitigate this manually by inferring
# Hispanic race for any probability greater than .5.
race.lists.df$likely_race<-ifelse(race.lists.df$probability_hispanic>.5,
       "hispanic",
       race.lists.df$likely_race)

df2$race<-race.lists.df$likely_race

# Extract first name and predict gender
df2$firstnames<-word(df2$name,1)
gender.lists.df<-gender(df2$firstnames)
gender.lists.df$firstnames<-gender.lists.df$name

# Merge and remove duplicates
df2<-merge(df2, gender.lists.df, by="firstnames")
df2<-distinct(df2)

# Fix capitalization
df2$race<-recode(df2$race, asian = "Asian",
                        black = "Black",
                        white = "White",
                        hispanic = "Hispanic")

# Total percentages from lists
totals.from.lists<-df2 %>%
  drop_na(race) %>%
  group_by(race) %>%
  summarise(n = n(), .groups="drop_last") %>%
  mutate(Percent = n / sum(n)) %>%
  filter(race=="White")

df2 %>%
drop_na(race) %>%
ggplot(aes(x = race)) +  
  geom_bar(aes(y = (..count..)/sum(..count..)), fill = "#9124b5") +
  theme_bw() +
  labs(title="Racial Diversity Among Tech Journalists on Twitter",
       subtitle="Sample 2: Lists of tech journalists on Twitter, curated by third parties",
      x="Most likely race",
       y="Proportion",
      caption= "TechJournalismIsLessDiverseThanTech.com") +
  coord_flip()

Using the third-party Twitter lists of tech journalists and inferring race surnames suggests that tech journalists are 84% white.

df2 %>%
drop_na(gender) %>%
ggplot(aes(x = gender)) +  
  geom_bar(aes(y = (..count..)/sum(..count..)), fill = "#9124b5") +
  theme_bw() +
  labs(title="Gender Diversity Among Tech Journalists on Twitter",
       subtitle="Sample 2: Lists of tech journalists on Twitter, curated by third parties",
      x="Most likely gender",
       y="Proportion",
      caption="Gender is inferred from Social Security Administration data. \n TechJournalismIsLessDiverseThanTech.com") +
  coord_flip() + 
  scale_y_continuous(breaks = seq(0, 1, by = 0.10), limits=(c(0,1)))

df2 %>%
drop_na(gender, race) %>%
ggplot(aes(x = race)) +  
  geom_bar(aes(y = (..count..)/sum(..count..)), fill = "#9124b5") +
  theme_bw() +
  labs(title="Racial and Gender Diversity Among Tech Journalists on Twitter",
       subtitle="Sample 2: Lists of tech journalists on Twitter, curated by third parties",
      x="Most likely race",
       y="Proportion",
      caption="Race is inferred from surname using Census data; gender from \n first name using Social Security Administration baby names. \n TechJournalismIsLessDiverseThanTech.com") +
  coord_flip() + facet_wrap(.~gender) + 
  scale_y_continuous(breaks = seq(0, 1, by = 0.10), limits=(c(0,1)))

Conclusion

The similar distributions across two different datasets (search-based and curated) and two different methods for estimating race and gender (algorithmic and subjective) suggest the results are likely an accurate portrait of race and gender diversity for tech journalists on Twitter.

If you spot any errors, please let us know here.

As noted at the outset, however, there’s only so much we can do from the outside. The ideal would be for every media corporation reporting on tech to follow the leads of Amazon, Apple, Facebook, Google, and Microsoft and publish its own annual diversity report.