publisher colophon


Data Sources and Methods for Regional Profiles

In this appendix, we include a series of tables that offer a portrait of the demographic and economic characteristics of each region that served as a case study for this book. Unless otherwise noted, the various references to regional data in the text can be found in these tables. We may sometimes make reference in the text to values of the same variables found below summarized by census region, across the largest 192 metropolitan areas (based on 2000 population), or across the nation as a whole. However, to conserve space, those summary tables are not included; they will be made available upon request. In what follows, we describe the data and methodology used to generate these informative tables, and then present the tables themselves.

A wide variety of data sources were drawn from to create the tables; broad and detailed sources are summarized in table B.1. As noted in the text, each region is defined by its corresponding metropolitan area as defined by the Office of Management and Budget’s (OMB’s) December 2003 Core Based Statistical Area (CBSA) definitions. The one exception is the Raleigh-Durham region, which encompasses both the Raleigh and Durham metropolitan areas (two separate CBSAs).

One key aspect of the data presented here is that it is geographically consistent over time. Given that metropolitan boundaries can shift from one decade to the next (and even in between), data collected for the same metropolitan area in various censuses does not necessarily represent the same geographic coverage. To make the data consistent, much of it (particularly for years prior to 2003) had to be collected at more detailed levels of geography and summarized according to the December 2003 CBSA boundaries.

Fortunately for us, CBSAs are always equivalent to a single county or a grouping of counties, and given that county boundaries are far more stable, much of our data assembly boiled down to the aggregation of county data (if it was not already available at the December 2003 CBSA level, or equivalent). However, not all the data we were interested in was available at the county level, so we drew several measures from two unique datasets that deserve further explanation.


The first is the Building Resilient Regions (BRR) database, a project of the Building Resilient Regions Network, funded by the John D. and Catherine T. MacArthur Foundation (Pastor et al. 2012). We were key contributors to the development of this dataset, which was assembled to support the research of the BRR network and others. It consists of hundreds of variables covering 361 metropolitan and 567 micropolitan CBSAs in the United States. In addition to using a uniform December 2003 CBSA geography (as does all the data presented in our tables here), most variables are available separately for combined principal cities and suburbs of each CBSA, based on the aggregation of census tract–level data from various years. The principal city definitions are also based on the OMB’s standards, and include the largest city in each CBSA, plus additional cities that meet specific population-size and employment requirements, while the suburbs include the remainder of the CBSA. The data inputs into the BRR dataset are multiple and have been explained elsewhere (see e.g. the description in Pastor, Lester, and Scoggins 2009).

The second dataset deserving of further explanation—which we will refer to simply as the IPUMS-based dataset—was created using microdata samples (i.e. “individual-level” data) from the Integrated Public Use Microdata Series (IPUMS) for four points in time: 1980, 1990, 2000, and a 2012 five-year file, which includes data from 2008 through 2012 pooled together (Ruggles et al. 2010). The 1980 through 2000 files are based on the decennial censuses and cover about 5 percent of the US population each. More recent microdata files are based on the American Community Survey (ACS), however, and only cover about 1 percent of the US population each. Thus, we chose to use the 2012 five-year ACS file to improve statistical reliability (achieving a sample size that is comparable to that available in previous years) and because the central year of the sample is 2010, which is consistent with the last year reported for most other measures in our tables.

Compared with the more commonly used census “summary files,” which include a limited set of summary tabulations of population and housing characteristics, the microdata samples provide the flexibility to create more detailed tabulations. To avoid reporting highly unreliable estimates (which, in this data exercise, really only applies to the unemployment rates by race/ethnicity), we do not report any estimates from the IPUMS microdata that are based on a universe of fewer than 100 individual survey respondents.

A key limitation of the IPUMS microdata is geographic detail. Each year of the data has a particular “lowest level” of geography associated with the individuals included, known as the Public Use Microdata Area (PUMA) or County Group in 1980. The major challenge for our purposes was that PUMAs do not neatly align with the boundaries of metropolitan areas. While several PUMAs are often entirely contained within the core of a metropolitan area, there can be a few more peripheral PUMAs straddling the metropolitan area boundary. Moreover, while the same PUMAs were used for both the 2000 and 2008–2011 microdata, the 1980, 1990, and 2012 microdata each have their own distinct PUMA geographies.

To summarize measures at the regional level, we had to first create a set of geographic crosswalks between the PUMAs and each region for each year of microdata, down-weighting appropriately when PUMAs extended beyond the regional boundary. To do this we estimated the share of each PUMA’s population that fell inside each region using population data for each year from GeoLytics, at the 2010 census-block-group level of geography (2010 population information was used for the 2008–2012 geographic crosswalks). If the share was at least 50 percent, then the PUMAs were assigned to the region and included when generating our regional summary measures. For most PUMAs assigned to the region, the share was 100 percent. For some PUMAs, however, the share was somewhere between 50 and 100 percent, and this share was used to adjust the survey weights downward for individuals included in such PUMAs when estimating regional summary measures.

For the remainder of the data sources used, geographic aggregation was more straightforward, involving simply the aggregation of data across the counties included in each region. However, many of the variables themselves could use some more explanation around the specific data sources and/or methods used to generate them. Below, we walk through the data sources by variable category (e.g. demography and immigration, regional economy) and provide documentation as necessary on a variable-by-variable basis.


Beginning with demography and immigration, data on regional population and net population growth (and by principal cities/suburbs) is from the BRR database and the US Census Bureau. The BRR database was used to get the population of principal cities and suburbs (adjusted to be consistent with official figures from the decennial census of each year). However, because the latest data point in the BRR database is a 2005–2009 average, data from the 2010 Census (SF1) was used to fill in information for 2010.

The percentage of the population by race/ethnicity, and net population growth attributable to people of color, are based on the decennial census for each year. Racial/ethnic categories are based on individual responses to two questions: one on race and the other on Hispanic or Latino origin. All racial groups (whites, Blacks, Asian and Pacific Islanders, Native Americans, and Others) are non-Hispanic—that is, they include individuals identifying the respective racial groups alone, and who do not identify as Hispanic or Latino (Other includes those identifying with a single other race not listed, or as multiracial). All persons identifying as Hispanic or Latino are treated as a separate racial group. The term people of color refers to everyone who does not identify as white, and net population growth attributable to people of color is figured as the net change in people of color divided by the net change in the total population over the past decade. Unless otherwise noted, these racial classifications apply to all other data reported in the tables. The percentages foreign (and by citizenship) are from the IPUMS-based dataset.


Total jobs, average annual earnings per job, GDP per job, and the ratio of GDP per jobs to earnings per job are from the US Bureau of Economic Analysis (BEA). However, because the BEA does not provide county-level GDP information, and their metro area–level estimates only go back to 2001 (not to mention being based on February 2013 CBSAs), we generated our own county-level GDP estimates, which were then aggregated to December 2003 CBSAs. To do this, we relied on the BEA’s state-level GDP estimates, which are available all the way back to 1963. We first made slight adjustments to make the series consistent before and after 1997, when the BEA shifted from the Standard Industrial Classification (SIC) to the North American Industry Classification System (NAICS), and then allocated GDP to the counties in each state in proportion to the total earnings of employees in each county. Finally, we adjusted the resulting county-level estimates to be consistent with the BEA’s reported metro-area estimates for 2001 and later, and adjusted all estimates prior to 2001 to ensure a smooth transition at the metro level between 2000 and 2001. Unemployment rates (and by race/ethnicity) are from the IPUMS-based dataset.


Information on the poverty rate (both metro-wide and by principal cities/suburbs), the 80/20 household income ratio, and income differentials by race is from the BRR database, with data from the 2010 ACS one-year summary used to fill in values for 2010. Data from the 2011 ACS three-year summary file was also used, but only to fill in poverty data for a handful of smaller principal cities for which no data was reported in the 2010 one-year summary file.

The Gini coefficient and the percentage of households by income level are from the IPUMS-based dataset. Due to its importance to our regional selection process, however, and given the availability of single-year estimates for this measure from the ACS one-year summary file, we use the ACS one-year summary data for 2010 rather than our IPUMS-based estimate (which, as noted above, represents a 2008–2012 average). Our estimates of the Gini coefficient for household income prior to 2010 are based on all households in the microdata samples, following the standard formula with use of trapezoidal integration to estimate the area under the Lorenz curve.

The percentage of households by income level is based on an analysis that seeks to illustrate broad shifts in the household income distribution. For each region, “middle-income” households were defined in 1980 (based on 1979 income) as those in the middle 40 percent of the distribution (with upper- and lower-income households defined implicitly), and the upper and lower income values capturing them were identified. These middle-income boundary values were adjusted for each subsequent year to rise (or fall) by the same percentage as real average household income, and the share falling between the adjusted “middle-income” boundaries was calculated (along with the corresponding upper- and lower-income household shares). Thus, the percentage of middle-income households each year, for example, reflects the share of households enjoying the same relative standard of living as the middle 40 percent of households did in 1979.


The principal cities–suburbs job distribution data is from the State of the Cities Data Systems (SOCDS), Decennial Census, and American Community Survey Data for 1980 through 2000.1 The SOCDS data includes employment counts on a place-of-work basis for CBSAs, principal cities, and suburbs (based on the OMB’s December 2003 definitions). Because the SOCDS data had not yet been updated with 2010 information at the time of writing, we drew 2010 data from the Longitudinal Employer-Household Dynamics, which uses a variety of data sources to generate estimates of total workers/jobs by census block (among other variables). We matched the block-level data to CBSAs and principal cities using GIS software to summarize the data as needed.

All spatial-segregation measures (spatial segregation by race and income, spatial poverty, and poverty concentration) were generated using data from GeoLytics, with the exception of 2010, for which we used the 2010 Census (SF1) for racial segregation, and the 2012 ACS five-year summary file for all other measures. While the GeoLytics data originates from the decennial censuses of each year, the advantage of the GeoLytics data is that it has been “reshaped” to be expressed on 2010 census-tract boundaries (the same geography used by the 2010 Census and the 2012 ACS), and so the underlying geography for our calculations is consistent over time. The census-tract boundaries of the original decennial census data change with each release, which could potentially cause a change in the value of our spatial segregation measures even if no actual change in residential segregation occurred.

Segregation by race, and the poverty dissimilarity index, are measured using the “dissimilarity index,” which is calculated using two racial (or any other sort of) groups and can be construed as indicating the share of one group that would have to move to a new census tract to make the distribution of the two groups across all tracts in the region the same.2 Our method for measuring income segregation was derived from a 2010 report, Residential Segregation by Income, 1970–2009 (Bischoff and Reardon 2013). The only difference is that we focused on household income rather than family income so that we could generate measures of income segregation that cover the total population, including unrelated individuals, who are not included in family income measures. We organized census tracts within each region into six income categories, based on the ratio of tract-level median household income to the regional average (with the latter figured as a weighted average of the tract medians). The six income categories are defined as follows. Poor includes tracts with median household income of less than 67 percent of the regional average; low income means 67–79 percent of the regional average; low-mid income, 80–99 percent; high-mid income, 100–124 percent; high income, 125–149 percent; and affluent, income of 150 percent of the regional average or more. Once each tract’s income category was determined, population was summed by income category across all tracts in each region to get the distribution shown in the tables.

The calculations for spatial poverty and poverty concentration were somewhat simpler. For spatial poverty, tracts were tagged as high poverty or very high poverty if the official federal poverty rate was above 20 or 40 percent, respectively. Within each region, the total population and the total number of persons falling below the federal poverty line residing in such tracts were determined, and divided by the respective regional total to get the figures shown in the tables.


The data on educational attainment is from the BRR database, with data from the 2010 ACS one-year summary file used to fill in values for 2010. Also, due to lack of detail on educational attainment in the BRR database in 1980, educational-attainment data for that year is from the IPUMS-based dataset and relies on years of schooling completed rather than degrees earned. For 1980, completing 12, 16, or 18 or more years of schooling was taken to be the equivalent of a high school diploma, bachelor’s degree, or graduate or professional degree, respectively, while those with schooling of 13–15 years were assigned to the some college category.

Data on workers by industry is from Woods & Poole Economics. We used this data because, unlike the publicly available Quarterly Census of Employment and Wages, it provides data back to 1980, estimated for NAICS-coded industries. The industry groups shown mostly correspond to single two-digit NAICS codes, with some groupings to simplify the industry categories. The following NAICS codes were grouped together to form a single industry in the tables: agriculture (NAICS 11) and mining (NAICS 21); transportation and warehousing (NAICS 48–49) and utilities (NAICS 22); finance and insurance (NAICS 52) and real estate (NAICS 53); professional services (NAICS 54), management (NAICS 55), and administrative support (NAICS 56); and other services (NAICS 81), arts, entertainment, and recreation (NAICS 71), and accommodation and food services (NAICS 72).


All data appearing in the Industry and Wage Structure portion of each table is based on an analysis of data from the Quarterly Census of Employment and Wages (QCEW), which only reports industry data on a NAICS basis from 1990 forward, with some minor supplementation from Woods & Poole Economics, where there were undisclosed industry data for any particular region/industry. Given differences in methodology between the two data sources, it would not be appropriate to simply plug in corresponding Woods & Poole data directly to fill in the QCEW data for undisclosed industries. Our approach was to first calculate the number of jobs and total wages from undisclosed industries in each region, and then distribute those quantities across the undisclosed industries in proportion to their reported numbers in the Woods & Poole data. This was done after first making some simple adjustments to the Woods & Poole data to better align it with the QCEW, which includes only wage and salary workers rather than all workers.

Despite its only being available on a consistent NAICS basis from 1990 forward, the QCEW was chosen for the analysis of industry wage structure because it is the most comprehensive data source for employment and wages by industry, covering 98 percent of all jobs in the United States. The analysis seeks to track shifts in regional industrial job composition and wage growth over time by industry wage level for the private sector. Using 1990 as the base year, we classified broad industries (at the two-digit NAICS level) into low-, middle-, and high-wage categories. An industry’s wage category was based on its average annual wage, and each of the three categories contained approximately one-third of the nineteen two-digit private NAICS industries for each region.

We applied the 1990 industry wage category classification across all the years in the dataset, so that the industries within each category remained the same over time. This way, we could track the broad trajectory of jobs and wages in low-, middle-, and high-wage industries. This approach was adapted from a method used in a Brookings Institution report, Building from Strength: Creating Opportunity in Greater Baltimore’s Next Economy (Vey 2012).

Appendix B: Data Sources and Methods for Regional Profiles

1. Available at

2. The formula used to calculate it is well established, and made available by the US Census Bureau at

Additional Information

Related ISBN
MARC Record
Launched on MUSE
Open Access
Back To Top

This website uses cookies to ensure you get the best experience on our website. Without cookies your experience may not be seamless.