Research project

Integrated Census Microdata (I-CeM)

Unlocking our past

Funded by a grant from the Economic and Social Research Council (ESRC), the Integrated Census Microdata (I-CeM) project was a three-year programme which produced a standardised, integrated dataset of most of the censuses of Great Britain for the period 1851 to 1911. The dataset is available at the UK Data Archive.

The Integrated Census Microdata (I-CeM) is a collection of individual-level census data for Great Britain covering the period 1851 to 1911 (England and Wales, 1851-1861 & 1881-1911; Scotland, 1851-1901) – some 185 million person records in total.  It is integrated because the underlying raw census data have been enhanced through the creation of multiple standardised derived variables which have been specially designed to facilitate comparative analyses over time.

By making available to academic researchers detailed information about everyone resident in this country, collected at decennial censuses from 1851 to 1911, the I-CeM data collection – one of the largest of its kind in the world, has transformed the landscape for research work in the economic, social, and demographic history of this country during a period of profound change in the wake of the industrial revolution.

Created as a result of a major award from the UK Economic and Social Research Council (ESRC), this project is a collaboration between the University of Essex, the University of Leicester, and with a commercial partner 'FindMyPast'.

The I-CeM data collection is available to academic researchers and teachers via the UK Data Service (UKDS) in two forms – in an anonymised version available online, and in a full version via secure data access arrangements. 

What has the project involved?

The decennial census data for Great Britain is one of the most heavily used of all national historic records. Either in aggregated form or as individual or household records, they form the bedrock of much historical research for the nineteenth and early twentieth centuries. The specific details of each census changed from decade to decade, but the core information collected for the period 1851 to 1911 include for every individual living in the country: a place of residence, gender, age, household membership, marital status, occupation, place of birth and disability. These and other raw census data have been transcribed from the original manuscript records by our commercial partner, the genealogical service provider, FindMyPast, a massive undertaking in its own right.

However, whilst an invaluable resource for those wishing to search for the records relating to specific individuals (largely yet not exclusively genealogists and family historians), in their raw transcribed form, the census data are of limited value for large scale historical research. This is due to the simple reason that whilst generated as responses to formulaic questions, the census data are essentially textual data.

The point is illustrated by a few simple statistics. Stating where one was born results in over 6 million uniquely different answers in the I-CeM data collection; providing one’s occupation, over 5 million different responses; and even the relatively simple question about one’s relationship within the household generates hundreds of thousands of replies. The data include over three hundred different ways of expressing the occupation ‘Blacksmith’ and over two hundred ways of recording ‘Hammersmith’ as a birthplace. Standardising these data to allow, say, all blacksmiths living or born in Hammersmith to be identified and analysed along with the households in which they lived, was a central task of the original I-CeM project.

Data standardisation and harmonisation overtime was a starting point for the I-CeM project, yet it went a lot further than this. From the standardised variables, a wide range of data enhancements and derived variables have been constructed to augment the transcribed census data. Thus, to give but two examples, all households containing servants are directly identifiable, as are those that contain a married child living with their parents. In consequence, the I-CeM data collection makes census-based research dramatically quicker and easier and opens up this key resource to many.

In conjunction with the UKDS, the I-CeM project also created an online interface and bespoke dissemination tool for authenticated users to access versions of the I-CeM data to generate their own datasets. The use of these data is additionally supported by a comprehensive User Guide and other documentation and tools.

Research and learning possibilities

The British nineteenth-century and early twentieth-century decennial census returns are an invaluable historical source of information for the social and economic analyses of the Victorian and Edwardian period. Much of the history of the period could not be written without this source.

From 1851 onwards, the decennial British census returns contain vast amounts of comparable information on house and household structures, and, for each individual, on name, marital condition, relationship to head of household, age, sex, occupation, birthplace, and some medical disabilities. Later censuses include information on home working, industry of employment, and the fertility of married couples.

However, large-scale academic analysis of the manuscript sources such as the censuses has traditionally required time-consuming manual inputting of data from the census returns into computer systems for analysis which limits the scope, the geographical scale and the time periods of the research analysis that can be undertaken.

I-CeM's comprehensive temporal and spatial coverage offers support for research across a number of key historical and social science disciplines. It also allows comparative analysis alongside similar international research resources, where they exist. The extensive I-CeM data collection relieves researchers from the need to key in their own data, which will increase the available time for analysis, and enable researchers to increase the range and complexity of the issues that can be addressed from the census returns.

In addition, the ability to analyse complete censuses, rather than samples or local subsets of the data, opens up genuine new fields of research, not previously possible or practical. The accumulated bibliography of publications of using I-CeM data gives an indication of the range of research possibilities.

Given the relative ease with which the I-CeM data can be utilised, the collection also offers huge potential for undergraduate and postgraduate learning engagements in a number of contexts, both methodological and substantive. This is true not only of the raw data but also of products derived from I-CeM, such as Populations Past, which allows aggregated census data and derived variables to be mapped and compared for England and Wales.

Access to Data

The I-CeM dataset is held at the UK Data Service (UKDS) at the University of Essex in two forms - a ‘full’ version and an ‘anonymised’ version without names and addresses.

Anonymised dataset

Data from the anonymised version can be downloaded to their computers by accredited researchers in higher education institutions via a bespoke download facility at the UKDS. This allows researchers to identify the data on particular individuals though selection criteria, and to then download it, once an end user license agreement has been signed. The UKDA download facility also enables users to create tabulations of data online using the NESSTAR analysis software.

Full dataset

Names and address information is excluded from the general I-CeM database mentioned above due to restrictions placed on the data by the data owners, BrightSolid. If accredited researchers wish to obtain these variables then they are required to apply for a special licence via the UKDS.

 

Documentation and supporting materials

What are they for?

In the course of the I-CeM project, various documents, spreadsheets, and datasets, have been created, partly as spinoff’s from the work done, or in order to explain various facets of the data. These are made available here to help users of the I-CeM dataset, but also as stand-alone academic products in their own right.

Browse the following sections for supporting documentation.

The I-CeM guide

 

This will be the general user manual for the I-CeM data collection covering:

  • the history of British census-taking, 1851-1911;
  • the history of the individual censuses, with lists of the official publications created, and images of the schedules and enumerators books used in the census;
  • the provenance of the I-CeM data;
  • the enrichment programme for the I-CeM data, explaining the processes of reconciliation, reformatting, standardisations, and the creation of inference and enriched variables from the original data;
  • descriptions of all the variables in the I-CeM dataset, including the variable labels, form and length, which census years they cover, their access status, and their possible values;
  • and, means of access to the dataset.

The Guide will have a navigation pane for ease of use. To view the .pdf navigation pane, click the Adobe Reader Toolbar icon at the end of the Tools pop up, and then click the 'Bookmarks' icon (second down).

The latest version of the Guide is dated August 2020. It is recommended that users of I-CeM have access to the latest version of the Guide.

Missing data and differences in population counts

It is important for users to realise that not all census records have survived. This especially true for 1851 and 1861 wherein some cases data for whole parishes or even Registration Sub-Districts is ‘wanting’. Further details on missing data is available from both The National Archives and FindMyPast websites, as follows:

In addition, users are referred to a paper by Bennett, van Lieshout and Schürer which estimates the levels of missing data in the I-CeM data collection, together with ‘weights’ that have been calculated to ‘correct’ for missing data when undertaking analyses at a national level.

A further problem exists in the identification of missing census records at the level of the parish. As part of the I-CeM project, an attempt was made to reallocate each individual census record to the census parish in which it was enumerated.

This process is known as reconciliation. Whilst this sounds like a straight-forward exercise, it proved to be laborious and time-consuming and can never be exact due to the census information available that underpins this process.

In short, in order to allocate each page of the census data to an administrative parish, the I-CeM project team had to rely on the transcribed information at the head of each census page, which is non-standard, imprecise and sometimes incorrect. Despite this, each census record was allocated a parish identifier (PARID, see Data Dictionaries section below). The table below shows the difference between the population of parishes found in the raw I-CeM data following reconciliation, and what one would expect from the published Census Reports.

Subsequent to the release on the I-CeM data via the UK Data Service, a number of corrections to the PARID allocation have been made for parishes in England and Wales. These corrections can be implemented by downloading the following look-up tables.

Duplicate records

Both the original census data and the FindMyPast transcriptions contain a relatively small number of duplications as well as empty (null). Whilst every effort was made to identify these during the processing stages of the I-CeM some additional duplicate and null/empty records have subsequently been identified. These can be identified via the following look-up table.

Consistent Parish Geographies

Given that many users will wish to analyse census data over time for comparative geographical units, the I-CeM team endeavoured to identify the parishes of enumeration consistently across all censuses, thus creating the CONPARID variable in the I-CeM dataset - see the I-CeM Guide for a definition. This is based on the work of Professor Sir E. A. Wrigley for England and Wales1, and by Professor Michael Anderson for Scotland.

In creating consistent geographies the basic logic is to amalgamate parishes where necessary so that the geographical territory under consideration remains constant over time. So, for example, assume that part of parish A was transferred to parish B between census years. In order to create a consistent geographical unit overtime one would need to treat them not as separate parishes, but as a single entity. The reasoning for producing a consistent geography variable is that it facilitates comparisons over time, where, as far as possible, like is being compared to like.

Due to the multiple changes in census enumeration geography over time, and the fact that parish boundaries themselves change, the project produced two sets of consistent geography: one for the period 1851-1891, the other for 1901-1911. Mapping of the I-CeM data for England and Wales can be achieved by using these consistent parish variables in combination with the GIS of historical parishes created by Satchell, A.E.M and Kitson, P.K and Newton, G.H and Shaw-Taylor, L. and Wrigley, E.A (2018). 1851 England and Wales census parishes, townships and places. [Data Collection]. Colchester, Essex: UK Data Archive. 10.5255/UKDA-SN-852232, available from the UKDS.

A link between the ID variable in the dbf_file for this GIS can be linked to the I-CeM variable CONPARID via the look-up table below:

 

Data Dictionaries

What are data dictionaries?

As part of the I-CeM project a number of data dictionaries have been created. These essentially provide users with detailed information on how key variables have been coded and standardised. These are made available here to help users of the I-CeM data collection but also as stand-alone academic products in their own right which can be used in conjunction with related research projects.

The Marital Condition Dictionary

This dictionary gives the text strings relating to marital condition in the COND variable of the I-CeM dataset, and their relationship to the MAR coded variable – see I-CeM Guide for definitions.

The Relationship to Head Dictionary

This dictionary gives the text strings relating to relationship to head in the RELAT variable in the I-CeM dataset, and how these relate to the codes in the RELA variable – see the I-CeM Guide for definitions.

The Occupational Matrix

The I-CeM Occupational Matrix indicates the position of occupational headings in the occupational tables in the published Census Reports for England and Wales and Scotland for the years 1851 to 1911. Each row of this Matrix gives the position of one occupational heading in the occupational classifications used in the Census Reports for this period. The HISCO coding of each of the occupational headings is also given. Each row is given a number which then becomes the value found in the OCCODE variable in the I-CeM dataset. A more detailed introduction to the Matrix, its construction and uses is available.

The Employment Status Dictionary

The I-CeM Employment Status Dictionary gives the text strings relating to employment status found in the EMPLOY variable in the I-CeM dataset, and the equivalent employment status code EMPLOYCODE – see I-CeM Guide for definitions.

The Parish (PARID) Dictionary

This identifies the ‘parish’ of enumeration listed in the various tables published year by year in the GRO and GRO(S) Census Reports. It is, therefore, not consistent over time. Equally, the same named parish in different years may not cover the same geographical territory, due to boundary changes over time. This is the basis of the PARID variable in the I-CeM dataset – see the I-CeM Guide for a definition. A full list of these parish units in available for each year.

The Placelist (STD_PAR) Dictionary

This gives a list of all the places, hamlets, townships, etc., given as birthplaces in the I-CeM dataset, with the STD_PAR parish, CNTI county, and ALT_CNTI county, in which they fall. See the I-CeM guide for definitions of these variables.

The Disability Dictionary

The I-CeM Disability Dictionary gives a listing of the text strings found in the disabilities column (DISAB), and the DISCODE1 and DISCODE2 values they are given in the I-CeM dataset – see the I-CeM Guide for definitions. The DISCODE1 variable is a 7 digit code but the Dictionary only gives the code from the first positive value. Thus, ‘Deaf Dumb Born Like It’ has a value of 0100010 in the dataset, which indicates that information on dumbness and hearing impairment is present, as is information about duration of disability, but there is no information regarding visual impairment, idiocy and imbecility, lunacy, other disabilities, or information relating to severity of disability. But in the Dictionary the string has a value of 100010, the first ‘0’ indicating the absence on information on visual impairment being omitted.

The Language Dictionary

This dictionary gives the text strings relating to language spoken in the LANG variable of the I-CeM dataset, and its relationship to the LANGCODE variable – see I-CeM Guide for definitions.

The Number of Rooms Dictionary

This dictionary gives the text strings relating to the number of rooms in the NOOFROOMS variable in the I-CeM dataset, and their relationship to the NOOFROOMSCODE variable – see the I-CeM Guide for definitions.

The Building Type Dictionary

This gives the text strings relating to building types in the BUILDTYPE variable in the I-CeM dataset, and their relationship to the BTCODE variable – see the I-CeM Guide for definitions.

People and partners

The I-CeM project was conceived and developed by Professors Kevin Schürer and Edward Higgs. Together, their successful bid to the Economic and Social Research Council for funding (The Integrated Census Microdata (I-CeM) Project, ESRC Award Ref: RES-062-23-1629), having previously brokered access to the raw digital data with the project’s commercial partner FindMyPast, which as part of the BrightSolid group, is the leading UK genealogy and family history online services, provider.

The original ESRC funded I-CeM project hired a number of research assistants at the University of Essex whose contribution were invaluable. Throughout its duration, the project also gained help and support from a wide body of others, including representatives at the Universities of Cambridge and Edinburgh, FindMyPast and the UK Data Archive

Research Assistants

Jamie Collins

Former Research Assistant, University of Essex

Nicola Farnworth

Former Research Assistant, University of Essex

Lisa Gardner

Former Research Assistant, University of Essex

Mitch Goodrum

Former Research Assistant, University of Essex

Christine Jones

Former Research Assistant, University of Essex

Amanda Wilkinson

Former Research Assistant, University of Essex

University of Essex Logo
UKRI ESR Council Logo