What is Methodology?
Understanding methodological research
Methodology is one of the research programmes that characterise the ADRC-England research portfolio. Although the term 'methodology' covers a broad area in data science, ADRC-E researchers focus on data linkage and linked data quality of administrative data, paradata, survey data and synthetic data.
Administrative datasets (such as records of service users or benefit claimants) may contain many millions of records, usually each relating to individual persons. In some cases the same identifier (for example a national insurance number) may be used by more than one agency but often there is no common identification code.
Administrative data researchers are concerned to understand how individuals are connected in different datasets – for example, what type of people are using a service and which of them are most likely to go on to be claimants of a benefit, perhaps in relation to the ages of individuals and the areas in which they live. Being able to investigate these types of questions can be very important to understanding how effectively services are meeting current needs, or how they might need to be adapted to meet future demand.
The standard ADRN approach to administrative data research is for a separate organisation (known as a ‘trusted third party’) to link together records from each dataset which relate to the same individuals – for example by matching the reference numbers or names of individuals and then to delete all the identifying information. The researchers are then provided with this anonymised linked dataset in which service users and benefit claimants have been matched. The linked data can be used to investigate how service use varies by age and area; and which users go on to claim the benefit of interest. Researchers may, for example, build statistical models to predict the likelihood that service users with different characteristics will need a particular benefit in the future. Much of the research supported by the ADRN follows this broad pattern.
Due to the very large number of people covered by most administrative datasets, this is a very powerful way of understanding important patterns in society. Many more people are included than could ever be covered, for example, by a survey of service users or the general population.
However, there are still important methodological aspects of administrative data research which remain challenging and are a focus for ADRC-E researchers at the University of Southampton and UCL.
Areas of methodological research
Examples of methodological research include understanding the types of error which may be present in different administrative datasets – for example people who are not correctly included or whose information is not up-to-date. These situations might occur if certain types of people have trouble registering to use services, or move residential address frequently and tend to transfer between agencies in different areas. These, and other similar issues, might be thought of as ‘data quality challenges’. As we begin to use administrative data more extensively, it is important that researchers understand these types of errors and document what they find, both to aid other researchers and to help organisations to improve their own administrative processes.
There are also methodological challenges associated with the processes to be used by trusted third parties for linkage of records which do not have common identifiers. These methods will generally involve matching common characteristics such as names and dates of birth, but these can be prone to errors (e.g. when individuals use different forms of their name, change name or have not always provided dates of birth). Generally, using more and better structured information and different ways to recognise a possible match can help to improve the accuracy of the linked data and reliability of research. A related consideration might be how ages and areas should be coded in the matched data to ensure that researchers obtain useful results but the dataset could not allow anyone to re-identify individuals.
A third area of methodological research concerns the statistical models to be used by researchers. It is important to develop and use methods which are resilient to the types of errors present in linked administrative data and still being able to provide reliable estimates of the relationships of interest. These three broad themes are all important work areas for ADRC-England.
Examples of methodological research
Dr Jamie Moore - Understanding and monitoring the impact of non-response in social surveys
People failing to take part in or dropping out of surveys has always been a problem for those involved in running and using surveys. This is because non-response biases in estimates from the survey can arise if these people differ from survey respondents. For a long time, maximising survey response rates has been seen as the solution to minimising such biases. However, in the last thirty years, response rates have declined, and also a growing body of research has arisen that suggests they are at best weak predictors of biases anyway. Hence, methodologists have sought other ways to assess the risks posed to survey dataset quality by non-response.
Our research involves the development and validation of new methods of assessing survey dataset quality. Specifically, we focus on monitoring response within and across subgroups of the sample to check whether their representation in the survey dataset reflects that in the sample population. To obtain information on survey non-respondents (which is not otherwise available), we link the survey sample to their census records. We couple this with survey paradata detailing the attempts made by interviewers to contact survey non-respondents. The idea behind this approach is that it enables response to be monitored during the data collection period, informing adaptive strategies in which collection methods may be modified to maximise data quality and/or minimise collection costs. In particular, it means that decisions can be made about whether it is really essential to the overall quality of the survey dataset to pursue a participant beyond a certain number of attempts or whether it would be better to focus on other participants in other under represented groups. Our findings are currently being utilised by the Office for National Statistics to optimise data collection in a number of national level government surveys.
Dr James Doidge - Classifying studies of linked data to understand linkage error
Linkage error — missed links between records that pertain to the same individual or false links between records that do not — can reduce the quality of linked data and influence the validity and precision of analysis results. The influence of linkage error is complex, depending on: the research question, the study design and the characteristics of the population or sample. Furthermore, in many applications, all possible linkage algorithms and probabilistic thresholds can involve biases in the same direction; in which case the common approach of varying linkage methods is inadequate for measuring sensitivity or bias.
We are developing a framework for classifying studies of linked data to facilitate sensitivity and bias analyses that account for the influence of linkage error. The classification system helps to identify combinations of design elements that differ across data linkage studies and affect the way that linkage error influences parameter estimates. In the context of each classification, missed links and false links can be translated into effects of selection bias and information bias, which in turn facilitates implementation of standard epidemiologic techniques for sensitivity and bias analysis.
Prof David Martin - Automated zone design for disclosure risk in spatially referenced synthetic data
We are exploring how automated zone design methods — computational tools for designing boundaries on a map to meet predefined criteria, for example all areas must contain more than a specified number of people — could make it easier for data owners to make information available to researchers without revealing locations in a way that might compromise personal privacy. Data owners tend, quite rightly, to have clear-cut policies designed to prevent individuals and where they live from being identified. While this works reasonably well to prevent identification of a specific location or individual, it can make analysis of detailed spatial patterns very difficult. An example of this might be a researcher who wants to look at where rare diseases occur.
Spatial aggregation is a standard approach to in the protection of population data such as those collected from a census of population. When dealing with the analysis of administrative data, decisions must be made by the data provider regarding appropriate levels of spatial aggregation attached to individual records, both at the point of researcher access within a secure data laboratory and again on release of analysis results. Conventional rules of thumb regarding minimum threshold population sizes may not be a good indicator of risk, particularly where analysis concerns spatial relationships with environmental or social factors, which themselves are strongly geographically patterned, presenting additional geoprivacy challenges. This research uses our automated zone design software to investigate the trade-offs between disclosure risk and geographical detail in different research situations. We have constructed a large synthetic dataset of individuals and households so that we can undertake these experiments without risking disclosure of any real data.
Prof Li-Chun Zhang - Adjusting for linkage error to improve inference
Record linkage of separate datasets will generate linkage errors that can cause bias and loss of efficiency of the subsequent analysis, if the linked data are treated as if they were truly observed, unless a unique identifier exists for this purpose. The adjustment for linkage errors are particularly challenging for whoever performs secondary analysis and has no access to all the linkage key variables and the separate datasets, nor the details or tools of the actual linkage procedure.
Three different approaches have been investigated for secondary analysis. Firstly, research shows that typical analysis methods using maximum likelihood estimation with linked data may produce biased and inconsistent results. Secondly, using linear regression as an example, we developed an approach to reduce bias (using linear regression). This approach does not require actually linking the separate datasets, but it does require access to non-disclosive data about the precision of the linkage process. Currently, releasing such linkage comparison data is not standard practice for linkage datasets. Thirdly, we investigated the conditions by which valid analysis can be obtained, based on a subset of the all links that otherwise could have been made. This can be useful for analysis of large population datasets such as that arising from the Census data linkage project at the Office for National Statistics, where the required computation is infeasible if one attempts to use all the linkable and missed records.
We are developing practical, scalable adjustment approaches, which include case weighting methods similar to survey re-weighting for non-response and do not require detailed comparison data for all possible links.
- Automated zone design: Computational tools for designing boundaries on a map to meet predefined criteria, for example, all areas must contain more than a specified number of people.
- Bias: Effect that deprives a statistical result of representativeness by systematically distorting it, as distinct from a random error which may distort on any one occasion but balances out on the average.
- Dataset: A Quantitative Dataset is a collection of structured information on data subjects that can be measured numerically and analysed statistically, obtained using quantitative research methods such as surveys or questionnaires. A Qualitative Dataset is a collection of unstructured information on data subjects that typically cannot be measured numerically, obtained using qualitative research methods such as interview transcripts, audio/video/digital recordings and photographic material.
- Geoprivacy: The keeping private of someone's geographic location, especially the restriction of geographical data maintained by personal electronic equipment.
- Identifier: Variables (or sets of variables) in Datasets, such as name, address, full date of birth, postcode information telephone number and tax reference number, which can directly identify subjects.
- Linear regression: An approach for modeling the relationship between a dependent variable and several explanatory variable.
- Linkage error: The missed links between records that pertain to the same individual or false links between records that do not.
- Linked data: A Dataset that is created through Data Linkage.
- Maximum likelihood estimation: method of estimating how much a variable or a number of variables contribute to an observed response (e.g. how much does height contribute to weight).
- Microdata: Data on the characteristics of units of a population, such as individuals, households, or establishments, collected by a census, survey or experiment.
- Inference: A conclusion reached on the basis of evidence and reasoning.
- Paradata: Additional data that can be captured during the process of producing a survey statistic.
- Regression: a measure of the relation between the mean value of one variable (e.g. output) and corresponding values of other variables (e.g. time and cost).
- Secondary analysis: Re-analysis of data already collected in a previous study, by a different researcher normally wishing to address a new research question.
- Spatial aggregation: The process of grouping spatial data at a level of detail or resolution that is larger than the level at which the data were collected.
- Statistical Disclosure Control: Methodology used in the design of statistical products in order to protect the identity of Data Subjects.
- Synthetic Data: Microdata records created to improve data utility while preventing disclosure of confidential respondent information. Synthetic data is created by statistically modeling original data and then using those models to generate new data values that reproduce the original data's statistical properties.
- Trusted third party (TTP): A Trusted Third Party (TTP) performs the matching of Direct Identifiers from different data sources, or the matching of Direct Identifiers of a single data source against an existing population spline.
- Weighing methods: Methods for adjusting data so that it is more representative of the population.
For a full list of terminology used throughout the network, please visit the ADRN Glossary webpage.
For more examples of research performed at ADRC-England, please visit our Projects webpage.