Skip to Main Content
Douglas College Library About Us Articles & Databases Research Guides Services Faculty News Events Learning Centre

Secondary Research: Statistics and Data

Data vs. Statistics

Data vs. Statistics

 

Many people use the terms data and statistics interchangeably.  Strictly speaking, however, "data is the raw information from which statistics are created.  Put in the reverse, statistics provide an interpretation and summary of data" (MSU Libraries. What is the difference between Data and Statistics?).

Statistics: A brief overview

Statistics

 

Statistics are an excellent source of "facts" for researchers.  They are convenient to use as someone else has done the work of taking raw research data and then cleaning, interpreting and presenting them in a digestible format, such as a chart, table, or graph.

 

   

  

 

 

 

 

 

 

 

 

 OpenClipArt-Vectors from Pixabay

 

 

 

 

 

 

 

 

 

  Statistics Canada. Monthly retail commodity sales.

 

 

 

 

 

 

 

 

 

 OpenClipArt-Vectors from Pixabay

 

 

Key Considerations & Quality Indicators:
 

 

  • Double-check that they are relevant, e.g., actually cover the population, geography, timeframe(s) and/or topic(s) for your research project.

     
  • Assess the credibility of the individual or organization that generated the statistics, e.g., do they have expertise and/or a proven track record for producing high quality statistics - such as a national statistical agency, expert researcher in the field, citizen-science group / community organization?
     
    • When assessing the credibility of community-based or citizen-science projects you'll want to use similar quality criteria as with more traditional statistic-producing organizations, including explanation of the research methods employed, approach to sampling if relevant, open acknowledgement of any limitations or gaps in the data collection.
       
    • Quick-Assessments:  It's a good sign if the organization is listed in the Government of Canada's Citizen Science Portal OR
       
    • routinely partners with government agencies, reputable non-profit/charitable organizations etc
       
    • For example, in the City of Vancouver an annual count of unhoused people has been conducted since 2002 by volunteers under the direction/with the support of a variety community-partners and organizations, including foundations, housing societies, non-profits and the City itself - and include methods and discussion of limits/gaps to the data collection process.

       
  • Whenever provided - check the methodology used to generate the work, paying particular attention to:

     
    • the duration of the research - e.g., were the data collected over a "long enough" timeline for you to be able to rely on it?

       
    • wording of any questionnaires/surveys - e.g., were any of the questions ambiguous, confusing or leading respondents into expressing a particular viewpoint?

       
    • the sample size and makeup - e.g., Did the research team use an appropriate sampling method?  OR could anyone participate? e.g., an anonymous online survey.  If no sampling was employed the results could be subject to a host of problems, including:
       
      • survey fraud, e.g., were measures put in place to prevent multiple submissions from the same person or bot? 
         
      • demographic over or under representation, e.g., an online-only survey will not capture responses from people with no or poor internet access
         
      • regional over or under representation, e.g., do the respondents come from diverse and/or relevant geographic regions?  Is there any way to know?

         
    • If sampling was employed - was the sample size sufficiently large to extrapolate to a larger population?  Were the demographic groups you are interested in included - e.g., age groups, gender(s), ethnicit(ies), languages spoken and/or anything else relevant to your research needs?
       

 

Ask Yourself:

 

  • Are any demographic groups under-represented - or not represented at all in the data? 

     
    • If yes - does the data source openly discuss the demographic gaps? 
       
    • If not, are the data actually useful for your purposes?

 

 

Consider this example from Canada:  Starting in 2021 respondents to Statistics Canada surveys - including the Census of Canada - could for the first time identify their sex at birth AND their current gender identify. 

 

Prior to this, respondents could only report their sex as male or female.  As a result, important historical information about people who are non-binary or transgender will never be available - and the data that were collected about sex before 2021 paint an imprecise picture of the population.

 

To learn about about the gender-identity options now available to respondents see Statistics Canada: Age, Sex at Birth and Gender Reference Guide: Census of Population 2021 [PDF].


 

Also consider:

 

  • Why might certain groups have been omitted? 
     
  • How might their omission skew the data?
     
  • Could an incomplete sample result in a biased or misleading "picture" of that group?
     
  • Who might benefit from presenting an incomplete set of "facts" to the world



 

Final Analysis: If there are gaps in demographic representation and/or you cannot assess the validity of the methodologies used how can you trust or rely on the provided statistics?

 

Data: At a glance

Data

 

Sometimes the statistics you need don't exist.   For these cases you may need to either generate your own research data or find someone else's raw datasets and run your own analyses.

 

image of a raw dataset

 

 

 

 

 

 

Mika Baumeister on Unsplash

 

 

Some generalities:

 

 

  • "Raw" research data are the original recorded results of a research study - before the data have been cleaned, interpreted and/or made into a presentation format, such as a chart, table or graph.

     
  • "Research Data" is frequently (but not always) made available in a machine-readable format - e.g., can be read and processed by a computer.  Example formats include csv, xml, json

     
    • depending on the discipline and/or age of the data you may find that the research data you need is not available in a machine-readable format, e.g., will need to be extracted from handwritten documents, field or lab books, non-digitized print books, photographs etc.

       
  • In order to analyze and interpret machine-readable data, it generally needs to be opened in a spreadsheet, such as Excel/Google Sheets or using specialized software for statistical analysis such as SPSS, R, Stata, SAS etc.

 

 

Some keys to success when using datasets:
 

 

  • As with statistics, it's important to ensure that the datasets you use are relevant to your research question for the population(s), geography, timeframe(s) and/or topic(s) etc. in question, and that appropriate research methodologies/sampling were employed. 

     
  • You'll also want to assess the credibility of the data source:

     
    • is it an organization with a proven track record for providing high quality data, such as a governmental statistical agency, IGO, e.g., StatsCan, BC Stats, the World Bank, WHO etc?
       
    • were the data generated by a researcher with expertise in the discipline working at a post-secondary institution / research institute?

       
    • Ask yourself:  If it's unclear or unknown who produced the data or how they were produced, how can you rely on them?

       
  • Given that datasets are unprocessed, the accompanying metadata (information about the data) will also be crucially important in determining the quality, relevance and potential usefulness of the data: e.g.,

     
    • is there a project abstract or overview you can consult? 

       
    • Do the data come with a copy of the research transcripts/questionnaires/surveys that you can review? 

       
    • Is there a codebook or data dictionary explaining variable labels/codes used? 

       
    • Is there a ReadMe file with any other information needed to make complete sense of the data?

       
    • Have the data been made available in an open format such as txt, csv and/or R?  If not - do you have the relevant statistical software and knowledge to use it?