Article Text


A gene-based risk score for lung cancer susceptibility in smokers and ex-smokers
  1. R P Young1,
  2. R J Hopkins1,
  3. B A Hay1,
  4. M J Epton2,
  5. G D Mills3,
  6. P N Black1,
  7. H D Gardner1,
  8. R Sullivan4,
  9. G D Gamble1
  1. 1
    Department of Medicine, Auckland Hospital, Auckland, New Zealand
  2. 2
    Department of Medicine, University of Otago, Christchurch, New Zealand
  3. 3
    Department of Medicine, Waikato Hospital, Hamilton, New Zealand
  4. 4
    Department of Oncology, Auckland Hospital, Auckland, New Zealand
  1. Correspondence to Dr R Young, Department of Medicine, Auckland Hospital, Private Bag 92019, Auckland, New Zealand; roberty{at}


Background: Epidemiological and family studies suggest that lung cancer results from the combined effects of age, smoking and genetic factors. Chronic obstructive pulmonary disease (COPD) is also an independent risk factor for lung cancer and coexists in 40–60% of lung cancer cases.

Methods: In a two-stage case–control association study, genetic markers associated with either susceptibility or protection against lung cancer were identified. In a test cohort of 439 Caucasian smokers or ex-smokers, consisting of healthy smokers and lung cancer cases, 157 candidate single nucleotide polymorphisms (SNPs) were screened. From this, 30 SNPs were identified, the genotypes (codominant or recessive model) of which were associated with either the healthy smokers (protective) or lung cancer (susceptibility) phenotype. After genotyping of this 30-SNP panel in a second validation cohort of 491 subjects and using the same protective and susceptibility genotypes from our test cohort, a 20-SNP panel was selected on the basis of independent univariate analyses.

Results: Using multivariate logistic regression, including the 20 SNPs, it was also found that age, history of COPD, family history of lung cancer and gender were significantly and independently associated with lung cancer.

Conclusions: When numeric scores were assigned to both the SNP and demographic data, and sequentially combined by a simple algorithm in a risk model, the composite score was found to be linearly related to lung cancer risk with a bimodal distribution. Genetic data may therefore be combined with other risk variables from smokers or ex-smokers to identify individuals who are most susceptible to developing lung cancer.

Statistics from

Approximately 90% of people with lung cancer have a smoking history, yet only 10–15% of chronic smokers get lung cancer, suggesting that factors in addition to smoking exposure must be relevant.1 Epidemiological studies have identified age, smoking exposure, impaired lung function and family history as key risk factors for lung cancer.2 Genetic factors have been shown to play a modest role in determining susceptibility to lung cancer,3 4 most likely by conferring an inherent susceptibility (exaggerated or maladaptive response) to chronic inflammation from aero-pollutant exposure (most commonly smoking).5 6 Like many cancers, this provides the initial stimulus to tissue remodelling (eg, small airway disease and/or emphysema), DNA damage and impaired cell cycle control.6 7

Prospective studies have shown that ∼20% of smokers develop chronic obstructive pulmonary disease (COPD), defined by spirometry (forced expiratory volume in 1 s (FEV1)/forced vital capacity <70% and/or FEV1<80% of predicted), whereas the majority have preserved lung function at, or close to, predicted values.8 Both prospective and retrospective studies show that spirometric evidence of COPD is found in 40–60% of smokers diagnosed with lung cancer.9 10 In contrast with smokers who maintain normal lung function, those with COPD have a 2–6-fold greater risk of lung cancer, independent of age and pack-years.9 10 11 These studies suggest that smokers with COPD are at an inherently increased risk of lung cancer (susceptible phenotype), whereas most smokers with normal lung function (estimated to be 80%) are at least risk (resistant phenotype).9 10 11

Genetic predisposition to lung cancer is likely to be both polygenic and heterogeneous, conferred by a variable combination of relatively common polymorphisms with low penetrance and modest effect sizes.12 13 Moreover, it is likely that important smoking-gene interactions underlie lung cancer,14 as seen in other smoking-related cancers (eg, bladder and stomach). As genetic variants associated with both COPD and lung cancer have been identified, and include the recently reported chromosome 15q25 gene locus,15 16 we suggest it is important to measure lung function in all participants of case–control studies of lung cancer. For both clinical and biostatistical reasons, screening the exposed controls will increase the power of the study to identify relevant genetic variants compared with studies in which the control group is unscreened.17

It is well known that non-genetic risk factors such as age, history of COPD and smoking history are very important and can be combined to develop risk-based tools for determining lung cancer susceptibility.18 19 Recently, genotype data from previously implicated prostate cancer susceptibility single-nucleotide polymorphisms (SNPs) were combined with family history to derive risk estimates for prostate cancer.20 The objective of this study was to use a similar approach to analysing data from a case–control study and show how genetic variants, previously showing small effects on lung cancer risk, can be combined in an algorithm with other known risk factors to derive a gene-based susceptibility score for lung cancer.


Study population

This study was a two-stage case–control design conducted in three centres following the same recruitment protocol. Only people of Caucasian ancestry were recruited (all four grandparents of Caucasian descent). Lung cancer cases were identified through hospital clinics between 2004 and 2007 using the following criteria: >40 years of age, minimum 15 pack-years of smoking, diagnosis confirmed on histological or cytological grounds, and limited to the following four histological subtypes: adenocarcinoma, squamous cell cancer, small cell cancer and non-small cell cancer (generally large cell or bronchoalveolar subtypes). The median time interval between diagnosis and recruitment was 3 months. Patients with lung cancer underwent blood sampling for DNA extraction, an investigator-administered questionnaire and spirometry with a portable spirometer (Easy-One; ndd Medizintechnik AG, Zurich, Switzerland) following American Thoracic Society (ATS) criteria. For patients with lung cancer who had already undergone surgery, results of preoperative lung function tests performed by the hospital laboratory (using ATS criteria) were sourced from the medical records.

Control subjects were recruited from the same communities as the cases using the following criteria: Caucasian ancestry (as defined above), age 45–80 years and past or current smoking history of a minimum of 15 pack-years. Controls were volunteers who met the above criteria and were identified through either a community mail-out or while attending a community-based club for older people located in the suburbs that were the referral base for the hospital clinics from which the lung cancer cases were recruited. All smoking controls underwent blood sampling, spirometry and the same investigator-administered questionnaire given to the cases. Controls with spirometry consistent with COPD were excluded (30% of those who volunteered). Informed written consent was obtained from both lung cancer patients and community-acquired healthy “smoker” controls. The study was approved by the local ethics committee. The questionnaire (modified from the ATS respiratory questionnaire) included data on demographic variables such as age, gender, medical history, family history of lung disease, active and passive tobacco exposure, and occupational aero-pollutant exposures.

Selection and genotyping of SNPs

After extensive review of both the lung cancer and COPD literature, polymorphisms with the following attributes were selected for initial screening in the test cohort: (a) SNPs in genes encoding proteins in pathways of cell-cycle control, oxidant response, apoptosis and airways inflammation; (b) SNPs known either to have functional effects on in vitro assays or to be either non-synonymous or in regulatory regions. In a test cohort of 439 smokers (run 1 recruited during 2003–2005: 239 lung cancer cases and 200 control smokers), 157 candidate SNPs were screened (available on request), and those where the difference in genotype frequencies exceeded a 20% magnitude difference and p value <0.20 were identified as part of our model-forming approach.21 Where the call rate (percentage of samples for which genotyping failed for technical reasons) fell below 95% for any cohort, the reading and/or genotyping of failed samples was repeated for that SNP; after retesting, SNPs with call rates <95% were not included in further analysis. SNPs were assigned as “protective” when the homozygote and/or heterozygote genotype for either allele was found more often in control smokers than lung cancer cases (in a recessive or codominant model). SNPs were assigned as “susceptible” when the homozygote and/or heterozygote genotype was found significantly more often in lung cancer cases than control smokers.


Genomic DNA was extracted from whole blood samples using standard salt-based methods. Purified genomic DNA was aliquoted (10 ng/μl concentration) into 96-well plates and genotyped on a Sequenom system (Sequenom Autoflex Mass Spectrometer and Samsung 24 pin nanodispenser) by the Australian Genome Research Facility ( using sequences designed in-house (available on request) and recommended amplification and separation methods (iPLEX;

Of the 157 candidate SNPs screened in our discovery cohort, 30 met the above criteria. These SNPs were genotyped in a second cohort of 491 smokers using identical recruitment methods (run 2 recruited during 2006–2007: 207 lung cancer cases and 284 control smokers). For all SNP assays, again a minimum call rate of 95% was required. This second validation cohort of lung cancer cases and control smokers was identical with the first groups with respect to demographic factors and lung cancer characteristics. On the basis of independent univariate analyses in run 1 and run 2 (consistency, direction and significance of association), a final panel of the 20 most discriminatory SNPs was selected (12 susceptibility SNPs and eight protective SNPs from the test panel of 30).


The assignment of a protective or susceptible SNP genotype was made from the test cohort data (run 1) and was strictly applied to the data from run 2. On the basis of an algorithm derived from our work on the genetics of COPD (unpublished data), a scoring system was applied to the genotypes for each of the susceptibility and protective SNPs. For each subject, a numerical value of −1 was assigned for each of the protective genotypes present among the protective SNPs and +1 for each of the susceptible genotypes present. Where an individual did not have either the protective or susceptibility genotype for that SNP, the score was 0 (ie, did not contribute to the genetic score). This approach is consistent with a recently published study in prostate cancer.20 Weighting the presence of specific susceptible or protective genotypes according to their individual odds ratios (ORs; from univariate regression) did not significantly improve the discriminatory performance of the raw SNP score (unpublished data).

Lung cancer susceptibility score

The approach of deriving an overall “susceptibility score” by combining independent risk factors is comparable to existing risk-scoring systems such as the Prostate Cancer Test, Framingham Score for coronary artery disease risk and the Gail Score for breast cancer.18 19 20 22 23 By using multivariate logistic and stepwise regression analysis, the 20-SNP panel was examined in combination with relevant non-genetic factors. This analysis of run 1 data identified age, family history of lung cancer and previous diagnosis of COPD as significant contributors to lung cancer susceptibility. In addition, and consistent with other case–control studies, female gender in our study was also associated with a small increased risk of lung cancer (p<0.01). However, we did not include gender in the final risk model, as its importance in prospective studies has been lacking.24 On the basis of a multivariate logistic regression analysis in run 1 (see results for combined analysis below), a score was assigned according to age, history of COPD and family history. These variables have been identified in other risk assessment tools for lung cancer susceptibility18 19 and improved the discriminatory power of the SNP score data alone. As smoking exposure (pack-years) was a recruitment criterion for this study and comparable between cases and controls, it was not surprising to find that it made little contribution to this scoring system derived from our cohorts. The lung cancer susceptibility score was plotted with (a) the frequency of lung cancer and (b) the floating absolute risk (equivalent to OR) across the combined smoker/ex-smoker cohort.25 26

Statistical analysis

Patient characteristics in the cases and controls were compared by unpaired t tests for continuous variables and χ2 test for discrete variables. Genotype and allele frequencies were checked for each SNP by Hardy–Weinberg equilibrium (tests that genotype frequencies were as expected from the allele frequencies). Population admixture was excluded by the population structure analysis on genotyping data from 40 unrelated SNPs.27 Distortions in the genotype frequencies were identified between cases and controls using 2 × 3 contingency tables. Genotype data (20-SNP panel) and the most relevant non-genetic variables were combined in a stepwise fashion to assess their combined effects on discriminating low and high risk (by OR and receiver operating characteristic (ROC) curve) by score quintile. The frequency distribution of the optimised lung cancer susceptibility score was compared across the cases and controls. Its clinical utility was assessed using ROC analysis, which assesses how well the model predicts risk across the score (ie, clinical performance of the score with respect to sensitivity, specificity and false positive rate). To assess the stability of the optimised risk model, a sensitivity analysis was performed in which age, gender and smoking dose were more stringently matched between cases and controls. The effect on sensitivity and performance of the lung cancer susceptibility score to the addition of non-genetic variables was also assessed by comparing ORs and ROC analyses.


Demographic variables and genotyping

Table 1 summarises the characteristics of the lung cancer cases and healthy control smokers. The 446 lung cancer cases from run 1 (n = 239) and run 2 (n = 207) were comparable (with respect to demographic characteristics, histology and staging) and similar to a large published series.28 Given the small difference in age, the 482 healthy control smokers (200 in run 1, 282 in run 2) were comparably exposed with respect to smoking and other aero-pollutants. The lower frequency of current smokers in the lung cancer group probably reflects coexisting COPD (higher quit rates), and longer duration of smoking in lung cancer cases reflects an older age. In a gene-by-smoking interaction model such as this, differences in smoking exposure are more likely to obscure effects (bias to the null) than generate effects. Consistent with the findings of others, the lung cancer cohort had higher rates of a family history of lung cancer (19% vs 9%) and history of COPD (29% vs 5%). The latter (5%) probably reflects a clinical diagnosis of COPD, based on symptoms but not spirometry, in smokers with asthma and/or chronic bronchitis. As expected, lung function was worse in the lung cancer cohort than the healthy smoker controls. Testing lung function in the lung cancer cases (performed within 3 months of diagnosis, in the absence of pleural effusions and before surgery) allows us to test for confounding by COPD (see below).

Table 1

Summary of characteristics for the lung cancer and resistant smokers

The observed genotypes for the 20 SNPs in this study were in Hardy–Weinberg equilibrium (table 2), thereby excluding significant genotyping error. The genotype frequencies for the controls were comparable to those from the International Hapmap Project ( The development of the lung cancer susceptibility score is described in the Methods section, and a summary of the 20-SNP panel univariate analysis is presented in table 3. Although six of the top 20 SNPs do not reach traditional levels of significance, they have been included in the panel because (a) in previous studies they have been shown to have functional effects, (b) they have been associated with COPD and/or lung cancer (see Discussion), (c) in combination they make a contribution to the performance of the susceptibility score, and (d) their inclusion recognises the likely genetic heterogeneity that exists in lung cancer case–controls studies. A SAS macro was used to estimate the false discovery rate (FDR) (Osborne JA, North Carolina State University; and produce a q statistic as the smallest p value that would be said to be statistically significant while preserving an overall 5% significance level.

Table 2

Expected genotype frequencies and Hardy–Weinberg equilibrium (HWE)

Table 3

Genotypes and results of univariate analysis

Risk model development

In a multivariate logistic regression analysis that included the SNPs (individually), age (>60 years), family history of lung cancer (first-degree relative), gender and history of COPD, the OR for the susceptibility and protective SNPs was 1.1–3.2 and 0.20–0.80, respectively (the combined SNP score is independently related to lung cancer, p<0.001). The OR for age >60 years, family history of lung cancer and history of COPD was 3.5 (95% CI 2.5 to 4.9, p<0.001), 2.5 (95% CI 1.6 to 4.0, p<0.001) and 7.5 (95% CI 4.5 to 12.4, p<0.001), respectively (total area under the curve (AUC) = 0.80 where SNPs were included individually with adjustment for all variables). History of COPD in this model confers a high risk, in part due to differences in lung function derived from the study design. On the basis of these findings, and those from previously published studies,3 4 9 10 11 we derived an optimised score by assigning scores to non-genetic variables as follows; +4 for those aged >60 years old, +3 for those with a family history of lung cancer and +4 for those with a diagnosis of COPD (ie, age and diagnosis of COPD equally weighted).

Lung cancer score  =  (number of susceptible genotypes) – (number of protective genotypes) + 3 (for positive family history for lung cancer) + 4 (for past diagnosis of COPD) + 4 (for age>60 years old).

Such an approach is consistent with existing risk scores18 19 and places the SNP data in an appropriate clinical context.22 23 Gender was not included in the finalised risk model for the reasons described above (and its inclusion did not alter the AUC).

Model performance

In the optimised model, the lung cancer susceptibility score was compared with frequency of lung cancer, and a linear relationship was found across the lung cancer susceptibility scores ⩽1 to 8+, with lung cancer frequency spanning 17–86% (fig 1). The magnitude of this effect was also sequentially examined using the floating absolute risk25 26 plotted on a log scale (equivalent to an OR), which references the lowest frequency group as OR = 1 (referent group) and compares the lung cancer score with the referent group. The OR for SNPs alone (fig 2a), SNPs, family history and age (fig 2b) and SNPs, family history, age and COPD (fig 2c) spanned from 1 to 10 (p<0.001), 1 to 19 (p<0.001) and 1 to 28 (p<0.001), respectively, across the lung cancer scores when subjects were grouped approximately as heptiles or quintiles. Subgrouping by age band or histology did not alter this linear relationship between score and OR (data not shown).The lung cancer susceptibility score for lung cancer cases and controls shows a bimodal distribution on frequency distribution, indicating potential utility as screening test.29

Figure 1

Frequency of lung cancer according to the lung cancer susceptibility (risk) score modelled with single-nucleotide polymorphisms, age, family history and chronic obstructive pulmonary disease.

Figure 2

Odds ratio of lung cancer according to the lung cancer susceptibility (risk) score using (a) single-nucleotide polymorphisms (SNPs) only, (b) SNPs, family history (FHx) and age, and (c) SNPs, family history, age and history of chronic obstructive pulmonary disease (COPD).

Analysis of model sensitivity

To correct for the small differences in age, smoking status, COPD and gender mix between cases and controls, a subgroup (sensitivity) analysis was performed (a) limited to those >60 years of age (age weighting equally applied to all), (b) removing COPD from the model, and (c) where mean age, pack-years and gender were closely matched between cases and controls (n = 450: 72 vs 69 years, 45 vs 43 pack-years and 70% vs 70% male respectively). A linear increase in OR across quintiles of the lung cancer susceptibility score (range 1–58, p<0.01) was still evident, with confidence intervals consistent (ie, overlapping) with those derived with the full dataset (fig 2).

ROC analysis

In a ROC analysis (n = 930) of the optimised model, we found that the AUC or c statistic for run 1, run 2 and run 1+2 was 0.82, 0.75 and 0.79, respectively. The AUC in the total cohort for the 20-SNP panel, age, family history of lung cancer and history of COPD on their own were 0.68, 0.70, 0.55 and 0.62, respectively. When just “genetic factors” are used in the risk model (SNPs + family history of lung cancer), as seen in the Prostate Cancer Study,20 the ORs span 1–10 across septiles and the AUC = 0.70 (with no contribution from age and COPD). On stepwise analysis, age and the SNP panel make the greatest contribution to the AUC (SNPs = 0.68, age + SNPs = 0.76, age + SNPs + family history = 0.77), with history of COPD making a small additional contribution (total AUC = 0.79) (fig 3). Using an FDR analysis, 12 SNPs were identified as being significantly associated with lung cancer and, when combined with age and family history, derived an AUC = 0.75. When gender was included in the model, the AUC was not improved.

Figure 3

Distribution of the lung cancer susceptibility (risk) score between cases and controls. (a) Single-nucleotide polymorphisms (SNPs) only, (b) SNPs, family history and age, and (c) SNPs, family history, age and history of chronic obstructive pulmonary disease. Ca, cancer.

Inclusion of COPD

As smokers with normal lung function were selected as controls (resistant or lowest risk phenotype), and history of COPD was included in the model, it is necessary to examine the effect of including COPD in the model. In the ROC analysis, history of COPD alone was a modest discriminator (AUC = 0.62) and added little to the other variables in the combined model (increased AUC from 0.77 to 0.79). When history of COPD was removed from the model, the ORs span 1–19 across quintiles (p<0.001) and performance characteristics are minimally affected (AUC = 0.77). The model was also tested in young smokers (⩽60 years old), in whom COPD prevalence was only 3% and the ORs spanned 1–16 (p<0.001). Most importantly, the model was assessed by comparing the smoking controls and lung cancer cases subgrouped according to lung function (fig 4a,b). This shows that (a) the distribution of the susceptibility score was comparable between lung cancer cases divided by those with high or low lung function (fig 4a) and (b) when people with COPD (based on spirometry) are removed from the analysis (leaving smoking controls compared with lung cancer cases with normal lung function), the bimodal distribution is not affected (fig 4b). We conclude that the risk model is not significantly affected after adjustment for differences in COPD prevalence between cases and controls.

Figure 4

(a) Frequency distribution of the lung cancer score among controls and lung cancer cases divided according to low and normal lung function. (b) Frequency distribution of lung cancer score among controls and lung cancer cases with normal lung function. Ca, cancer; FEV1, forced expiratory volume in 1s.


This study has used a two-stage case–control candidate gene approach and identified a panel of protective and susceptibility SNPs that individually confer only small effects (OR ranging from 0.3 to 2.6). This is very much in keeping with the experience from case–control association studies to date.30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 Consistent with existing risk models, relevant factors were combined using an algorithm (in this study including SNP data) to derive a susceptibility score on a simple linear scale. This study design, and the algorithmic approach that underlies our lung cancer susceptibility score, takes into account important epidemiological observations relevant to genetic predisposition to lung cancer. Firstly, that, although smoking exposure is for the majority a prerequisite to developing lung cancer, increasing age, smoking dose and poor lung function have important independent effects on lung cancer susceptibility. Secondly, the genetic factors underlying lung cancer risk are likely to be both polygenic and heterogeneous, conferred by a variable combination of genetic variants (ie, SNPs with low penetrance and small effect sizes). Thirdly, genetic factors may confer either a protective30 31 or susceptibility16 phenotype to lung cancer. Here we report a 20-SNP panel which, combined with family history,20 define risk (OR) across quintiles ranging from 1 to 10 with an AUC of 0.70. A risk tool with greater clinical utility can be derived by including age and presence of COPD to identify those at greatest susceptibility to lung cancer (OR range 1–28 and AUC = 0.79).

Several other important factors relevant to the genetic epidemiology of lung cancer have been considered in the design of this study. We sought to minimise false-positive results in a number of ways. The most important of these was to internally validate our findings using a two-stage design with an initial test cohort (run 1) to identify SNPs of potential interest. We then tested only those SNPs in a second cohort of cases and controls (run 2) using univariate and multivariate analysis to rank the SNPs under both conditions. Secondly, population stratification was excluded, and, thirdly, the presence of genotyping error was minimised through Hardy–Weinberg equilibrium analysis (see Methods) and by the exclusion of SNPs with <95% call rate (fails on genotyping are invariably genotype specific, thus generating false-positive associations). With respect to important confounding factors, our lung cancer cases and healthy smoking controls were matched for smoking exposure (pack-years). They were also similar with respect to gender and age mix. When the combined cohort was subgrouped by age band, the lung cancer susceptibility score maintained its discriminating utility across all groups. It was concluded that lung function was not confounding the results of this study, as the distribution of the lung cancer susceptibility score across the lung cancer cases, subdivided by normal and low lung function, showed no significant difference. The same could not be said of previously published case–control studies in which lung function was not measured.

However, weaknesses in this study include the modest size of the cohorts, borderline significance of some SNPs in the absence of correction, cross-sectional design, and recruitment limited to Caucasians. Moreover, it is accepted that, by selecting a control population with normal lung function (but comparable exposure) and including COPD (history) in the score, we will increase the score in those with lung cancer compared with controls (5% vs 29%). However, although this increases the magnitude of the difference between cases and controls (reflected in the ORs), it contributes little to the performance of the score (adds 0.02 to the AUC; see the Results section). Moreover, when subjects with COPD are excluded from the analysis (fig 4b, smoking controls versus lung cancer cases with normal lung function), the discriminating utility of the score is unaffected. In addition, in the youngest age band (confined to cases and controls ⩽60 years old), the prevalence of COPD (history) was only 3% (little effect from COPD weighting), there was no age weighting, and the susceptibility score was still a good discriminator (OR spans 1–16, p<0.001). We argue that screening individuals with COPD (based on spirometry) out of the controls has the following advantages: (1) best reflects the majority of smokers with no COPD estimated at 80%8; (2) best reflects the majority of smokers who will not develop lung cancer (resistant phenotype) estimated to be 80–90% (thereby minimising the dilutional effects of including patients with COPD, ie, misclassification)1 17; (3) best suited to identifying “protective” SNPs by comparing exposed individuals at either end of the risk spectrum.30 31 That said, replication using an unselected control group (in which COPD prevalence would be 10% or more) might better reflect an unselected at-risk population and, as expected, reduce differences in the susceptibility score between cases and controls (dilutional effect). Although population stratification was formally tested, and our population is confined to Caucasians (where population admixture is less of an issue), it is possible that this remains a problem. A further limitation of the study is that, although the cases and controls were arguably representative, not all variables were precisely matched (eg, age, gender and smoking patterns). We reanalysed our data in a closely matched cohort (n = 450: 72 vs 69 years, 45 vs 43 pack-years and 70% vs 70% male for cases and controls, respectively) and found the performance of the susceptibility score across quintiles was unchanged (OR range 1–58, p<0.01). Further studies will need to be carried out to address these issues.

It is likely that genetic susceptibility to lung cancer results from a variable combination of several genetic variants in genes encoding proteins involved in several pathways activated by chronic smoke exposure and the inflammatory response that follows. A candidate gene (ie, hypothesis-driven) approach was used to identify potentially functional SNPs associated with the development of lung cancer. Although the SNPs identified in this study may only reflect linkage disequilibrium with functional variants nearby, these SNPs are likely to have functional effects and involvement directly with susceptibility to lung cancer. Two SNPs are from genes involved in the metabolism of smoking-derived carcinogens (N-acetyltransferase 2 and cytochrome P450 2E1) and previously linked to smoking-related cancers of the aerodigestive system.32 33 Five SNPs are from genes encoding inflammatory cytokines implicated in carcinogenesis or lung matrix remodelling (COPD), the latter strongly implicated in lung cancer development (interleukins 1, 8 and 18, tissue necrosis factor receptor, Toll-like receptor 9).34 35 36 37 38 39 40 41 42 Two SNPs are from genes that have been implicated in smoking addiction and lung cancer (dopamine D2 receptor and dopamine transporter 1).43 44 Two SNPs are functional and found in genes involved in the antioxidant response to aero-pollutants such as smoking (α1-antichymotrypsin and extracellular superoxide dismutase).30 31 45 46 Both of these have been associated with COPD, and one is upregulated in lung cancer. Six of the SNPs are found in genes involved in processes such as cell-cycle control, DNA repair and apoptosis, and associated with lung cancer in previously published studies (xeroderma pigmentosum complementary group D, p73, Bcl-2, FasL, Cerb1 and REV1).47 48 49 50 51 52 53 54 55 Two of the SNPs are from genes encoding integrins also implicated in apoptosis, cancer susceptibility and, for one, upregulation in lung cancer cells.56 57 58 One of the SNPs (α5 nAChR) has recently been associated with both lung cancer and COPD in genome-wide association studies.16 59 60 61 62 This receptor appears to de directly related to nicotine effects on airway inflammation.63 As can be seen, the SNP panel (table 3) is made up of a variety of SNPs from genes implicated in metabolism of smoke-derived carcinogens, oxidant response, cell-cycle control and inflammation. Twelve of these SNPs have been associated with lung cancer in other cohorts. It is likely that other SNPs from as yet unidentified genes will be identified in the future. To assess further the utility of the lung cancer susceptibility score, a prospective study is in progress. To date, the lung cancer cases (n = 43) have the same mean and distribution as the lung cancer cases reported in this study (unpublished data). Further case–control and functional studies will be needed to further explore the role of these SNPs in lung cancer susceptibility.

We propose that clinical utility of genotype data requires that many SNPs are analysed and their effects combined with other epidemiological factors of relevance.20 The algorithm approach used in this study is comparable to that recently published for prostate cancer20 and involves minimal assumptions (not hierarchical or path analysis based). The patient’s score can be compared with the scores in smokers with least susceptibility to lung cancer (lowest quintiles) in a simple linear fashion. Such an approach is comparable to the risk tools developed by others18 19 22 23 and similar in approach to recently published studies on risk in diabetes, where SNP data were combined with non-genetic risk variables to refine existing risk models.64 65 The clinical utility of the lung cancer susceptibility score was assessed by ROC analysis. This showed the c statistic to be 0.79 and, at a cut-off of ⩾3, an estimated sensitivity of 89% and corresponding specificity of 45%. After FDR analysis, 12 significantly associated SNPs were included in the model, with little decrease in the AUC (0.75 vs 0.77). These findings are comparable to the ROC performance of the Framingham Score (c statistic = 0.74),22 although other methods of assessing model performance have been advocated (eg, reclassification table approach66). The c statistic for the 20-SNP panel on its own was 0.68 (and 0.70 when combined with family history), indicating its utility in the current cohort. In contrast with the models for diabetes and prostate cancer, in our risk model for lung cancer it has been possible to account for the important environmental risk factor of smoking. There is evidence, although limited, that genetic testing may positively alter the behaviour of smokers in the context of smoking cessation (increase intent and possibly improve quit rate67 68) or by lowering smoking prevalence.69 The lung cancer susceptibility score may also have utility in early diagnosis of lung cancer where delays in diagnosis may affect survival.70 Although further validation studies are required, this study suggests that genetic data may be combined with other risk variables from smokers or ex-smokers to identify individuals most susceptible to developing lung cancer.

Main messages

  • Lung cancer results from the combined effects of smoking and genetic susceptibility.

  • Chronic obstructive pulmonary disease is a common pre-existing and independent risk factor for lung cancer.

  • Genetic susceptibility for lung cancer includes genetic variants (single nucleotide polymorphims (SNPs)) conferring reduced risk (“protective”) best identified using a healthy smoking cohort.

  • Genetic susceptibility for lung cancer results from the combined effects of genetic variants (SNPs) conferring either susceptibility or protective predisposition.

  • Genetic and non-genetic variables can be combined to give a global risk score for susceptibility to lung cancer.

Current research questions

  • Can the lung cancer susceptibility score be validated in smokers and ex-smokers in prospective studies and other populations?

  • Will use of the lung cancer susceptibility score improve patient outcomes to reduce risk of lung cancer and/or detect lung cancer at a treatable stage?


We gratefully acknowledge the participation of subjects in this study, in particular the patients with lung cancer.


View Abstract


  • See Editorial, p 505

  • Funding This study was in part funded by the Health Research Council of New Zealand (Grant 9101-3602829), the Auckland Medical Research Foundation of New Zealand and the University of Auckland (Staff Research Fund), New Zealand.

  • Competing interests This study was part funded by Synergenz BioScience Ltd. RY is an advisor to this company.

  • Provenance and peer review Not commissioned; externally peer reviewed.

Request permissions

If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.

Linked Articles