Genetic variant effects on gene expression in human pancreatic islets and their implications for T2D

Genetic variant effects on gene expression in human pancreatic islets and their implications for T2D

Cohort characteristicsThe samples from 420 donors included 189 males and 231 females, with an age range of 16–81 years (median = 54 years, 11 not available). Of the 420 individuals, 37 were identified as diabetic. BMI information was available for 334 individuals (median, 26.3 kg/m2), while HbA1c measurements were available for 198 (median, 5.8%). Due to the historical nature of some of the samples used in this study, QC information about the pancreatic islet isolation was limited: 254 samples listed their purity (median, 75%) and 26 samples listed their viability (median, 93.5%). No other biological information was available in the historical records. This information is available in the covariate files included in the EGA submission.Pancreatic islet sample collection and processingSamples collection and processing is summarized in Supplementary Fig. 17.The processing of the Geneva samples from Nica et al.18 were originally done using RNA libraries with 49-bp paired-end reads; however, in order to allow joint analysis with the other available datasets for this study, mRNA samples were reprocessed using a 100-bp paired-end sequencing protocol. The library preparation and sequencing followed customary Illumina TruSeq protocols for next generation sequencing as described in the original publication18. All procedures followed ethical guidelines at the University Hospital in Geneva.The 89 Lund samples from Fadista et al.16 were jointly processed with 102 new islet samples that were processed uniformly following the same protocol. These islet samples were obtained from 191 cadaver donors of European ancestry from the Nordic Islet Transplantation Programme (http://www.nordicislets.org). Purity of islets was assessed by dithizone staining, while measurement of DNA content and an estimate of the contribution of exocrine and endocrine tissue were assessed as previously described61. Total RNA was isolated with the AllPrep DNA/RNA Mini Kit following the manufacturer’s instructions (Qiagen), sample preparation was performed using Illumina’s TruSeq RNA Sample Preparation Kit according to manufacturer’s recommendations. The target insert size of 300 bp was sequenced using a paired-end 101 bp protocol on the HiSeq2000 platform (Illumina). Illumina Casava v.1.8.2 software was used for base calling. All procedures were approved by the ethics committee at Lund University.The Oxford dataset included samples collected in Oxford and Edmonton that were jointly sequenced in Oxford are included in this set of samples. Islet sample procurement, mRNA processing, and sequencing procedure has been described in van de Bunt et al.17. To the 117 samples previously published (78 from Edmonton and 39 from Oxford), 57 samples were added and processed following similar protocols as before (27 from Edmonton and 30 from Oxford). Briefly, freshly isolated human islets were collected at the Oxford Centre for Islet Transplantation in Oxford, or the Alberta Diabetes Institute IsletCore (www.isletcore.ca) in Edmonton, Canada. Additional islets were obtained from the Alberta Diabetes Institute IsletCore’s long-term cryopreserved biobank. Freshly isolated islets were processed for RNA and DNA extraction after 1–3 days in culture in CMRL media. Cryopreserved samples were thawed as described in Manning et al.62 and Lyon et al.63. RNA was extracted from human islets using Trizol (Ambion, UK or Sigma-Aldrich, Canada). To clean remaining media from the islets, samples were washed three times with phosphate-buffered saline (Sigma-Aldrich, UK). After the final cleaning step 1 mL Trizol was added to the cells. The cells were lysed by pipetting immediately to ensure rapid inhibition of RNase activity and incubated at room temperature for 10 min. Lysates were then transferred to clean 1.5 mL RNase-free centrifuge tubes (Applied Biosystems, UK). RNA quality (RIN score) was determined using an Agilent 2100 Bioanalyser (Agilent, UK), with a RIN score > 6 deemed acceptable for inclusion in the study. Samples were stored at −80 °C prior to sequencing. PolyA selected libraries were prepared from total RNA at the Oxford Genomics Centre using NEBNext ultra directional RNA library prep kit for Illumina with custom 8 bp indexes64. Libraries were multiplexed (three samples per lane), clustered using TruSeq PE Cluster Kit v3, and paired-end sequenced (100 nt) using Illumina TruSeq v3 chemistry on the Illumina HiSeq2000 platform. All procedures were approved by the Human Research Ethics Board at the University of Alberta (Pro00013094), the University of Oxford’s Oxford Tropical Research Ethics Committee (OxTREC Ref. 2,3,4,5,6,7,8,9,10,11,12,13,14,15), or the Oxfordshire Regional Ethics Committee B (REC reference: 09/H0605/2). All organ donors provided informed consent for use of pancreatic tissue in research.The USA samples from Varshney et al.15 were originally processed as follows: 39 islet samples from organ donors were received from the Integrated islet Distribution Program, the National Disease Research Interchange, and Prodo- Labs. Total RNA from 2000–3000 islet equivalents was extracted and purified using Trizol (Life Technologies). RNA quality was confirmed with Bioanalyzer 2100 (Agilent); samples with RNA integrity number (RIN) > 6.5 were prepared for mRNA sequencing. We added External RNA Control Consortium spike-in controls (Life Technologies) to one microgram of total RNA. PolyA+, stranded mRNA RNA-Seq libraries were generated for each islet using the TruSeq-stranded mRNA kit according to manufacturer’s protocol (Illumina). Each islet RNA-Seq library was barcoded, pooled into 12-sample batches, and sequenced over multiple lanes of HiSeq2000 to obtain an average depth of 100 million 2 × 101 bp sequences. All procedures followed ethical guidelines at the National Institutes of Health (NIH).Β-cell sample collection and processingTo the 11 FAC-sorted beta-cells population samples previously published18, we added 15 more samples that were processed following the same protocols. Briefly, islets were dispersed into single cells, stained with Newport Green, and sorted into “beta” and “non-beta” populations. The proportion of beta (insulin), alpha (glucagon), and delta (somatostatin) cells in each population (as percentage of total cells) was determined by immunofluorescence. mRNA extractions as well as sequencing followed the same details described for islets samples processing for the Geneva samples.Read mapping and exon quantificationThe 100-bp sequenced paired-end reads were mapped to the GRCh37 reference genome65 with GEM66. Exon quantifications were calculated for all elements annotated in GENCODE67 v19, removing genes with >20% zero read count. All overlapping exons of a gene were merged into meta-exons with identifier of type ENSG000001.1_exon.start.pos_exon.end.pos, as described in Lappalainen21. Read counts over these elements were calculated without using read pair information, except for excluding reads where the pairs mapped to two different genes. We counted a read in an exon if either its start or end coordinates overlapped an exon. For split reads, we counted the exon overlap of each split fragment, and added counts per read as 1/(number of overlapping exons per gene). Gene-level quantifications used the sum of all reads mapped to exons from the gene. Genes with >20% zero read counts were removed.Genotype imputationGenotypes for all islet samples, including 19 β-cell samples, were available from OmniExpress and Omni2.5 genotype arrays. Quality of genotyping from the shared SNPs in both arrays was assessed before imputation separately and, samples were excluded if they had an overlap genotype success rate <90%; and MAF differences >20% compared to the 1000 G reported European MAF. The two panels were separately pre-phased with SHAPEIT68 v2 using the IMPUTE2-supplied genetic maps. After pre-phasing, the panels were imputed with IMPUTE2 (ref. 69) v2.3.1 using the 1000 Genomes Phase I integrated variant set (March 2012) as the reference panel70. SNPs with INFO score > 0.4 and HWE p > 1e−6 (for chrX this was calculated from female individuals only) from each panel were kept. A combined vcf for each chromosome was generated from the intersection of the checked variants in each panel. Directly genotyped SNPs with a MAF < 1% (including the exome components of the chips not shared between all centers) were merged into the combined vcfs: (i) If SNPs were not imputed they were added and (ii) If SNPs had been imputed, the imputed calls for the individual were replaced by the typed genotype. Dosages were calculated from the imputation probabilities (genotyped samples) or genotype calls (WGS samples). For the 22 autosomes, the dosage calculation was: 2 × ((0.5*heterozygous call) + homozygous alt call). For chromosome X (where every individual should be functionally hemizygous), the calculation was: (0.5*heterozygous call) + homozygous alt call). Genotype calls for males can only be “0/0” and “1/1”. The total number of variants available for analysis after quality assessment was 8,056,952.For the 26 β-cell samples, 19 had genotypes available from OmniExpress arrays, whereas 7 had DNA sequence available. Variant calling from DNA sequence has been previously described in Nica et al.18. Briefly, the Genome Analysis Toolkit71 v1.5.31 was used following the best practice variant detection v3 to call variants. Reads were aligned to the hg19 reference genome with BWA72. A confidence score threshold of 30 for variant detection was used and a minimum base quality of 17 for base calling. Good confidence (1% FDR) SNP calls were then imputed on the 1000 Genomes reference panel and phased with BEAGLE73 v3.3.2. Imputation of variants from samples with arrays genotyping were imputed together with genotypes from individuals with islets samples as described before and then merged with genotypes from DNA sequences. SNPs with INFO score > 0.4, HWE p > 1e−6 and MAF > 5%, were kept for further analysis. The total number of variants available for analysis after quality assessment was 6,847,993.RNA-Seq quality assessment and data normalizationHeterozygous sites per sample were matched with genotype information to confirm the ID of the samples74. Eleven samples did not match with their genotypes, six of which would be resolved by identifying concordant matches. For the remaining samples, no matches were found on the genotypes and they were removed from the dataset, giving a total of 420 samples with genotypes. Raw read counts from exons and genes were scaled to ten million to allow comparison between samples with different libraries. Scaled raw counts were then quantile normalized.We used PC analysis to evaluate and control the effects of unwanted technical variation, and the expected batch effects due to the islet sample processing and mRNA sequencing being performed across four labs. The main differences in samples processing, and sequencing differences were grouped in a variable identifying the lab of origin of the samples. Since each institution handled samples in a different way with different processing and sequencing protocols, we expect differences across samples from different labs to be greater than the differences between samples from the same lab. Supplementary Fig. 1 shows these differences, while comparing the samples distribution in PC1 vs PC2, with the colors identifyinand g the lab of origin (GEN, OXF, LUN, USA). This analysis also identified internal batch effects for OXF and LUN samples, as a second set of samples were sequenced for this study in both institutions. Therefore, to control for these and other potential sources of unwanted global variation, we included 25 PCs of expression, as covariates in the linear model used to identify eQTLs (see below).To identify the optimal number of PCs require to control for differences in the origin of the samples for the discovery of eQTLs, we performed a permutation test: expression sample labels and expression covariates were permuted within each of the four laboratories before performing a standard eQTL analysis against non-permuted genotypes (and matched PCs for genotypes), using different numbers of PCs for expression. Significant eQTLs beyond a 5% FDR in any of these analyses are considered false positives due to technical differences across laboratories. Our results indicate that controlling for ten PCs was sufficient to minimize the number of false positives due to batch effects originating from differences in processing of the islet samples. In addition, we performed an eQTL discovery controlling for 1, 5, 10, 20, 30, 40, and 50 PCs for expression, as well as gender, 4 PCs derived from genotype data, and a variable defining the laboratory of origin (coded as: OXF, LUN, GEN, and USA). After evaluation of the results, we conclude that controlling for 25 PCs for expression was optimal as this controlled for differences across and within laboratories, while maximizing the discovery of eQTLs.eQTL analysiseQTL analysis for islets and β-cells were performed using fastQTL19 on 420 islet samples and 26 FAC-sorted β-cell samples with available genotypes. Cis-eQTL analysis was restricted to SNPs in a 1 MB window upstream and downstream the TSS for each gene, and SNPs with MAF > 1%. For the analysis of β-cell samples, we used a filter of MAF > 5%. Exon-level eQTLs identified best exon–SNP association per gene (using the –group flag), while gene-level eQTLs used gene quantifications and identified the best gene–SNP association. Variables included in the linear models were the first 4 PCs for genotypes, the first 25 PCs for expression, gender, and a variable identifying the laboratory of origin of the samples. Significance for the SNP–gene association was assessed using 1000 permutations per gene, correcting p-values with a beta approximation distribution19. Genome-wide multiple testing correction was performed using the q-value correction22 implemented in largeQvalue75.Results of this joint analysis were highly correlated with those obtained from a fixed-effects meta-analysis of the four component studies, indicating appropriate control for the technical differences between the studies (Supplementary Fig. 16).To discover multiple independent eQTLs, we applied a stepwise regression procedure has also been described in Brown et al.76. We tested significant eGenes (FDR < 1%) for secondary associations in any exon. The maximum beta-adjusted p-value (correcting for multiple testing across the SNPs and exons) over these genes was taken as the gene-level threshold. The next stage proceeded iteratively for each gene and threshold. A cis-scan of the window was performed in each iteration, using 1000 permutations and correcting for all previously discovered SNPs. If the beta-adjusted p-value for the most significant exon–SNP or gene–SNP (best association) was not significant at the gene-level threshold, the forward stage was complete and the procedure moved on to the backward step. If this p-value was significant, the best association was added to the list of discovered eQTLs as an independent signal and the forward step proceeded to the next iteration. The exon-level cis-eQTL scan identified eQTLs from different exons, but reported only the best exon–SNP in each iteration. Once the forward stage was complete for a given gene, a list of associated SNPs was produced that we refer to as forward signals. The backward stage consisted of testing each forward signal separately, controlling for all other discovered signals. To do this, for each forward signal we ran a cis-scan over all variants in the window using fastQTL, fitting all other discovered signals as covariates. If no SNP was significant at the gene-level threshold the signal being tested was dropped, otherwise the best association from the scan was chosen as the variant that represented the signal best in the full model.GTEx eQTLsWe identified exon-level eQTLs for 44 GTEx tissues using fastQTL19 following the same procedure as for the islet eQTLs. Covariates included followed the previously published number of PCs for expression8 and included 15 PCs for expression for tissues with <154 samples; 30 PCs for samples between 155 and 254 samples; and 35 PCs for samples with >254 samples. Independent eQTLs from exons were identified as described for islet eQTLs. The proportion of shared eQTLs between islet and β-cell eQTLs and the eQTLs from GTEx tissues were identified using Π122.Tissue deconvolutionTo identify the contribution of β-cells, non-β-cells and exocrine (non-islet) cells to overall gene expression measured in islets, we performed an expression deconvolution analysis. Expression profiles from GTEx whole pancreas was used as a model for the exocrine component of expression8, while FAC-sorted expression profiles from β-cell and non-β-cells from Nica et al.18 were used to identify the fraction of expression derived from islet cells. First, we performed differential expression analysis of (a) exocrine vs whole islet samples; (b) β-cell vs whole islet samples; and (c) non-β-cell vs whole islet samples. The top 500 genes from each analysis were combined, and a deconvolution matrix of log2-transformed median expression values was prepared for each cell type. Next, deconvolution was performed using the Bioconductor package DeconRNASeq77. Deconvolution values per sample are included in the covariates file, together with the expression values in the EGA submission.Genotype-by-cell type associationsGenotype-by-cell-type-specific regulatory effects were identified by testing for interactions between SNPs and cellular fraction estimates. We performed the analysis using a linear model and residuals from expression in gene quantifications. Residuals were extracted using a linear mixed model controlling for fixed effect variables (batch effects, islets purity, GC mean, and merge), and random variables (tag/primers and date of sequencing). The merge variable identified samples that were sequenced multiple times, with a final set of reads merged from multiple files. Significance for the genotype-by-cell interactions were evaluated using FDR and 100 permutations in each analysis: genotype-by-β-cell proportions, genotype-by-non-β-cell proportions, and genotype-by-exocrine cell proportions.Enrichment of eQTLs in T2D and glycemic GWASWithin each tissue, we asked if the magnitude of the eQTL effect for a given set of GWAS SNPs were larger than expected for a randomly selected matched set of SNPs (as described below). We performed enrichment analysis separately for each trait and tissue type using the following procedure. For each trait, we used lead GWAS variants from the following sources: T2D (all; n = 403)5; subsets of these T2D-associated variants that likely act via β-cell action (n = 43); glycemic traits (fasting glucose and β-cell function (HOMA-B) in nondiabetic individuals; n = 56); and for comparison T1D (n = 55)38 (Supplementary Data 13)3,36,62. As InsPIRE pancreatic islet and GTEx tissue-based estimates of the variant exon effect, we used eQTL beta coefficients for exons tested for each gene within 1 Mb of the GWAS lead variant. For each GWAS lead variant for a given trait, we identified the eQTL with the largest absolute effect size estimate among all the tested exons (max individual SNP beta). Across the lead GWAS variants, we took the median of the max individual SNP betas (observed median of the maxes). To generate a null distribution of the medians of the maxes, we repeated the analysis 15,000 times. For each replicate, we matched each GWAS SNP to a SNP present in the eQTL data based on the number of SNPs in LD, distance to TSS, number of nearby genes and MAF, and calculated the median of the maxes (null median of maxes). To form a distribution of the effect size enrichment we divided the observed median of the maxes by the null medians of the maxes. For each tissue and GWAS trait, we defined the median of the effect size enrichment distribution as the enrichment. We estimated the one-sided 95% confidence intervals as the fifth percentiles of the effect size enrichment distribution. We calculated the one-sided p-value for enrichment as the proportion of replicates with enrichment values <1.Colocalization of islet eQTL with T2D GWASColocalization of GWAS variants and eQTLs was performed using both COLOC40 and RTC19. For the analysis using COLOC, all variants within 250 kb flanking regions around index variants were tested for colocalization, using default parameters from the software. The analysis used summary statistics from T2D GWAS6 and fasting glucose35. GWAS variants and eSNPs pairs were considered to colocalize if the COLOC score for shared signal was >0.9. RTC analysis was also performed using defaults parameters from the software with a list of 459 lead GWAS variants for T2D and fasting glucose (Supplementary Data 13). Associations between GWAS and gene expression were considered as colocalizing if RTC score was >0.9. The plots showing colocalization of GWAS and eQTLs were generated using LocusCompare54.Chromatin states, islet ATAC-seq, and TF footprintsWe used a previously published 13 chromatin state model that included pancreatic islets along with 30 other diverse tissues15. Briefly, these chromatin states were generated from cell/tissue ChIP-seq data for H3K27ac, H3K27me3, H3K36me3, H3K4me1, and H3K4me3, and input control from a diverse set of publically available data31,78,79,80 using the ChromHMM program81. Chromatin states were learned jointly from 33 cell/tissues that passed QC by applying the ChromHMM (version 1.10) hidden Markov model algorithm at 200-bp resolution to five chromatin marks and input15. We ran ChromHMM with a range of possible states and selected a 13-state model, because it most accurately captured information from higher-state models and provided sufficient resolution to identify biologically meaningful patterns in a reproducible way. As reported previously15, stretch enhancers were defined as contiguous enhancer chromatin state (active enhancers 1 and 2, genic enhancer, and weak enhancer) segments longer than 3 kb, whereas typical enhancers were enhancer state segments smaller than the median length of 800 bp (ref. 31).We downloaded raw ATAC-seq data (fastq files) for a total of 33 islet samples from nondiabetic donors: 14 from ref. 82, 17 from ref. 11, and two from ref. 15. We processed these data uniformly by trimming all reads to 36 bp, mapping to hg19 using bwa-mem (version 0.7.15-r1140)83, removing duplicates using Picard (http://broadinstitute.github.io/picard), and pruning reads to retain properly paired and mapped reads (samtools view -F 256 -F 1024 -F 2048 -q 30). Since these 33 samples were obtained from different studies, we sought to obtain the set of peaks that were reproducible. We subsampled each ATAC-seq sample bam file to the minimum read depth across samples of 27,994,993 reads, merged these subsampled bam files across all samples and called peaks, using (1) the merged bam file that uniformly represents each sample and (2) each individual sample bam file. We used MACS2 (https://github.com/taoliu/MACS), version 2.1.0, with flags -g hs–nomodel–shift -100–extsize 200 -B–broad–keep-dup all, to call peaks and filtered out regions blacklisted by the ENCODE consortium due to poor mappability (wgEncodeDacMapabilityConsensusExcludable.bed and wgEncodeDukeMapabilityRegionsExcludable.bed). We then selected peaks from the merged bam file that were reproducibly called across the majority (at least 17) of the 33 individual samples, resulting in 64,129 peaks that we used for downstream analyses.TF footprint motifs are occurrences of TF motifs (obtained from databases of DNA-binding motifs for several TFs) in accessible chromatin regions (identified from assays such as ATAC-seq). We downloaded previously published islet TF footprint motifs15, which were generated in a haplotype-aware manner using ATAC-seq and genotyping data from the phased, imputed genotypes for two islet samples using vcf2diploid81 v0.2.6a and DNA-binding motif information for 1995 publicly available TF motifs84,85,86. We subset the footprint motifs selecting occurrences within the new set of reproducible ATAC-seq peaks described above.Filtering eQTL SNPsSince low MAF SNPs, due to low power, can only be identified as significant eQTL SNP (eSNPs) with high eQTL effect sizes (slope or the beta from the linear regression), we observed that absolute effect size varies inversely with MAF (Supplementary Fig. 15). To conduct eQTL effect size based analyses in an unbiased manner, we selected significant (FDR 1%) eSNPs with MAF > = 0.2. We then pruned this list to retain the most significant SNPs with pairwise LD (r2) < 0.8 for the EUR population using PLINK87 and 1000 genomes variant call format (vcf) files (downloaded from ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/) for reference (European population). This filtering process resulted in n = 3832 islet eSNPs.Enrichment of genetic variants in genomic featuresTo calculate the enrichment of islet eSNPs to overlap with genomic features, such as chromatin states, ATAC-seq peaks, and TF footprint motifs, we used the GREGOR tool88. For each input SNP, GREGOR selects ~500 control SNPs matched for MAF, distance to the gene, and number of SNPs in LD (r2) ≥ 0.99. A unique overlap is reported if the feature overlaps any input lead SNP or its LD (r2) > 0.99 SNPs. Fold enrichment is calculated as the number unique overlaps over the mean number of loci at which the matched control SNPs (or their LD (r2) > 0.99 SNPs) overlap the same feature. This process accounts for the length of the features, as longer features will have more overlap by chance with control SNP sets. We used the following parameters in GREGOR for eQTL enrichment: r2 threshold (for inclusion of SNPs in LD with the lead eSNP) = 0.99, LD window size = 1 Mb, and minimum neighbor number = 500. Since eQTL loci can only occur within ~1 Mb from TSSs of genes actually tested for eQTLs, we checked what fraction of the GREGOR control loci occurred within 1 Mb of tested genes. Out of total 6,031,279 control loci, 98.6% (5,949,654) variants occur within 1 Mb of the TSS for genes for which exon-level eQTL were tested.For enrichment of T2D GWAS SNPs in islet chromatin states, we downloaded the list of T2D GWAS SNPs from Mahajan, et al.5. We pruned this list to retain the most significant SNPs with pairwise LD (r2) < 0.2 for the EUR population using PLINK87 and 1000 genomes variant call format (vcf) files (downloaded from ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/) for reference (European population). This filtering process resulted in N = 378 T2D GWAS SNPs. We used GREGOR to calculate enrichment using the following specific parameters: r2 threshold (for inclusion of SNPs in LD with the lead eSNP) = 0.8, LD window size = 1 Mb, and minimum neighbor number = 500.We investigated if footprint motifs were more enriched to overlap eQTLs of high vs low effect sizes. We sorted the filtered (as described above) eQTL list by absolute effect size values and partitioned these into two equally sized bins (N eSNPs = 1916). Since TF footprints were available for a large number of motifs (N motifs = 1995), the enrichment analysis had a large multiple testing burden and limited power with 1916 eSNPs in each bin. Therefore, we only considered footprint motifs that were significantly enriched (FDR < 1%, Benjamini and Yekutieli method from R p. adjust function, N motifs = 283) and that overlap the bulk set of eSNPs (LD r2 < 0.8 pruned but not MAF filtered, N eSNPs = 6,468, Supplementary Data 12) for enrichment with the binned set of eSNPs. This helped reduce the multiple testing burden. We then calculated enrichment for the selected footprints to overlap SNPs in each bin using GREGOR with same parameters as described above (Supplementary Data 12).eSNP effect size distribution in chromatin statesWe identified the islet eQTL eSNPs (after LD pruning and MAF filtering as described above) occurring in chromatin states or ATAC-seq peaks within chromatin states using BEDtools intersect89. Similar to the enrichment calculation procedure, we considered a unique eQTL overlap if the lead eSNP or a proxy SNP with LD (r2) > 0.99 occurred in these regions. We considered the effect size as the slope or the beta from the linear regression for the eQTL overlapping each region. P-values were calculated using the Wilcoxon rank-sum test in R (ref. 90).TF motif directionality analysisWe selected TF footprint motifs that were significantly enriched (after Bonferroni correction accounting for 1995 total motifs) to overlap islet eQTL (considering LD r2 < 0.8 pruned lead eSNPs or their r2 > 0.99 proxy eSNPs as a locus) with at least ten eQTL locus overlaps, resulting in N = 329 TF motifs. We determined the overlap position of the eSNP with each TF footprint motif. We considered instances where the eSNP overlapped the TF footprint motif at a position with information content > =0.7 and either the eSNP effect or the non-effect allele was the most preferred base in the motif. For each TF footprint motif and eSNP overlap, we rekeyed the direction of effect on the target gene being positive or negative, with respect to the most preferred base in the motif. For each TF motif, we compiled the fraction of instances where the SNP allele that was most preferred in the TF footprint motif (i.e., base with highest probability in the motif) associated with increased expression of the associated gene. We refer to this metric as the motif directionality fraction where fractions near 1 suggest activating and fractions near 0 suggest repressive preferences toward the target gene expression. Motif directionality fraction near 0.5 suggests no activity preference or context dependence.We compared our results to a previously published study that quantified transcription activating or repressive activities based on massively parallel reported assays in HepG2 and K562 cells34 (Supplementary Fig. 8). We found that the motif directionality measures metric were largely concordant (Spearman’s r = 0.69, p = 7.7 × 10−17) with orthogonal motif activity measures derived from MPRAs performed in HepG2 and K562 cell line34 (Supplementary Fig. 8). We then considered 109 motifs from our analyses that were reported to have significant (p < 0.01) activating or repressive scores from MPRAs in both HepG2 and K562. With the null expectation of the motif directionality fraction being equal to 0.5, i.e., TF binding equally likely to increase or decrease target gene expression, we used a binomial test to identify TFs that show significant deviation from the null (N = 18 at FDR < 10%, Supplementary Data 12).Cell culture and transcriptional reporter assaysMIN6 mouse insulinoma beta-cells52 were grown in Dulbecco’s modified Eagle’s Medium (DMEM; Sigma-Aldrich, St. Louis, Missouri/USA) with 10% fetal bovine serum, 1 mM sodium pyruvate, and 0.1 mM beta-mercaptoethanol. INS-1-derived 832/13 rat insulinoma beta-cells (a gift from C. Newgard, Duke University, Durham, North Carolina/USA) were grown in RPMI-1640 medium (Corning, New York/USA) supplemented with 10% fetal bovine serum, 10 mM HEPES, 2 mM L-glutamine, 1 mM sodium pyruvate, and 0.05 mM beta-mercaptoethanol. EndoC-βH1 cells (Endocell) were grown according to Ravassard et al.49 in DMEM (Sigma-Aldrich), 5.6 mmol/L glucose with 2% BSA fraction V fatty acid free (Roche Diagnostics), 50 μmol/L 2-mercaptoethanol, 10 mmol/L nicotinamide (Calbiochem), 5.5 μg/mL transferrin (Sigma-Aldrich), 6.7 ng/mL selenite (Sigma-Aldrich), 100 U/mL penicillin, and 100 μg/mL streptomycin. Cells were grown on coating consisting of 1% matrigel and 2 µg/mL fibronectin (Sigma). We maintained cell lines at 37 °C and 5% CO2.To test haplotypes for allele-specific effects on transcriptional activity, we PCR amplified a 765-bp genomic region (element 1) containing variants: rs7798124, rs7798360, and rs7781710, and a second 592-bp genomic region (element 2) containing variants: rs10228796, rs10258074, rs2191348, and rs2191349 from DNA of individuals homozygous for each haplotype. The oligonucleotide primer sequences are listed in Supplementary Table 6. We cloned the PCR amplicons into the multiple cloning site of the Firefly luciferase reporter vector pGL4.23 (Promega, Fitchburg, Wisconsin/USA) in both orientations, as described previously91. Vectors are designated as “forward” or “reverse” based on the PCR-amplicon orientation with respect to DGKB gene. We isolated and verified the sequence of five independent clones for each haplotype in each orientation. For the 5′ eQTL a 250 bp construct containing the rs17168486 SNP (Origene) was subcloned into the Firefly luciferase reporter vector pGL4.23 (Promega) in both orientations.We plated the MIN6 (200,000 cells) or 832/13 (300,000 cells) in 24-well plates 24 h before transfections and plated the EndoC-βH1 cells (140,000 cells) 48 h prior to transfection. We co-transfected the pGL4.23 constructs with phRL-TK Renilla luciferase reporter vector (Promega) in duplicate into MIN6 or 832/13 cells, and in triplicate for EndoC-βH1 cells. For the transfections, we used Lipofectamine LTX (ThermoFisher Scientific, Waltham, Massachusetts/USA) with 250 ng of plasmid DNA and 80 ng Renilla for MIN6 cells, Fugene6 (Promega) with 720 ng of plasmid, and 80 ng Renilla for 832/13 cells per each welll and Fugene6 with 700 ng plasmid and 10 ng renilla for EndoC-βH1 cells. We incubated the transfected cells at 37 °C with 5% CO2 for 48 h. We measured the luciferase activity with cell lysates using the Dual-Luciferase® Reporter Assay System (Promega). We normalized Firefly luciferase activity to the Renilla luciferase activity. We compared differences between the haplotypes using unpaired two-sided t-tests. All experiments were independently repeated on a second day and yielded comparable results.Electrophoretic mobility shift assaysEMSAs were performed as previously described. We annealed 17-nucleotide biotinylated complementary oligonucleotides (Integrated DNA Technologies) centered on variants: rs10228796, rs10258074, rs2191348, and rs2191349 (Supplementary Table 7). MIN6 nuclear protein extract was prepared using the NE-PER kit (Thermo Scientific). To conduct the EMSA binding reactions, we used the LightShift Chemiluminescent EMSA kit (Thermo Scientific) following the manufacturer’s protocol. Each reaction consisted of 1 μg poly(dI-dC), 1× binding buffer, 10 μg MIN6 nuclear extract, and 400 fmol biotinylated oligonucleotide. We resolved DNA–protein complexes on nondenaturing DNA retardation gels (Invitrogen) in 0.5× TBE. We transferred the complexes to Biodyne B Nylon membranes (Pall Corporation), and UV cross-linked (Stratagene) to the membrane. We used chemiluminescence to detect the DNA–protein complexes. EMSAs were repeated on a second day with comparable results.Reporting summaryFurther information on research design is available in the Nature Research Reporting Summary linked to this article.

Via Source link