Molecular Classification of Thyroid Nodules with Indeterminate Cytology: Development and Validation of a Highly Sensitive and Specific New miRNA-Based Classifier Test Using Fine-Needle Aspiration Smear Slides

Background: Thyroid nodules can be identified in up to 68% of the population. Fine-needle aspiration (FNA) cytopathology classifies 20%–30% of nodules as indeterminate, and these are often referred for surgery due to the risk of malignancy. However, histological postsurgical reports indicate that up to 84% of cases are benign, highlighting a high rate of unnecessary surgeries. We sought to develop and validate a microRNA (miRNA)-based thyroid molecular classifier for precision endocrinology (mir-THYpe) with both high sensitivity and high specificity, to be performed on the FNA cytology smear slide with no additional FNA. Methods: The expression of 96 miRNA candidates from 39 benign/39 malignant thyroid samples, (indeterminate on FNA) was analyzed to develop and train the mir-THYpe algorithm. For validation, an independent set of 58 benign/37 malignant FNA smear slides (also classified as indeterminate) was used. Results: In the training set, with a 10-fold cross-validation using only 11 miRNAs, the mir-THYpe test reached 89.7% sensitivity, 92.3% specificity, 90.0% negative predictive value and 92.1% positive predictive value. In the FNA smear slide validation set, the mir-THYpe test reached 94.6% sensitivity, 81.0% specificity, 95.9% negative predictive value, and 76.1% positive predictive value. Bayes' theorem shows that the mir-THYpe test performs satisfactorily in a wide range of cancer prevalences. Conclusions: The presented data and comparison with other commercially available tests suggest that the mir-THYpe test can be considered for use in clinical practice to support a more informed clinical decision for patients with indeterminate thyroid nodules and potentially reduce the rates of unnecessary thyroid surgeries.

indeterminate thyroid nodules is not low, even when considering the reclassification of NIFTP (noninvasive follicular thyroid neoplasm with papillary-like nuclear features) as benign (RoM 6%-60%) (9)(10)(11)(12), surgery is still the current practice performed and recommended for the majority of these cases (8).
Considering that around 69% of indeterminate thyroid nodules are classified as benign by postsurgical histology, ranging from 24.8% for Bethesda class V up to 84.1% to Bethesda class III (13), many unnecessary surgeries are performed yearly. This problem is worsening: an overdiagnosis and overtreatment of thyroid nodules has been documented over the last decade as a result of a large increase in thyroid cancer screening (14), highlighting an important need to improve the diagnostic evaluations of patients with indeterminate cytological classifications on FNA in the preoperative setting.
To provide more objective information and support a more informed and personalized clinical decision, molecular tests have emerged as powerful tools to overcome this problem (15). Mutational analysis of genes such as BRAF, TERT, RAS, and TP53 and several gene fusions, alone or in panels, have good specificity and positive predictive values (PPVs) and are usually used as ''rule-in'' tests to predict malignancy (15,16), but they are not intended to avoid unnecessary surgeries.
On the other hand, ''rule-out'' molecular classifier tests are good tools to help identifying benign thyroid nodules, and these tests aim to reduce unnecessary surgeries due to their high sensitivity and negative predictive values (NPVs) (17). To our knowledge, molecular classifier tests are currently only performed commercially by five centralized laboratories worldwide (18)(19)(20)(21)(22), which geographically and financially hinders their use by and benefit to patients from other countries. Another important limitation is that three out of the five tests require an additional biological sample, which usually means performing at least one additional FNA (18)(19)(20). Although a needle washout from the initially performed passes can be collected and stored, and can potentially avoid an additional FNA, this is not performed routinely by all centers. Currently, only two out of the five (21,23) molecular classifier tests can be performed using the cytological smear slides that were used to classify the thyroid nodule as indeterminate. These two tests are good ''rule-out'' options due to their high sensitivity; however, both tests are unable to achieve the proposed minimum of 80% of specificity in order to be considered a ''rule-in'' test option and thus perform adequately in a broad range of disease prevalence (17).
In the present study, our aim was to identify a panel of microRNAs (miRNAs) that have an distinct expression profile in benign nodules compared to malignant nodules, and to develop and validate a miRNA-based thyroid molecular classifier for a precision endocrinology (mir-THYpe) test with both high sensitivity and high specificity, which could be performed directly from the readily available cytological smear slides without the need for an additional FNA.

Study design and participants
A retrospective analysis was performed to identify samples from patients with thyroid nodules who were subjected to FNA procedures between January 2013 and July 2017 from which the cytopathology analysis classified the samples as ''atypia of undetermined significance/follicular lesion of undetermined significance'' (AUS/FLUS -Bethesda class III), ''follicular or oncocytic (Hürthle cell) neoplasm/suspicious for a follicular or oncocytic (Hürthle cell) neoplasm'' (FN/SFN -Bethesda class IV), or ''suspicious for malignancy'' (SUSP -Bethesda class V), and who had undergone total or partial thyroidectomy. Samples were obtained from the Barretos Cancer Hospital (for mir-THYpe test development and validation) and from Midwestern State University UNICENTRO (for mir-THYpe test validation).
Samples were tested at Barretos Cancer Hospital, in a laboratory certified according to the provisions of the College of American Pathologists, United Kingdom National External Quality Assessment Service, and the European Molecular Genetics Quality Network. The study was approved by the board of the investigational ethics committee and listed under CAAE 52739416.5.0000.5437. All patients provided written informed consent to participate in the study.

Sample selection and evaluation
To confirm whether the nodule described in the surgical pathology report of the removed tissue corresponded to the same nodule biopsied in the FNA, all eligible cases were reviewed. The nodule description obtained by ultrasonography was also compared with the FNA cytology analysis report and the final postsurgical report. The sample was included in the study only if the FNA cytology report and other characteristics, such as size and localization, corresponded to the nodule described in the final surgical pathology report. Samples in which the final diagnosis did not correspond to the FNA biopsy, as well as those samples in which it was impossible to define with absolute certainty that the two samples were matching, due to more than one nodule in the same patient or to a lack of details in the description of the reports, were excluded.
All selected samples had their respective FNA cytology slides reviewed by a second independent pathologist. The reviews were double-blinded, without the reviewing pathologist being aware of the original classification. Samples whose review reports were concordant with the original were immediately included in the study. Samples with discordance with the original report were subjected to blinded review by an independent third pathologist in order to make a decision. If the opinion from the third pathologist was concordant with any of the two previous opinions, the sample was included in the study with its classification. If discordant with both previous opinions, the sample was excluded from the study. To confirm the postsurgical pathology reports, the histological slides of the nodules corresponding to the FNA smears were also reviewed according to the same approach.
For the development and training phase of the mir-THYpe test (training set), only cryopreserved and FFPE (formalinfixed paraffin-embedded) tissues were used. For the validation phase of the mir-THYpe test (validation set), only FNA cytology smear slides were used, and they were used only when at least two representative slides were available: one was to be used in the study, and the other was to be kept in the patient records. None of the samples used in the validation set were used for algorithm development or training.
Demographic patient characteristics and clinical data are shown in Table 1.

RNA extraction
Prior to extraction, all cases were evaluated by a pathologist to analyze the percentage of tumor tissue present in the sample. All samples used had more than 70% of tumor tissue. Total RNA from the FFPE samples was isolated from two to six 10-lm tissue sections using the RecoverAll Total Nucleic Acid Isolation Kit for FFPE (Ambion, Carlsbad, CA). Total RNA extraction from cryopreserved samples was performed using the QIAsymphony automated system (QIAGEN, Hilden, Germany), using the microRNA enrichment extraction protocol. Total RNA from the FNA cytology smear slides (stained by Papanicolaou, Giemsa, or Diff-Quik methods) was performed with the TRIzol reagent (Thermo Scientific, Waltham, MA) protocol. The cells were scraped only from marked zones delimited by a cytopathologist, with a minimum of at least six groups of 10 thyroid cells being acceptable to perform the protocol. Quantification of the total RNA extracted was performed using the Qubit 2.0 equipment (Thermo Scientific, Waltham, MA). RNA was resuspended to a final volume of 30 lL of ultrapure D/RNase-free water. The entire process was carried out according to the techniques established in the RNA manipulation protocols, in an exclusive laboratory environment, and with the aid of tubes, pipettes, tips, and exclusive supports that had been previously sanitized with the RNAseZAP product (Ambion, Carlsbad, CA). All steps were performed according to the manufacturer's instructions.

Reverse transcription cDNA synthesis and preamplification
Twenty nanograms of total RNA from each sample were used for cDNA synthesis. The reaction used a pool of specific reverse transcription (RT) predesigned inventoried primers (Thermo Scientific, Waltham, MA) to select only the miR-NAs of interest (see the list of targets in Supplementary  Table S1; Supplementary Data are available online at www.liebertpub.com/thy), and the reactions were performed using the TaqMan microRNA Reverse Transcription Kit (Thermo Scientific, Waltham, MA). Subsequently, the RT-PCR products were subjected to the preamplification stage, with a pool of specific predesigned TaqMan inventoried assays (Thermo Scientific, Waltham, MA) and the Pre-Amp Master Mix 2 · (Applied Biosystems, Carlsbad, CA). All reactions were performed according to the manufacturer's instructions.

Real-time PCR
For the training set, customized TaqMan Low-Density Array (TLDA) 384-well microfluidic cards with inventoried For the training set, we selected an exploratory panel of 96 miRNA candidates to be analyzed on the TLDA cards. For this first panel, we used an ensemble of two different approaches. First, we used the FirePlex Discovery Engine (www.fireflybio .com) to filter miRNAs that were frequently cited in the literature using ''thyroid nodules,'' ''thyroid cancer,'' thyroid biomarker,'' and ''thyroid miRNA'' as keywords. Second, we performed a comprehensive search of PubMed (www.ncbi .nlm.nih.gov/pubmed), looking not only for the frequencies of the miRNAs cited in relevant studies but also for the targets that were described to be differentially expressed in benign and malignant nodules or as good candidates for normalizers (housekeeping targets).
All targets that were not amplified in at least 95% of the cryopreserved and FFPE samples used for the training set were excluded and were not even considered as candidates to constitute the mir-THYpe algorithm. The top 10 targets with lower Ct standard deviations across all benign and malignant samples were assigned as normalizer candidates. All remaining miRNAs were assigned to be used as discriminator candidates.

miRNA-based thyroid molecular classifier algorithm development and training
For each sample from the training set (FFPE and cryopreserved), we produced a series of features by normalizing the Ct value of each discriminator candidate to all possible combinations of the average of Ct values from the 10 normalizer candidates expressed as an exponential delta cycle threshold (DCt) = 2^( Ct average normalizers -Ct discriminator) . The best features were selected based on the mutual information filter-based method (24) and classifiers were generated using the Tree-based algorithms (25). The algorithm tree with better clinical performance, based on the 10-fold crossvalidation analysis, was locked and then challenged with the normalized data obtained from the samples from the validation set cohort (FNA cytology smear slides). The classifier algorithm tree performance was evaluated per the NPV, PPV, sensitivity, specificity, negative and positive likelihood ratios, accuracy, and area under the curve. Supplementary Figure S1 shows an unsupervised hierarchical clustering analysis of normalized expression data from discriminator miRNAs used on the mir-THYpe algorithm.

Statistical analyses
Statistical analyses were performed using the R software, an open-source statistical programming environment. Confidence intervals for sensitivity, specificity, and accuracy were ''exact'' Clopper-Pearson confidence intervals. Confidence intervals for the likelihood ratios were calculated using the ''Log method'' (26). Confidence intervals for the predictive values were the standard logit confidence intervals (27). Bayes' theorem analysis was based on Hall (28).

Results
From the 1205 patients identified with both thyroid FNA cytology and postsurgical histology results available during the period of interest, 272 samples (22.5%) were classified as AUS/FLUS (Bethesda class III), FN/SFN (Bethesda class IV), or SUSP (Bethesda class V). Of these samples, 60 were excluded due to a lack of at least two representative FNA smear slides or the presence of multinodular thyroid glands for which the description of the punctured FNA nodule was not sufficient to correlate it with the available FFPE or cryopreserved tissue. Another 20 samples were excluded during the pathology review process, in which the final consensus was either ''benign'' (Bethesda class II) or ''malignant'' (Bethesda class VI), and the other 19 samples were excluded due to the low quality of the RNA extraction and/or real-time PCR amplification issues.
The remaining 173 samples were split into two cohorts. The first cohort, used to develop and train the mir-THYpe algorithm (training set), comprised 78 samples (from 78 patients), with 39 benign and 39 malignant samples (50% cancer prevalence); 50% of the samples were from FFPE, and 50% were from cryopreservation. The second cohort, used to validate the mir-THYpe algorithm ( Table S1). From the remaining 65 miRNA candidates, 10 were selected as normalizer candidates (numbers 56-65) based on the standard deviations, and the other 55 miRNAs (numbers 1-55) were used as discriminator candidates.

mir-THYpe algorithm performance
Training set. Table 2 summarizes all of the statistical parameters considered in the analysis of the clinical performance of the mir-THYpe algorithm. Table 3  Comparison with other molecular classifiers. We compiled the latest data published and compared the statistical parameters reached by the FNA validation set cohorts from mir-THYpe and another five molecular classifiers (Table 4). We observed that the cancer prevalence from each study varied from 23.7% (Afirma GSC) to 52.6% (ThyroSeq v3), which made evaluating and statistically comparing the clinical performance of each test difficult. Using Bayes' theorem, we predicted the NPV (Table 5 and Fig. 1A) and the PPV (Fig. 1B) of each test for different cancer prevalence rates to enable a fair statistical comparison. At the cancer prevalence used by the Afirma GSC study (23.7%), all tests were predicted to perform with an NPV greater than 90%. However, when considering the cancer prevalence used by the Thyr-oSeq v3 study (52.6%), only the mir-THYpe, ThyroSeq v3, and ThyroidPrint tests were predicted to perform with an NPV greater than 90%.

Discussion
Here, we present the results of the development and validation of a new molecular classifier test in precision endocrinology for indeterminate thyroid nodules (mir-THYpe) that analyzes the expression profiles of 11 miRNAs obtained from the same FNA cytology smear slides used to classify the thyroid nodule as indeterminate. This approach has the advantage that there is no need for a repeat FNA, which may be required with other methods if samples are not collected prospectively. In order to focus on available cytology slides, our mir-THYpe test was developed to scrape only thyroid cells in marked zones delimited by a cytopathologist.
Since this is a retrospective study, we decided to apply very rigid sample inclusion criteria to guarantee that (1) the available FNA smear slide was truly correlated with the postsurgical nodule that was biobanked, (2) the assigned Bethesda description/class and histological subtyping  The distribution of the number of malignant versus benign samples (cancer prevalence) in the training set was different from the distribution in the validations set. The analysis of the first cohort (50.0% cancer prevalence) was aimed to provide the same amount of information from both classes to allow similar learning for the algorithm training, which was confirmed when the 10-fold cross-validation test achieved high and similar performance values in term of the sensitivity (89.7%) and specificity (92.3%). The analysis of the second cohort (38.9% cancer prevalence) was aimed to have a similar distribution observed to that in the Barretos Cancer Hospital (approximately 44% cancer prevalence) and to be inside the 20%-40% cancer prevalence, which is the range for indeterminate nodules reported in most clinical centers (20). The observed clinical performance of the validation set cohort, in which both the sensitivity (94.6%) and specificity (81.0%) were high and close to the values observed in the training set, is good evidence that the data from both cohorts are consistent. Although the data for the training set was generated only with postsurgical tissues and although the data for the validation set was generated only by FNA smear slides, the key to reaching this level of concordance between  both cohorts was the use of good-quality normalization data, and this was only possible by using, among our 11-miRNA panel, more normalizer biomarkers (6) than discriminator biomarkers (5).
Considering that the specificity is the ratio between the FP and the sum of true negatives (TN) with the FP (specificity = FP/[FP+TN]), and that virtually all FP patients will undergo thyroidectomy surgery, the sensitivity is also a parameter to estimate the potential number of unnecessary surgeries that can be avoided if the test is performed for 100% of the patients with indeterminate thyroid nodules. In this situation, the mir-THYpe test has potential to reduce up to 81.0% of unnecessary surgeries.
According to a review and meta-analysis study published by Vargas-Salas and colleagues in 2018 (17), a sensitivity of 92% and specificity of 80% appear to be ideal for an indeterminate thyroid nodule molecular classifier test to have an appropriate clinical performance within a wide cancer prevalence range (20%-40%), reaching at least an NPV of 94% and a PPV of 60% (17). According to these thresholds, only the mir-THYpe, ThyroSeq v3, and ThyroidPrint tests currently present or exceed, within the same test, the proposed thresholds (see Table 4) and can be considered ''rule-out'' and ''rule-in'' tests. Although the NPV is the preferred statistical parameter that clinicians consider when evaluating a thyroid molecular classifier test to reduce unnecessary surgeries, reinforcing that this parameter (as the PPV) is highly influenced by the cancer prevalence of the study is highly important (Fig. 1). The NPV prediction of the mir-THYpe test across all of the different cancer prevalence rates from other compared studies (Table 5) shows that the mir-THYpe is predicted to have an NPV lower than 94% (93.1%) only at a 52.6% cancer prevalence (''ThyroSeq v3'' study).
It is important to caution that all the data used here to compare the latest version of each molecular classifier test were published very recently (Rosetta GX Reveal and Thy-roidPrint in 2017; Afirma GSC and ThyroSeq v3 in 2018; except ThyraMIR/ThyGenX published in 2015), therefore limited information is published about the prospective clinical performance of the latest version of each test, including the mir-THYpe here described.
In summary, the reported data suggest that the mir-THYpe test could be considered for use in clinical practice as a complementary tool to support a more informed clinical decision for patients with indeterminate thyroid nodules and potentially contribute to a reduction in the rates of unnecessary thyroid surgeries and unnecessary expenses by the healthcare system. Moreover, our data show that the mir-THYpe can be considered both a malignancy ''rule-in'' and ''rule-out'' test, and can be readily performed using already available cytology slides at a significantly lower cost. Yet, prospective studies assessing the mir-THYpe performance in the clinical setting should be carried out to validate its routine use in clinical practice and its cost-effectiveness.