Chapter 9 Pubmon

9.1 2020-2021 PhD year 1

-202009

-202010

-202011

-202012

-202101

-202102

-202103

-202104

-202105

-202106

-202107

-202108

9.2 2021-2022 PhD year 2

The cell-level resolution data provide much richer information for answering a number of important questions that cannot otherwise be answered by bulk data, for example, the composition of cell types in complex tissues, the cell-to-cell heterogeneity in transcription, and the transcriptional dynamics in many biological processes such as development, differentiation, and disease progression.

There are several scientific goals in scRNA-seq studies. The first one is to decipher the cellular composition of complex tissues: one wants to know the identities of the cell types and subtypes, as well as their proportions in the tissue sample. The cellular composition itself can be of great interest in biological and clinical practices, for example, it was reported that tumor-infiltrating immune cell compositions play a vital role in understanding antitumor immune responses [4]. Once the cell types are identified, cell type–specific gene expressions are also of great interest since they enhance the understandings of cell signatures [5]. There are other goals, for example, new and rare cell type discovery [6] and pseudo-time construction to represent the temporal dynamics of transcription during a biological process [7].

  • 202109

Evaluation of some aspects in supervised cell type identification for single-cell RNA-seq: classifier, feature selection, and reference construction(Ma, Su, and Wu 2021)

Background Cell type identification is one of the most important questions in single-cell RNA sequencing (scRNA-seq) data analysis. With the accumulation of public scRNA-seq data, supervised cell type identification methods have gained increasing popularity due to better accuracy, robustness, and computational performance. Despite all the advantages, the performance of the supervised methods relies heavily on several key factors: feature selection, prediction method, and, most importantly, choice of the reference dataset.

Results In this work, we perform extensive real data analyses to systematically evaluate these strategies in supervised cell identification. We first benchmark nine classifiers along with six feature selection strategies and investigate the impact of reference data size and number of cell types in cell type prediction. Next, we focus on how discrepancies between reference and target datasets and how data preprocessing such as imputation and batch effect correction affect prediction performance. We also investigate the strategies of pooling and purifying reference data.

Conclusions Based on our analysis results, we provide guidelines for using supervised cell typing methods. We suggest combining all individuals from available datasets to construct the reference dataset and use multi-layer perceptron (MLP) as the classifier, along with F-test as the feature selection method. All the code used for our analysis is available on GitHub (https://github.com/marvinquiet/RefConstruction_supervisedCelltyping).

Study design

There are several other supervised cell typing methods available for scRNA-seq. For example, scSorter [27] borrows information from lowly expressed marker genes to assign cells; scPred [12] adopts a principal component analysis (PCA)-based feature selection; SingleCellNet [28] uses top-pair transformation on gene space and selects informative paired genes as features; CellAssign [29] builds a probabilistic model with some prior knowledge of cell markers, etc. But according to a recent comparison [20], SVM with rejection, scmap, and CHEAH are among the best performers, so we decide not to include more such methods. GEDFN is a method designed for predicting phenotype from bulk expression but can be directly applied to scRNA-seq cell typing. We include it because we want to understand whether incorporating gene network information can improve the results. ItClust is a semi-supervised method which only uses the reference data to obtain initial values for unsupervised clustering in target data. MARS uses a meta-learning concept to construct cell type landmarks by jointly embedding both annotated and unannotated data without removing the batch effects and then assigns cell types based on the learned embedding space. We want to evaluate the performances of these semi-supervised methods under different scenarios.

SUPERGNOVA: local genetic correlation analysis reveals heterogeneous etiologic sharing of complex traits(Zhang et al. 2021)

Local genetic correlation quantifies the genetic similarity of complex traits in specific genomic regions. However, accurate estimation of local genetic correlation remains challenging, due to linkage disequilibrium in local genomic regions and sample overlap across studies. We introduce SUPERGNOVA, a statistical framework to estimate local genetic correlations using summary statistics from genome-wide association studies. We demonstrate that SUPERGNOVA outperforms existing methods through simulations and analyses of 30 complex traits. In particular, we show that the positive yet paradoxical genetic correlation between autism spectrum disorder and cognitive performance could be explained by two etiologically distinct genetic signatures with bidirectional local genetic correlations.

SUPERGNOVA_workflow

Recovering genotypes and phenotypes using allele-specific genes(Gürsoy et al. 2021)

With the recent increase in RNA sequencing efforts using large cohorts of individuals, surveying allele-specific gene expression is becoming increasingly frequent. Here, we report that, despite not containing explicit variant information, a list of genes known to be allele-specific in an individual is enough to recover key variants and link the individuals back to their genotypes and phenotypes. This creates a privacy conundrum.

NanoCaller for accurate detection of SNPs and indels in difficult-to-map regions from long-read sequencing by haplotype-aware deep neural networks(Ahsan et al. 2021)

Long-read sequencing enables variant detection in genomic regions that are considered difficult-to-map by short-read sequencing. To fully exploit the benefits of longer reads, here we present a deep learning method NanoCaller, which detects SNPs using long-range haplotype information, then phases long reads with called SNPs and calls indels with local realignment. Evaluation on 8 human genomes demonstrates that NanoCaller generally achieves better performance than competing approaches. We experimentally validate 41 novel variants in a widely used benchmarking genome, which could not be reliably detected previously. In summary, NanoCaller facilitates the discovery of novel variants in complex genomic regions from long-read sequencing.

  • 202110

A census of cell types in the brain’s motor cortex

A multimodal cell census and atlas of the mammalian primary motor cortex

  • important

NIH’s Brain Research through Advancing Innovative Neurotechnologies (BRAIN) Initiative - Cell Census Network (BICCN)

9.2.2 PPI topic

identifying cancer drivers

9.3 2023-2024 PhD year 4

References

Ahsan, Mian Umair, Qian Liu, Li Fang, and Kai Wang. 2021. “NanoCaller for Accurate Detection of SNPs and Indels in Difficult-to-Map Regions from Long-Read Sequencing by Haplotype-Aware Deep Neural Networks.” Genome Biology 22 (1): 1–33. https://genomebiology.biomedcentral.com/articles/10.1186/s13059-021-02472-2.
Gao, Teng, Ruslan Soldatov, Hirak Sarkar, Adam Kurkiewicz, Evan Biederstedt, Po-Ru Loh, and Peter Kharchenko. 2022. “Haplotype-Enhanced Inference of Somatic Copy Number Profiles from Single-Cell Transcriptomes.” bioRxiv.
Gürsoy, Gamze, Nancy Lu, Sarah Wagner, and Mark Gerstein. 2021. “Recovering Genotypes and Phenotypes Using Allele-Specific Genes.” Genome Biology 22 (1): 1–9. https://genomebiology.biomedcentral.com/articles/10.1186/s13059-021-02477-x.
Karlsson, Kasper, Moritz Przybilla, Hang Xu, Eran Kotler, Kremena Karagyozova, Alexandra Sockell, Katherine Liu, et al. 2022. “Experimental Evolution in Tp53 Deficient Gastric Organoids Recapitulates Tumorigenesis.” bioRxiv.
Ma, Wenjing, Kenong Su, and Hao Wu. 2021. “Evaluation of Some Aspects in Supervised Cell Type Identification for Single-Cell RNA-Seq: Classifier, Feature Selection, and Reference Construction.” Genome Biology 22 (1): 1–23. https://genomebiology.biomedcentral.com/articles/10.1186/s13059-021-02480-2.
Wei, Runmin, Siyuan He, Shanshan Bai, Emi Sei, Min Hu, Alastair Thompson, Ken Chen, Savitri Krishnamurthy, and Nicholas E Navin. 2022. “Spatial Charting of Single-Cell Transcriptomes in Tissues.” Nature Biotechnology, 1–10.
Williams, Marc J, Tyler Funnell, Ciara H O’Flanagan, Andrew McPherson, Sohrab Salehi, Ignacio Vázquez-Garcı́a, Farhia Kabeer, et al. 2021. “Evolutionary Tracking of Cancer Haplotypes at Single-Cell Resolution.” bioRxiv.
Zhang, Yiliang, Qiongshi Lu, Yixuan Ye, Kunling Huang, Wei Liu, Yuchang Wu, Xiaoyuan Zhong, et al. 2021. “SUPERGNOVA: Local Genetic Correlation Analysis Reveals Heterogeneous Etiologic Sharing of Complex Traits.” Genome Biology 22 (1): 1–30. https://genomebiology.biomedcentral.com/articles/10.1186/s13059-021-02478-w.