import pegasusio as io
import pandas as pd
h5ad
file¶We use pbmc3k h5ad file from https://cellxgene-example-data.czi.technology/pbmc3k.h5ad as demo. First, read it using PegasusIO:
data1 = io.read_input("pegasusio_test_cases/case1/pbmc3k.h5ad", genome = 'hg19')
data1
The gene-count matrix has 2638 cell barcodes and 1838 genes, and PegasusIO stores it as the default UnimodalData
element within a MultimodalData
object.
We can generate SCP compatible outputs from it. These files are needed when one imports data to Single-Cell Portal (SCP):
io.write_output(data1, "pegasusio_test_cases/case1/pbmc3k", file_type = 'scp')
We can also write the data in mtx
format:
io.write_output(data1, "pegasusio_test_cases/case1/pbmc3k_mtx")
Below is to generate loom
format output:
io.write_output(data1, "pegasusio_test_cases/case1/pbmc3k.loom")
Below is to generate zarr.zip
format output:
io.write_output(data1, "pegasusio_test_cases/case1/pbmc3k.zarr.zip")
We use 10X data from http://cf.10xgenomics.com/samples/cell-exp/3.0.2/1k_hgmm_v3/1k_hgmm_v3_filtered_feature_bc_matrix.h5 for the demo.
data2 = io.read_input("pegasusio_test_cases/case2/1k_hgmm_v3_filtered_feature_bc_matrix.h5")
data2
You can see that in the MultimodalData
object data2
, there are two UnimodalData
elements: one with key hg19-rna
, which is human data; the other with key mm10-rna
, which is mouse data. And currently the default UnimodalData
it refers to is the human data.
To reset the default UnimodalData
to mouse data, use the following method:
data2.select_data('mm10-rna')
data2
Now write_output
will generate output for the mouse data matrix:
io.write_output(data2, "pegasusio_test_cases/case2/mouse.h5ad")
PegasusIO can read data matrix in different formats. In this case, we demonstrate csv
, loom
, and mtx
formats. The data we use is from https://data.humancellatlas.org/explore/projects/cddab57b-6868-4be4-806f-395ed9dd635a/m/expression-matrices.
data3_csv = io.read_input("pegasusio_test_cases/case3/19dc248b-2e9d-4c52-8065-e681a61d1514.csv/expression.csv", genome = 'hg19')
data3_csv
data3_mtx = io.read_input("pegasusio_test_cases/case3/42468c97-1c5a-4c9f-86ea-9eaa1239445a.mtx", genome = 'hg19')
data3_mtx
data3_loom = io.read_input("pegasusio_test_cases/case3/pancreas.loom", genome = 'hg19')
data3_loom
As mentioned above, all data3_csv
, data3_mtx
, and data3_loom
are PegasusIO's UnimodalData
objects.
We can then write the object into AnnData
h5ad format:
io.write_output(data3_csv, "pegasusio_test_cases/case3/pancreas.h5ad")
zarr
files with Scrublet scores¶In this case, we use two channels from human bone marrow data at https://data.humancellatlas.org/explore/projects/cc95ff89-2e68-4a08-a234-480eca21ce79: donor 1 channel 1, and donor 8 channel 8. Both channels have been processed with Scrublet to estimate doublet scores, and are stored in zarr
format.
First, load two files into memory:
data4_1 = io.read_input("pegasusio_test_cases/case4/MantonBM1_1_dbls.zarr")
data4_1
data4_2 = io.read_input("pegasusio_test_cases/case4/MantonBM8_8_dbls.zarr")
data4_2
Both channels have over 4000 cell barcodes. Below are Scrublet scores of the first channel:
data4_1.obs['scrublet_scores']
And its scrublet stats information can be retrieved as below:
data4_1.uns['scrublet_stats']
We can also aggregate two channels into one data matrix using PegasusIO's aggregate_matrices
function. To do that, we need to first prepare a sample sheet in csv
format:
sheet4 = pd.read_csv("pegasusio_test_cases/case4/count_matrix.csv")
sheet4
The sample sheet should have at least two columns: Sample
, specifying sample name; Location
, specifying location of the sample's gene-count matrix file.
Then use this sample sheet for data aggregation:
data4 = io.aggregate_matrices("pegasusio_test_cases/case4/count_matrix.csv")
data4
You can see that data4
contains 8436 cells from both channels altogether.
In this case, we demonstrate aggregating data matrices with quality-control filtering settings. We use mouse lung cells from Mouse Cell Atlas paper. In particular, we use samples "Lung 1", "Lung 2", and "Lung 3" from DGE format file here.
Similarly as in Case 4, first prepare a sample sheet:
sheet5 = pd.read_csv("pegasusio_test_cases/case5/count_matrix.csv")
sheet5
In details, Location
column lists 3 files as the following:
for _, row in sheet5.iterrows():
print(row['Location'])
Now we can aggregate the three samples with quality-control filtering settings:
data5 = io.aggregate_matrices("pegasusio_test_cases/case5/count_matrix.csv",
default_ref = 'mm10',
append_sample_name = False,
min_genes = 500,
max_genes = 6000,
mito_prefix = 'mt-',
percent_mito = 20)
data5
We keep cells with:
Besides, we need to specify the name prefix of mitochondrial genes in order to calculate the second criterion.
We also don't append sample name as prefix to cell barcodes after aggregation, as in this case, all the barcodes are already distinct beforehand.
For details on these parameters, please see PegasusIO documentation.
In this case, we show how to manipulate data across different protocols/omics. We use the following data:
First, aggregate them using the following sample sheet:
sheet6 = pd.read_csv("pegasusio_test_cases/case6/count_matrix.csv")
sheet6
Notice that the sample names are the same this time.
data6 = io.aggregate_matrices("pegasusio_test_cases/case6/count_matrix.csv")
data6
data6
has 4 UnimodalData
elements: GRCh38-citeseq
for CITE-Seq data, GRCh38-rna
for RNA data, GRCh38-tcr
for TCR data, and GRCh38-bcr
for BCR data.
We first check CITE-Seq data. Its antibody control list is constructed based on information from biolegend website, as shown below:
antibody_control_sheet = pd.read_csv("pegasusio_test_cases/case6/antibody_control.csv")
antibody_control_sheet
Now load it to CITE-Seq data:
data6.select_data('GRCh38-citeseq')
data6.load_control_list("pegasusio_test_cases/case6/antibody_control.csv")
data6.arcsinh_transform()
data6
data6.select_data('GRCh38-tcr')
data6
data6.get_chain('TRA')
data6.get_chain('TRB')
data6.select_data('GRCh38-bcr')
data6
data6.get_chain('IGK')
data6.get_chain('IGL')
data6.get_chain('IGH')
sheet7 = pd.read_csv("pegasusio_test_cases/case7/count_matrix.csv")
sheet7
for _, row in sheet7.iterrows():
print(row['Location'])
data7 = io.aggregate_matrices("pegasusio_test_cases/case7/count_matrix.csv")
data7.arcsinh_transform()
data7