PegasusIO Tutorial¶

import pegasusio as io
import pandas as pd

Case 1: Read `h5ad` file¶

We use pbmc3k h5ad file from https://cellxgene-example-data.czi.technology/pbmc3k.h5ad as demo. First, read it using PegasusIO:

data1 = io.read_input("pegasusio_test_cases/case1/pbmc3k.h5ad", genome = 'hg19')
data1

2020-06-05 09:35:14,667 - pegasusio.readwrite - INFO - h5ad file 'pegasusio_test_cases/case1/pbmc3k.h5ad' is loaded.
2020-06-05 09:35:14,668 - pegasusio.readwrite - INFO - Function 'read_input' finished in 0.28s.

MultimodalData object with 1 UnimodalData: 'hg19-rna'
    It currently binds to UnimodalData object hg19-rna

UnimodalData object with n_obs x n_vars = 2638 x 1838
    Genome: hg19; Modality: rna
    It contains 1 matrices: 'X'
    It currently binds to matrix 'X' as X

    obs: 'n_genes_by_counts', 'log1p_n_genes_by_counts', 'total_counts', 'log1p_total_counts', 'pct_counts_in_top_50_genes', 'pct_counts_in_top_100_genes', 'pct_counts_in_top_200_genes', 'pct_counts_in_top_500_genes', 'n_genes', 'percent_mito', 'n_counts', 'louvain', 'leiden'
    var: 'featureid', 'n_cells_by_counts', 'mean_counts', 'log1p_mean_counts', 'pct_dropout_by_counts', 'total_counts', 'log1p_total_counts', 'n_cells', 'highly_variable', 'means', 'dispersions', 'dispersions_norm'
    obsm: 'X_pca', 'X_umap', 'X_tsne', 'X_draw_graph_fr'
    varm: 'PCs'
    uns: 'draw_graph', 'leiden', 'louvain', 'neighbors', 'pca', 'genome', 'modality'

The gene-count matrix has 2638 cell barcodes and 1838 genes, and PegasusIO stores it as the default UnimodalData element within a MultimodalData object.

We can generate SCP compatible outputs from it. These files are needed when one imports data to Single-Cell Portal (SCP):

io.write_output(data1, "pegasusio_test_cases/case1/pbmc3k", file_type = 'scp')

2020-06-05 09:35:14,721 - pegasusio.text_utils - INFO - Metadata file /Users/yy939/GitHub/pegasusio/notebooks/pegasusio_test_cases/case1/pbmc3k.scp.metadata.txt is written.
2020-06-05 09:35:14,736 - pegasusio.text_utils - INFO - Coordinate file /Users/yy939/GitHub/pegasusio/notebooks/pegasusio_test_cases/case1/pbmc3k.scp.X_pca.coords.txt is written.
2020-06-05 09:35:14,746 - pegasusio.text_utils - INFO - Coordinate file /Users/yy939/GitHub/pegasusio/notebooks/pegasusio_test_cases/case1/pbmc3k.scp.X_umap.coords.txt is written.
2020-06-05 09:35:14,758 - pegasusio.text_utils - INFO - Coordinate file /Users/yy939/GitHub/pegasusio/notebooks/pegasusio_test_cases/case1/pbmc3k.scp.X_tsne.coords.txt is written.
2020-06-05 09:35:14,774 - pegasusio.text_utils - INFO - Coordinate file /Users/yy939/GitHub/pegasusio/notebooks/pegasusio_test_cases/case1/pbmc3k.scp.X_draw_graph_fr.coords.txt is written.
2020-06-05 09:35:14,778 - pegasusio.text_utils - INFO - Barcode file /Users/yy939/GitHub/pegasusio/notebooks/pegasusio_test_cases/case1/pbmc3k.scp.barcodes.tsv is written.
2020-06-05 09:35:14,786 - pegasusio.text_utils - INFO - Feature file /Users/yy939/GitHub/pegasusio/notebooks/pegasusio_test_cases/case1/pbmc3k.scp.features.tsv is written.
2020-06-05 09:35:16,179 - pegasusio.text_utils - INFO - Matrix file /Users/yy939/GitHub/pegasusio/notebooks/pegasusio_test_cases/case1/pbmc3k.scp.matrix.mtx is written.
2020-06-05 09:35:16,180 - pegasusio.text_utils - INFO - write_scp_file is done.
2020-06-05 09:35:16,180 - pegasusio.readwrite - INFO - scp file 'pegasusio_test_cases/case1/pbmc3k' is written.
2020-06-05 09:35:16,181 - pegasusio.readwrite - INFO - Function 'write_output' finished in 1.50s.

We can also write the data in mtx format:

io.write_output(data1, "pegasusio_test_cases/case1/pbmc3k_mtx")

2020-06-05 09:35:19,647 - pegasusio.text_utils - INFO - /Users/yy939/GitHub/pegasusio/notebooks/pegasusio_test_cases/case1/pbmc3k_mtx/hg19-rna/matrix.mtx.gz is written.
2020-06-05 09:35:19,709 - pegasusio.text_utils - INFO - barcodes.tsv.gz is written.
2020-06-05 09:35:19,743 - pegasusio.text_utils - INFO - features.tsv.gz is written.
2020-06-05 09:35:19,744 - pegasusio.text_utils - INFO - Mtx for hg19-rna is written.
2020-06-05 09:35:19,744 - pegasusio.text_utils - INFO - Mtx files are written.
2020-06-05 09:35:19,745 - pegasusio.readwrite - INFO - mtx file 'pegasusio_test_cases/case1/pbmc3k_mtx' is written.
2020-06-05 09:35:19,745 - pegasusio.readwrite - INFO - Function 'write_output' finished in 3.56s.

Below is to generate loom format output:

io.write_output(data1, "pegasusio_test_cases/case1/pbmc3k.loom")

2020-06-05 09:35:21,730 - pegasusio.hdf5_utils - INFO - pegasusio_test_cases/case1/pbmc3k.loom is written.
2020-06-05 09:35:21,730 - pegasusio.readwrite - INFO - loom file 'pegasusio_test_cases/case1/pbmc3k.loom' is written.
2020-06-05 09:35:21,731 - pegasusio.readwrite - INFO - Function 'write_output' finished in 1.98s.

Below is to generate zarr.zip format output:

io.write_output(data1, "pegasusio_test_cases/case1/pbmc3k.zarr.zip")

2020-06-05 09:35:21,736 - pegasusio.zarr_utils - WARNING - Detected and removed pre-existing file pegasusio_test_cases/case1/pbmc3k.zarr.zip.
2020-06-05 09:35:21,834 - pegasusio.readwrite - INFO - zarr.zip file 'pegasusio_test_cases/case1/pbmc3k.zarr.zip' is written.
2020-06-05 09:35:21,835 - pegasusio.readwrite - INFO - Function 'write_output' finished in 0.10s.

Case 2: Process human and mouse mixture data with V3 chemistry¶

We use 10X data from http://cf.10xgenomics.com/samples/cell-exp/3.0.2/1k_hgmm_v3/1k_hgmm_v3_filtered_feature_bc_matrix.h5 for the demo.

data2 = io.read_input("pegasusio_test_cases/case2/1k_hgmm_v3_filtered_feature_bc_matrix.h5")
data2

2020-06-05 09:35:22,736 - pegasusio.readwrite - INFO - 10x file 'pegasusio_test_cases/case2/1k_hgmm_v3_filtered_feature_bc_matrix.h5' is loaded.
2020-06-05 09:35:22,737 - pegasusio.readwrite - INFO - Function 'read_input' finished in 0.90s.

MultimodalData object with 2 UnimodalData: 'hg19-rna', 'mm10-rna'
    It currently binds to UnimodalData object hg19-rna

UnimodalData object with n_obs x n_vars = 1063 x 57905
    Genome: hg19; Modality: rna
    It contains 1 matrices: 'X'
    It currently binds to matrix 'X' as X

    obs: 
    var: 'featureid'
    obsm: 
    varm: 
    uns: 'genome', 'modality'

You can see that in the MultimodalData object data2, there are two UnimodalData elements: one with key hg19-rna, which is human data; the other with key mm10-rna, which is mouse data. And currently the default UnimodalData it refers to is the human data.

To reset the default UnimodalData to mouse data, use the following method:

data2.select_data('mm10-rna')
data2

MultimodalData object with 2 UnimodalData: 'hg19-rna', 'mm10-rna'
    It currently binds to UnimodalData object mm10-rna

UnimodalData object with n_obs x n_vars = 1063 x 54232
    Genome: mm10; Modality: rna
    It contains 1 matrices: 'X'
    It currently binds to matrix 'X' as X

    obs: 
    var: 'featureid'
    obsm: 
    varm: 
    uns: 'genome', 'modality'

Now write_output will generate output for the mouse data matrix:

io.write_output(data2, "pegasusio_test_cases/case2/mouse.h5ad")

2020-06-05 09:35:23,727 - pegasusio.readwrite - INFO - h5ad file 'pegasusio_test_cases/case2/mouse.h5ad' is written.
2020-06-05 09:35:23,727 - pegasusio.readwrite - INFO - Function 'write_output' finished in 0.98s.

Case 3: Read different file formats¶

PegasusIO can read data matrix in different formats. In this case, we demonstrate csv, loom, and mtx formats. The data we use is from https://data.humancellatlas.org/explore/projects/cddab57b-6868-4be4-806f-395ed9dd635a/m/expression-matrices.

data3_csv = io.read_input("pegasusio_test_cases/case3/19dc248b-2e9d-4c52-8065-e681a61d1514.csv/expression.csv", genome = 'hg19')
data3_csv

2020-06-05 09:35:31,824 - pegasusio.readwrite - INFO - csv file 'pegasusio_test_cases/case3/19dc248b-2e9d-4c52-8065-e681a61d1514.csv/expression.csv' is loaded.
2020-06-05 09:35:31,825 - pegasusio.readwrite - INFO - Function 'read_input' finished in 8.09s.

MultimodalData object with 1 UnimodalData: 'hg19-rna'
    It currently binds to UnimodalData object hg19-rna

UnimodalData object with n_obs x n_vars = 2544 x 58347
    Genome: hg19; Modality: rna
    It contains 1 matrices: 'X'
    It currently binds to matrix 'X' as X

    obs: 'genes_detected', 'file_uuid', 'file_version', 'total_umis', 'emptydrops_is_cell', 'barcode', 'cell_suspension.provenance.document_id', 'specimen_from_organism.provenance.document_id', 'derived_organ_ontology', 'derived_organ_label', 'derived_organ_parts_ontology', 'derived_organ_parts_label', 'cell_suspension.genus_species.ontology', 'cell_suspension.genus_species.ontology_label', 'donor_organism.provenance.document_id', 'donor_organism.human_specific.ethnicity.ontology', 'donor_organism.human_specific.ethnicity.ontology_label', 'donor_organism.diseases.ontology', 'donor_organism.diseases.ontology_label', 'donor_organism.development_stage.ontology', 'donor_organism.development_stage.ontology_label', 'donor_organism.sex', 'donor_organism.is_living', 'specimen_from_organism.organ.ontology', 'specimen_from_organism.organ.ontology_label', 'specimen_from_organism.organ_parts.ontology', 'specimen_from_organism.organ_parts.ontology_label', 'library_preparation_protocol.provenance.document_id', 'library_preparation_protocol.input_nucleic_acid_molecule.ontology', 'library_preparation_protocol.input_nucleic_acid_molecule.ontology_label', 'library_preparation_protocol.library_construction_method.ontology', 'library_preparation_protocol.library_construction_method.ontology_label', 'library_preparation_protocol.end_bias', 'library_preparation_protocol.strand', 'project.provenance.document_id', 'project.project_core.project_short_name', 'project.project_core.project_title', 'analysis_protocol.provenance.document_id', 'dss_bundle_fqid', 'bundle_uuid', 'bundle_version', 'analysis_protocol.protocol_core.protocol_id', 'analysis_working_group_approval_status'
    var: 'featureid', 'featuretype', 'chromosome', 'featurestart', 'featureend', 'isgene', 'genus_species'
    obsm: 
    varm: 
    uns: 'genome', 'modality'

data3_mtx = io.read_input("pegasusio_test_cases/case3/42468c97-1c5a-4c9f-86ea-9eaa1239445a.mtx", genome = 'hg19')
data3_mtx

2020-06-05 09:35:32,031 - pegasusio.text_utils - INFO - Detected mtx file in HCA DCP format.
2020-06-05 09:35:36,287 - pegasusio.readwrite - INFO - mtx file 'pegasusio_test_cases/case3/42468c97-1c5a-4c9f-86ea-9eaa1239445a.mtx' is loaded.
2020-06-05 09:35:36,288 - pegasusio.readwrite - INFO - Function 'read_input' finished in 4.46s.

MultimodalData object with 1 UnimodalData: 'hg19-rna'
    It currently binds to UnimodalData object hg19-rna

UnimodalData object with n_obs x n_vars = 2544 x 58347
    Genome: hg19; Modality: rna
    It contains 1 matrices: 'X'
    It currently binds to matrix 'X' as X

    obs: 'genes_detected', 'file_uuid', 'file_version', 'total_umis', 'emptydrops_is_cell', 'barcode', 'cell_suspension.provenance.document_id', 'specimen_from_organism.provenance.document_id', 'derived_organ_ontology', 'derived_organ_label', 'derived_organ_parts_ontology', 'derived_organ_parts_label', 'cell_suspension.genus_species.ontology', 'cell_suspension.genus_species.ontology_label', 'donor_organism.provenance.document_id', 'donor_organism.human_specific.ethnicity.ontology', 'donor_organism.human_specific.ethnicity.ontology_label', 'donor_organism.diseases.ontology', 'donor_organism.diseases.ontology_label', 'donor_organism.development_stage.ontology', 'donor_organism.development_stage.ontology_label', 'donor_organism.sex', 'donor_organism.is_living', 'specimen_from_organism.organ.ontology', 'specimen_from_organism.organ.ontology_label', 'specimen_from_organism.organ_parts.ontology', 'specimen_from_organism.organ_parts.ontology_label', 'library_preparation_protocol.provenance.document_id', 'library_preparation_protocol.input_nucleic_acid_molecule.ontology', 'library_preparation_protocol.input_nucleic_acid_molecule.ontology_label', 'library_preparation_protocol.library_construction_method.ontology', 'library_preparation_protocol.library_construction_method.ontology_label', 'library_preparation_protocol.end_bias', 'library_preparation_protocol.strand', 'project.provenance.document_id', 'project.project_core.project_short_name', 'project.project_core.project_title', 'analysis_protocol.provenance.document_id', 'dss_bundle_fqid', 'bundle_uuid', 'bundle_version', 'analysis_protocol.protocol_core.protocol_id', 'analysis_working_group_approval_status'
    var: 'featureid', 'featuretype', 'chromosome', 'featurestart', 'featureend', 'isgene', 'genus_species'
    obsm: 
    varm: 
    uns: 'genome', 'modality'

data3_loom = io.read_input("pegasusio_test_cases/case3/pancreas.loom", genome = 'hg19')
data3_loom

2020-06-05 09:35:41,315 - pegasusio.readwrite - INFO - loom file 'pegasusio_test_cases/case3/pancreas.loom' is loaded.
2020-06-05 09:35:41,315 - pegasusio.readwrite - INFO - Function 'read_input' finished in 5.02s.

MultimodalData object with 1 UnimodalData: 'hg19-rna'
    It currently binds to UnimodalData object hg19-rna

UnimodalData object with n_obs x n_vars = 2544 x 58347
    Genome: hg19; Modality: rna
    It contains 1 matrices: 'X'
    It currently binds to matrix 'X' as X

    obs: 'analysis_protocol.protocol_core.protocol_id', 'analysis_protocol.provenance.document_id', 'analysis_working_group_approval_status', 'barcode', 'bundle_uuid', 'bundle_version', 'cell_suspension.genus_species.ontology', 'cell_suspension.genus_species.ontology_label', 'cell_suspension.provenance.document_id', 'derived_organ_label', 'derived_organ_ontology', 'derived_organ_parts_label', 'derived_organ_parts_ontology', 'donor_organism.development_stage.ontology', 'donor_organism.development_stage.ontology_label', 'donor_organism.diseases.ontology', 'donor_organism.diseases.ontology_label', 'donor_organism.human_specific.ethnicity.ontology', 'donor_organism.human_specific.ethnicity.ontology_label', 'donor_organism.is_living', 'donor_organism.provenance.document_id', 'donor_organism.sex', 'dss_bundle_fqid', 'emptydrops_is_cell', 'file_uuid', 'file_version', 'genes_detected', 'library_preparation_protocol.end_bias', 'library_preparation_protocol.input_nucleic_acid_molecule.ontology', 'library_preparation_protocol.input_nucleic_acid_molecule.ontology_label', 'library_preparation_protocol.library_construction_method.ontology', 'library_preparation_protocol.library_construction_method.ontology_label', 'library_preparation_protocol.provenance.document_id', 'library_preparation_protocol.strand', 'project.project_core.project_short_name', 'project.project_core.project_title', 'project.provenance.document_id', 'specimen_from_organism.organ.ontology', 'specimen_from_organism.organ.ontology_label', 'specimen_from_organism.organ_parts.ontology', 'specimen_from_organism.organ_parts.ontology_label', 'specimen_from_organism.provenance.document_id', 'total_umis'
    var: 'featureid', 'chromosome', 'featureend', 'featurestart', 'featuretype', 'genus_species', 'isgene'
    obsm: 
    varm: 
    uns: 'CreationDate', 'LOOM_SPEC_VERSION', 'last_modified', 'genome', 'modality'

As mentioned above, all data3_csv, data3_mtx, and data3_loom are PegasusIO's UnimodalData objects.

We can then write the object into AnnData h5ad format:

io.write_output(data3_csv, "pegasusio_test_cases/case3/pancreas.h5ad")

... storing 'genes_detected' as categorical
... storing 'total_umis' as categorical
... storing 'emptydrops_is_cell' as categorical
... storing 'barcode' as categorical
... storing 'specimen_from_organism.provenance.document_id' as categorical
... storing 'derived_organ_ontology' as categorical
... storing 'derived_organ_label' as categorical
... storing 'derived_organ_parts_ontology' as categorical
... storing 'derived_organ_parts_label' as categorical
... storing 'cell_suspension.genus_species.ontology' as categorical
... storing 'cell_suspension.genus_species.ontology_label' as categorical
... storing 'donor_organism.provenance.document_id' as categorical
... storing 'donor_organism.human_specific.ethnicity.ontology' as categorical
... storing 'donor_organism.human_specific.ethnicity.ontology_label' as categorical
... storing 'donor_organism.diseases.ontology' as categorical
... storing 'donor_organism.diseases.ontology_label' as categorical
... storing 'donor_organism.development_stage.ontology' as categorical
... storing 'donor_organism.development_stage.ontology_label' as categorical
... storing 'donor_organism.sex' as categorical
... storing 'donor_organism.is_living' as categorical
... storing 'specimen_from_organism.organ.ontology' as categorical
... storing 'specimen_from_organism.organ.ontology_label' as categorical
... storing 'specimen_from_organism.organ_parts.ontology' as categorical
... storing 'specimen_from_organism.organ_parts.ontology_label' as categorical
... storing 'library_preparation_protocol.provenance.document_id' as categorical
... storing 'library_preparation_protocol.input_nucleic_acid_molecule.ontology' as categorical
... storing 'library_preparation_protocol.input_nucleic_acid_molecule.ontology_label' as categorical
... storing 'library_preparation_protocol.library_construction_method.ontology' as categorical
... storing 'library_preparation_protocol.library_construction_method.ontology_label' as categorical
... storing 'library_preparation_protocol.end_bias' as categorical
... storing 'library_preparation_protocol.strand' as categorical
... storing 'project.provenance.document_id' as categorical
... storing 'project.project_core.project_short_name' as categorical
... storing 'project.project_core.project_title' as categorical
... storing 'bundle_version' as categorical
... storing 'analysis_protocol.protocol_core.protocol_id' as categorical
... storing 'analysis_working_group_approval_status' as categorical
... storing 'featuretype' as categorical
... storing 'chromosome' as categorical
... storing 'featurestart' as categorical
... storing 'featureend' as categorical
... storing 'isgene' as categorical
... storing 'genus_species' as categorical

2020-06-05 09:35:45,457 - pegasusio.readwrite - INFO - h5ad file 'pegasusio_test_cases/case3/pancreas.h5ad' is written.
2020-06-05 09:35:45,457 - pegasusio.readwrite - INFO - Function 'write_output' finished in 4.14s.

Case 4: Process multiple `zarr` files with Scrublet scores¶

In this case, we use two channels from human bone marrow data at https://data.humancellatlas.org/explore/projects/cc95ff89-2e68-4a08-a234-480eca21ce79: donor 1 channel 1, and donor 8 channel 8. Both channels have been processed with Scrublet to estimate doublet scores, and are stored in zarr format.

First, load two files into memory:

data4_1 = io.read_input("pegasusio_test_cases/case4/MantonBM1_1_dbls.zarr")
data4_1

2020-06-05 09:35:45,507 - pegasusio.readwrite - INFO - zarr file 'pegasusio_test_cases/case4/MantonBM1_1_dbls.zarr' is loaded.
2020-06-05 09:35:45,508 - pegasusio.readwrite - INFO - Function 'read_input' finished in 0.05s.

MultimodalData object with 1 UnimodalData: 'GRCh38-rna'
    It currently binds to UnimodalData object GRCh38-rna

UnimodalData object with n_obs x n_vars = 4274 x 19360
    Genome: GRCh38; Modality: rna
    It contains 1 matrices: 'X'
    It currently binds to matrix 'X' as X

    obs: 'passed_qc', 'n_genes', 'n_counts', 'percent_mito', 'scrublet_scores'
    var: 'featureid', 'n_cells', 'percent_cells', 'robust', 'highly_variable_features'
    obsm: 
    varm: 
    uns: 'genome', 'modality', 'scrublet_stats'

data4_2 = io.read_input("pegasusio_test_cases/case4/MantonBM8_8_dbls.zarr")
data4_2

2020-06-05 09:35:45,562 - pegasusio.readwrite - INFO - zarr file 'pegasusio_test_cases/case4/MantonBM8_8_dbls.zarr' is loaded.
2020-06-05 09:35:45,563 - pegasusio.readwrite - INFO - Function 'read_input' finished in 0.05s.

MultimodalData object with 1 UnimodalData: 'GRCh38-rna'
    It currently binds to UnimodalData object GRCh38-rna

UnimodalData object with n_obs x n_vars = 4162 x 18178
    Genome: GRCh38; Modality: rna
    It contains 1 matrices: 'X'
    It currently binds to matrix 'X' as X

    obs: 'passed_qc', 'n_genes', 'n_counts', 'percent_mito', 'scrublet_scores'
    var: 'featureid', 'n_cells', 'percent_cells', 'robust', 'highly_variable_features'
    obsm: 
    varm: 
    uns: 'genome', 'modality', 'scrublet_stats'

Both channels have over 4000 cell barcodes. Below are Scrublet scores of the first channel:

data4_1.obs['scrublet_scores']

barcodekey
AAACCTGAGCAGGTCA    0.022873
AAACCTGCACACTGCG    0.007703
AAACCTGCACCGGAAA    0.023813
AAACCTGCATAGACTC    0.054320
AAACCTGCATCGATGT    0.044107
                      ...   
TTTGTCAGTCCGCTGA    0.060842
TTTGTCATCAGTCAGT    0.010380
TTTGTCATCATGTAGC    0.005234
TTTGTCATCCGCTGTT    0.010812
TTTGTCATCCTCTAGC    0.012691
Name: scrublet_scores, Length: 4274, dtype: float64

And its scrublet stats information can be retrieved as below:

data4_1.uns['scrublet_stats']

{'detectable_doublet_fraction': 0.35423490875058494,
 'detected_doublet_rate': 0.013102480112306972,
 'overall_doublet_rate': 0.03698811096433289,
 'threshold': 0.24325254050006934}

We can also aggregate two channels into one data matrix using PegasusIO's aggregate_matrices function. To do that, we need to first prepare a sample sheet in csv format:

sheet4 = pd.read_csv("pegasusio_test_cases/case4/count_matrix.csv")
sheet4

The sample sheet should have at least two columns: Sample, specifying sample name; Location, specifying location of the sample's gene-count matrix file.

Then use this sample sheet for data aggregation:

data4 = io.aggregate_matrices("pegasusio_test_cases/case4/count_matrix.csv")
data4

2020-06-05 09:35:45,673 - pegasusio.readwrite - INFO - zarr file 'pegasusio_test_cases/case4/MantonBM1_1_dbls.zarr' is loaded.
2020-06-05 09:35:45,674 - pegasusio.readwrite - INFO - Function 'read_input' finished in 0.05s.
2020-06-05 09:35:45,725 - pegasusio.readwrite - INFO - zarr file 'pegasusio_test_cases/case4/MantonBM8_8_dbls.zarr' is loaded.
2020-06-05 09:35:45,725 - pegasusio.readwrite - INFO - Function 'read_input' finished in 0.05s.
2020-06-05 09:35:46,296 - pegasusio.aggr_data - INFO - Function 'aggregate' finished in 0.56s.
2020-06-05 09:35:46,297 - pegasusio.data_aggregation - INFO - Aggregated 2 files.
2020-06-05 09:35:46,298 - pegasusio.data_aggregation - INFO - Function 'aggregate_matrices' finished in 0.68s.

MultimodalData object with 1 UnimodalData: 'GRCh38-rna'
    It currently binds to UnimodalData object GRCh38-rna

UnimodalData object with n_obs x n_vars = 8436 x 20381
    Genome: GRCh38; Modality: rna
    It contains 1 matrices: 'X'
    It currently binds to matrix 'X' as X

    obs: 'passed_qc', 'n_genes', 'n_counts', 'percent_mito', 'scrublet_scores', 'Channel'
    var: 'featureid'
    obsm: 
    varm: 
    uns: 'genome', 'modality', 'var_dict', 'uns_dict'

You can see that data4 contains 8436 cells from both channels altogether.

Case 5: Data aggregation with filtering¶

In this case, we demonstrate aggregating data matrices with quality-control filtering settings. We use mouse lung cells from Mouse Cell Atlas paper. In particular, we use samples "Lung 1", "Lung 2", and "Lung 3" from DGE format file here.

Similarly as in Case 4, first prepare a sample sheet:

sheet5 = pd.read_csv("pegasusio_test_cases/case5/count_matrix.csv")
sheet5

In details, Location column lists 3 files as the following:

for _, row in sheet5.iterrows():
    print(row['Location'])

pegasusio_test_cases/case5/Lung1_rm.batch_dge.txt.gz
pegasusio_test_cases/case5/Lung2_rm.batch_dge.txt.gz
pegasusio_test_cases/case5/Lung3_rm.batch_dge.txt.gz

Now we can aggregate the three samples with quality-control filtering settings:

data5 = io.aggregate_matrices("pegasusio_test_cases/case5/count_matrix.csv", 
                              default_ref = 'mm10', 
                              append_sample_name = False,
                              min_genes = 500,
                              max_genes = 6000,
                              mito_prefix = 'mt-',
                              percent_mito = 20)
data5

2020-06-05 09:35:47,452 - pegasusio.readwrite - INFO - tsv file 'pegasusio_test_cases/case5/Lung1_rm.batch_dge.txt.gz' is loaded.
2020-06-05 09:35:47,453 - pegasusio.readwrite - INFO - Function 'read_input' finished in 1.12s.
2020-06-05 09:35:48,050 - pegasusio.readwrite - INFO - tsv file 'pegasusio_test_cases/case5/Lung2_rm.batch_dge.txt.gz' is loaded.
2020-06-05 09:35:48,051 - pegasusio.readwrite - INFO - Function 'read_input' finished in 0.60s.
2020-06-05 09:35:48,146 - pegasusio.qc_utils - INFO - After filtration, 1589 out of 2835 cell barcodes are kept in UnimodalData object mm10-rna.
2020-06-05 09:35:49,487 - pegasusio.readwrite - INFO - tsv file 'pegasusio_test_cases/case5/Lung3_rm.batch_dge.txt.gz' is loaded.
2020-06-05 09:35:49,488 - pegasusio.readwrite - INFO - Function 'read_input' finished in 1.34s.
2020-06-05 09:35:49,575 - pegasusio.qc_utils - INFO - After filtration, 860 out of 1796 cell barcodes are kept in UnimodalData object mm10-rna.
2020-06-05 09:35:49,662 - pegasusio.qc_utils - INFO - After filtration, 1308 out of 4485 cell barcodes are kept in UnimodalData object mm10-rna.
2020-06-05 09:35:49,954 - pegasusio.aggr_data - INFO - Function 'aggregate' finished in 0.29s.
2020-06-05 09:35:49,955 - pegasusio.data_aggregation - INFO - Aggregated 3 files.
2020-06-05 09:35:49,960 - pegasusio.data_aggregation - INFO - Function 'aggregate_matrices' finished in 3.64s.

MultimodalData object with 1 UnimodalData: 'mm10-rna'
    It currently binds to UnimodalData object mm10-rna

UnimodalData object with n_obs x n_vars = 3757 x 23450
    Genome: mm10; Modality: rna
    It contains 1 matrices: 'X'
    It currently binds to matrix 'X' as X

    obs: 'n_genes', 'n_counts', 'percent_mito', 'Channel'
    var: 
    obsm: 
    varm: 
    uns: 'genome', 'modality'

We keep cells with:

500 $<=$ Number of expressed genes $<$ 6000, and
Percent of mitochondrial genes $<=$ 20%

Besides, we need to specify the name prefix of mitochondrial genes in order to calculate the second criterion.

We also don't append sample name as prefix to cell barcodes after aggregation, as in this case, all the barcodes are already distinct beforehand.

For details on these parameters, please see PegasusIO documentation.

Case 6: Process RNA + CITE-Seq + TCR + BCR data¶

In this case, we show how to manipulate data across different protocols/omics. We use the following data:

First, aggregate them using the following sample sheet:

sheet6 = pd.read_csv("pegasusio_test_cases/case6/count_matrix.csv")
sheet6

Notice that the sample names are the same this time.

data6 = io.aggregate_matrices("pegasusio_test_cases/case6/count_matrix.csv")
data6

2020-06-05 09:35:50,970 - pegasusio.readwrite - INFO - 10x file 'pegasusio_test_cases/case6/vdj_v1_hs_pbmc2_5gex_protein_filtered_feature_bc_matrix.h5' is loaded.
2020-06-05 09:35:50,970 - pegasusio.readwrite - INFO - Function 'read_input' finished in 0.98s.
2020-06-05 09:35:51,043 - pegasusio.readwrite - INFO - csv file 'pegasusio_test_cases/case6/vdj_v1_hs_pbmc2_t_filtered_contig_annotations.csv' is loaded.
2020-06-05 09:35:51,044 - pegasusio.readwrite - INFO - Function 'read_input' finished in 0.07s.
2020-06-05 09:35:51,083 - pegasusio.readwrite - INFO - csv file 'pegasusio_test_cases/case6/vdj_v1_hs_pbmc2_b_filtered_contig_annotations.csv' is loaded.
2020-06-05 09:35:51,084 - pegasusio.readwrite - INFO - Function 'read_input' finished in 0.04s.
2020-06-05 09:35:51,137 - pegasusio.multimodal_data - INFO - After filtration, 8258 out of 8258 cell barcodes are kept in UnimodalData object GRCh38-citeseq.
2020-06-05 09:35:51,193 - pegasusio.multimodal_data - INFO - After filtration, 2987 out of 3009 cell barcodes are kept in UnimodalData object GRCh38-tcr.
2020-06-05 09:35:51,250 - pegasusio.multimodal_data - INFO - After filtration, 1185 out of 1202 cell barcodes are kept in UnimodalData object GRCh38-bcr.
2020-06-05 09:35:51,459 - pegasusio.aggr_data - INFO - Function 'aggregate' finished in 0.20s.
2020-06-05 09:35:51,460 - pegasusio.data_aggregation - INFO - Aggregated 3 files.
2020-06-05 09:35:51,460 - pegasusio.data_aggregation - INFO - Function 'aggregate_matrices' finished in 1.47s.

MultimodalData object with 4 UnimodalData: 'GRCh38-citeseq', 'GRCh38-rna', 'GRCh38-tcr', 'GRCh38-bcr'
    It currently binds to CITESeqData object GRCh38-citeseq

CITESeqData object with n_obs x n_vars = 8258 x 17
    Genome: GRCh38; Modality: citeseq
    It contains 1 matrices: 'raw.count'
    It currently binds to matrix 'raw.count' as X

    obs: 'Channel'
    var: 'featureid', '_control_id'
    obsm: 
    varm: 
    uns: 'genome', 'modality', '_control_names', '_control_counts', '_obs_keys'

data6 has 4 UnimodalData elements: GRCh38-citeseq for CITE-Seq data, GRCh38-rna for RNA data, GRCh38-tcr for TCR data, and GRCh38-bcr for BCR data.

6.1. CITE-Seq¶

We first check CITE-Seq data. Its antibody control list is constructed based on information from biolegend website, as shown below:

antibody_control_sheet = pd.read_csv("pegasusio_test_cases/case6/antibody_control.csv")
antibody_control_sheet

Now load it to CITE-Seq data:

data6.select_data('GRCh38-citeseq')
data6.load_control_list("pegasusio_test_cases/case6/antibody_control.csv")
data6.arcsinh_transform()
data6

MultimodalData object with 4 UnimodalData: 'GRCh38-citeseq', 'GRCh38-rna', 'GRCh38-tcr', 'GRCh38-bcr'
    It currently binds to CITESeqData object GRCh38-citeseq

CITESeqData object with n_obs x n_vars = 8258 x 14
    Genome: GRCh38; Modality: citeseq
    It contains 2 matrices: 'raw.count', 'arcsinh.transformed'
    It currently binds to matrix 'arcsinh.transformed' as X

    obs: 'Channel'
    var: 'featureid', '_control_id'
    obsm: 
    varm: 
    uns: 'genome', 'modality', '_control_names', '_control_counts', '_obs_keys'

6.2. TCR¶

data6.select_data('GRCh38-tcr')
data6

MultimodalData object with 4 UnimodalData: 'GRCh38-citeseq', 'GRCh38-rna', 'GRCh38-tcr', 'GRCh38-bcr'
    It currently binds to VDJData object GRCh38-tcr

VDJData object with n_obs x n_vars = 2987 x 50
    Genome: GRCh38; Modality: tcr
    It contains 10 matrices: 'high_confidence', 'length', 'reads', 'umis', 'v_gene', 'd_gene', 'j_gene', 'c_gene', 'cdr3', 'cdr3_nt'
    It currently binds to matrix 'umis' as X

    obs: 'is_cell', 'nTRA', 'nTRB', 'nTRD', 'nTRG', 'nMulti', 'Channel'
    var: 
    obsm: 
    varm: 
    uns: 'genome', 'modality', '_v_gene', '_d_gene', '_j_gene', '_c_gene', '_cdr3', '_cdr3_nt'

data6.get_chain('TRA')

data6.get_chain('TRB')

6.3 BCR¶

data6.select_data('GRCh38-bcr')
data6

MultimodalData object with 4 UnimodalData: 'GRCh38-citeseq', 'GRCh38-rna', 'GRCh38-tcr', 'GRCh38-bcr'
    It currently binds to VDJData object GRCh38-bcr

VDJData object with n_obs x n_vars = 1185 x 40
    Genome: GRCh38; Modality: bcr
    It contains 10 matrices: 'high_confidence', 'length', 'reads', 'umis', 'v_gene', 'd_gene', 'j_gene', 'c_gene', 'cdr3', 'cdr3_nt'
    It currently binds to matrix 'umis' as X

    obs: 'is_cell', 'nIGK', 'nIGL', 'nIGH', 'nMulti', 'Channel'
    var: 
    obsm: 
    varm: 
    uns: 'genome', 'modality', '_v_gene', '_d_gene', '_j_gene', '_c_gene', '_cdr3', '_cdr3_nt'

data6.get_chain('IGK')

data6.get_chain('IGL')

data6.get_chain('IGH')

Case 7: Process Flow Cytometry data¶

sheet7 = pd.read_csv("pegasusio_test_cases/case7/count_matrix.csv")
sheet7

for _, row in sheet7.iterrows():
    print(row['Location'])

pegasusio_test_cases/case7/PBMC8_30min_patient1_Reference.fcs
pegasusio_test_cases/case7/PBMC8_30min_patient2_Reference.fcs
pegasusio_test_cases/case7/PBMC8_30min_patient3_Reference.fcs

data7 = io.aggregate_matrices("pegasusio_test_cases/case7/count_matrix.csv")
data7.arcsinh_transform()
data7

2020-06-05 09:35:51,708 - pegasusio.readwrite - INFO - fcs file 'pegasusio_test_cases/case7/PBMC8_30min_patient1_Reference.fcs' is loaded.
2020-06-05 09:35:51,709 - pegasusio.readwrite - INFO - Function 'read_input' finished in 0.01s.
2020-06-05 09:35:51,727 - pegasusio.readwrite - INFO - fcs file 'pegasusio_test_cases/case7/PBMC8_30min_patient2_Reference.fcs' is loaded.
2020-06-05 09:35:51,727 - pegasusio.readwrite - INFO - Function 'read_input' finished in 0.02s.
2020-06-05 09:35:51,740 - pegasusio.readwrite - INFO - fcs file 'pegasusio_test_cases/case7/PBMC8_30min_patient3_Reference.fcs' is loaded.
2020-06-05 09:35:51,741 - pegasusio.readwrite - INFO - Function 'read_input' finished in 0.01s.
2020-06-05 09:35:51,866 - pegasusio.aggr_data - INFO - Function 'aggregate' finished in 0.11s.
2020-06-05 09:35:51,867 - pegasusio.data_aggregation - INFO - Aggregated 3 files.
2020-06-05 09:35:51,868 - pegasusio.data_aggregation - INFO - Function 'aggregate_matrices' finished in 0.17s.

MultimodalData object with 1 UnimodalData: 'unknown-cyto'
    It currently binds to CytoData object unknown-cyto

CytoData object with n_obs x n_vars = 28898 x 35
    Genome: unknown; Modality: cyto
    It contains 2 matrices: 'raw.data', 'arcsinh.transformed'
    It currently binds to matrix 'arcsinh.transformed' as X

    obs: 'Channel'
    var: 'featureid', '_control_id'
    obsm: '_controls'
    varm: 
    uns: 'genome', 'modality', 'uns_dict', '_control_names'

	Sample	Location
0	lung1	pegasusio_test_cases/case5/Lung1_rm.batch_dge....
1	lung2	pegasusio_test_cases/case5/Lung2_rm.batch_dge....
2	lung3	pegasusio_test_cases/case5/Lung3_rm.batch_dge....

	Sample	Location
0	health	pegasusio_test_cases/case6/vdj_v1_hs_pbmc2_5ge...
1	health	pegasusio_test_cases/case6/vdj_v1_hs_pbmc2_t_f...
2	health	pegasusio_test_cases/case6/vdj_v1_hs_pbmc2_b_f...

	Antibody	Control
0	CD3_TotalSeqC	IgG1_control_TotalSeqC
1	CD19_TotalSeqC	IgG1_control_TotalSeqC
2	CD45RA_TotalSeqC	IgG2b_control_TotalSeqC
3	CD4_TotalSeqC	IgG1_control_TotalSeqC
4	CD8a_TotalSeqC	IgG1_control_TotalSeqC
5	CD14_TotalSeqC	IgG2a_control_TotalSeqC
6	CD16_TotalSeqC	IgG1_control_TotalSeqC
7	CD56_TotalSeqC	IgG1_control_TotalSeqC
8	CD25_TotalSeqC	IgG1_control_TotalSeqC
9	CD45RO_TotalSeqC	IgG2a_control_TotalSeqC
10	PD-1_TotalSeqC	IgG1_control_TotalSeqC
11	TIGIT_TotalSeqC	IgG2a_control_TotalSeqC
12	CD127_TotalSeqC	IgG1_control_TotalSeqC
13	CD15_TotalSeqC	IgG1_control_TotalSeqC

	high_confidence	length	reads	umis	v_gene	d_gene	j_gene	c_gene	cdr3	cdr3_nt
barcodekey
health-AAACCTGAGACCACGA	True	521	1569	2	TRAV1-2	None	TRAJ12	TRAC	CAVMDSSYKLIF	TGTGCTGTGATGGATAGCAGCTATAAATTGATCTTC
health-AAACCTGAGGCTCTTA	True	518	2019	2	TRAV1-2	None	TRAJ33	TRAC	CAVKDSNYQLIW	TGTGCTGTGAAGGATAGCAACTATCAGTTAATCTGG
health-AAACCTGAGTGAACGC	True	504	2665	2	TRAV1-2	None	TRAJ35	TRAC	CAVCTI	TGTGCTGTCTGTACGATA
health-AAACCTGAGTTGTCGT	True	557	7528	7	TRAV12-2	None	TRAJ54	TRAC	CAVNLEIQGAQKLVF	TGTGCCGTGAACCTCGAAATTCAGGGAGCCCAGAAGCTGGTATTT
health-AAACCTGCAAACGTGG	False	0	0	0	None	None	None	None	None	None
...	...	...	...	...	...	...	...	...	...	...
health-TTTGTCAGTTGCCTCT	True	585	5663	5	TRAV14DV4	None	TRAJ49	TRAC	CAMREAGTGNQFYF	TGTGCAATGAGAGAGGCCGGGACCGGTAACCAGTTCTATTTT
health-TTTGTCAGTTTAGCTG	True	486	2727	3	TRAV35	None	TRAJ7	TRAC	CAGQLCYGNNRLAF	TGTGCTGGGCAGCTCTGCTATGGGAACAACAGACTCGCTTTT
health-TTTGTCATCAAGGCTT	True	741	1327	3	TRAV1-2	None	TRAJ28	TRAC	CAVRSTGTGAGSYQLTF	TGTGCTGTGAGATCGACGGGGACTGGGGCTGGGAGTTACCAACTCA...
health-TTTGTCATCATGGTCA	True	942	5027	6	TRAV1-2	None	TRAJ33	TRAC	CAALDSNYQLIW	TGTGCTGCCCTGGATAGCAACTATCAGTTAATCTGG
health-TTTGTCATCTCGTTTA	True	527	4486	4	TRAV1-2	None	TRAJ33	TRAC	CAVMDSNYQLIW	TGTGCTGTGATGGATAGCAACTATCAGTTAATCTGG

	high_confidence	length	reads	umis	v_gene	d_gene	j_gene	c_gene	cdr3	cdr3_nt
barcodekey
health-AAACCTGAGACCACGA	True	584	5238	7	TRBV6-1	TRBD2	TRBJ2-1	TRBC2	CASSGLAGGYNEQFF	TGTGCCAGCAGTGGACTAGCGGGGGGCTACAATGAGCAGTTCTTC
health-AAACCTGAGGCTCTTA	True	551	3846	4	TRBV6-4	TRBD2	TRBJ2-3	TRBC2	CASSGVAGGTDTQYF	TGTGCCAGCAGTGGGGTAGCGGGAGGCACAGATACGCAGTATTTT
health-AAACCTGAGTGAACGC	True	674	3002	6	TRBV2	TRBD1	TRBJ1-2	TRBC1	CASNQGLNYGYTF	TGTGCCAGCAATCAGGGCCTTAACTATGGCTACACCTTC
health-AAACCTGAGTTGTCGT	True	676	8576	10	TRBV9	TRBD1	TRBJ1-6	TRBC1	CASSATGSGSPLHF	TGTGCCAGCAGCGCTACAGGGTCGGGTTCACCCCTCCACTTT
health-AAACCTGCAAACGTGG	True	695	17409	24	TRBV20-1	TRBD1	TRBJ2-3	TRBC2	CSGKGGTDTQYF	TGCAGTGGAAAGGGTGGCACAGATACGCAGTATTTT
...	...	...	...	...	...	...	...	...	...	...
health-TTTGTCAGTTGCCTCT	False	0	0	0	None	None	None	None	None	None
health-TTTGTCAGTTTAGCTG	True	764	21059	26	TRBV4-3	TRBD2	TRBJ2-5	TRBC2	CASSQAPISGAGETQYF	TGCGCCAGCAGCCAAGCCCCAATTAGCGGGGCCGGAGAGACCCAGT...
health-TTTGTCATCAAGGCTT	True	521	715	2	TRBV24-1	TRBD2	TRBJ2-5	TRBC2	CATSDPTSGGSQTQYF	TGTGCCACCAGTGACCCCACTAGCGGGGGGTCGCAGACCCAGTACTTC
health-TTTGTCATCATGGTCA	True	527	6829	6	TRBV20-1	TRBD2	TRBJ1-1	TRBC1	CSARGDGHTEAFF	TGCAGTGCTAGAGGGGACGGACACACTGAAGCTTTCTTT
health-TTTGTCATCTCGTTTA	True	542	6173	8	TRBV20-1	TRBD2	TRBJ2-5	TRBC2	CSATRLGREQETQYF	TGCAGTGCTACGCGACTAGGCCGAGAACAAGAGACCCAGTACTTC

	Sample	Location
0	sample1	pegasusio_test_cases/case4/MantonBM1_1_dbls.zarr
1	sample2	pegasusio_test_cases/case4/MantonBM8_8_dbls.zarr

	high_confidence	length	reads	umis	v_gene	d_gene	j_gene	c_gene	cdr3	cdr3_nt
barcodekey
health-AAACCTGAGAGCAATT	True	626	931	7	IGKV4-1	None	IGKJ4	IGKC	CQQYYSTPLTF	TGTCAGCAGTATTATAGTACTCCTCTCACTTTC
health-AAAGCAACATCACAAC	True	587	1823	16	IGKV1-12	None	IGKJ2	IGKC	CQQADSPPLF	TGTCAACAGGCTGACAGTCCCCCTCTTTTT
health-AAAGTAGAGTGACATA	True	570	3604	35	IGKV4-1	None	IGKJ3	IGKC	CQQYYSTPFTF	TGTCAGCAATATTATAGTACTCCATTCACTTTC
health-AAATGCCAGTGTTGAA	True	695	12547	113	IGKV4-1	None	IGKJ1	IGKC	CQQYYSTHRTF	TGTCAGCAATATTATAGCACTCATCGGACGTTC
health-AAATGCCCACCGCTAG	True	671	2229	24	IGKV1D-17	None	IGKJ2	IGKC	CLQHNSYPYTF	TGTCTACAGCATAATAGTTACCCGTACACTTTT
...	...	...	...	...	...	...	...	...	...	...
health-TTTGTCAAGTCATGCT	True	671	8218	77	IGKV3-15	None	IGKJ4	IGKC	CQQYNNWPPLTF	TGTCAGCAGTATAATAACTGGCCTCCCCTCACTTTC
health-TTTGTCACAAACCCAT	False	0	0	0	None	None	None	None	None	None
health-TTTGTCACAGCTCGAC	True	677	15418	142	IGKV1-12	None	IGKJ2	IGKC	CQQARSLPYTF	TGTCAACAGGCTCGCAGCCTCCCGTACACTTTT
health-TTTGTCACAGTAAGAT	True	679	5729	59	IGKV2D-40	None	IGKJ2	IGKC	CMQRIEFPYTF	TGCATGCAACGTATAGAGTTCCCGTACACTTTT
health-TTTGTCAGTCCAGTAT	False	0	0	0	None	None	None	None	None	None

	Sample	Location
0	PBMC1	pegasusio_test_cases/case7/PBMC8_30min_patient...
1	PBMC2	pegasusio_test_cases/case7/PBMC8_30min_patient...
2	PBMC3	pegasusio_test_cases/case7/PBMC8_30min_patient...

	high_confidence	length	reads	umis	v_gene	d_gene	j_gene	c_gene	cdr3	cdr3_nt
barcodekey
health-AAACCTGAGAGCAATT	True	594	327	2	IGHV1-2	IGHD1-26	IGHJ4	IGHD	CARGNSGSYNRNWFFDYW	TGTGCGAGAGGCAATAGTGGGAGCTACAATCGAAATTGGTTCTTTG...
health-AAAGCAACATCACAAC	False	0	0	0	None	None	None	None	None	None
health-AAAGTAGAGTGACATA	True	652	3640	38	IGHV1-2	IGHD3-16	IGHJ5	IGHA1	CARVPGWGHNYFDPW	TGTGCGAGAGTCCCCGGTTGGGGACACAACTACTTCGACCCCTGG
health-AAATGCCAGTGTTGAA	True	691	2631	39	IGHV1-69-2	IGHD6-25	IGHJ4	IGHG2	CARDVPEGKAAILGYFDWW	TGTGCGAGAGATGTCCCAGAGGGAAAAGCGGCCATTTTAGGGTACT...
health-AAATGCCCACCGCTAG	True	521	3041	30	IGHV2-5	IGHD4-17	IGHJ4	IGHM	CAHRRYGDYDGDFDYW	TGTGCACACAGACGTTACGGTGACTACGACGGAGACTTTGACTACTGG
...	...	...	...	...	...	...	...	...	...	...
health-TTTGTCAAGTCATGCT	True	574	1161	16	IGHV3-7	IGHD4-4	IGHJ1	IGHM	CARAYFTVTTEGCFQHW	TGTGCGAGAGCTTACTTTACAGTAACTACCGAAGGATGCTTCCAGC...
health-TTTGTCACAAACCCAT	True	534	2474	31	IGHV4-39	IGHD6-19	IGHJ3	IGHM	CARDSSGWYADAFDIW	TGTGCGAGAGATAGCAGTGGCTGGTACGCGGATGCTTTTGATATCTGG
health-TTTGTCACAGCTCGAC	True	653	1886	35	IGHV1-24	IGHD4-17	IGHJ4	IGHG1	CVGQNGDYFDYW	TGTGTGGGGCAGAACGGTGACTACTTTGACTACTGG
health-TTTGTCACAGTAAGAT	True	587	2273	28	IGHV3-23	IGHD6-13	IGHJ4	IGHM	CAKRPDHSSSWYGRGFDYW	TGTGCGAAAAGGCCCGATCATAGCAGCAGCTGGTACGGTAGGGGTT...
health-TTTGTCAGTCCAGTAT	True	567	1351	14	IGHV4-31	IGHD6-6	IGHJ4	IGHM	CARDLGQLGHFDYW	TGTGCCAGAGATCTAGGGCAGCTCGGCCATTTTGACTACTGG