NIRD: Network Inference by Reduced Dimensions

Overview

NIRD offers a matrix factorization-based framework that allows to interpret high-dimensional data into interpretable modules to infer gene regulatory network. It is designed to scale efficiently for large single-cell RNA-Sequencing datasets to allow the robust identification of key regulatory signals, as well as temporal dynamics.

Introduction

This vignette represents a comprehensive tutorial on the NIRD tool for interpretation of biological datasets using network-based approach. NIRD permits researchers to perceive the contribution of the regulatory components by transforming datasets into networks, using dimensionality reduction and assessing the overlapping structure with AUC metrics. This methodology can be applied to describe a number of biological situations, such as disease vs. control comparisons and temporal dynamics.

Installation

Installing NIRD within a Conda environment is recommended.

Step 1: Create a Conda environment

conda create -n nird python=3.8 pip -y

OR

conda env create -f nird.yml

Step 2: Activate conda environment

conda activate nird

Step 3: Install required dependencies

Install all required libraries using the provided requirements.txt file:

pip install -r requirements.txt

Input Data Format

The NIRD tool requires a gene expression matrix as input. This matrix should be in CSV or TSV format. The matrix rows should represent samples (e.g. GSM IDs from GEO datasets) and columns should represent genes. Each cell should contain the expression value of a gene in a given sample (e.g., TPM, FPKM, or raw counts, depending on your preprocessing pipeline).

|            | CDC42BPA  | ARHGAP1   | CADM1     | CSNK1A1   | SLC25A3   |
| ---------- | --------- | --------- | --------- | --------- | --------- |
| GSM2172403 | 59.00775  | 14.55039  | 50.11628  | 31.77519  | 156.81395 |
| GSM2172457 | 0.74419   | 2.31783   | 573.31008 | 395.65891 | 119.65891 |
| GSM2172483 | 87.41085  | 44.32558  | 160.48837 | 63.75194  | 664.98450 |
| GSM2172489 | 68.16279  | 50.72868  | 84.61240  | 521.28682 | 0         |
| GSM2172507 | 208.06202 | 266.34884 | 351.10078 | 82.31008  | 163.42636 |

This tool also supports datasets that come with gold standard regulatory interactions, commonly used in benchmarks like DREAM5. You can use the provided script (Gold_Data) to load both expression data and gold standard regulatory networks for evaluation or training.

The gold standard file should contain three with columns:

Regulator    Target    Label
Aff1         Wfdc18     1  
Arnt2        Abca13     1  
Atf3         Igf2       1  

Note: Duplicates are allowed in the file, they will be handled internally by the tool.


🔄 Note: Support for Transcription Velocity Data (Expr-Velo Inference)

Apart from standard expression-expression (expr-expr) network inference, the Double_Expr.py script also supports gene regulatory network inference using transcription velocity data. This allows the tool to infer Expr-Velo networks for linking steady-state expression with future transcriptional dynamics.

This feature enables NIRD to capture causal, time-lagged, or dynamic regulations that are not detectable from static expression data alone.

🧬 Input Format for Velocity Data

The transcription velocity matrix should be a CSV file where:

|            | LINC02593 | NOC2L   | C1orf159 | SDF4    | UBE2J2  |
| ---------- | --------- | ------- | -------- | ------- | ------- |
| SRR2978582 | -0.4771   | -0.4657 | -1.2073  | -0.0402 | 0.2096  |
| SRR2978568 | -0.3028   | -0.5002 | -0.6872  | -0.0317 | -0.1525 |
| SRR2978599 | -0.2604   | -0.5261 | -0.4434  | -0.0435 | 0.1729  |

Running NIRD

Once you’ve completed the primary setup, you’re ready to run the NIRD tool for network inference and evaluation.

The NIRD tool supports four different modes depending on the type of data available:



1. Single Expression Mode

Use this mode when only one expression dataset is available.

python NIRD.py \
--datasets single_expr \
--file1 MF_Datasets/mESC/smartSeq.csv \
--outdir inferred_networks

2. Double Expression Mode

Use this mode to infer and compare GRNs from two expression datasets.

python NIRD.py \
--datasets double_expr \
--file1 MF_Datasets/mESC/dropSeq.csv \
--file2 MF_Datasets/mESC/smartSeq.csv \
--outdir inferred_networks

3. Gold Data Mode

Use this mode when expression data, transcription factor data, and a gold standard network are available.

python NIRD.py \
--datasets gold_data \
--expr_file MF_Datasets/dream5/net2/dream5_net2_expression_data.tsv \
--tf_file MF_Datasets/dream5/net2/dream5_net2_transcription_factors.tsv \
--gold_file MF_Datasets/dream5/net2/dream5_net2_gold.tsv \
--outdir inferred_networks

4. NIRD_Velo Mode

Use this mode when time-course expression and RNA velocity data are available.

python NIRD_Velo.py \
--file1 MF_Datasets/transcription_velocity/00h_time_course_expr.csv \
--file2 MF_Datasets/transcription_velocity/0th_hr_endo_RNA_Velo.csv \
--outdir inferred_networks



Step 5: Help & Arguments

If you're unsure about the available command-line options or want to check how to properly format your input arguments, you can always view the detailed usage information using:

python NIRD.py --help
python NIRD_Velo.py --help

NIRD.py Arguments

usage: NIRD.py [-h] [--datasets {single_expr,double_expr,gold_data}] [--methods METHODS] [--evaluations EVALUATIONS] [--do_eval] [--file1 FILE1] [--file2 FILE2]
               [--expr_file EXPR_FILE] [--tf_file TF_FILE] [--gold_file GOLD_FILE] --outdir OUTDIR

Run matrix factorization methods on biological datasets.

optional arguments:
  -h, --help            Show this help message and exit.
  --datasets {single_expr,double_expr,gold_data}
                        Dataset name: single_expr, double_expr or gold_data.
  --methods METHODS     Comma-separated list of method names.
  --evaluations EVALUATIONS
                        Comma-separated list of evaluation function names.
  --do_eval             If set, perform evaluation and generate plots.
  --file1 FILE1         For single_expr: expression data file | For double_expr: first expression data file.
  --file2 FILE2         For double_expr: second expression data file.
  --expr_file EXPR_FILE For gold_data: expression data file (.tsv).
  --tf_file TF_FILE     For gold_data: transcription factors file.
  --gold_file GOLD_FILE For gold_data: gold standard file.
  --outdir OUTDIR       Directory where inferred networks and results will be saved.

NIRD_Velo.py Arguments

usage: NIRD_Velo.py [-h] [--datasets {double_expr}] [--methods METHODS] [--evaluations {Eval_EdgeOverlapping}] [--do_eval] --file1 FILE1 --file2 FILE2 --outdir OUTDIR

Run matrix factorization methods on biological datasets.

optional arguments:
  -h, --help            Show this help message and exit.
  --datasets {double_expr}
                        Dataset name (only double_expr is supported in NIRD_Velo).
  --methods METHODS     Comma-separated list of method names.
  --evaluations {Eval_EdgeOverlapping}
                        Evaluation function to use (only Eval_EdgeOverlapping is supported).
  --do_eval             If set, perform evaluation and generate plots.
  --file1 FILE1         First expression data file.
  --file2 FILE2         Second expression data file.
  --outdir OUTDIR       Directory where inferred networks and results will be saved.

Included MF and Benchmarking Methods in this tool

NIRD includes the following 13 core matrix factorization-based methods for gene regulatory network (GRN) inference: SVD, NMF, ICM, BD, BMF, LSNMF, KLD_NMF, ENMF, PMF, SNMF, PMFCC, SepNMF, and Kernel_PCA.

These methods represent novel or hybrid GRN inference techniques tailored for expression and transcription velocity data.

Additionally, several traditional GRN inference methods are included for benchmarking purposes only: ARACNE, RELNET, MRNET, C3NET, GENIE3, and GrnBoost2.

This allows you to compare the performance of NIRD’s methods against widely used classical algorithms.

The Final Output

The final inferred network will be a symmetric matrix where each cell represents a score of interaction strength / feature importance score between genes, presumably based on reduced-dimensional representations of the expression matrix.

|          | Saal1    | Xrcc1    | Ldb1     | Nr6a1    | Slc7a6os | Chchd7   | Emc10    | Ptms     | Meaf6    | Tor1b    |
| -------- | -------- | -------- | -------- | -------- | -------- | -------- | -------- | -------- | -------- | -------- |
| Saal1    | 0        | 3.29E-05 | 2.38E-05 | 2.92E-05 | 3.01E-05 | 1.75E-04 | 7.83E-06 | 1.59E-05 | 2.83E-05 | 2.18E-05 |
| Xrcc1    | 3.29E-06 | 0        | 3.16E-05 | 1.66E-05 | 5.17E-05 | 1.23E-04 | 9.67E-06 | 2.32E-05 | 2.71E-05 | 3.00E-05 |
| Ldb1     | 4.35E-06 | 3.37E-05 | 0        | 3.78E-05 | 3.03E-05 | 1.19E-04 | 7.09E-06 | 2.62E-05 | 3.25E-05 | 2.50E-05 |
| Nr6a1    | 3.34E-06 | 2.54E-05 | 4.15E-05 | 0        | 2.78E-05 | 1.49E-04 | 6.92E-06 | 3.36E-05 | 3.30E-05 | 3.04E-05 |
| Slc7a6os | 3.55E-06 | 3.35E-05 | 2.14E-05 | 1.88E-05 | 0        | 1.20E-04 | 7.72E-06 | 1.82E-05 | 2.34E-05 | 2.22E-05 |
| Chchd7   | 3.56E-06 | 3.67E-05 | 3.05E-05 | 2.37E-05 | 2.67E-05 | 0        | 7.56E-06 | 2.81E-05 | 2.10E-05 | 4.22E-05 |
| Emc10    | 3.74E-06 | 3.90E-05 | 2.50E-05 | 1.96E-05 | 3.58E-05 | 1.35E-04 | 0        | 1.50E-05 | 2.24E-05 | 2.12E-05 |
| Ptms     | 3.62E-06 | 3.30E-05 | 2.96E-05 | 2.60E-05 | 3.49E-05 | 1.92E-04 | 8.40E-06 | 0        | 2.64E-05 | 2.40E-05 |
| Meaf6    | 3.39E-06 | 3.39E-05 | 2.35E-05 | 1.84E-05 | 3.83E-05 | 1.45E-04 | 8.57E-06 | 1.45E-05 | 0        | 2.47E-05 |
| Tor1b    | 2.86E-06 | 3.26E-05 | 2.60E-05 | 1.95E-05 | 3.20E-05 | 2.15E-04 | 9.34E-06 | 1.70E-05 | 2.52E-05 | 0        |

Post-Network Inference Analysis

1. Network Centrality Analysis

After inferring gene-to-gene interaction matrices using NIRD, centrality measures like PageRank and degree are calculated for each gene. These scores help identify influential or hub genes in the network. High-centrality genes are often key regulators or signaling components and may play essential roles in cellular processes or disease mechanisms.

2. Differential Network Analysis

NIRD enables comparative network analysis between conditions (e.g., normal vs disease). By computing differences in PageRank and degree for each gene, it identifies genes that gain or lose influence across conditions. These differential scores highlight candidate genes that may drive disease progression or represent therapeutic targets.

3. Functional Enrichment Analysis

Genes with the highest differential network scores are subjected to pathway enrichment analysis. This determines which biological pathways are significantly overrepresented, helping link network-level changes to known cellular processes such as inflammation, ECM remodeling, or signaling dysregulation.

4. Module and Cluster Analysis

Community detection or clustering techniques can be applied to the inferred network to identify gene modules. These modules often correspond to co-regulated genes or functionally coherent groups, offering insight into coordinated biological responses or cell-type-specific activities.