Statistics & additional information

Database statistics

Number of transcripts: 146,742
Number of genes: 65,694
Genes are defined by grouping transcripts in same orientation with at least one partially overlapping exon.

Transcript clustering and naming

Different lncRNA transcripts are considered to belong to the same gene if they share at least one (partially) overlapping exon and reside on the same DNA strand. In this way, transcripts are clustered into genes.

If a lncRNA gene has an official gene symbol according to HGNC, that symbol is used as the primary ID (eg. HOTAIR). Transcripts in the gene are numbered, starting with the most upstream transcript (eg. HOTAIR:1).

If the lncRNA does not (yet) have an official gene symbol, we employ a universal lncRNA nomenclature based the gene symbol of the nearest protein coding gene to ease communication among researchers. These lncRNA genes are then named after the HUGO symbol of the nearest protein-coding gene on the same strand using the following scheme: ‘lnc-HUGO-#’. The lncRNA genes are numbered, starting with the lncRNA gene closest to the protein-coding gene. A second number is added to denote the different transcript variants starting with the most upstream transcript, for example, lnc-MYCN-1:1 denotes transcript 1 from gene lnc-MYCN-1 (more info).

GRCh38/hg38 reference genome

LNCipedia now supports both the hg19 and hg38 reference genomes. Switch to you prefered reference genome by selecting the reference genome from the “Genome” link in the menu, all genomic coordinates and links to other websites/tools will be updated to the corresponding reference genome. Exports are available for both reference genomes. Of note: positions are automatically converted using LiftOver, transcripts that do not have a unique position in both reference genomes or a different size will only be available in one reference genome.

UCSC trackhub

A UCSC trackhub is available at http://lncipedia.org/trackhub/hub.txt

LncRNA sources used

LncRNAdb (september 2011): 105 transcripts

The LncRNAdb contains lncRNAs identified from the literature in around 60 different species.

Broad Institute (Human Body Map lincRNAs): 14,279 transcripts

Human lincrna Catalog collected there data from RNA-seq across 24 tissues and cell types.

Ensembl release 64: 9,069 transcripts
Ensembl release 68: 19,794 transcripts
Ensembl release 75: 23,498 transcripts
Ensembl release 83: 26,376 transcripts
Ensembl release 87: 25,960 transcripts

Ensembl gene annotation, cDNA alignments and chromatin-state map data from the Ensembl regulatory build are used to predict lincRNAs for human and mouse. The data of human lncRNA's is imported to LNCipedia.

Gencode 13: 19,812 transcripts

The main data set combines the HAVANA manual annotation using evidence from various sources and research groups with the Ensembl automatic annotation pipelines to achieve an accurate and complete annotation of the human genome.

Refseq - Dec 2014: 4,774 transcripts

Each RefSeq (Reference Sequence) is constructed wholly from sequence data submitted to the International Nucleotide Sequence Database Collaboration. Only entries with property “biomol_ncrna_lncrna” were considered

Nielsen et al: 7,656 transcripts

Expression levels are evaluated across 12 human tissues(bladder, brain, breast, colon, heart, kidney, liver, lung, muscle, ovary, prostate and skin) using a custom-designed microarray, supplemented with RNAseq.
Various filters were applied:
1. All probes were aligned to all protein-coding mRNAs using BLAST and probes with E-scores below 1 × 1e−10 failed.
2. Probes overlapping a genomic region with more than 10 human chained self-alignments (Kent et al. 2003).
3. Probes overlapping regions with mitochondrial homology.
4. Probes overlapping repeatMask regions.
The following three filter rules were subsequently applied to all nc transcripts:
Nc transcripts with any probe failing filter 1 were discarded.
Nc transcripts with no probes passing filters 2, 3, and 4 were discarded.
Nc transcripts overlapping pseudogenes defined by GENCODE (V12) were discarded.
Collectively, this reduced the number of analyzed transcripts from 26,910 to 12,115.
After filtering the data for lncRNA's we added 5,339 transcripts to the database.

Hangauer et al: 5,339 transcripts

The data from this publication is collected from RNA-seq and performed de novo transcriptome assembly on each of the RNA-seq datasets to generate 6,833,809 de novo assembled transcripts. Transcripts were filtered, only long non-coding RNAs are added to the database.
Filter: Fragments per kilobase of transcript per million mapped reads(FPKM)>1

NONCODE: 93,164 transcripts

NONCODE data is collected from three sources:
1. Literature mining,
2. Specialized databases,
3. GenBank

Sun and Gadad et al., 2015: 2,305 transcripts

Abstract: We describe a computational approach that integrates GRO-seq and RNA-seq data to annotate long noncoding RNAs (lncRNAs), with increased sensitivity for low-abundance lncRNAs. We used this approach to characterize the lncRNA transcriptome in MCF-7 human breast cancer cells, including >700 previously unannotated lncRNAs. We then used information about the (1) transcription of lncRNA genes from GRO-seq, (2) steady-state levels of lncRNA transcripts in cell lines and patient samples from RNA-seq, and (3) histone modifications and factor binding at lncRNA gene promoters from ChIP-seq to explore lncRNA gene structure and regulation, as well as lncRNA transcript stability, regulation, and function. Functional analysis of selected lncRNAs with altered expression in breast cancers revealed roles in cell proliferation, regulation of an E2F-dependent cell-cycle gene expression program, and estrogen-dependent mitogenic growth. Collectively, our studies demonstrate the use of an integrated genomic and molecular approach to identify and characterize growth-regulating lncRNAs in cancers.

FANTOM CAT: 27,719 transcripts

Abstract: Here, using FANTOM5 cap analysis of gene expression (CAGE) data, we integrate multiple transcript collections to generate a comprehensive atlas of 27,919 human lncRNA genes with high-confidence 5′ ends and expression profiles across 1,829 samples from the major human primary cell types and tissues. Genomic and epigenomic classifications of these lncRNAs reveals that most intergenic lncRNAs originate from enhancers rather than from promoters. Incorporating genetic and expression data, we show that lncRNAs overlapping trait-associated single nucleotide polymorphisms are specifically expressed in cell types relevant to the traits, implicating these lncRNAs in multiple diseases. We further demonstrate that lncRNAs overlapping expression quantitative trait loci (eQTL)-associated single nucleotide polymorphisms of messenger RNAs are co-expressed with the corresponding messenger RNAs, suggesting their potential roles in transcriptional regulation. Combining these findings with conservation data, we identify 19,175 potentially functional lncRNAs in the human genome.

http://fantom.gsc.riken.jp/cat/

The stringent set of FANTOM CAT lncRNAs is included in LNCipedia with the exclusion of 34 transcripts that were in conflict with the HUGO gene boundaries.

Protein coding potential

Protein coding potential is assessed by means of two different prediction algorithms and a novel PRIDE database search algorithm.

CPC: Coding Potential Calculator

From the CPC website:
We developed a Support Vector Machine-based classifier, named Coding Potential Calculator (CPC), to assess the protein-coding potential of a transcript based on six biologically meaningful sequence features. 10-fold cross-validation on the training dataset and independent testing on three large standalone datasets showed that CPC can discriminate coding from noncoding transcripts with high accuracy.

HMMER: Biosequence analysis using profile hidden Markov Models using HMMER

We used the hmmscan algoritm against the pfam2 database to search for known protein domains in all 6 reading frames of the transcript. The number of found pfam domains is reported for both the 5' to 3' and 3' to 5' direction.

PRIDE: database search

We have re-analysed +100 Homo sapiens proteomics projects from the PRIDE database by searching MSMS spectra in standard UniProtKB/Swiss-Prot human database together with the translated version of lncipedia.

PhyloCSF: Coding Potential of a multi-species nucleotide sequence alignment

We use the PhyloCSF algoritm to benchmark the (non)coding Ensembl data. We achieved a specificity and sensitivity of 93%, the cutoff is 60.7876. A score lower than this cutoff means that the transcript is non-coding, above this cutoff it is likely to be coding.

CPAT: Coding-Potential Assessment Tool using an alignment-free logistic regression model

We use the CPAT algoritm to calculate the coding probability based on the sequence of the lncRNA. The suggested coding probability cutoff of 0.364 is used, this cutoff corresponds to a sensitivity and specificity of 0.966.

Ribosome-profiling: Lee et al., 2012 and Bazzini et al., 2014

253 lncRNAs containing small open reading frames (smORFS) are provided by Bazzini et al., 2014. Bazzini and colleagues developed an approach to detect smORFs using ribosome profiling whereby the periodicity of ribosome movement on actively translated ORFs is used to distinguish coding from non-coding sequences.

A second approach to apply ribosome profiling in the quest for novel coding RNAs has been described by Lee et al., 2012. Using lactimidomycin, a ribosome inhibitor specific to initiating ribosomes, translation initiation sites (TIS) were mapped in HEK-293 cells.

Conservation

Locus conservation

Locus conservation is assessed by evaluating the positional conservation and order of the flanking protein coding genes. A human lncRNA locus is considered conserved when the flanking protein coding genes have flanking orthologues in another species, as assessed by the Emsembl Compara API. Currently locus conservation in mouse and zebrafish compared to human is provided. Our analyses suggest human locus conservation of 60% compared to mouse and 25% compared to zebrafish.

Transcript classification

Transcripts are classified based on their relative position to protein coding genes (Ensembl 84). The order in which the position is queried:

  1. Overlap with protein coding gene on the same strand:
    1. No overlap with protein-coding exons: intronic
    2. Otherwise: sense overlapping
  2. Overlap with protein coding gene on the oposite strand: antisense
  3. No overlap with any protein coding gene
    1. Transcription start site of protein coding gene is within 100 bp of lncRNA transcription start site: bidirectional
    2. Otherwise: intergenic