Statistics & additional information
Number of transcripts: 118,777
Number of genes: 63,953
Genes are defined by grouping transcripts in same orientation with at least one partially overlapping exon.
Transcript clustering and naming
Different lncRNA transcripts are considered to belong to the same gene if they share at least one (partially) overlapping exon and reside on the same DNA strand. In this way, transcripts are clustered into genes.
If a lncRNA gene has an official gene symbol according to HGNC, that symbol is used as the primary ID (eg. HOTAIR). Transcripts in the gene are numbered, starting with the most upstream transcript (eg. HOTAIR:1).
If the lncRNA does not (yet) have an official gene symbol, we employ a universal lncRNA nomenclature based the gene symbol of the nearest protein coding gene to ease communication among researchers. These lncRNA genes are then named after the HUGO symbol of the nearest protein-coding gene on the same strand using the following scheme: ‘lnc-HUGO-#’. The lncRNA genes are numbered, starting with the lncRNA gene closest to the protein-coding gene. A second number is added to denote the different transcript variants starting with the most upstream transcript, for example, lnc-MYCN-1:1 denotes transcript 1 from gene lnc-MYCN-1 (more info).
GRCh38/hg38 reference genome
LNCipedia now supports both the hg19 and hg38 reference genomes. Switch to you prefered reference genome by selecting the reference genome from the “Genome” link in the menu, all genomic coordinates and links to other websites/tools will be updated to the corresponding reference genome. Exports are available for both reference genomes. Of note: positions are automatically converted using LiftOver, transcripts that do not have a unique position in both reference genomes or a different size will only be available in one reference genome.
A UCSC trackhub is available at http://lncipedia.org/trackhub/hub.txt
LncRNA sources usedLncRNAdb (september 2011): 105 transcripts
The LncRNAdb contains lncRNAs identiﬁed from the literature in around 60 different species.Broad Institute (Human Body Map lincRNAs): 14,279 transcripts
Human lincrna Catalog collected there data from RNA-seq across 24 tissues and cell types.Ensembl release 64: 9,069 transcripts
Ensembl release 68: 19,794 transcripts
Ensembl release 75: 23,498 transcripts
Ensembl release 83: 26,376 transcripts
Ensembl gene annotation, cDNA alignments and chromatin-state map data from the Ensembl regulatory build are used to predict lincRNAs for human and mouse. The data of human lncRNA's is imported to LNCipedia.Gencode 13: 19,812 transcripts
The main data set combines the HAVANA manual annotation using evidence from various sources and research groups with the Ensembl automatic annotation pipelines to achieve an accurate and complete annotation of the human genome.Refseq - Dec 2014: 4,774 transcripts
Each RefSeq (Reference Sequence) is constructed wholly from sequence data submitted to the International Nucleotide Sequence Database Collaboration. Only entries with property “biomol_ncrna_lncrna” were consideredNielsen et al: 7,656 transcripts
Expression levels are evaluated across 12 human tissues(bladder, brain, breast, colon, heart, kidney, liver, lung, muscle, ovary, prostate and skin) using a custom-designed microarray, supplemented with RNAseq.
Various filters were applied:
1. All probes were aligned to all protein-coding mRNAs using BLAST and probes with E-scores below 1 × 1e−10 failed.
2. Probes overlapping a genomic region with more than 10 human chained self-alignments (Kent et al. 2003).
3. Probes overlapping regions with mitochondrial homology.
4. Probes overlapping repeatMask regions.
The following three filter rules were subsequently applied to all nc transcripts:
Nc transcripts with any probe failing filter 1 were discarded.
Nc transcripts with no probes passing filters 2, 3, and 4 were discarded.
Nc transcripts overlapping pseudogenes defined by GENCODE (V12) were discarded.
Collectively, this reduced the number of analyzed transcripts from 26,910 to 12,115.
After filtering the data for lncRNA's we added 5,339 transcripts to the database.
The data from this publication is collected from RNA-seq and performed de novo transcriptome assembly on each of the RNA-seq datasets to generate 6,833,809 de novo assembled transcripts.
Transcripts were filtered, only long non-coding RNAs are added to the database.
Filter: Fragments per kilobase of transcript per million mapped reads(FPKM)>1
NONCODE data is collected from three sources:
1. Literature mining,
2. Specialized databases,
Abstract: We describe a computational approach that integrates GRO-seq and RNA-seq data to annotate long noncoding RNAs (lncRNAs), with increased sensitivity for low-abundance lncRNAs. We used this approach to characterize the lncRNA transcriptome in MCF-7 human breast cancer cells, including >700 previously unannotated lncRNAs. We then used information about the (1) transcription of lncRNA genes from GRO-seq, (2) steady-state levels of lncRNA transcripts in cell lines and patient samples from RNA-seq, and (3) histone modifications and factor binding at lncRNA gene promoters from ChIP-seq to explore lncRNA gene structure and regulation, as well as lncRNA transcript stability, regulation, and function. Functional analysis of selected lncRNAs with altered expression in breast cancers revealed roles in cell proliferation, regulation of an E2F-dependent cell-cycle gene expression program, and estrogen-dependent mitogenic growth. Collectively, our studies demonstrate the use of an integrated genomic and molecular approach to identify and characterize growth-regulating lncRNAs in cancers.
Protein coding potential
Protein coding potential is assessed by means of two different prediction algorithms and a novel PRIDE database search algorithm.
CPC: Coding Potential Calculator
From the CPC website:
We developed a Support Vector Machine-based classifier, named Coding Potential Calculator (CPC), to assess the protein-coding potential of a transcript based on six biologically meaningful sequence features. 10-fold cross-validation on the training dataset and independent testing on three large standalone datasets showed that CPC can discriminate coding from noncoding transcripts with high accuracy.
HMMER: Biosequence analysis using profile hidden Markov Models using HMMER
We used the hmmscan algoritm against the pfam2 database to search for known protein domains in all 6 reading frames of the transcript. The number of found pfam domains is reported for both the 5' to 3' and 3' to 5' direction.
PRIDE: database search
We have re-analysed +100 Homo sapiens proteomics projects from the PRIDE database by searching MSMS spectra in standard UniProtKB/Swiss-Prot human database together with the translated version of lncipedia.
PhyloCSF: Coding Potential of a multi-species nucleotide sequence alignment
We use the PhyloCSF algoritm
to benchmark the (non)coding Ensembl data. We achieved a specificity and sensitivity of 93%, the cutoff is 60.7876.
A score lower than this cutoff means that the transcript is non-coding, above this cutoff it is likely to be coding.
CPAT: Coding-Potential Assessment Tool using an alignment-free logistic regression model
We use the CPAT algoritm to calculate the coding probability based on the sequence of the lncRNA. The suggested coding probability cutoff of 0.364 is used, this cutoff corresponds to a sensitivity and specificity of 0.966.
Ribosome-profiling: Lee et al., 2012 and Bazzini et al., 2014
253 lncRNAs containing small open reading frames (smORFS) are provided by Bazzini et al., 2014. Bazzini and colleagues developed an approach to detect smORFs using ribosome profiling whereby the periodicity of ribosome movement on actively translated ORFs is used to distinguish coding from non-coding sequences.
A second approach to apply ribosome profiling in the quest for novel coding RNAs has been described by Lee et al., 2012. Using lactimidomycin, a ribosome inhibitor specific to initiating ribosomes, translation initiation sites (TIS) were mapped in HEK-293 cells.
Locus conservation is assessed by evaluating the positional conservation and order of the flanking protein coding genes. A human lncRNA locus is considered conserved when the flanking protein coding genes have flanking orthologues in another species, as assessed by the Emsembl Compara API. Currently locus conservation in mouse and zebrafish compared to human is provided. Our analyses suggest human locus conservation of 60% compared to mouse and 25% compared to zebrafish.
Transcripts are classified based on their relative position to protein coding genes (Ensembl 84). The order in which the position is queried:
- Overlap with protein coding gene on the same strand:
- No overlap with protein-coding exons: intronic
- Otherwise: sense overlapping
- Overlap with protein coding gene on the oposite strand: antisense
- No overlap with any protein coding gene
- Transcription start site of protein coding gene is within 100 bp of lncRNA transcription start site: bidirectional
- Otherwise: intergenic