Two well-known facets in protein synthesis in eukaryotic cells are transcription of DNA to pre-RNA in the nucleus and the translation of messenger-RNA (mRNA) to proteins in the cytoplasm. A critical intermediate step is the removal of segments (introns) containing 97% of the nucleic-acid sites in pre-RNA and sequential alignment of the retained segments (exons) to form mRNA through a process referred to as splicing. Alternative forms of splicing enrich the proteome while abnormal splicing can enhance the likelihood of a cell developing cancer or other diseases. Mechanisms for splicing and origins of splicing errors are only partially deciphered. Our goal is to determine if rules on splicing can be inferred from data analytics on nucleic-acid sequences. Toward that end, we represent a nucleic-acid site as a point in a plane defined in terms of the anterior and posterior sub-sequences of the site. The “point-set” representation expands analytical approaches, including the use of statistical tools, to characterize genome sequences. It is found that point-sets for exons and introns are visually different, and that the differences can be quantified using a family of generalized moments. We design a machine-learning algorithm that can recognize individual exons or introns with 91 % accuracy. Point-set distributions and generalized moments are found to differ between organisms.

1.
M.
Goldberg
,
J.
Fischer
,
L.
Hood
,
L.
Hartwell
,
C.
Aquardro
,
L.
Silver
, and
A. E.
Reynolds
,
Genetics: From Genes to Genomes
, 7th ed. (
McGraw-Hill Publishing
,
2021
).
2.
D.
Rieder
,
Z.
Trajanoski
, and
J. G.
McNally
, “
Transcription factories
,”
Front. Genet.
3
,
221
(
2012
).
3.
I.
Wetterberg
,
J.
Zhao
,
S.
Masich
,
L.
Wieslander
, and
U.
Skoglund
, “
In situ transcription and splicing in the Balbiani ring 3 gene
,”
EMBO J.
20
(
10
),
2564
2574
(
2001
).
4.
P.
Cramer
,
D. A.
Bushnell
,
J.
Fu
,
A. L.
Gnatt
,
B.
Maier-Davis
,
N. E.
Thompson
,
R. R.
Burgess
,
A. M.
Edwards
,
P. R.
David
, and
R. D.
Kornberg
, “
Architecture of RNA polymerase II and implications for the transcription mechanism
,”
Science
288
(
5466
),
640
649
(
2000
).
5.
E. C.
Merkhofer
,
P.
Hu
, and
T. L.
Johnson
, “
Introduction to co-transcriptional RNA splicing
,”
Meth. Mol. Biol.
1126
,
83
96
(
2014
).
6.
M.
Livingstone
,
E.
Atas
,
A.
Meller
, and
N.
Sonenberg
, “
Mechanisms governing the control of mRNA translation
,”
Phys. Biol.
7
(
2
),
021001
(
2010
).
7.
W.
Gilbert
, “
Why genes in pieces?
,”
Nature
271
(
5645
),
501
(
1978
).
8.
N. K.
Kadri
,
X. M.
Mapel
, and
H.
Pausch
, “
The intronic branch point sequence is under strong evolutionary constraint in the bovine and human genome
,”
Commun. Biol.
4
(
1
),
1206
(
2021
).
9.
E. L.
Lasda
and
T.
Blumenthal
, “
Trans-splicing
,”
Wiley Interdisc. Rev. RNA
2
(
3
),
417
434
(
2011
).
10.
M. C.
Wahl
,
C. L.
Will
, and
R.
Lührmann
, “
The spliceosome: Design principles of a dynamic RNP machine
,”
Cell
136
(
4
),
701
718
(
2009
).
11.
M.
Hiller
,
Z.
Zhang
,
R.
Backofen
, and
S.
Stamm
, “
Pre-mRNA secondary structures influence exon recognition
,”
PLoS Genetics
3
(
11
),
e204
(
2007
).
12.
M.
Long
and
M.
Deutsch
, “
Intron exon structures of eukaryotic model organisms
,”
Nucleic Acids Res.
27
(
15
),
3219
3228
(
1999
).
13.
L.
Zhu
,
Y.
Zhang
,
W.
Zhang
,
S.
Yang
,
J.-Q.
Chen
, and
D.
Tian
, “
Patterns of exon-intron architecture variation of genes in eukaryotic genomes
,”
BMC Genomics
10
,
47
(
2009
).
14.
M. M.
Scotti
and
M. S.
Swanson
, “
RNA mis-splicing in disease
,”
Nat. Rev. Genet.
17
(
1
),
19
32
(
2016
).
15.
T. A.
Cooper
,
L.
Wan
, and
G.
Dreyfuss
, “
RNA and disease
,”
Cell
136
(
4
),
777
793
(
2009
).
16.
W.
Jiang
and
L.
Chen
, “
Alternative splicing: Human disease and quantitative analysis from high-throughput sequencing
,”
Comput. Struct. Biotechnol. J.
19
,
183
195
(
2021
).
17.
Y.
Wang
,
J.
Liu
,
B. O.
Huang
,
Y.-M.
Xu
,
J.
Li
,
L.-F.
Huang
,
J.
Lin
,
J.
Zhang
,
Q.-H.
Min
,
W.-M.
Yang
, and
X.-Z.
Wang
, “
Mechanism of alternative splicing and its regulation
,”
Biomed. Rep.
3
(
2
),
152
158
(
2015
).
18.
M. D.
Purugganan
, “
The fractal nature of RNA secondary structure
,”
Naturwissenschaften
76
(
10
),
471
473
(
1989
).
19.
P.
Weidemüller
,
M.
Kholmatov
,
E.
Petsalaki
, and
J. B.
Zaugg
, “
Transcription factors: Bridge between cell signaling and gene regulation
,”
Proteomics
21
(
23–24
),
2000034
(
2021
).
20.
H.
Keren
,
G.
Lev-Maor
, and
G.
Ast
, “
Alternative splicing and evolution: Diversification, exon definition and function
,”
Nat. Rev. Genet.
11
(
5
),
345
355
(
2010
).
21.
K. J.
Hertel
, Spliceosomal Pre-mRNA Splicing Methods and Protocols, Methods in Molecular Biology 1126, 1st ed. (Humana Press, Totowa, NJ, 2014).
22.
Y.
Zhang
,
J.
Qian
,
C.
Gu
, and
Y.
Yang
, “
Alternative splicing and cancer: A systematic review
,”
Signal Transduct. Targeted Ther.
6
(
1
),
78
(
2021
).
23.
E. M.
Hong
,
C. K.
Ingermarsdotter
, and
A. M. L.
Lever
, “
Therapeutic applications of trans-splicing
,”
Br. Med. Bull.
136
,
4
20
(
2020
).
24.
Q.
Lei
,
C.
Li
,
Z.
Zuo
et al., “
Evolutionary insights into RNA trans-splicing in vertebrates
,”
Genome Biol. Evol.
8
,
562
577
(
2016
).
25.
A.
Takata
,
N.
Matsumoto
, and
T.
Kato
, “
Genome-wide identification of splicing QTLs in the human brain and their enrichment among schizophrenia associated loci
,”
Nat. Commun.
8
,
14519
(
2017
).
26.
R.-H.
Fu
,
S.-P.
Liu
,
H.-J.
Huang
,
S.-J.and
Chen
,
P.-R.
Chen
,
Y.-H.
Lin
,
Y.-C.
Ho
,
W.-L.
Chang
,
C.-H.
Tsai
,
W.-C.
Shyu
, and
S.-Z.
Lin
, “
Aberrant alternative splicing events in Parkinson’s disease
,”
Cell Transplant.
22
(
4
),
653
661
(
2013
).
27.
N.
Lopez-Bigas
,
B.
Audit
,
C.
Ouzounis
,
G.
Parra
, and
R.
Guigo
, “
Are splicing mutations the most frequent cause of hereditary disease?
,”
FEBS Lett.
579
(9),
1900
1903
(
2006
).
28.
J.
Taylor
and
S. C.
Lee
, “
Mutations in spliceosome genes and therapeutic opportunities in myeloid malignancies
,”
Genes Chromosomes Cancer
58
,
889
902
(
2019
).
29.
M.
Suñé Pou
,
M. J.
Limeres
,
C.
Moreno-Castro
,
C.
Hernández-Munain
,
J. M.
Suñé Negre
,
M. L.
Cuestas
, and
C.
Suñé
, “
Innovative therapeutic and delivery approaches using nanotechnology to correct splicing defects underlying disease
,”
Front. Genet.
11
,
731
(
2020
).
30.
P. D.
Stenson
,
E. V.
Ball
,
M.
Mort
,
A. D.
Phillips
,
K.
Shaw
, and
D. N.
Cooper
, “The human gene mutation database (HGMD) and its exploitation in the fields of personalized genomics and molecular evolution,”
Curr. Protoc. Bioinform.
1
,
13–20
(
2012
).
31.
M.
Pal
,
A. S.
Ponticelli
, and
D. S.
Luse
, “
The role of the transcription bubble and TFIIB in promoter clearance by RNA polymerase II
,”
Mol. Cell
19
(
1
),
101
110
(
2005
).
32.
E. A.
Moehle
,
H.
Braberg
,
N. J.
Krogan
, and
C.
Guthrie
, “
Adventures in time and space: Splicing efficiency and RNA polymerase II elongation rate
,”
RNA Biol.
11
(
4
),
313
319
(
2014
).
33.
M.
Gao
,
D. T.
Fritz
,
L. P.
Ford
, and
J.
Wilusz
, “
Interaction between a poly (A)-specific ribonuclease and the 5’ cap influences mRNA deadenylation rates in vitro
,”
Mol. Cell
5
(
3
),
479
488
(
2000
).
34.
M. E.
Rogalska
,
C.
Vivori
, and
J.
Valcárcel
, “
Regulation of pre-mRNA splicing: Roles in physiology and disease, and therapeutic prospects
,”
Nat. Rev. Genet.
24
(
4
),
251
269
(
2023
).
35.
A. G.
Matera
and
Z.
Wang
, “
A day in the life of the spliceosome
,”
Nat. Rev. Mol. Cell Biol.
15
(
2
),
108
121
(
2014
).
36.
C. L.
Will
and
R.
Lührmann
, “
Spliceosome structure and function
,”
Cold Spring Harbor Perspect. Biol.
3
(
7
),
a003707
(
2011
).
37.
S.
Melnik
,
B.
Deng
,
A.
Papantonis
,
S.
Baboo
,
I. M.
Carr
, and
P. R.
Cook
, “
The proteomes of transcription factories containing RNA polymerases I, II or III
,”
Nat. Methods
8
(
11
),
963
968
(
2011
).
38.
A. A.
Reyes
,
R. D.
Marcum
, and
Y.
He
, “
Structure and function of chromatin remodelers
,”
J. Mol. Biol.
433
(
14
),
166929
(
2021
).
39.
R. I.
Skotheim
and
M.
Nees
, “
Alternative splicing in cancer: Noise, functional, or systematic?
,”
Int. J. Biochem. Cell Biol.
39
(
7-8
),
1432
1449
(
2007
).
40.
G.
Yeo
and
C. B.
Burge
, “
Maximum entropy modeling of short sequence motifs with applications to RNA splicing signals
,”
J. Comput. Biol.
11
(
2-3
),
377
394
(
2004
).
41.
M.
Pertea
,
X.
Lin
, and
S. L.
Salzberg
, “
GeneSplicer: A new computational method for splice site prediction
,”
Nucleic Acids Res.
29
(
5
),
1185
1190
(
2001
).
42.
F.-O.
Desmet
,
D.
Hamroun
,
M.
Lalande
,
G.
Collod-Beroud
,
M.
Claustres
, and
C.
Beroud
, “
Human splicing finder: An online bioinformatics tool to predict splicing signals
,”
Nucleic Acids Res.
37
,
e67
(
2009
).
43.
K.
Jaganathan
,
S. K.
Panagiotopoulou
,
J. F.
McRae
,
S. F.
Darbandi
et al., “
Predicting splicing from primary sequence with deep learning
,”
Cell
176
,
535
548
(
2019
).
44.
J.
Zuallaert
,
F.
Godin
,
M.
Kim
,
A.
Soete
,
Y.
Saeys
, and
W.
De Nerve
, “
SpliceRover: Interpretable convolutional neural networks for improved splice site prediction
,”
Bioinformatics
34
,
4180
4188
(
2018
).
45.
W.
Jang
,
J.
Park
,
H.
Chae
, and
M.
Kim
, “
Comparison of in silico tools for splice-altering variant prediction using established spliceogenic variants: An end-users point of view
,”
Int. J. Genom.
2022
,
5265686
(
2022
).
46.
M. L.
Tress
,
F.
Abascal
, and
A.
Valencia
, “
Alternative splicing may not be the key to proteome complexity
,”
Trends Biochem. Sci.
42
(
2
),
98
110
(
2017
).
47.
R.
Aebersold
,
J. N.
Agar
,
I. J.
Amster
,
M. S.
Baker
,
C. R.
Bertozzi
,
E. S.
Boja
,
C. E.
Costello
,
B. F.
Cravatt
,
C.
Fenselau
,
B. A.
Garcia
et al., “
How many human proteoforms are there?
,”
Nat. Chem. Biol.
14
(
3
),
206
214
(
2018
).
48.
R.
Sorek
,
R.
Shamir
, and
G.
Ast
, “
How prevalent is functional alternative splicing in the human genome?
,”
Trends Genet.
20
(
2
),
68
71
(
2004
).
49.
K.
Yoshida
,
M.
Sanada
,
Y.
Shiraishi
,
D.
Nowak
,
Y.
Nagata
,
R.
Yamamoto
et al., “
Frequent pathway mutations of splicing machinery in myelodysplasia
,”
Nature
478
,
64
69
(
2011
).
50.
M.
Imielinski
,
A. H.
Berger
,
P. S.
Hammerman
,
B.
Hernandez
,
T. J.
Pugh
,
E.
Hodis
et al., “
Mapping the hallmarks of lung adenocarcinoma with massively parallel sequencing
,”
Cell
150
,
1107
1120
(
2012
).
51.
J. W.
Harbour
,
A. H.
Robertson
,
M. D.
Onken
,
L. A.
Worley
, and
A. M.
Bowcock
, “
Recurrent mutations at codon 625 of the splicing factor SF3B1 in uveal melanoma
,”
Nat. Genet.
45
,
133
135
(
2013
).
52.
S.
Nik-Zainal
,
H.
Davies
,
J.
Staaf
,
M.
Ramakrishna
,
D.
Glodzik
,
X.
Zou
et al., “
Landscape of somatic mutations in 560 breast cancer whole-genome sequences
,”
Nature
534
,
47
54
(
2016
).
53.
D. Y.
Vargas
,
A.
Raj
,
S. A. E.
Marras
,
F. R.
Kramer
, and
S.
Tyagi
, “
Mechanism of mRNA transport in the nucleus
,”
Proc. Natl. Acad. Sci. U.S.A.
102
(
47
),
17008
17013
(
2005
).
54.
W. J.
Kent
,
C. W.
Sugnet
,
T. S.
Furey
,
K. M.
Roskin
,
T. H.
Pringle
,
A. M.
Zahler
, and
D.
Haussler
, “
The human genome browser at UCSC
,”
Genome Res.
12
(
6
),
996
1006
(
2002
).
55.
B. T.
Lee
,
G. P.
Barber
,
A.
Benet-Pagès
,
J.
Casper
,
H.
Clawson
,
C.
Diekhans
,
M.
Fischer
,
J. N.
Gonzalez
,
A. S.
Hinrichs
,
C. M.
Lee
et al., “
The ucsc genome browser database: 2022 update
,”
Nucleic Acids Res.
50
(
D1
),
D1115
D1122
(
2022
).
56.
B.
Pan
,
R.
Kusko
,
W.
Xiao
,
Y.
Zheng
,
Z.
Liu
,
C.
Xiao
,
S.
Sakkiah
,
W.
Guo
,
P.
Gong
,
C.
Zhang
et al., “
Similarities and differences between variants called with human reference genome HG19 or HG38
,”
BMC Bioinf.
20
(
2
),
101
(
2019
).
57.
P.
Cvitanović
,
G. H.
Gunaratne
, and
I.
Procaccia
, “
Topological and metric properties of Hénon-type strange attractors
,”
Phys. Rev. A
38
(
3
),
1503
1520
(
1988
).
58.
P.
Cvitanović
, “Stretch, fold, prune,” in Chaos: Classical and Quantum, edited by P. Cvitanović, R. Artuso, R. Mainieri, G. Tanner, and G. Vattay (Niels Bohr Institute, Copenhagen, 2024).
59.
N.
Metropolis
,
M. L.
Stein
, and
P. R.
Stein
, “
On finite limit sets for transformations on the unit interval
,”
J. Combinat. Theory Ser. A
15
,
25
44
(
1973
).
60.
J.
Guckenheimer
, “
On the bifurcation of maps of the interval
,”
Invent. Math.
39
,
165
178
(
1977
).
61.
J. W.
Milnor
and
W.
Thurston
, “
On iterated maps of the interval
,” in
Dynamical Systems: Lecture Notes in Mathematics
, edited by J. C. Alexander (Springer, 1988), Vol. 1342.
62.
R. S.
Illingworth
and
A. P.
Bird
, “
CpG islands—A rough guide
,”
FEBS Lett.
583
(
11
),
1713
1720
(
2009
).
63.
D.
Ruelle
,
Thermodynamic Formalism: The Mathematical Structure of Equilibrium Statistical Mechanics
(
Cambridge University Press
,
2004
).
64.
K.
OŚhea
and
R.
Nash
, “An introduction to convolutional neural networks,” arXiv:1511.08458 (2015).
65.
E.
Alpaydin
,
Introduction to Machine Learning
(
The MIT Press
,
2020
).
66.
S.
Marsland
,
Machine Learning: An Algorithmic Perspective
(
Chapman and Hall/CRC
,
2011
).
67.
U.
Michelucci
,
Applied Deep Learning: A Case-Based Approach to Understanding Deep Neural Networks
(
Apress Media, LLC
,
New York
,
2018
).
68.
C.
Cortes
,
M.
Mohri
, and
A.
Rostamizadeh
, “L2 regularization for learning kernels,” arXiv:1205.2653 (2012).
69.
R. J.
O’Hara
, “
Homage to clio: Toward an historical philosophy for evolutionary biology
,”
Syst. Zool.
37
,
142
155
(
1988
).
70.
R. J.
O’Hara
, “
Population thinking and tree thinking in systematics
,”
Zool. Scr.
26
,
323
329
(
1997
).
71.
D. A.
Baum
,
S.
DeWitt Smith
, and
S. S.
Donovan
, “
The tree thinking challenge
,”
Science
310
,
979
980
(
2006
).
72.
J. C.
Avise
,
Evolutionary Pathways in Nature: A Phylogenetic Approach
(
Cambridge University Press
,
Cambridge
,
2006
).
You do not currently have access to this content.