Skip to main content
Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
Front Genet. 2023; 14: 1154120.
Published online 2023 Apr 20. doi: 10.3389/fgene.2023.1154120
PMCID: PMC10156977
PMID: 37152988

Identification of discriminant features from stationary pattern of nucleotide bases and their application to essential gene classification

Associated Data

Supplementary Materials
Data Availability Statement

Abstract

Introduction: Essential genes are essential for the survival of various species. These genes are a family linked to critical cellular activities for species survival. These genes are coded for proteins that regulate central metabolism, gene translation, deoxyribonucleic acid replication, and fundamental cellular structure and facilitate intracellular and extracellular transport. Essential genes preserve crucial genomics information that may hold the key to a detailed knowledge of life and evolution. Essential gene studies have long been regarded as a vital topic in computational biology due to their relevance. An essential gene is composed of adenine, guanine, cytosine, and thymine and its various combinations.

Methods: This paper presents a novel method of extracting information on the stationary patterns of nucleotides such as adenine, guanine, cytosine, and thymine in each gene. For this purpose, some co-occurrence matrices are derived that provide the statistical distribution of stationary patterns of nucleotides in the genes, which is helpful in establishing the relationship between the nucleotides. For extracting discriminant features from each co-occurrence matrix, energy, entropy, homogeneity, contrast, and dissimilarity features are computed, which are extracted from all co-occurrence matrices and then concatenated to form a feature vector representing each essential gene. Finally, supervised machine learning algorithms are applied for essential gene classification based on the extracted fixed-dimensional feature vectors.

Results: For comparison, some existing state-of-the-art feature representation techniques such as Shannon entropy (SE), Hurst exponent (HE), fractal dimension (FD), and their combinations have been utilized.

Discussion: An extensive experiment has been performed for classifying the essential genes of five species that show the robustness and effectiveness of the proposed methodology.

Keywords: essential genes, DNA, co-occurrence matrix, feature analysis, classification

1 Introduction

Essential genes are necessary for the survival of a living being and are considered the basis of life. Essential genes consist of vital data of genomes and, hence, could be the key to the broad interpretation of life and expansion (Juhas et al., 2011). It decides significant attributes involving cellular structure, chemistry, and reproduction, among others. Genomes have encoded data for the functions regularly viewed as in all life forms, and the instructions could be species-specific. Some genes appear essential for survival, whereas others seem to be optional. Essential genes have been provided to segregate genes and determine the fundamental sustaining cellular life components. Deletion of an essential gene would result in cell death. As a result, essential gene prediction aids in identifying the bare minimum of genes necessary for the vital survival of specific cell types. The discovery and analysis of essential genes aids our understanding of origin of life (Koonin, 2000). Furthermore, essential genes play a crucial role in synthetic molecular biology, vital to genome development. An extensive comprehension of essential genes can empower researchers to clarify the biological essence of microorganisms (Juhas et al., 2014), generate the smallest genome subset (Itaya, 1995), evolve promising medication targets, and create probable drugs to fight infectious diseases (Dickerson et al., 2011). Due to their significance, the identification of essential genes has been viewed as essential in bioinformatics and genomics.

Essential genes are a set of genes necessary for an organism to thrive in a certain climate. Most of these are only necessary for particular circumstances. For instance, if a cell is supplied with the amino acid lysine, the gene responsible for lysine production is non-essential. However, if the amino acid supply is unavailable, the gene encoding the enzyme responsible for lysine biosynthesis becomes essential, as protein synthesis is not possible without it. Essential genes regulate the activity of fundamental cells in almost every species (Qin, 2019; Guo et al., 2021). Genes are essential if they cannot be knocked out individually under circumstances when most of the needed nutrients are present in the growth medium and the organism grows at its optimal temperature. One of the major issues is determining which identified genes are necessary. There are various experimental techniques to identify essential genes in microorganisms, such as gene knockouts (Roemer et al., 2003), RNA interference (Cullen and Arndt, 2005), transposon mutagenesis (Veeranagouda et al., 2014), and single-gene knockout procedures (Giaever et al., 2002). However, these experimental techniques have various benefits and are generally good. They are still expensive and laborious. So, there is a need for computational methods to identify essential genes.

Because essential genes have biological significance, several computational methods, particularly machine learning methods, have been employed to ascertain them. For this objective, many feature extraction and model building approaches have been developed (Gil et al., 2004; McCutcheon and Moran, 2010; Juhas et al., 2012; Mobegi et al., 2017). Chen and Xu (2005) effectively used high-throughput data and machine learning techniques in Saccharomyces cerevisiae to evaluate protein dispensability. Seringhaus et al. (2006) constructed a machine learning model to predict essential genes in S. cerevisiae using several intrinsic genomic factors. Additionally, Yuan et al. (2012) designed three machine learning techniques based on informative genomic characteristics to detect knockdown lethality in mice. Deng (2015) proposed an important gene classification algorithm using hybrid characteristics like intrinsic and context-dependent genome aspects. This model acquired area under the receiver operating characteristic curve (AUC) scores of 0.86–0.93 when testing the same organism and scores of 0.69–0.89 when predicting cross-organisms using ten-fold cross-validation.

Zhang et al. (2020) have contributed significantly by combining sequence- and network-based features to identify essential genes and arrived at valid results by utilizing a deep learning-based model to learn the characteristics generated from sequencing data and protein–protein interaction networks. Liu et al. (2017) published the findings of comprehensive research on 31 bacterial species, including cross-validation, paired, self-test, and leave-one-species-out experiments. Rout et al. (2020) proposed a method to identify essential genes of four species based on various quantitative methods, including purine and pyrimidine distribution. Le et al. (2020) proposed a model for identifying essential genes using an ensemble deep neural network. Xu et al. (2020) developed a method to predict essential genes in prokaryotes based on sequence-based features using an artificial neural network. A web server, Human Essential Genes Interactive Analysis Platform (HEGIAP), was developed by Chen et al. (2020) for detailed analysis of human essential genes.

An expression-based predictor was developed by Kuang et al. (2021) to recognize the essential genes in humans. The predictor utilized gene expression profiles to predict lncRNAs in cancer cells. Senthamizhan et al. (2021) created a database NetGenes for essential genes, which contains predictions for 2,711 bacterial species using network-based features. The protein–protein interaction network was used to extract features from the STRING database. Marques de Castro et al. (2022)predicted the essential genes in Tribolium castaneum and Drosophila melanogaster based on the physicochemical and statistical data along with subcellular locations. They extracted extrinsic and intrinsic attributes from the essential and nonessential data. This paper analyzed the DNA sequences of five species, i.e., Homo sapiens, Danio rerio, D. melanogaster, Mus musculus, and Arabidopsis thaliana, to identify essential genes. The proposed model extracts co-occurrence matrices from the essential gene sequences to find some informative patterns that distinguish the species. This paper also finds the impact of different co-occurrence matrices and existing features, such as Hurst exponent (HE), fractal dimension (FD), Shannon entropy (SE), and modified Shannon entropy (MSE).

The rest of the paper is structured in the following manner. The definitions of various fundamental parameters are given in Section 2, with relevant descriptions. The proposed methodology with detailed dataset description is discussed in Section 3. The efficiency of our strategy is proven by experimental findings and comments in Section 4, which summarizes the paper by highlighting the most important aspects of the whole investigation. Finally, the paper is concluded in Section 5.

2 Basic terminology

Essential genes are a family linked to critical cellular activities for survival of species. Identifying essential genes is a multidisciplinary process that necessitates both computational and wet-lab validation experiments. Several machine learning methods have been developed to improve classification accuracy, making it a time-consuming and resource-intensive process. Hence, with lower validation costs, most of these methods use supervised methods, which necessitate massive labeled training data sets, typically impractical for less-sequenced species. On the other hand, the rise of high-throughput wet-lab experimental approaches like next-generation sequencing has resulted in an oversupply of unlabeled essential gene sequence data. In the initial study, it has been observed that a fixed-dimensional feature vector represents every DNA sequence by using various quantitative measures, such as SE, MSE, FD, and HE. To estimate these quantitative measures, we convert gene sequences into binary sequences based on pyrimidine and purine distribution. The two main forms of nucleotide bases in DNA are made up of nitrogenous bases. Adenine (A) and guanine (G) are purines, whereas cytosine (C) and thymine (T) are pyrimidines. Here, purine and pyrimidine bases are expressed as 1 and 0, respectively.

A/G1and C/T0.
(1)

2.1 Shannon entropy and modified Shannon entropy

SE may be used to determine how much uncertainty or information a sequence contains (Zurek, 1989; Khandelwal et al., 2022b). The uncertainty affects the distribution of each word. A sequence’s uncertainty concerning a base pair ranges from 0 to 2n, where n is the length of a word. The SE uses the probability p of the two possibilities (0/1) to calculate information entropy. The following equation gives the SE of a binary sequence:

SE=i=01pilog2pi,
(2)

where p i indicates the probability of two values regarding the binary sequence, and SE is used to compute the uncertainty in a binary string (Khandelwal et al., 2022a). When the probability p = 0, the event is assured never to happen, resulting in no uncertainty and entropy of 0. Similarly, if p = 1, the result is definite; hence, the entropy must be 0. When p = 1/2, the uncertainty is highest, and the SE is 1. The MSE of different word size is given by

MSE=j=1kwjlog2wj,
(3)

where w j indicates the frequency of the j th word in the gene sequence. For instance, for a word of length 1, w j is determined using the frequencies of purine or pyrimidine 0, 1, and for a word of length 2, w j is determined using the two-time repeat of purine or pyrimidine 00, 10, 01, and 11. The number of words determined by taking the maximum length of both purines and pyrimidines is represented by k (Rout et al., 2020).

2.2 Hurst exponent

The HE evaluates a data set’s smoothness and degree of similarities. The HE is often used to analyze auto-correlation in time-series analysis. It is calculated using rescaled range analysis (R/S analysis) and has a value of 0–1 (Hurst, 1951; Khandelwal et al., 2022c). A negative auto-correlation of a time series is indicated by a HE value between 0 and 0.5, while a HE value between 0.5 and 1 indicates a positive auto-correlation. If the HE value is 0.5, the series is random, meaning that there is no relation between the variable and its previous values (Hassan et al., 2021; Rout et al., 2022). The HE of a binary sequence D n is computed by the following equation:

RnSn=n2HE,
(4)

where

Sn=1ni=1nDim2,
(5)

and

Rn=maxX1,X2,,XnminX1,X2,,Xn,
(6)

Xt=i=1tDimfort=1,2,3,,n
(7)

m=1ni=1nDi.
(8)

2.3 Fractal dimension

Every DNA sequence is converted into indicator matrices (Rout et al., 2018; Umer et al., 2021). Let X = {A, T, C, and G} denote the set of finite alphabet nucleotides, and D(N) denote a DNA sequence with four symbols from X of length N. The indicator function for every DNA sequence is described by the following equation:

F:DN×DN0,1,andDN=0,1,
(9)

such that the indicator matrix will be

IN,N=1,if si=sj0,if sisjwheresi,sjDN.
(10)

Here, I(N, N) is a matrix with values 0 and 1, and it produces a binary image of the DNA sequence as a 2D dot-plot. Within the same sequence, the binary image can represent the distribution of 0s and 1s. It is possible to assign a white dot to 0 and a black dot to 1. The FD from an indicator matrix can be computed as the average number of σ(n) of 1, randomly selected n× n from an N× N indicator matrix (Cattani, 2010; Rout et al., 2014; Upadhayay et al., 2019). Using σ(n), the FD is computed by the following equation:

FD=1Nn=2Nlogσnlogn.
(11)

3 Proposed scheme

In this paper, we used the Database of Essential Genes (http://www.essentialgene.org/) for experimental findings and discussion. This dataset consists of essential genes of five species. There are 2,051 H. sapiens (HS), 315 D. rerio (DR), 339 D. melanogaster (DOM), 356 A. thaliana (AT), and 125 M. musculus (MM) essential genes. Table 1 lists some of the terminologies employed in the proposed technique for reference.

TABLE 1

List of species considered in the proposed technique.

NameSymbol used
Arabidopsis thaliana AT
Drosophila melanogaster DOM
Danio rerio DR
Homo sapiens HS
Mus musculus MM
Naming convention for Arabidopsis thaliana [AT 1AT 356]
Naming convention for Drosophila melanogaster [DOM 1DOM 339]
Naming convention for Danio rerio [DR 1DR 315]
Naming convention for Homo sapiens [HS 1HS 2051]
Naming convention for Mus musculus [MM 1MM 125]

3.1 Proposed feature representation technique

The DNA (deoxyribonucleic acid) sequence of essential genes S is composed of four bases: adenine (A), guanine (G), cytosine (C), and thymine (T). So, several occurrences may exist with combinations of A, C, T, G within the sequence S . The co-occurrences of A, C, T, G in the DNA sequence establishes the relationship between the nucleotide. It is the first time that a method has been proposed for finding the co-occurrences of nucleotides A, C, T, G within S . The objective of finding these co-occurrences is to analyze the patterns of A, C, T, G within the DNA sequence S to derive some useful features that uniquely discriminate the species by the feature representation of their essential genes. Assuming x = (A, C, T, G) is a vector of the nucleotides, then the possibility of arrangement of these characters in the DNA gene sequences is represented through co-occurrence matrices formed by the vector combination, which are shown in Table 2.

TABLE 2

Possible sets of occurrences of nucleobases A, C, T, G in a DNA sequence or essential gene formed by the combination of vectors, where I, J, K, L, M, N, O, P are the co-occurrence matrices.

X Y X T × Y
X 1 = (A, C, T, G)(A, C, T, G) I4×4=X1T4×1×Y1×4
X 2 = (AA, CC, TT, GG)(A, C, T, G) J4×4=X2T4×1×Y1×4
X 3 = (AC, AT, AG, CT, CG, TG)(A, C, T, G) K6×4=X3T6×1×Y1×4
X 4 = (CA, TA, GA, TC, GC, GT)(A, C, T, G) L6×4=X4T4×1×Y1×4
X 5 = (ACT, ACG, ATG, CTG)(A, C, T, G) M4×4=X5T4×1×Y1×4
X 6 = (CAT, CAG, TAG, TCG)(A, C, T, G) N4×4=X6T4×1×Y1×4
X 7 = (ATC, AGC, AGT, CGT)(A, C, T, G) O4×4=X7T4×1×Y1×4
X 8 = (TCA, GCA, GTA, GTC)(A, C, T, G) P4×4=X8T4×1×Y1×4

Here, the computed co-occurrence matrices of different combinations of nucleobases represent the distribution of nucleobases throughout the essential gene S. This distribution of nucleobases examines the texture pattern and considered the spatial relationship of nucleobases in the essential gene S. Experimentally, it has been observed that the occurrences of the spatial relationship of nucleobases cannot provide fixed information of the stationary and non-stationary patterns of A, C, T, and G. However, the obtained spatial relationship contains the information of both these patterns at a time. Hence, statistically it is easier to compute information considering both stationary and non-stationary patterns at a time rather than differentiating stationary and non-stationary patterns in S. The essential genes are very critical for the survival of any organism. It is beneficial for cell growth. Each gene sequence is variable in length, and the arrangements A, C, T, G nucleobases are zigzag. Hence, finding the stationary and non-stationary patterns of A, C, T, G and the co-occurrences of the different combinations of these nucleobases will help find its natural pattern in the gene. Hence, deriving the valuable patterns of the variety of A, C, T, G through co-occurrence matrix descriptors will considerably improve the retrieval performance and be eligible to analyze the statistical and structural information effectively from those patterns. Hence, inspired by the co-occurrence matrix of texture analysis (Umer et al., 2016) of image processing and pattern recognition, we have employed the ideas of gray-level co-occurrence matrix. Here, we have computed several co-occurrence matrices from each essential gene data. Now, I4×4 , J4×4 , K6×4 , L6×4 , M4×4 , N4×4 , O4×4 , and P4×4 co-occurrences matrices are computed that contain several patterns of A, C, T, G nucleobases in each DNA sequence S . These co-occurrence matrices are defined in Table 3, Supplementary Table S1, Supplementary Table S2, Supplementary Table S3, Supplementary Table S4, Supplementary Table S5, Supplementary Table S6, and Supplementary Table S7, respectively.

TABLE 3

Co-occurrence matrix I that contains several patterns of A, C, T, G nucleobases in DNA gene sequence S

ACTG
A#(AA)#(AC)#(AT)#(AG)
C#(CA)#(CC)#(CT)#(CG)
T#(TA)#(TC)#(TT)#(TG)
G#(GA)#(GC)#(GT)#(GG)

Here, from the given DNA sequence S , the aforementioned co-occurrence matrices are obtained. Each co-occurrence matrix G contains the number of occurrences of A, C, T, G nucleobases with a specific combinations and offset in S . Since a sequence S with q different combinations of A, C, T, G nucleobases will produce a co-occurrence matrix of size q × 4 for the given offset, so the (r,s) th value of a co-occurrence matrix (Table 3, Supplementary Table S1, Supplementary Table S2, Supplementary Table S3, Supplementary Table S4, Supplementary Table S5, Supplementary Table S6, and Supplementary Table S7) gives the number of times that r th and s th nucleobases present in S . Hence, mathematically, here each co-occurrence matrix (Table 3, Supplementary Table S1, Supplementary Table S2, Supplementary Table S3, Supplementary Table S4, Supplementary Table S5, Supplementary Table S6, and Supplementary Table S7) is given by

G=i=1nj=1n1Gi,j=r&Gi+i,j+j=s0otherwise,
(12)

The offset (△i, △j) defines the spatial relation for which the matrix G is calculated. The number of co-occurrences of the combinations of A, C, T, G present in S is obtained by the co-occurrence matrices. So, to extract distinguish and discriminant features, each matrix G is normalized to G=Gr=0qs=0qG(r,s) . Then, the normalized co-occurrence matrix G is used to compute some features like entropy, dissimilarity, energy, homogeneity, and contrast. The mathematical definitions of these features are shown in Table 4.

TABLE 4

Features extracted from a co-occurrence matrix G of DNA sequence S.

FeatureFormulae
Energy r=0qs=0qG(r,s)2
Entropy r=0qs=0qG(r,s)×ln(G(r,s))
Homogeneity r=0qs=0qG(r,s)(1+(rs)2)
Contrast r=0qs=0qG(r,s)×(rs)2
Dissimilarity r=0qs=0qG(r,s)×|(rs)|

Now, the features defined in Table 4 are extracted from each co-occurrence matrix (Table 3, Supplementary Table S1, Supplementary Table S2, Supplementary Table S3, Supplementary Table S4, Supplementary Table S5, Supplementary Table S6, and Supplementary Table S7), and the list of feature vectors extracted from these matrices is obtained as follows:

f I = (f 1, f 2, f 3, f 4, f 5) from I (Table 3)

f J = (f 6, f 7, f 8, f 9, f 10) from J (Supplementary Table S1)

f K = (f 11, f 12, f 13, f 14, f 15) from K (Supplementary Table S2)

f L = (f 16, f 17, f 18, f 19, f 20) from L (Supplementary Table S3)

f M = (f 21, f 22, f 23, f 24, f 25) from M (Supplementary Table S4)

f N = (f 26, f 27, f 28, f 29, f 30) from N (Supplementary Table S5)

f O = (f 31, f 32, f 33, f 34, f 35) from O (Supplementary Table S6)

f P = (f 36, f 37, f 38, f 39, f 40) from P (Supplementary Table S7)

Hence, the final feature representation of a DNA sequence or essential gene S is given by the feature vector f = (f I , f J , f K , f L , f M , f N , f O , f P ).

3.2 Classification

In this study, for the classification of the essential genes in the employed species, the decision tree (DT), k-nearest neighbor (KNN), and support vector machine (SVM) classifiers are used. During experimentation, the datasets of each species Arabidopsis thaliana (AT), Drosophila melanogaster (DOM), Danio rerio (DR), Homo sapiens (HS), and Mus musculus (MM) are divided into two, with 50% of its data input into the training set and the remaining 50% into the testing set. Then, a five-fold cross-validation technique is employed. Finally, the average performance for the testing data is reported for the proposed system.

DT is a supervised algorithm, and it is generated by using the Iterative Dichotomiser 3 algorithm (ID3) or CART algorithm (Classification algorithm and Regression Tree) (Quinlan, 1986). The DT uses decision nodes to split the dataset into smaller subsets based on information gain (IG) or the Gini index. ID3 uses IG to evaluate how well an attribute splits the training dataset based on its classification objective. IG is the difference between the dataset’s entropy before and after splitting depending on the specified attribute values. Let X = x 1, x 2, x 3, …., x n represent the set of instances, A represent the attribute, and X v subset of X having A = v. Then, IG is given by

IGX,A=EntXvVA|Xv||X|EntXv,
(13)

where ENT(X) is the entropy of X and V(A) is the collection of all possible A values. Entropy of X is given by

EntX=i=1cpilog2pi,
(14)

where p i denotes the probability for current state X.

KNN is a supervised machine learning and non-parametric technique that signifies that it makes no assumptions about the underlying data. The KNN method ensures that the unseen data and existing dataset are comparable and places the unseen data in the most similar class to the unseen data. KNN works by just storing the data during training time. When it sees new data at testing time, it finds k-nearest neighbor to the latest data by using distance measure, i.e., Euclidean distance, and classifies it based on the similarity (Peterson, 2009). The steps of the KNN algorithm are as follows.

  • 1. First, select the value of K, i.e., the closest data points. Any integer may be used as K.
  • 2. Do the following for each data point in the test data set: (i) find the distance between the data point and all samples in the training dataset using one of the following methods: Manhattan, Euclidean, or Hamming distance. In this paper, Euclidean distance measure is used for calculating the distance; (ii) sort samples in the ascending order depending on the distance value; (iii) select the top K samples as the nearest neighbors to the test data point; (iv) next, the test data point will be assigned a class depending on the most common class of these K samples.

The SVM is a supervised machine learning approach for classifying data. The SVM is a well-known technique used in various bioinformatics and computational biology problems, and it needs fewer model parameters to describe the non-linear transition from primary sequence to protein structure region. To minimize the error, the SVM will create the hyperplane repeatedly. The SVM is noted for its quick training, which is necessary for high-throughput database testing (Suthaharan, 2016). Let the dataset be represented by (X 1, y 1), (X 2, y 2), (X 3, y 3), …. , (X n , y n ). The SVM solves the following equation:

minw,bw2such thati,yiw,Xi+b1,
(15)

where w and b is the weight and bias of the hyperplane equation wX + b = 0, respectively.

3.3 Evaluation metrics

In this paper, the essential gene classification problem is a multi-class classification problem as we have classified essential genes of five species, i.e., AT, DOM, DR, HS, and MM. For every class in the target, the evaluation matrices (accuracy, precision, recall, and F1-score) were computed. Then, the weighted averaging technique was used to give the final value of evaluation metrics.

Accuracy=i=1Cni×TPi+TNiTPi+TNi+FPi+FNii=1Cni,
(16)

Precision=i=1Cni×TPiTPi+FPii=1Cni
(17)

Recall=i=1Cni×TPiTPi+FNii=1Cni
(18)

F1score=i=1Cni×2×Precisioni×RecalliPrecisioni+Recallii=1Cni,
(19)

where

Precisioni=TPiTPi+FPi,
(20)

and

Recalli=TPiTPi+FNi,
(21)

where TP i , TN i , FP i , and FN i are the counts of true positives, true negatives, false positives, and false negatives, respectively, for the i th class. Here, C represents the number of classes in the problem, and n i indicates the number of samples in the i th class.

3.4 Model framework

The proposed model classified essential genes of five species based on co-occurrence matrices. The proposed model finds the eight different co-occurrence matrices from the DNA sequences. From each co-occurrence matrix, five features, i.e., energy, entropy, homogeneity, contrast, and dissimilarity, were extracted. The existing features, such as HE, FD, SE, and MSE were also computed and then combined with the proposed features for the classification of essential genes. A supervised machine learning algorithm, SVM, was used to evaluate the model. Figure 1 shows essential genes. A supervised machine learning algorithm, SVM was used to evaluate the model. Figure 1 shows the framework of the proposed model.

An external file that holds a picture, illustration, etc.
Object name is fgene-14-1154120-g001.jpg

Framework of the proposed model for the classification of essential genes. Here, CoM indicates the co-occurrence matrices.

4 Result and discussion

The proposed essential gene classification model can identify novel essential genes with high recall and precision while only requiring a small number of previously identified essential genes in some species. Such a method could be highly beneficial when investigating essential genes in newly sequenced genomes of other species with few known examples of essential genes. The proposed work has been implemented in the ‘Python’ environment, while the ‘Python’ library of machine learning algorithms has been employed for data classification tasks. Python is the best scripting and programming language, is open-source, and has high-level object-oriented programming approaches that deal with mathematical and statistical functions. The method’s implementation for the proposed methodology is executed in the Kaggle repository that explores research to data scientists and machine learning engineers as best practitioners in these fields. Here, for Python tools, we have employed NumPy, Pandas, Matplotlib, Sklearn.Preprocessing, Sklearn.Classifiers, Sklearn.Metrics, and some other packages for data analysis and prediction models. The feature vectors extracted from each DNA gene sequence S undergo KNN, DT, and SVM classifiers. The datasets from AT, DOM, DR, HS, and MM species are given in Table 5. The experimentation of the proposed methodology has been divided into sub-sections.

TABLE 5

Demonstration of actual files containing gene sequences corresponding to AT, DOM, DR, HS, and MM species.

Actual filesActual files containing DNA sequences
AT356356
DOM339339
DR315315
HS20542051
MM411125

4.1 Experiment for the proposed features

In this section, experiments with individual features have been performed. Here, from each DNA sequence S , individual feature from each f I , f J , f K , f L , f M , f N , f O , f P have been considered, and then classification has been performed. Figure 2 demonstrates the distribution of F1-score performance obtained by DT, KNN, and SVM classifiers with respect to every 40 features computed from co-occurrence matrices of DNA sequence S. From this figure, it has been observed that both the KNN and SVM classifiers predict the classification problem better than the DT classifier for most of the features. Moreover, it has also been observed that classifiers have obtained more or less similar performance for most features but better performance due to the 19th, 26th, 27th, 30th, 32nd, and 35th features of the forty-dimensional feature vector f. For measuring the impact of individual features such as entropy, homogeneity, energy, contrast, and dissimilarity on the classification of essential genes, the performance has been reported concerning KNN, DT, and SVM classifiers in Table 6. Here, experiments are carried out under the same training–testing protocols, and from each DNA sequence S , the corresponding features are extracted from all co-occurrence matrices. So, each eight-dimensional feature vector is extracted for entropy, homogeneity, energy, contrast, and dissimilarity features.

An external file that holds a picture, illustration, etc.
Object name is fgene-14-1154120-g002.jpg

Demonstration of distribution of F1-score performance obtained by decision tree, KNN, and SVM classifiers with respect to the 40 features computed from co-occurrence matrices of DNA gene sequence S.

TABLE 6

Impact of different co-occurrence features on the classification of essential gene sequences of AT, DOM, DR, HS, and MM species.

ClassifierAccuracyPrecisionRecallF1-score
Effect of entropy features
 K-nearest neighbors63.5656.6863.5659.39
 Decision tree52.9553.5652.9553.25
 Support vector machine64.3741.4464.3750.42
Effect of dissimilarity features
 K-nearest neighbors62.9657.3862.9659.55
 Decision tree52.7053.8452.7053.25
 Support vector machine67.0758.8067.0756.75
Effect of energy features
 K-nearest neighbors59.4852.7159.4855.46
 Decision tree48.6549.8248.6549.22
 Support vector machine64.9450.3264.9451.83
Effect of homogeneity features
 K-nearest neighbors63.0657.5963.0659.99
 Decision tree53.6154.8153.6154.19
 Support vector machine67.6760.7667.6758.29
Effect of contrast features
 K-nearest neighbors64.2558.9264.2561.02
 Decision tree54.8056.2754.8055.51
 Support vector machine68.3659.8268.3658.85

As shown in Table 6, for every feature, the performance is more or less the same, but for the KNN classifier, the performance is better than that of DT and SVM. Here, F1-score has been considered classification performance as the employed species AT, DOM, DR, HS, and MM have class imbalance problems. Furthermore, the effect of features computed from each co-occurrence matrix in the subsequent experiments has been considered. Here, the 5-dimensional feature vector is extracted from each co-occurrence matrix. The performance due to these feature vectors is reported in Table 7 under the same training–testing protocol. Table 7 shows that there is a more or less a similar effect of co-occurrence matrix features on the essential gene classification. Hence, the features computed from the co-occurrence metrics are helpful and effective. Here, the KNN classifier has better performance.

TABLE 7

Impact of features extracted from different co-occurrence matrices for the classification of essential gene sequences of AT, DOM, DR, HS, and MM species.

ClassifierAccuracyPrecisionRecallF1-score
Effect of first matrix
 K-nearest neighbors63.3756.3963.3759.20
 Decision tree53.7054.0253.7053.85
 Support vector machine64.3841.4464.3850.42
Effect of second matrix
 K-nearest neighbors62.0554.4362.0557.54
 Decision tree53.2053.8853.2053.53
 Support vector machine64.3841.4464.3850.42
Effect of third matrix
 K-nearest neighbors60.5852.6960.5855.66
 Decision tree49.7251.0149.7250.34
 Support vector machine64.3841.4464.3850.42
Effect of fourth matrix
 K-nearest neighbors62.9658.3262.9659.41
 Decision tree54.3355.1454.3354.72
 Support vector machine64.3841.4464.3850.42
Effect of fifth matrix
 K-nearest neighbors57.9149.7257.9153.02
 Decision tree47.2448.1447.2447.69
 Support vector machine64.3841.4464.3850.42
Effect of sixth matrix
 K-nearest neighbors61.4954.1361.4957.14
 Decision tree52.6954.3452.6953.49
 Support vector machine65.3547.6165.3553.36
Effect of seventh matrix
 K-nearest neighbors58.8252.9458.8255.37
 Decision tree50.4451.5650.4450.99
 Support vector machine64.8146.8164.8153.45
Effect of eighth matrix
 K-nearest neighbors56.1250.8656.1252.78
 Decision tree49.2849.8649.2849.56
 Support vector machine64.3841.4464.3850.42

4.2 Experiment for the existing features

In the further experiment, the performance has been compared with some existing state-of-the-art feature extraction techniques such as SE, MSE, HE, and FD(discussed in Section 2), where these features are extracted accordingly. The performance is obtained concerning KNN, DT, and SVM classifiers. The performance due to these features is reported in Table 8, implying that SE, HE, MSE, and FD features have more or less similar performance. Still, among the classifiers, SVM has obtained better performance. The comparison of these performances and the proposed system has been shown in Figure 3, which shows that the proposed approach has better classified the essential genes of AT, DOM, DR, HS, and MM species under the same training–testing protocol. Here, the difference is in the proposed system, and the forty-dimensional feature vector is considered, while the one-dimensional feature vector is extracted in each existing feature extraction technique. Hence, this work investigates the discriminatory power of co-occurrence matrix features with better performance than the existing state-of-the-art features.

TABLE 8

Impact of existing and proposed features on the classification of essential genes for the AT, DOM, DR, HS, and MM species.

ClassifierAccuracyPrecisionRecallF1-score
Effect of Shannon entropy features
 K-nearest \neighbors53.1046.2453.1049.14
 Decision tree48.2846.9648.2847.53
 Support vector machine64.3341.3864.3350.36
Effect of Hurst exponent features
 K-nearest neighbors53.9845.6353.9849.14
 Decision tree43.5745.4143.5744.45
 Support vector machine64.3341.3864.3350.36
Effect of modified Shannon entropy features
 K-nearest neighbors54.6746.2054.6749.71
 Decision tree41.7643.9841.7642.80
 Support vector machine64.2645.6464.2650.66
Effect of fractal dimension features
 K-nearest neighbors58.1152.1958.1152.15
 Decision tree68.3546.7268.3555.51
 Support vector machine68.3546.7268.3555.51
Effect of proposed features
 K-nearest neighbors64.9559.4964.9561.50
 Decision tree58.3159.2458.3158.70
 Support vector machine66.1456.5766.1454.35
An external file that holds a picture, illustration, etc.
Object name is fgene-14-1154120-g003.jpg

Performance (F1-score) comparison of existing features and the proposed features for the classification of essential genes of AT, DOM, DR, HS, and MM species.

4.3 Experiment for the combined features

The co-occurrence of nucleotides A, C, T, G in the essential gene derives the distribution of these nucleotides and also their relative position information within the gene S . The existing state-of-the-art techniques of feature extraction (discussed in this work) are key measures in information theory. For example, SE and its modified technique compute the amount of uncertainty and randomness of nucleotides in the gene S . HE measures the relative tendency and characteristic parameters for analyzing its distribution in the essential gene. The FD computes the fractal-like distribution of nucleotides from the indicator matrix calculated from the essential gene S . So, the similarity of patterns of nucleotides computed by the co-occurrence matrices and the information of uncertainty, randomness, relative tendency, and fractal-like distribution information in S are combined here to obtain more discriminant features for the classification of essential genes of AT, DOM, DR, HS, and MM species. The principal component analysis of dimensionality reduction with variation ratio has been adopted to find the best suitable combination of these features. The performance due to the combination of these features is demonstrated in Table 9.

TABLE 9

Demonstration of discriminant features among proposed features, Shannon entropy, Hurst exponent, modified Shannon entropy and fractal dimension features.

FeatureEigen-valuesRankFeatureEigen-valuesRank
f 1 13.9081 f 23 0.28323
f 2 4.4342 f 24 0.25724
f 3 3.6283 f 25 0.22425
f 4 2.8954 f 26 0.19226
f 5 2.5055 f 27 0.15227
f 6 2.2336 f 28 0.10928
f 7 1.9047 f 29 0.04129
f 8 1.6028 f 30 0.03230
f 9 1.3889 f 31 0.02732
f 10 1.13310 f 32 0.02731
f 11 0.98611 f 33 0.02333
f 12 0.85512 f 34 0.01934
f 13 0.82013 f 35 0.01535
f 14 0.75014 f 36 0.00836
f 15 0.71415 f 37 0.00637
f 16 0.52516 f 38 0.00143
f 17 0.47117 f 39 0.00144
f 18 0.44018 f 40 0.00242
f 19 0.43219 f 41 0.00341
f 20 0.33320 f 42 0.00340
f 21 0.32921 f 43 0.00439
f 22 0.29922 f 44 0.00438

Table 10 reports the discriminatory power of combined features with respect to various dimensional reduced features concerning KNN, DT, and SVM classifiers and shows that highest F1-score is 71.42 and it is due to the SVM classifier. As this is class imbalance problem, so F1-score performance has been reported.

TABLE 10

Demonstration of performance due to combination of features for the classification of essential genes of AT, DOM, DR, HS, and MM species.

VariationClassifierAccuracyPrecisionRecallF1-scoreFeature dimension
0.85K-nearest neighbors72.0166.3772.0168.674
Decision tree63.0963.6363.0963.34
Support vector machine74.3068.7774.3067.69
0.9K-nearest neighbors71.5266.7771.5268.945
Decision tree62.6763.8162.6763.18
Support vector machine75.9169.5775.9170.31
0.95K-nearest neighbors73.8268.8373.8270.807
Decision tree63.9364.6763.9364.29
Support vector machine76.4672.6376.4671.06
0.99K-nearest neighbors73.9668.2973.9670.669
Decision tree64.4865.3564.4864.88
Support vector machine76.3270.5676.32 71.42

The bold value indicates the highest F1-score.

For better understanding and visibility, the final performance for the combination of features for the classification of essential genes of AT, DOM, DR, HS, and MM species has been shown in Figure 4.

An external file that holds a picture, illustration, etc.
Object name is fgene-14-1154120-g004.jpg

Demonstration of final performance for the combination of features for the classification of essential genes of AT, DOM, DR, HS, and MM species.

5 Conclusion

A novel method of feature extraction and analysis for the classification of essential genes of Arabidopsis thaliana (AT), Drosophila melanogaster (DOM), Danio rerio (DR), Homo sapiens (HS), and Mus musculus (MM) species has been considered in this work. The implementation of the proposed scheme is divided into three segments. In the first segment, novel co-occurrence matrix-based features are extracted from genes that derive the distribution of nucleotides and their relative position from the respective gene. The features from these measures belong to the statistical analysis of the distribution of stationary patterns of nucleotides in the essential genes. In the second segment, some existing state-of-the-art feature computation techniques such as SE, HE, and FD are used as information theory measures that compute uncertainty, randomness, relative tendency, and fractal-like structures in the gene. In the third segment of this work, the features from the proposed methodology and the existing techniques are individually carried out for classification tasks where their F1-score performance has been considered for comparison. These comparisons show the robustness and effectiveness of the proposed methodology. Finally, the features from the proposed scheme and the existing techniques are combined to compute more discriminatory features for classifying essential genes of AT, DOM, DR, HS, and MM species.

Funding Statement

HQ thanks the United States NSF award 1761839 and 2200138 and a catalyst award from the United States National Academy of Medicine.

Data availability statement

Data used for this study is publicly available at http://www.essentialgene.org/.

Author contributions

RR and SU conceived the method and design. RR, SU, and MK conducted the experiment, and RR, SU, MK, SP, and SM analyzed the results. RR, SU, MK, and SP wrote the manuscript. SM, BB, and HQ reviewed and edited the manuscript. All authors read and approved the final manuscript.

Conflict of interest

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Publisher’s note

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors, and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

Supplementary material

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fgene.2023.1154120/full#supplementary-material

References

  • Cattani C. (2010). Fractals and hidden symmetries in dna. Math. problems Eng. 2010. 10.1155/2010/507056. [CrossRef] [Google Scholar]
  • Chen H., Zhang Z., Jiang S., Li R., Li W., Zhao C., et al. (2020). New insights on human essential genes based on integrated analysis and the construction of the hegiap web-based platform. Briefings Bioinforma. 21, 1397–1410. 10.1093/bib/bbz072 [PMC free article] [PubMed] [CrossRef] [Google Scholar]
  • Chen Y., Xu D. (2005). Understanding protein dispensability through machine-learning analysis of high-throughput data. Bioinformatics 21, 575–581. 10.1093/bioinformatics/bti058 [PubMed] [CrossRef] [Google Scholar]
  • Cullen L. M., Arndt G. M. (2005). Genome-wide screening for gene function using rnai in mammalian cells. Immunol. cell Biol. 83, 217–223. 10.1111/j.1440-1711.2005.01332.x [PubMed] [CrossRef] [Google Scholar]
  • Deng J. (2015). “An integrated machine-learning model to predict prokaryotic essential genes,” in Gene essentiality (Springer; ), 137–151. [PubMed] [Google Scholar]
  • Dickerson J. E., Zhu A., Robertson D. L., Hentges K. E. (2011). Defining the role of essential genes in human disease. PloS one 6, e27368. 10.1371/journal.pone.0027368 [PMC free article] [PubMed] [CrossRef] [Google Scholar]
  • Giaever G., Chu A. M., Ni L., Connelly C., Riles L., Véronneau S., et al. (2002). Functional profiling of the saccharomyces cerevisiae genome. nature 418, 387–391. 10.1038/nature00935 [PubMed] [CrossRef] [Google Scholar]
  • Gil R., Silva F. J., Peretó J., Moya A. (2004). Determination of the core of a minimal bacterial gene set. Microbiol. Mol. Biol. Rev. 68, 518–537. 10.1128/MMBR.68.3.518-537.2004 [PMC free article] [PubMed] [CrossRef] [Google Scholar]
  • Guo H.-B., Ghafari M., Dang W., Qin H. (2021). Protein interaction potential landscapes for yeast replicative aging. Sci. Rep. 11, 7143–7154. 10.1038/s41598-021-86415-8 [PMC free article] [PubMed] [CrossRef] [Google Scholar]
  • Hassan S. S., Rout R. K., Sahoo K. S., Jhanjhi N., Umer S., Tabbakh T. A., et al. (2021). A vicenary analysis of sars-cov-2 genomes. Cmc-Computers Mater. Continua 69, 3477–3493. 10.32604/cmc.2021.017206 [CrossRef] [Google Scholar]
  • Hurst H. E. (1951). Long-term storage capacity of reservoirs. Trans. Am. Soc. Civ. Eng. 116, 770–799. 10.1061/taceat.0006518 [CrossRef] [Google Scholar]
  • Itaya M. (1995). An estimation of minimal genome size required for life. FEBS Lett. 362, 257–260. 10.1016/0014-5793(95)00233-y [PubMed] [CrossRef] [Google Scholar]
  • Juhas M., Eberl L., Glass J. I. (2011). Essence of life: Essential genes of minimal genomes. Trends cell Biol. 21, 562–568. 10.1016/j.tcb.2011.07.005 [PubMed] [CrossRef] [Google Scholar]
  • Juhas M., Reuß D. R., Zhu B., Commichau F. M. (2014). Bacillus subtilis and escherichia coli essential genes and minimal cell factories after one decade of genome engineering. Microbiology 160, 2341–2351. 10.1099/mic.0.079376-0 [PubMed] [CrossRef] [Google Scholar]
  • Juhas M., Stark M., von Mering C., Lumjiaktase P., Crook D. W., Valvano M. A., et al. (2012). High confidence prediction of essential genes in burkholderia cenocepacia. PloS one 7, e40064. 10.1371/journal.pone.0040064 [PMC free article] [PubMed] [CrossRef] [Google Scholar]
  • Khandelwal M., Kumar Rout R., Umer S., Mallik S., Li A. (2022a). Multifactorial feature extraction and site prognosis model for protein methylation data. Briefings Funct. Genomics 22, 20–30. 10.1093/bfgp/elac034 [PubMed] [CrossRef] [Google Scholar]
  • Khandelwal M., Rout R. K., Umer S. (2022b). Protein-protein interaction prediction from primary sequences using supervised machine learning algorithm. In 2022 12th International Conference on Cloud Computing, Data Science and Engineering (Confluence) (IEEE), 268–272. [Google Scholar]
  • Khandelwal M., Sheikh S., Rout R. K., Umer S., Mallik S., Zhao Z. (2022c). Unsupervised learning for feature representation using spatial distribution of amino acids in aldehyde dehydrogenase (aldh2) protein sequences. Mathematics 10, 2228. 10.3390/math10132228 [CrossRef] [Google Scholar]
  • Koonin E. V. (2000). How many genes can make a cell: The minimal-gene-set concept. Annu. Rev. genomics Hum. Genet. 1, 99–116. 10.1146/annurev.genom.1.1.99 [PMC free article] [PubMed] [CrossRef] [Google Scholar]
  • Kuang S., Wei Y., Wang L. (2021). Expression-based prediction of human essential genes and candidate lncrnas in cancer cells. Bioinformatics 37, 396–403. 10.1093/bioinformatics/btaa717 [PubMed] [CrossRef] [Google Scholar]
  • Le N. Q. K., Do D. T., Hung T. N. K., Lam L. H. T., Huynh T.-T., Nguyen N. T. K. (2020). A computational framework based on ensemble deep neural networks for essential genes identification. Int. J. Mol. Sci. 21, 9070. 10.3390/ijms21239070 [PMC free article] [PubMed] [CrossRef] [Google Scholar]
  • Liu X., Wang B.-J., Xu L., Tang H.-L., Xu G.-Q. (2017). Selection of key sequence-based features for prediction of essential genes in 31 diverse bacterial species. PLoS One 12, e0174638. 10.1371/journal.pone.0174638 [PMC free article] [PubMed] [CrossRef] [Google Scholar]
  • Marques de Castro G., Hastenreiter Z., Silva Monteiro T. A., Martins da Silva T. T., Pereira Lobo F. (2022). Cross-species prediction of essential genes in insects. Bioinformatics 38, 1504–1513. 10.1093/bioinformatics/btac009 [PubMed] [CrossRef] [Google Scholar]
  • McCutcheon J. P., Moran N. A. (2010). Functional convergence in reduced genomes of bacterial symbionts spanning 200 my of evolution. Genome Biol. Evol. 2, 708–718. 10.1093/gbe/evq055 [PMC free article] [PubMed] [CrossRef] [Google Scholar]
  • Mobegi F. M., Zomer A., De Jonge M. I., Van Hijum S. A. (2017). Advances and perspectives in computational prediction of microbial gene essentiality. Briefings Funct. genomics 16, 70–79. 10.1093/bfgp/elv063 [PubMed] [CrossRef] [Google Scholar]
  • Peterson L. E. (2009). K-nearest neighbor. Scholarpedia 4, 1883. 10.4249/scholarpedia.1883 [CrossRef] [Google Scholar]
  • Qin H. (2019). Estimating network changes from lifespan measurements using a parsimonious gene network model of cellular aging. Bmc Bioinforma. 20, 599–608. 10.1186/s12859-019-3177-7 [PMC free article] [PubMed] [CrossRef] [Google Scholar]
  • Quinlan J. R. (1986). Induction of decision trees. Mach. Learn. 1, 81–106. 10.1007/bf00116251 [CrossRef] [Google Scholar]
  • Rout R. K., Pal Choudhury P., Maity S. P., Daya Sagar B., Hassan S. S. (2018). Fractal and mathematical morphology in intricate comparison between tertiary protein structures. Comput. Methods Biomechanics Biomed. Eng. Imaging and Vis. 6, 192–203. 10.1080/21681163.2016.1214850 [CrossRef] [Google Scholar]
  • Roemer T., Jiang B., Davison J., Ketela T., Veillette K., Breton A., et al. (2003). Large-scale essential gene identification in candida albicans and applications to antifungal drug discovery. Mol. Microbiol. 50, 167–181. 10.1046/j.1365-2958.2003.03697.x [PubMed] [CrossRef] [Google Scholar]
  • Rout R. K., Ghosh S., Choudhury P. P. (2014). Classification of mer proteins in a quantitative manner. Int. Comput. Appl. Eng. Sci. 4, 31–34. [Google Scholar]
  • Rout R. K., Hassan S. S., Sheikh S., Umer S., Sahoo K. S., Gandomi A. H. (2022). Feature-extraction and analysis based on spatial distribution of amino acids for sars-cov-2 protein sequences. Comput. Biol. Med. 141, 105024. 10.1016/j.compbiomed.2021.105024 [PMC free article] [PubMed] [CrossRef] [Google Scholar]
  • Rout R. K., Hassan S. S., Sindhwani S., Pandey H. M., Umer S. (2020). Intelligent classification and analysis of essential genes using quantitative methods. ACM Trans. Multimedia Comput. Commun. Appl. (TOMM) 16, 1–21. 10.1145/3343856 [CrossRef] [Google Scholar]
  • Senthamizhan V., Ravindran B., Raman K. (2021). Netgenes: A database of essential genes predicted using features from interaction networks. Front. Genet. 12, 722198. 10.3389/fgene.2021.722198 [PMC free article] [PubMed] [CrossRef] [Google Scholar]
  • Seringhaus M., Paccanaro A., Borneman A., Snyder M., Gerstein M. (2006). Predicting essential genes in fungal genomes. Genome Res. 16, 1126–1135. 10.1101/gr.5144106 [PMC free article] [PubMed] [CrossRef] [Google Scholar]
  • Suthaharan S. (2016). “Support vector machine,” in Machine learning models and algorithms for big data classification (Springer; ), 207–235. [Google Scholar]
  • Umer S., Dhara B. C., Chanda B. (2016). Texture code matrix-based multi-instance iris recognition. Pattern Analysis Appl. 19, 283–295. 10.1007/s10044-015-0482-2 [CrossRef] [Google Scholar]
  • Umer S., Mohanta P. P., Rout R. K., Pandey H. M. (2021). Machine learning method for cosmetic product recognition: A visual searching approach. Multimedia Tools Appl. 80, 34997–35023. 10.1007/s11042-020-09079-y [CrossRef] [Google Scholar]
  • Upadhayay P. D., Agarwal R. C., Rout R. K., Agrawal A. P. (2019). Mathematical characterization of membrane protein sequences of homo-sapiens. 2019 9th International Conference on Cloud Computing, Data Science and Engineering (Confluence). IEEE, 382–386. [Google Scholar]
  • Veeranagouda Y., Husain F., Tenorio E. L., Wexler H. M. (2014). Identification of genes required for the survival of b. fragilis using massive parallel sequencing of a saturated transposon mutant library. BMC genomics 15, 429–439. 10.1186/1471-2164-15-429 [PMC free article] [PubMed] [CrossRef] [Google Scholar]
  • Xu L., Guo Z., Liu X. (2020). Prediction of essential genes in prokaryote based on artificial neural network. Genes and genomics 42, 97–106. 10.1007/s13258-019-00884-w [PubMed] [CrossRef] [Google Scholar]
  • Yuan Y., Xu Y., Xu J., Ball R. L., Liang H. (2012). Predicting the lethal phenotype of the knockout mouse by integrating comprehensive genomic data. Bioinformatics 28, 1246–1252. 10.1093/bioinformatics/bts120 [PMC free article] [PubMed] [CrossRef] [Google Scholar]
  • Zhang X., Xiao W., Xiao W. (2020). Deephe: Accurately predicting human essential genes based on deep learning. PLoS Comput. Biol. 16, e1008229. 10.1371/journal.pcbi.1008229 [PMC free article] [PubMed] [CrossRef] [Google Scholar]
  • Zurek W. H. (1989). Algorithmic randomness and physical entropy. Phys. Rev. A 40, 4731–4751. 10.1103/physreva.40.4731 [PubMed] [CrossRef] [Google Scholar]

Articles from Frontiers in Genetics are provided here courtesy of Frontiers Media SA

-