Statistical classification: Difference between revisions

Content deleted Content added
General proofreading.
m disambiguation no longer needed; target is no longer a disambiguation page, removed: {{disambiguation needed|date=April 2024}}
(8 intermediate revisions by 5 users not shown)
Line 1:
{{Short description|Categorization of data using statistics}}
When [[classification]] is performed by a computer, statistical methods are normally used to develop the algorithm.
In [[statistics]], '''classification''' is the problem of identifying which of a set of [[categorical data|categories]] (sub-populations) an [[observation]] (or observations) belongs to. Examples are assigning a given email to the [[Spam filtering|"spam" or "non-spam"]] class, and assigning a diagnosis to a given patient based on observed characteristics of the patient (sex, blood pressure, presence or absence of certain symptoms, etc.).
 
Often, the individual observations are analyzed into a set of quantifiable properties, known variously as [[explanatory variables]] or ''features''. These properties may variously be [[categorical data|categorical]] (e.g. "A", "B", "AB" or "O", for [[blood type]]), [[ordinal data|ordinal]] (e.g. "large", "medium" or "small"), [[integer|integer-valued]] (e.g. the number of occurrences of a particular word in an [[email]]) or [[real number|real-valued]] (e.g. a measurement of [[blood pressure]]). Other classifiers work by comparing observations to previous observations by means of a [[similarity function|similarity]] or [[metric (mathematics)|distance]] function.
Line 18:
==Frequentist procedures==
 
Early work on statistical classification was undertaken by [[Ronald Fisher|Fisher]],<ref>{{Cite journal |doi = 10.1111/j.1469-1809.1936.tb02137.x|title = The Use of Multiple Measurements in Taxonomic Problems|year = 1936|last1 = Fisher|first1 = R. A.|journal = [[Annals of Eugenics]]|volume = 7|issue = 2|pages = 179–188|hdl = 2440/15227|hdl-access = free}}</ref><ref>{{Cite journal |doi = 10.1111/j.1469-1809.1938.tb02189.x|title = The Statistical Utilization of Multiple Measurements|year = 1938|last1 = Fisher|first1 = R. A.|journal = [[Annals of Eugenics]]|volume = 8|issue = 4|pages = 376–386|hdl = 2440/15232|hdl-access = free}}</ref> in the context of two-group problems, leading to [[Fisher's linear discriminant]] function as the rule for assigning a group to a new observation.<ref name=G1977>Gnanadesikan, R. (1977) ''Methods for Statistical Data Analysis of Multivariate Observations'', Wiley. {{ISBN|0-471-30845-5}} (p. 83&ndash;86)</ref> This early work assumed that data-values within each of the two groups had a [[multivariate normal distribution]]. The extension of this same context to more than two- groups has also been considered with a restriction imposed that the classification rule should be [[linear]].<ref name=G1977/><ref>[[C. R. Rao|Rao, C.R.]] (1952) ''Advanced Statistical Methods in Multivariate Analysis'', Wiley. (Section 9c)</ref> Later work for the multivariate normal distribution allowed the classifier to be [[nonlinear]]:<ref>[[T. W. Anderson|Anderson, T.W.]] (1958) ''An Introduction to Multivariate Statistical Analysis'', Wiley.</ref> several classification rules can be derived based on different adjustments of the [[Mahalanobis distance]], with a new observation being assigned to the group whose centre has the lowest adjusted distance from the observation.
 
==Bayesian procedures==
Line 74:
** {{annotated link|Least squares support vector machine}}
 
Choices between different possible algorithms are frequently made on the basis of quantitative [[Classification#Evaluation of accuracy|evaluation of accuracy]].
== Evaluation ==
Classifier performance depends greatly on the characteristics of the data to be classified. There is no single classifier that works best on all given problems (a phenomenon that may be explained by the [[No free lunch in search and optimization|no-free-lunch theorem]]). Various empirical tests have been performed to compare classifier performance and to find the characteristics of data that determine classifier performance. Determining a suitable classifier for a given problem is however still more an art than a science.
 
The measures [[precision and recall]] are popular metrics used to evaluate the quality of a classification system. More recently, [[receiver operating characteristic]] (ROC) curves have been used to evaluate the tradeoff between true- and false-positive rates of classification algorithms.
 
As a performance metric, the [[uncertainty coefficient]] has the advantage over simple [[accuracy]] in that it is not affected by the relative sizes of the different classes.
<ref name="Mills2010">
{{Cite journal
| author = Peter Mills
| title = Efficient statistical classification of satellite measurements
| journal = [[International Journal of Remote Sensing]]
| volume = 32
| issue = 21
| pages = 6109–6132
| doi= 10.1080/01431161.2010.507795
| year = 2011
| arxiv = 1202.2194
| bibcode = 2011IJRS...32.6109M
| s2cid = 88518570
}}</ref>
Further, it will not penalize an algorithm for simply ''rearranging'' the classes.
 
==Application domains==
Line 120 ⟶ 100:
* {{annotated link|Statistical natural language processing}}
 
{{More footnotes needed|date=January 2010}}
 
==See also==
{{Commons category}}
{{Portal|Mathematics}}
{{colbegin}}
Line 143 ⟶ 122:
 
==References==
{{Commons category}}
{{Reflist}}
 
{{Statistics|analysis||state=expanded}}
{{Authority control}}
 
{{DEFAULTSORT:Statistical Classification}}
-