Data modeling, restructuring, analysis, fuzzy cases, learning
This is more for overview of my own than for teaching or exercise.
Other data analysis, data summarization, learning

This article/section is a stub — probably a pile of halfsorted notes, is not wellchecked so may have incorrect bits. (Feel free to ignore, fix, or tell me) 
Contents
 1 Overall concepts, Glossary
 2 More concepts, some shared tools
 2.1 On the theory / math side
 2.2 Feature selection
 2.3 Graph based statistical modeling
 2.4 Dimensionality reduction
 2.4.1 As a wide concept
 2.4.2 Ordination, Factor Analysis, Multivariate analysis
 2.4.3 Factor Analysis, Principal Component Analysis (PCA), and variants
 2.4.4 Correspondence Analysis (CA)
 2.4.5 Multidimensionsional scaling (MDS)
 2.4.6 Singular value decomposition (SVD)
 2.4.7 Nonlinear dimensionality reduction / Manifold learning
 2.4.8 Expectation Maximisation (EM)
 3 Decisions, fuzzy coding
 4 Classification
 5 Clustering
 5.1 Intro
 5.2 Implementation notes
 5.2.1 kmeans
 5.2.2 Canopy clustering
 5.2.3 Bisecting kmeans
 5.2.4 hard cmeans
 5.2.5 UPGMA, WPGMA
 5.2.6 UPGMC, WPGMC
 5.2.7 Fuzzy cmeans (FCM)
 5.2.8 Fuzzy kmeans
 5.2.9 Shell clustering
 5.2.10 PAM
 5.2.11 CLARA
 5.2.12 AGNES
 5.2.13 DIANA
 5.2.14 Buckshot
 5.2.15 Birch
 5.2.16 Cure
 5.2.17 Rock
 5.2.18 Chameleon
 5.2.19 DBSCAN
 5.2.20 OPTICS
 5.2.21 FLAME
 5.2.22 Clustering by Committee (CBC)
 5.2.23 Principal Direction Divisive Partitioning (PDDP)
 5.2.24 Information Bottleneck
 5.2.25 Agglomerative Information Bottleneck
 5.2.26 Expectation Maximization Clustering
 5.3 Related fields
 5.4 See also
 6 State observers / state estimation (and filters in this sense)
 7 Optimization theory, control theory
 8 Semisorted
 9 Unsorted
Overall concepts, Glossary
Types of problems
This article/section is a stub — probably a pile of halfsorted notes, is not wellchecked so may have incorrect bits. (Feel free to ignore, fix, or tell me) 
The types of tasks/problems are often broadly sorted into some distinct types. ...which says little about how they are solved.
 Regression
 predicts a continuous variable
 Clustering
 yields groups of (mutual) similarity, and dissimilarity from other groups
 clustering may not deal well with future data of the same sort, unless some care has been taken, so isn't necessarily a learner/predicter
 Vector quantization
 Discretely dividing continuous space into various areas/shapes
 which itself can be used for decision problems, labeling, clustering, etc.
 Dimensionality reduction
 what: projecting attributes into lowerdimensional data
 with the idea that the resulting data is (hopefully) comparably predictive/correlative (compared to the original)
 The reason is often to eliminate attributes/data that may be irrelevant or too informationally sparse
 see also #Ordination.2C_Dimensionality_reduction.2C_Factor_Analysis.2C_Multivariate_analysis
 Feature extraction: discovering (a few nice) attributes from (many) old ones, or just from data in general.
 Often has a good amount of overlap with dimensionality reduction
 Control optimization problem  finding the best action of control in every (known) state of a system
Slightly more specifically:
 Reinforcement learning has a few different takes.
 You can see it as any task where actor learns how to do things in an environment
 You can also see it as implementations that do so in more specific reward systems, which is an idea that comes back in a handful of different learning algorithms
 Structured prediction refer to prediction problems that predict things more complex than single values
 specifically those that want multiple labels (amount potentially varying with the input), and their prediction relies on intermediate workings of the others
 which actually addresses implementation a little more than most others, in that running independent classifiers will give an answer but not be a structured prediction
 collaborative filtering is a relatively new one, from the internet era
 consider e.g. people categorizing music, movies, products
 the contribution from each user is very sparse
 others...
Related fields
Descriptors of learning systems
Modelbased versus modelfree systems
When you create a model of what to learn and where from, this can save a lot of computation, but will often not be adaptive to things that don't fit that model well.
Modelfree solutions avoid assumptions and their bias and can learn more, but may be infeasible because of not enough information to stably learn from, or requiring too much time and computation to optimize parameters well enough.
Supervised , unsupervised, and more
Classically we split things into supervised and unsupervised, and perhaps call thoes paradigms.
Supervised usually refers to the training processes that work (only) using alreadyannotated training data.
 For example many document classification methods, leastsquares fitting, basic neural network backpropagation
 This often means manually generating training data, which is often a slow process, and not always
Unsupervised refers to processes that work without intervention, often meaning they work without annotation.
 For example, clustering documents based on similarity, selforganizing maps
We've thought up other terms since then to refer to variations.
Semi/lightly supervised usually means there is an iterative process, which needs only minimal human intervention, e.g. to require manual intervention only for unclear cases.
Selfsupervised learning often refers to cases where you can automatically refine a solution, e.g. by automatically creating labels for further training.
This tends to work better for relatively narrowly defined tasks, say, learning to segment humans, within a piece of video at a time.
Note that selfsupervised can variably be seen as
 applying automatic supervision to an unsupervised question
 a flavour of lightly supervised, as it can be used to label a bunch of data more quickly
Reinforcementt learning could be another entry in that list, in that it's a fairly differerent category of expressing problems and solutions.
Problemwise, RL is those where an actor learns how to do things in an environment.
There generally is no labeled data, but there is a way of saying what is preferable and what fails to work. Not in the form of "this was the correct immediate response (input/output pair)", and more that the eventual outcome was good or bad.
Classically, RL it was a bit narrower, often expressed in things like like a markov decision process, in part because that was a basic and clear way to express what to reward and how (often solved via dynamic programming), and it might give a a more complete description than some other methods, which is something you'ld want around control optimization problems.
It is also used in a wider sense of any sort of rewardbased learning, i.e. "have an idea of outcome, punish the bad, reward the good"
Reward itself can be
 looser way (e.g. genetic algorithm just seeing how far things get)
 more direct (e.g. control optimization calculating reward per unit time within a horizon)
This is related to more finer distinctions with in RL, like
 episodic RL  episodic tasks have recognizable chunks of events to learn from, say each game of chess
 continuous RL  continuous tasks never, meaning you reward on the shorter term
Being so broad, it often lies somewhere between supervised and unsupervised,
roughly because you're telling it what you want, but not how to get there.
Implementationwise, that comes back in various methods, in different forms  see e.g.
 e.g. Markov Decision Process
 Qlearning
 evolutionary/genetic algorithms
 arguably a bunch of NN
...but note that there are quantifiable, sometimes huge differences between many of those, e.g. NNs (and problems for which we prefer NN) have a huge state space compared to MDPs.
More on RL
See also:
 LP Kaelbing, ML Littman, AW Moore. (1996) "Reinforcement learning: A survey."
 RS Sutton, AG Barto (1998) Reinforcement Learning: An Introduction
Inductive versus deductive systems
Transfer learning
Transfer learning is the idea that you could use knowledge from other systems.
State observers
State observers / state estimation will estimate the internal state of a real system, measuring what they easily can, and estimating what they need to. Optimization_theory,_control_theory#State_observers_.2F_state_estimation_.28and_filters_in_this_sense.29
This typically a singular state in control theory without much state, but can also be useful to present data to a learning system.
Related are the (introduced by Kalman)
 observability criterion
 the idea that the states of a system can be reasonably inferred from knowledge of its outputs.
 (which seems to include that all states the system can be in are known)
 in practical terms, that a good set of sensors lets you build a model that covers all its important variables.(verify)
 controllability criterion
 the idea that you can actually guide a system through all states.(verify)
Both are related to building something that is likely to be stable in a knowable way.
Again, concepts more related to control theory, but more widely applicable.
Properties and behaviours of learners
Underfitting and overfitting (learners)
Underfitting is when a model is too simple to be good at describing all the patterns in the data.
Underfitted models and learners may still generalize very well, and that can be intentional, e.g. to describe just the most major patterns.
It may be hard to quantify how crude is too crude, though.
Overfitting often means the model is allowed to be so complex that a part of it describes all the patterns there are, meaning the rest ends up describing just noise, or insignificant variance or random errors in the training set.
A little overfitting is not disruptive, but a lot of it often is, distorting or drowning out the parts that are actually modeling the major relationships.
Put another way, overfitting it is the (mistaken) assumption that convergence in the training data means convergence in all data.
There are a few useful tests to evaluate overfitting and underfitting.
Interpretability
On the theory / math side
Stochastic processes, deterministic processes, random fields
A deterministic process deals with possible determined cases, with no unknowns or random variables.
A stochastic process (a.k.a. random process) allows indeterminacy, typically by working with probability distributions.
A lot of data is stochastically modeled, often because you only need partial data (and often only have partial data to start with).
Hybrid models in this context means mixing deterministic and stochastic processes
A random field is random function over an arbitrary domain.
When that domain was time, we would probably call it a random process, but since this is basically a generalization that says it's not necessarily necessarily time, or onedimensional, or realvalued, we think of it as space instead, and field becomes the more apt description. (verify)
See also:
 http://en.wikipedia.org/wiki/Stochastic_process
 http://en.wikipedia.org/wiki/Deterministic_system
 http://en.wikipedia.org/wiki/Random_field
 http://en.wikipedia.org/wiki/Conditional_random_field
Markov property
the Markov property is essentially that there is no memory, only direct response: that response of a process is determined entirely by its current state, and current input (if you don't already define that as part of the state).
More formally, "The environment's response (s,r) at time t+1 depends only on the Markov state s and action a at time t" [1]
There are many general concepts that you can make stateless, and thereby Markovian:
 A Markov model is a stochastic model with the Markov property
 A Markov process is a stochastic process with the Markov property [2]
 A Markov chain refers to a Markov process with finite, countable states [3]
 A Markov random field [4]
 A Markov logic network [5]
 A Markov Decision Process (MDP) is a decision process that satisfies the Markov property
 ..etc.
Markov Decision Process
Belief network
LogLinear and Maximum Entropy
This article/section is a stub — probably a pile of halfsorted notes, is not wellchecked so may have incorrect bits. (Feel free to ignore, fix, or tell me) 
Bayesian learning
This article/section is a stub — probably a pile of halfsorted notes, is not wellchecked so may have incorrect bits. (Feel free to ignore, fix, or tell me) 
Bayesian learning is a general probablistic approach, often specifically used as a probablistic classifier.
Mathematically it is based on any observable attribute you can think of, and the math requires Bayesian inversion (see below).
Many basic implementations also use the Naive Bayes assumption (see below), because it saves a lot of computation time, and seems to work almost as well in most cases.
Bayesian classifier
Bayes Optimal Classifier
Naive Bayes Classifier
Bayesian (Belief) Network
Curse of dimensionality
The curse of dimensionality is, roughly, the idea that each dimension you add to your model makes life a little harder.
Intuitively, this is largely because a lot of these dimensions may be noninformative, but will still contribute to everything you do.
A common example is distance calculations. As you add dimensions, the contributing value of any one decays, meaning the bulk of lessexpressive are going to overpower the few good ones.
It's very likely to drown any good signal in a lot more noise (unless your method explicitly learns this instead of using a standard metric).
And things that build on those distances, like clustering, are going to have a harder time.
Kernel method
https://en.wikipedia.org/wiki/Kernel_method
Statistical modeling
Regression analysis
Linear regression
Logistic regression
Feature selection
https://en.wikipedia.org/wiki/Feature_selection
Graph based statistical modeling
TEMPORARILY ELSEWHERE
Dimensionality reduction
As a wide concept
Dimensionality reduction can be seen in a very wide sense, of creating a simpler variant of the data that focuses on the more interesting and removes the less interesting.
Note that in such a wide sense a lot of learning is useful as dimensionality reduction, just because the output is a useful and smaller thing, be it clustering, neural networks, whatever.
But in most cases it's also specific output, minimal output for that purpose.
Dimensionality reduction in practice is much more mellow version of that, reducing data to a more manageable form. It historically often referred to things like factor analysis and multivariate analysis, i.e. separating out what seem to be structural patterns, but not doing much with them yet, often acknowledging that we usually still have entangled surface effects in the observations given, and our direct output is probably still a poor view of the actual dependent variables that created the observations we look at.
(Whether that's an issue or not depends largely on what it's being used for)
Reducing the amount of dimensions in highly dimensional data is often done to alleviate the curse of dimensionality.[6]
Said curse comes mainly from the fact that for every dimension you add, the implied volume increases quicker and quicker.
So anything that wants to be exhausitve is now in trouble. Statistics has an exponentially harder job of proving significance (at least without exponential amounts more data), Machine learning needs as much more data to train well, optimization needs to consider all combinations of dimensions as variables, etc.
Distance metrics are funnier, in that the influence of any one dimension becomes smaller, so differences in metrics smaller  and more errorprone in things like clustering.
(verify)
It's not all bad, but certainly a thing to consider.
Ordination, Factor Analysis, Multivariate analysis
This article/section is a stub — probably a pile of halfsorted notes, is not wellchecked so may have incorrect bits. (Feel free to ignore, fix, or tell me) 
(See also Statistics#Multivariate_statistics)
Ordination widely means (re)ordering objects so that similar objects are near each other, and dissimilar objects are further away.
Ordination methods are often a step in something else, e.g. be good data massage before clustering.
It can become more relevant to higherdimensional data (lots of variables), in that ordination is something you watch (so sort of an implicit side effect) in the process of dimensionality reduction methods.
Of course, different methods have their own goals and focus, different requirements, and their own conventions along the way.
Typical methods include PCA, SVD, and
Factor Analysis, Principal Component Analysis (PCA), and variants
Correspondence Analysis (CA)
Conceptually similar to PCA, but uses a Chisquare distance, to be more appicable to nominal data (where PCA applies to continuous data).
See also:
Multidimensionsional scaling (MDS)
Refers to a group of similar analyses, with varying properties and methods, but which all focus on ordination
Commonly mentioned types of MDS include:
 Classical multidimensional scaling, a.k.a. Torgerson Scaling, Torgerson–Gower scaling.
 Metric multidimensional scaling
 Nonmetric multidimensional scaling
It is regularly used to provide a proximity visualization, so the target dimensions may be two or three simply because this is easier to plot.
Depending on how you count, there are somewhere between three and a dozen different MDS algorithms.
Some MDS methods closely resemble things like PCA, SVD, and others in how they change the data. Some more generic MDS variants are more on the descriptive side, so can be solved with PCA, SVD, and such.
A ordination, most try to not change the relative distances, but do change the coordinate system in the process.
Input
Input to many method is a similarity matrix  a square symmetric matrix often based on a similarity metric. Note some similar methods may be based on dissimilarity instead.
At a very pragmatic level, you may get
 a readymade similarity matrix
 items plus a method to compare them
 a table of items versus features (such as coordinates, preferences), with a method to compare them
 perceived similarities between each item (e.g in market research)
There is little difference, except in some assumption like whether the feature values are Euclidean, independent, orthogonal, and whatnot.
Result evaluation
MDS solutions tend to be fairly optimal, in that for the amount of target dimensions you will get the solution's distances correlating to the original data's distance's as well as they can.
There are still a number of things that help or hinder accuracy, primarily:
 the choice of input values
 the choice of target dimensions (since too few lead to ill placement choices)
 (to some degree) the type of MDS
 methods that have clusterlike implementations may be sensitive to initial state
You probably want to see how good a solution is.
The simplest method is probably calculating the correlation coefficient between input (dis)similarity and resulting data (dis)similarity, to show how much the MDS result fits the variation in the original. By rule of thumb, values below 0.6 mean a bad solution, and values above 0.8 or 0.9 are pretty good solutions (depending on the accuracy you want, but also varying with the of MDS).
Other methods include Kruskal's Stress, split data tests data stability tests (i.e., eliminating one item, see if result is similar) testretest reliability [7]
Algorithms
Note that in general, correlation can be complete (if k point in k1 dimensions, or distances between any two items are equal whichever way they are combined, e.g. if distances are those in euclidean space), but usually is not. The output's choice of axes is generally not particularly meaningful.
Probably classically the most common example is is principal coordinates analysis (PCO, PCoA), also known as Classical multidimensional scaling, Torgerson Scaling and TorgersonGower scaling, which is a single calculation step so does not require iteration or convergence testing.
See also (Torgerson, 1952), (Torgerson, 1958), (Gower, 1966), CMDSCALE in R
(Note: PCO (Principle Coordinate analysis) should not be confused with PCA (Principle Component Analysis) as it is not the same method, although apparently equivalent when the PCA kernel function is isotropic, e.g. is working on Euclidean coordinates/distance)(verify)
The first dimension in the result should capture the most variability (first principal coordinate), the second the second most (second principal coordinate), etc.
The eigenvectors of the input distance matrix yield the principal coordinates, and the eigenvalues give proportion of variance accounted for. As such, eigenvalue decomposition (or the more general singular value decomposition) can be used for this MDS method.
(The degree to which distances are violated can be estimated by how many small or negative eigenvalues. If there are none (...up to a given amount of dimensions...) then the analysis is probably reasonable), and you can use the eigenvalues to calculate how much of the total variance is accounted for(verify)  and you have the option of choosing afterwards how many dimensions you want to use (which you can't do in nonmetric).
Metric (multidimensional) scaling: a class of MDS that assumes dissimilarities are distances (and thereby also that they are symmetric).
May be used to indicate PCO, but is often meant to indicate a class based on something of a generalization, in that the stress function is more adjustable.
The optimization method used is often Stress Majorization (see also SMACOF [8] (Scaling by Majorizing A COmplicated Function)).
On iterative MDS methods:
minimize stress
no unique solution (so starting position may matter)
 stressbased MDS methods
 may be little more than nonparametric version of PCO(verify)
Nonmetric multidimensional scaling can be a broad name, but generally find a (nonparametric) monotonic relationship between [the dissimilarities in the itemitem matrix and the Euclidean distance between items] and [the location of each item in the lowdimensional space].
The optimization method is usually something like isotonic regression (which is due to monotonicity constraints). Methods regularly have both metric and nonmetric parts, and nonmetric scaling in the broad sense can describe quite varying methods (see e.g. Sammon's NLM).
Note that the the monotonic detail means that ranking of items ends up as more important than the (dis)similarities. This may be a more appropriate way of dealing with certain data, such as psychometric data, e.g. ratings of different items on an arbitrary scale.
Nonmetric MDS may give somewhat lowerstress solutions than metric MDS in the same amount of dimensions.(verify)
Also, certain implementations may deal better with nonnormality or varying error distributions (often by not making those assumptions).
Examples:
 Kruskal's nonmetric MDS(verify)
 ShepardKruskal Scaling(verify) (and (verify) whether that isn't the same thing as the last)
 Sammon nonlinear mapping [9]
Variations of algorithms can be described as:
 Replicated MDS: evaluates multiple matrices simultaneously
 Threeway scaling
 Multidimensional Unfolding
 Restricted MDS
Multidimensional scaling (MDS) [10]
 classical MDS (quite similar to PCA under some conditions, apparently when you use euclidean distance?)
 (Difference matrix > n dimensional coordintes (vectors))
 Kruskal's nonmetric MDS (R: isoMDS)
 Sammon's nonlinear mapping (R: sammon)
See also
 WS Torgerson (1958) Theory and Methods of Scaling
 JB Kruskal, and M Wish (1978) Multidimensional Scaling
 I Borg and P Groenen (2005) Modern Multidimensional Scaling: theory and applications
 TF Cox and MAA Cox (1994) Multidimensional Scaling
Generalizized MDS (GMDS)
A generalization of metric MDS where the target domain is nonEuclidean.
See also:
Sammon’s (nonlinear) mapping
Singular value decomposition (SVD)
See also:
 http://www.kwon3d.com/theory/jkinem/svd.html
 http://en.wikipedia.org/wiki/Singular_value_decomposition
 http://www.netlib.org/lapack/lug/node53.html
 http://public.lanl.gov/mewall/kluwer2002.html
Nonlinear dimensionality reduction / Manifold learning
Expectation Maximisation (EM)
This article/section is a stub — probably a pile of halfsorted notes, is not wellchecked so may have incorrect bits. (Feel free to ignore, fix, or tell me) 
A broadly applicable idea/method that iteratively uses the Maximum Likelihood (ML) idea and its estimation (MLE).
Decisions, fuzzy coding
This article/section is a stub — probably a pile of halfsorted notes, is not wellchecked so may have incorrect bits. (Feel free to ignore, fix, or tell me) 
On bias
Methods / algorithms / searchers
Decision trees
ID3
Pruning (ID3, others)
Rule Postpruning; C4.5
Instancebased learning
Semisorted
Markov Models, Hidden Markov Models
This article/section is a stub — probably a pile of halfsorted notes, is not wellchecked so may have incorrect bits. (Feel free to ignore, fix, or tell me) 
Something like (the simplest possible) Bayesian Belief Networks, but geared to streams of data. Can be seen as a state machine noting the likeliness of each next step based on a number of preceding steps.
The hidden variant only shows its output (and hides the model that produces it), the nonhidden one shows all of its state.
Simple ones are firstorder
Classification
Naive Bayes
Some background
A certain Reverend Thomas Bayes said a number of things about probability, used mostly in Bayesian inference in statistics, and in Naive Bayes classification.
Classification, in general, is usually based on a trained model (or sometimes hardcoded assumptions):
 deciding on a set of distinct target classes (say, exclusives as 'spam' vs. 'not spam', or rough article subjects (preferably exclusive, or classification itself does not make that much sense, although the valueforclass assignments still may)
 deciding some features to look at, that classification will be based on
 learning the importance of features to each class (often using good examples for each class)
Given something unseen to classify, you extract its features the same way you did for the training set, and see which class compares best. Classification is regularly many fuzzy measures of fit followed by a hard choice of which seems best.
Naive Bayes refers to one of the simpler algorithms that does classification this way.
Its definition has a few notational conventions:
 Each training class is referred to as c in the set of classes C (c_{1}...c_{i})
 We eventually want to judge an unseen document D
 and what we want for D is is the class for which the probability the document fits that class is highest
You can express such Bayesian classification as:
the c for which P(cD) is highest
or, a bit more formally yet,
c = maxarg( P(c_{i}D) )
Bayes's rule
The P(cD) above is not obvious to calculate, because it immediately asks us for a best choice for a document, and a thorough answer of that will probably depend on multiple pieces of information: document's features, all classes, and all documents.
So we need to break the problem down before we can give an estimation of the probability P(cD),
and the first step is to apply Bayes's rule (which pops up in various other types of probability calculations and estimations).
P(Dc) * P(c) P(cD) =  P(D)
...for each class c in C.
Looking at the parts:
P(c) is the base probability of each class, that is, the probability that a document is in class c given no other information.
 This is sometimes assumed to be equal for all classes (in which case it falls away),
 sometimes it is estimated, for example using the assumption that the amount of training examples in each class is a direct indication of how commonly an item falls into that class.
P(D) is the chance of finding the document, in general.
 This is usually assumed to be independent of the classification problem.
 because of that, and because it is hard to realistically estimate for an unseen document, it is often assumed to be equal for all documents (so falls away).
...so P(Dc) is the interesting one. It asks for the probability that a particular document in a particular class.
 This is still abstract, and not yet a directly calculable answer, but it is a smaller problem than we had a few paragraphs earlier: Unlike the earlier form (P(cD)) which deals with an immediate choice between alternatives, we now deal mainly/only with an calculation/estimation of fit, of 'how well does a specific Document fit in every specific class?' .
 This degree of fit for a document in a class can be done with any method that judges similarity, as long as it is (fairly) stable/consistent between documents, and it can be calculated (roughly) equally well for seen and unseen documents.
Long story somewhat shorter, applying Bayes' rule reduces the classification problem from
 "Choose best class immediately"
to
 "find the class for which our estimation of fit is highest" (by making an estimator that can judge individual documents, and can do so stably / comparably well for all documents)
Put another way, from:
maxarg( P(c_{i}D) )
to:
maxarg( P(c_{i}) * P(Dc_{i}) )
or, assuming equal baseclass probabilities, to:
maxarg( P(Dc_{i}) )
Features, Words, and Naivety
So let's talk about an implementation.
If we want to calculate a class's fit based on features, we need to choose those features, and a method to calculate an arbitrary document's fit to those features.
It's nice if each feature is fairly robust in that it does not react in an unnecessarily noisy way, and productive for the classification task at hand.
In Naive Bayes introductions, the choice of features is often words (some introductions talk mainly of words, some talk more broadly about features), and the probabilities are based on the relative occurrence of each chosen word.
Words are a specific implementation choice, but are a nice introduction, in part because it speaks to the imagination in cases such as spam (even though it's limited for this, given that spammers know and try to defeat this model).
If you're programming such a toy example, then the amount of words is often reduced to maybe a thousand.
The actual selection is interesting, in that they ought to be distinguishing within the training set ("the" may be common but not useful), but not be so unusual that they are rare within unseen documents.
There are cleverer ways of choosing words, and you probably want to consider the Zipf distribution of words, but to keep the example simple, let's say we take the most common 1000 in the training set, minus some obvious stopwords.
Naive Bayes is often explained by saying we want a probability based on n features in feature set F: P(cf_{1}..f_{n}), which Bayesinverses and denominatoreliminates to:
P(c) * P(f_{1}..f_{n}c)
...which roughly means "how do features in this new document compare to the one for the class?"
Technically, P(f_{1}..f_{n}c) should expand to a very long formula, as each feature depends on others. That's where the naivety of Naive Bayes comes in: the Naive Bayes assumption is simply that all features are independent of all others. This is rarely true, but is much simpler to work with, is evaluated much faster, and works better than you might expect. (this also effectively makes it a bagofwords model)
It means calculation of the probability of the class based of the features simplifies to a simple multiplication of individual feature values / probabilities. Given there are n features:
P(c) * Π_{i=1 to n} P(f_{i}c)
Note that the wordsasfeatures setup is a #Bag_of_words.2Ffeatures_.28assumption.29 assumption: it plus the naivety means that word order is ignored. This means that potentially indicative phrases, collocations, and such are represented only weakly at best, by their words' independent probabilities being slightly higher. Word ngrams are one possible fix, but will likely run you into a sparseness problem since ngrams are implicitly rarer than single words.
Even just bigrams may be problematic  if you take all possible ones from a large set of documents, you'll find that most documents have a tiny fraction of the overall real use.
Bayesian classification as a procedure...
Naive Bayes training comes down to:
 Deciding which set of features (F) you will use.
 Train, which calculates characteristics for each class, namely:
 P(c) for each class (see above), and more importantly,
 all P(fc): the probability of each feature f in each class c
Later, when classifying an unseen document, you calculate P(c) * P(f_{1}..f_{n}c) for it:
 for each c in C:
 take P(c), times
 the calculated probability of each feature in class c, i.e. each P(fc) based on the document.
 Choose the one with the highest probability, and you've got a classifier.
Choosing your features, and further discussion
Naive Bayes with simple word features works fairly well on various types of text, since it essentially notices differences in vocabulary use, the theory being that that is indicative of the document class. Common words can be useful when they aren't uniformly common between all classes, uncommon ones since they tend to up the probability of just a few classes.
You could simply use all words in all documents and end up with tens of thousands of features, although it would help speed to prune that a little. Which ones to leave is a choice you can be halfway smart about  although note that too much tweaking just means it will work better on the training data, with no guarantees for unseen data.
Also note that when you preprocess the training data, you need to process the unseen documents the same way. When choosing the features, you should consider what this may do to unseen documents.
A feature probability of zero is evil to calculations as it randomly destroys comparability in the both training and classification steps. It can happen fairly easily, such as when a word in the vocabulary doesn't occur in a document. A common solution is to cheat and pretend you did actually see it a little. There are a few different ways; some disturb the relative probabilities less than others.
You can add other features. The main property they should have is that their probabilities compare well, are in the same sort of scale. You could for example add important words a few times, a hackish way of weighing them more. You could for example add select word bigrams or add character ngrams.
You can alter the model further, observing that what you actually do is perclass strengthening or weakening of a probability that it will win. It doesn't even have to be a simple probability, even; you could for example include sentence length, which is not a direct probability, so you would have to e.g. remember the average sentence length in each class, calculate the average for the unseen document and define some metric of similarity. This isn't Naive Bayes anymore, but as a probabilitybased classifier it may work a little better if you choose your alterations well.
Pseudocode
This example pseudocode uses only word features, and some basic addone(verify) smoothing to avoid the zero problem.
Training (word features)
U = feature universe D = document set (text,class pairs) C = set of classes in D foreach c in C: BaseProb(c) = D with class c / D Text = concatenation of documents in current class N = tokens in Text foreach token t in U: P(tc) = ( numoccurence(t in Text) + 1 ) / (numtokens(Text)+U )
Classifying (word features)
foreach c in C: p=P(c) foreach token in Document p = p*P(tokenc)
...then collect each of these probabilities and choose the class with the highest probability.
In reality, the probabilities quickly become very small and you run into the problem that standard 32bit or 64bit floating point numbers cannot accurately contain them.
A common workaround for this is to add the logs of the probabilities (yielding negative numbers) instead of multiplying the probabilities. These values are much less likely to scale out of control, and they are as just as monotonous as the probabilities.
Semisorted
 Parzen classifier
 Backpropagation classifier
Evaluating classifiers
Combining classifiers
Support Vector Machines
Supportvector regression
Using the kernel method
Clustering
Intro
Clustering groups a few related subproblems, including
 cluster formation  organizing into clusters
 cluster segmentation  dealing with boundaries (often using cluster centers)
 labeling  assigning meaningful names
 for the (relatively few) cases where this makes sense to estimate
 deciding how many groups to have in the result
 evaluation of a solution (possibly feeding back into the previous point)
Formally, the simplest clustering ca be described as:
 you have a set of n data objects, call it D = { d_{1}, ..., d_{n} }
 in its simplest shape, a clustering result is a disjoint partitioning of D
 which makes clustering itself a function, mapping each datapoint to a cluster number/label that indicate membership of said cluster
The input data is often either
 a set of a points in a manydimensional space, usually a vector space, plus a metric to calculate distance between them, OR
 a set of alreadychosen distances
 preferably complete, but depending on how it's made it might be sparse, and it may be easier to do some fuzzy statistical estimation than to ask people to complete it
The latter may well be in the form of a distance matrix / similarity matrix.
A bunch of methods given datapointsplusmetric convert to that internally, but starting dataplusmetric is often a little more flexible up front  both for data massaging, and sometimes for implementation reasons.
That said, the choice of metric takes care,
because there are many ways to accidentally put some bias into the metric.
Many methods look at elementtoelement similarities/dissimilarities, while a few choose to be more involved with the data that comes from (e.g. some Maximumlikelihoodbased methods common in bioinformatics)
Variations
Hard, soft, and fuzzy clustering
Hard clustering means each item should be assigned to a single group. This is essentially a partitioning.
Soft clustering means something can belong to more than one group.
Regularly used for data known to be too complex to be reduced cleanly with hard clustering, such as when there are closeby, overlapping, or ambiguous groups.
Soft clustering is generally understood as boolean soft clustering: something can belong to one or more clusters, but there are no degrees.
Fuzzy clustering is soft clustering plus degrees of membership.
This means intermediate results are effectively still moderately highdimensional data, you often still have to make a decision about exclusion, thresholds or such (preferably within the algorithm, to have all information available).
If you don't make such a decision, the result more resembles dimensionality reduction.
Agglomerative versus divisive
Agglomerative clustering usually starts with each item in its own cluster and merges them where it seems a good idea.
Divisive clustering usually starts with every item in a single cluster and iteratively splits them as it sees fit, stopping either when some goal is satisfied, or sometimes until everything is split.
Differences include:
 what side they err on in unclear cases.(verify)
 divisive by nature makes its decision on more data, so can be more accurate
 e.g. early combination decisions in agglomerative may be less sensible but cannot be undone without making the method more complex
 divisive is often more efficient to calculate
 divisive is a little more supervised, in that you may need to explecitly give it a way to split that makes sense of the data
In general, divisive clusters seems better at finding large clusters,
agglomerative clustering is good at finding small clusters.(verify)
Hierarchical clustering
Hierarchical clustering creates a tree of relations, often by an process where we keep tracks of how things join, rather than just assimilate things into a larger blob.
Hierarchical clusterers can be flexible, in that their results can partition into an arbitrary number of groups (by choosing the depth at which there are that amount of groups).
Depending on the data these results contain may also be useful as an approximation for fuzzy clustering. They may also be a little more helpful in cluster stability tests.
Some algorithms record and retain he distances of (/stress at) each such joint.
These can be interesting to visualize (think dendrograms and such),
and to effectively allow the amountofcluster choice to be made later (think threshold in a dendrogram).
http://gepas.bioinfo.cipf.es/cgibin/docs/clusterhelp
Notes on....
Group number choice
Some algorithms try to decide on a suitable number of target groups, but many require you to choose an exact number before they get started.
This number is difficult to decide since there is usually is no welldefined, implicit, calculable best choice.
This is a problem particularly in hard clustering, because any decision of group membership is very final.
The membership of bordercases may not be stable under even the slightest amount of (sample) noise.
Things you can do include:
 use evaluation to measure the fitness of a solution (or subsolutions while still clustering), based
Note that such a metric in itself is only a relative value in a distribution you don't know  you'll often have to calculate the fitness for many solutions to get a stillvague idea of fitness.
 In the case of hard clustering, you can intentionally add some noise and see how much the membership of each item varies  and, say, report that as the confidence we have in a choice.
 use some type of crossvalidation
Intercluster and intracluster comparisons; susceptibilities
Depending on the algorithm, you often want to be able to compare
 items to clusters (agglomerative and divisive decisions)
 clusters to clusters (e.g. in hierarchical decisions)
 items to items (e.g. for centroid/medoid decisions)
Clustertocluster distances are most interesting to hierarchical clustering, and can be calculated in a number of ways (usually hardwired into the algorithm), including:
 Singlelink, a.k.a. minimum linkage or nearest linkage
 the most similar combination (lowest distance) of possible comparisons
 more susceptible to overchaining than most other methods
 Completelink , a.k.a. max/farthest linkage
 uses the least similar (largest distance) combination of possible comparisons
 often gives a nonchained, more equally divided clusters than singlelink
 outliers may have disproportional influence
 Averagelink (a.k.a. groupaverage, a.k.a. Groupaverage (agglomerative) clustering (GAAC))  average of distances between all intercluster pairs
 less sensitive to outliers than completelink, less sensitive to inversion than centroid approaches.
 Centroid approaches use a calculated average for comparison to a cluster
 quite susceptible to inversions (verify)
 Medoid approaches try to use a representative item for comparison to a cluster
 somewhat susceptible to inversions (verify)
 Densitybased methods care more about local density of items and less directly about the exact distances involved
 Ward's method
 based on Ward criterion, a.k.a. the Ward minimum variance criterion
Potential problems:
 outliers
 including a single outlier may drastically change comparisons to that group.
 The timing of their inclustion can have significant effects on the result.
 the chaining effect refers to algorithms doing a chain/string of assignments of to a group. The concept is clearest in
 Inversions (sometimes 'reversal' or 'nonmonotonicity')  describes when similarity values do not decrease monotonously in a series of iterations
 easily happens when a process makes decisions based on centers that move in the process of clustering (such as in many centroidstyle processes) particularly when combined with cases where there is no clear clustering solution.
See also:
TODO: read
 http://people.revoledu.com/kardi/tutorial/Similarity/index.html
 http://www.google.com/search?hl=en&q=Mahalanobis+distance
on convergence
This article/section is a stub — probably a pile of halfsorted notes, is not wellchecked so may have incorrect bits. (Feel free to ignore, fix, or tell me) 
Convergence is a nontrivial check in many algorithms.
You could check whether the group assignments have not changed, but this is sensitive to oscillations, resulting in a premature report of convergence and/or a failure to converge (depending somewhat on logic and data size).
A simple threshold is arbitrary since the error values often depend on the scale of the data values (which is not very trivial to correct for(verify)).
This is occasionally solved by error minimization criteria, for example minimizationofthesumofsquares.
There are some details to this. For example, reallocating a point between clusters, various methods consider only the error decrease in the target cluster  while the solution's total error may increase. It usually still converges, but the total error decreases with a little more oscillation, which "no significant improvement in the last step" terminating criterion may be sensitive to (though arguably it's always more robust to check whether the error decrease is roughly asymptotic with the minimum you presume it'll get to).
The idea resembles Expectation Maximization (EM) methods in that it tries to maximize the probability of the clusters being the correct by minimizing the energy/error.
Purely random initial positioning may cause the local minimum problem. Smartly seeding the initial centroids helps and need not be too computationally expensive  and in fact helps convergence; see e.g. kmeans++.
Alternatively, you could run many versions of the analysis, each with random initial placement, and see see whether (and/or to which degree) the results are stable, but this can be computationally expensive.
Robustness in hard clustering
Particularly hard clusterers are often not robust against even minor variations in the data. That is, separate data sets that are highly correlative may lead to significantly different results; areas in which membership is borderline flipflip under the tiniest (sample) noise.
You can evaluating a solution for stress (or correlate distances it implies to the original data e.g. in hierarhical data), though in itself this is only a general thing. It is meaningful the same measre from other clusterings of the same data, meaning that you *can* roughly compare different solutions for expression of the original data, but only in a roughly converging way.
One trick is to cause the problem and test how varied results will be over mild variations over the data. You can for example repeat the clustering some amount of times with some noise, and record how often things change membership.
You can repeat the clustering omitting random pieces of data to lessen the effect of outliers  pieces of data that do not agree with the rest.
You can even aggregate the results from these runs and combine them into a sort of fuzzy cluster result that can show you instabilities, and/or converge on a clusters amount choice.
Evaluation
Silhouette coefficients
DaviesBouldin index
Gap statistic
Related
Clustering as neighbour search
Hamming embedding
Unsorted
Implementation notes
kmeans
The kmeans problem is finding a cluster labelling for a given amount of clusters (k) with minimal error, where the error function is based on the the withingroup sum of squares.
(For completeness, that means for all elements in a group, calculate the square of the euclidean distance to the centroid, and sum up all these squares, which gives percluster error values. Various convergence checks will want to know the sum of these errors)
Most implementations are iterative and look something like:
 Position k cluster centroids (at random, or sometimes slightly more cleverly)
 For each element, assign to the nearest centroid
 Recalculate (affected) centroid means (and often the error/energy at the same time)
 Check whether the moved centroids change the element assignments.
 If so, iterate
 If no change, we have converged and can stop
Of iterative clustering methods, kmeans is the simplest and many others can be said to be based on it.
Kmeans gives better better results if the value for k is a good choice, representative for the data.
(it's not unusual to try various k and test them)
The common 'nearest cluster' criterion will avoid attraction of multiple clusters, and the whole will converge to a decent solution for k groups.
Limitations:
 results are sensitive to initial placement, and it is easy to get stuck in a local minimum.
 It is not unusual to run the clustering various times with different starting clusters and see how stable the clustering is.
 If k is not representative of the structure in the data the solution may not be satisfying at all. This is partly caused by, and partly independent of, the fact that there may be various possible stable clusterings.
 the simple distance metric means we say the shape for inclusion is always a circle
Averagecase runtime is decent because it's a fairly simple algorithm.
Worstcase runtime, for fairly pathological datasets, is fairly quite high. There are faster approximations of kmeans that you may wish to consider.
You can tweak kmeans in various ways. For example, you can assign weights (based of frequency, importance, etc) to elements to affect the centroid mean recalculation.
See also:
 Wikipedia: Kmeans algorithm
 Arthur & Vassilvitskii, kmeans++ The Advantages of Careful Seeding
 JA Lloyd (1957), Least Squares Quantization in PCM
 EW Forgy (1965), Cluster analysis of multivariate data: efficiency vs interpretability of classifications
 MacQueen (1967), Some methods for classification and analysis of multivariate observations.
 Hartigan and Wong (1979), A Kmeans clustering algorithm
 Kanungo, Mount, Netanyahu, Piatko, Silverman, and Wu, An efficient kmeans clustering algorithm: Analysis and implementation
 Bottou, Bengio, Convergence Properties of the KMeans Algorithms
Variations:
 ISODATA (Iterative SelfOrganising Data Analysis Technique Algorithm) builds on kmeans and tries to be smart in initial positions and in variations of k(verify).
 Hmeans is a variation on kmeans that recalculates the centroid only after a complete iteration over all the items, not after each reassignment.
 Seen one way, it checks for error decrease less often, which makes it a smidge more sensitive to local minima and perhaps doesn't converge as nicely.
 Although the difference in practice tends to be slight, kmeans tends to be the slightly safer (and much more common) choice, even if the order it handles elements in is a different kind of bias that hmeans avoids.
Canopy clustering
https://en.wikipedia.org/wiki/Canopy_clustering_algorithm
Bisecting kmeans
This article/section is a stub — probably a pile of halfsorted notes, is not wellchecked so may have incorrect bits. (Feel free to ignore, fix, or tell me) 
hard cmeans
This article/section is a stub — probably a pile of halfsorted notes, is not wellchecked so may have incorrect bits. (Feel free to ignore, fix, or tell me) 
UPGMA, WPGMA
This article/section is a stub — probably a pile of halfsorted notes, is not wellchecked so may have incorrect bits. (Feel free to ignore, fix, or tell me) 
UPGMA = Unweighed Pair Group Method using Arithmetic averaging
WPGMA = Weighed Pair Group Method using Arithmetic averaging
(These specific names/abbreviations come from Sneath and Sokal 1973)
UPGMA assigns equal weight to all distances:
D((u,v),w) = (n_{u}*D(u,w) + n_{v}*D(v,w)) / (n_{u}+n_{u})
WPGMA uses:
D((u,v),w) = (n_{u}*D(u,w) + n_{v}*D(v,w)) / 2
In the unweighed variant, the two things being combined weigh equally,
in the weighed variant, all leaves that are part of a cluster weigh in as much as the other part.
Bottomup combiners working from a difference matrix, combining whatever leaf/cluster distance is minmal, then recalculating the difference matrix.
It's not too hard too argue this terminology is a little arbitrary
UPGMC, WPGMC
This article/section is a stub — probably a pile of halfsorted notes, is not wellchecked so may have incorrect bits. (Feel free to ignore, fix, or tell me) 
Unweighed Pair Group Method using Centroids, and Weighed Pair Group Method using Centroids
(the specific names/abbreviations come from Sneath and Sokal 1973)
Fuzzy cmeans (FCM)
This article/section is a stub — probably a pile of halfsorted notes, is not wellchecked so may have incorrect bits. (Feel free to ignore, fix, or tell me) 
 Type: fuzzy clustering (not really partitioning anymore)
Method: Like kmeans, but weighs centroid recentering calculations by fuzzy distance to all data points
http://www.elet.polimi.it/upload/matteucc/Clustering/tutorial_html/cmeans.html
http://www.scholarpedia.org/article/Fuzzy_CMeans_Cluster_Analysis
Fuzzy kmeans
This article/section is a stub — probably a pile of halfsorted notes, is not wellchecked so may have incorrect bits. (Feel free to ignore, fix, or tell me) 
Shell clustering
This article/section is a stub — probably a pile of halfsorted notes, is not wellchecked so may have incorrect bits. (Feel free to ignore, fix, or tell me) 
Basic fuzzy cmeans would include by a radius  a sphere.
There are various other options, including:
 fuzzy cquadric shells algorithm (FCQS) detects ellipsoids
 fuzzy cvarieties algorithm (FCV) detects infinite lines (linear manifolds) in 2D
 adaptive fuzzy cvarieties algorithm (AFC): detects line segments in 2D data
 fuzzy cshells algorithm (FCS) detects circles
 fuzzy cspherical shells algorithm (FCSS) detects circles
 fuzzy crings algorithm (FCR) detects circles
 fuzzy crectangular shells algorithm (FCRS) detects rectangles
 GathGeva algorithm (GG) detects ellipsoids
 GustafsonKessel algorithm (GK) detects ellipsoids of roughly the same size
PAM
Partitional, medoidbased
CLARA
Partitional, medoidbased
AGNES
Hierarchical
AGglomerative NESting
DIANA
Hierarchical, divisive
DIvisive ANAlysis  the idea is that at every step, the most homogenous cluster is split in two
Buckshot
Hybrid
Birch
Hybrid
BIRCH (balanced iterative reducing and clustering using hierarchies)
https://en.wikipedia.org/wiki/BIRCH
Cure
Hybrid
Clustering Using REpresentatives
https://en.wikipedia.org/wiki/CURE_algorithm
Rock
Hybrid
"Rock: A robust clustering algorithm for categorical attributes"
Chameleon
Hybrid, Hierarchical
"CHAMELEON: A Hierarchical Clustering Algorithm Using Dynamic Modeling"
DBSCAN
Densitybased.
The assumption that real objects will always be a dense cloud of points more easily rejects random points as noise/outliers (even if relatively close).
It can also deal decently with closeby nonlinear clusters, if separated cleanly.
http://en.wikipedia.org/wiki/DBSCAN
OPTICS
Similar to DBSCAN
https://en.wikipedia.org/wiki/OPTICS_algorithm
FLAME
Fuzzy, densitybased
http://en.wikipedia.org/wiki/FLAME_Clustering
Clustering by Committee (CBC)
Hybrid
Based on responsive elements (a comittee) voting on specific outcomes.
See also:
 http://www.aclweb.org/aclwiki/index.php?title=CBC
 PA Pantel (2003) Clustering by Committee [11]
Principal Direction Divisive Partitioning (PDDP)
Information Bottleneck
See also:
 https://en.wikipedia.org/wiki/Information_bottleneck_method
 N Slonim, "The Information Bottleneck: Theory and Applications"
Agglomerative Information Bottleneck
See also:
 N Slonim, N Tishby (1999) "Agglomerative Information Bottleneck"
Expectation Maximization Clustering
Related fields
See also
 http://www.statsoft.com/textbook/stcluan.html
 http://www.cs.joensuu.fi/pages/franti/vq/algorithms.htm
State observers / state estimation (and filters in this sense)
State observers / state estimation will estimate the internal state of a real system, measuring what they easily can, and estimating what they need to,
Good knowledge of a system is useful to various control problems.
Bayes estimator, Bayes filter
alphabeta filter
An alpha beta filter (a.k.a. fg filter, gh filter)
Kalman filter
Multi hypothesis tracking
Particle filter
Data fusion
Sensor fusion
Optimization theory, control theory
Glossary
System analysis
Problem theory
Types of problems
Some controllers
In terms of the (near) future:
 greedy control doesn't really look ahead.
 PID can be tuned for some basic tendencies
 MPC tries to minimize mistakes in predicted future
 For example, take a HVAC system that actively heats but passively cools. This effectively means you should be very careful of overshooting. You would make the system sluggish  which also reduces performance because it lengthens the time of effects and settling
Nonlinear:
 HVAC
Greedy controllers
Doesn't look ahead, just minimizes for the current step.
Tends not to be stable. Can be stable enough for certain cases, in particular very slow systems where slow control is fine, and accuracy not so important.
For example, water boilers have such large water volume, which has high heat capacity, that even a bangbang controller (turn a fewhundredwatt heating element fully on or off according to temperature threshold) will keep the water within a few degrees of that threshold, simply because the water's heat capacity is large in relation to the heating element you'ld probably use (see also thermal inertia).
If on the other hand the boiler's volume is small or the heater too powerful, it will spike much too warm, and average will be noticeably above the setpoint.
A proportional controller (e.g. for the boiler case, even slow PWM of that heating element) would probably be good enough.
....but more generally, greedy control tends to not be stable when you want control to be fast.
You easily run into the issue that actuation may be too fast for the measurement,
easily causing feedback and oscillations as well.
Hysteresis (behaviour)
Hysteresis control (type)
Mapbased controller (type)
PID controller
This article/section is a stub — probably a pile of halfsorted notes, is not wellchecked so may have incorrect bits. (Feel free to ignore, fix, or tell me) 
PID is a fairly simple and generic controlloop system, still widely used in industrial control systems.
It is useful in systems that have to deal with some delay between and/or in actuation and sensing, where they can typically be tuned to work better than greedy proportionalonly controllers (and also be tuned to work worse  which you want to avoid).
Compared to some other cleverer methods, PID is computationally very cheap (a few adds and multiplies per step).
Yet:
 there is no easily quantified way to describe or guarantee optimality stability
 you have to tune them,
 and learn how to tune them
 tuning is complex in that it depends on
 how fast the actuation works
 how fast you sample
 how fast the system change in reaction / settles
 doesn't deal well with long time delays
 derivative component is sensitive to noise, so filtering may be a good idea
 has trouble controlling complex systems
 more complex systems should probably look to MPC or similar.
 linear at heart (assumes measurement and actuation are relatively linear)
 so doesn't perform so well in nonlinear systems
 symmetric at heart, so not necessarily wellsuited to nonsymmetric actuation
 consider e.g. a HVAC system  which would oscillate around its target by alternately heating and cooling.
 It is much more power efficient to do one passively, e.g. active heating and passive cooling (if it's cold outside), or active cooling and passive heating (if it's warmer outside)
 means it's easier to overshoot, and more likely to stick offsetpoint on the passive side, so on average be on one side
 You could make the system sluggish  in this case it reduces the speed at which it reaches the setpoint, but that is probably acceptable to you.
 in other words: sluggish system and/or a bias to one side
Some definition
This article/section is a stub — probably a pile of halfsorted notes, is not wellchecked so may have incorrect bits. (Feel free to ignore, fix, or tell me) 
The idea is to adjust the control based on some function of the error, and a Proportional–Integral–Derivative (PID) controller combines the three components it names, each tweaked with their own weight (gain).
The very short version is that
 P adjusts according to the proportional error
 I adjusts according to the integrated error
 D adjusts according to the derivative error
It can be summarized as:
where
 e(t) is the error
 P, I, and D are scalar weights controlling how much effect each component has
Some intuition
This article/section is a stub — probably a pile of halfsorted notes, is not wellchecked so may have incorrect bits. (Feel free to ignore, fix, or tell me) 
So how do you tune it?
MPC (Model Predictive Control)
FLC (Fuzzy Logic Control)
LQR
Qlearning
This article/section is a stub — probably a pile of halfsorted notes, is not wellchecked so may have incorrect bits. (Feel free to ignore, fix, or tell me) 
Qlearning is a specific type of #reinforcement learning.
Qlearning models discretized (state,action) mapping, and learns the values for each mapping and implicitly the optimal action in each state, by trial and error.
You can see it as a statebyaction table where each cell stores a value that is increased on success, decreased on failure (often with an EWMA(verify) to control learning rate?).
Like rewardbased learning in general, Qlearning can be supported by something as simple or complex as you wish.
The discretized approach mainly makes when there is a relatively small set of possible states and actions,
but the larger that space is, the longer it takes takes very long to actually learn enough for each pair.
Deep Qlearning is a variant that outputs the value for all actions from a network instead, which is often a more compact way to store what can works out as a very large state space, and may be m
and can be more robust(verify)
https://en.wikipedia.org/wiki/Qlearning
See also
Semisorted
Structured prediction
Mistake Bound learning
PAC theory
Evolutionary/genetic computation
This article/section is a stub — probably a pile of halfsorted notes, is not wellchecked so may have incorrect bits. (Feel free to ignore, fix, or tell me) 
For context:
Part of a wider category of evolutionary computation, which is is inspired by the observation that biological evolution tends to adapt well for a specific task.
Evolutionary computation is a wider field including concepts like optimizations imitating ant colonies or particle swarms, to gene expression, to selforganization, to agentbased modeling, to genetic programming.
Applications include include classification, curve fitting and other statistical/data modelling, feature selection.
Perhaps the largest applications lie around gaming, research, as people don't seem to like the opaqueness of the method for 'real' applications. But now that we're okay with equally opaque neural networks, perhaps that has ended? (and yes, things like NEAT apply for both and works decently).
Technically, genetic algorithms focus more on the evolutionary concepts and is more a method that barely gets into possible implementations, while genetic programming on its application specifically to generating machine programs, but border between the two is sometimes vague and used interchangably.
Genetic programming is specifically concerned with generating a program fit for a specific task by trial and error, with the largest components being a fitness function and introducing some randomness in a group (population) of candidate programs.
The representation that is altered can be anything productive, from trees of operators to machine code to neural nets.
Practically, you have to deal with aspects like
 generating an initial population
 evaluate fitness of individuals
 deciding how much and how to
 reproduce (copy asis)
 cross over (combine into offspring)
 and mutate (applying a genetic variation)
 deciding when to stop