The use of machine learning techniques, in particular unsupervised clustering and dimensionality reduction algorithms, is quickly becoming a standard workflow for identifying and visualizing biological populations from within high-dimensional data. These methods allow researchers to approach data analysis without the bias and subjectivity that has traditionally been standard in the field.
Algorithms have context-dependent strengths and weaknesses. Across algorithms, an inability to scale computation to large datasets is a common theme. Most algorithms are designed and distributed to run on individual computers where memory and CPU are quickly exhausted by large datasets. Even when high-performance compute resources are available, algorithms often don't scale to large datasets as a fundamental property of their design. If they do, it might result in an untenable increase in runtime or diminished quality of results.