Three criteria to evaluate if a package or library is suitable for processing very large datasets.
Implement core functions in a language that is efficient for computation
- C, C++, FORTRAN, Rust are rather good on this.
- Python, R, and other high-level languages are not very good.
Support multithreading and even distributed computing
- Relying on single core, single thread would not produce good performance.
- It is necessary to support multi-threading, either on the same computer or a cluster
Enable efficient indexing
- This is particularly critical for geospatial applications.
- Data should only be lazy-loaded, i.e., load as needed.
- Indexing allows quickly loading the needed data.
Some examples
lasR and lidR: Check out its documentation on its philosophy
- Multithreading
- LAS Catalog (implemented in lidR, supported by lasR)
- Spatial Indexing