Packages Good for Large Dataset Processing

Three criteria to evaluate if a package or library is suitable for processing very large datasets.

  • Implement core functions in a language that is efficient for computation

    • C, C++, FORTRAN, Rust are rather good on this.
    • Python, R, and other high-level languages are not very good.
  • Support multithreading and even distributed computing

    • Relying on single core, single thread would not produce good performance.
    • It is necessary to support multi-threading, either on the same computer or a cluster
  • Enable efficient indexing

    • This is particularly critical for geospatial applications.
    • Data should only be lazy-loaded, i.e., load as needed.
    • Indexing allows quickly loading the needed data.

Some examples


See also