A high-performance data engine providing simple and reliable data processing for any modality and scale.

Any Modality. Radical potential.

Unified multimodal data processing

Break down data silos with a single framework that handles structured tables, unstructured text, and rich media like images—all with the same intuitive API. Why juggle multiple tools when one can do it all?

Python-native, no JVM required

Built for modern AI/ML workflows with Python at its core and Rust under the hood. Skip the JVM complexity, version conflicts, and memory tuning to achieve 20x faster start times—get the performance without the Java tax.

Seamless scaling, from laptop to cluster

Start local, scale global—without changing a line of code. Daft's Rust-powered engine delivers blazing performance on a single machine and effortlessly extends to distributed clusters when you need more horsepower.

Use Cases

Large Scale Document Processing

  • This example demonstrates Daft's ability to seamlessly integrate with AI models to create text embeddings at scale.
  • Using a pre-trained transformer model, Daft processes large document collections stored in cloud storage, converting text into embeddings using easy-to-use user-defined functions for downstream AI applications like semantic search or clustering.
Native Multimodal Processing

[1]

Native Multimodal Processing

Process any data type—from structured tables to unstructured text and rich media—with native support for images, embeddings, and tensors in a single, unified framework.

Rust-Powered Performance

[2]

Rust-Powered Performance

Experience breakthrough speed with our Rust foundation delivering vectorized execution and non-blocking I/O that processes the same queries with 5x less memory while consistently outperforming industry standards by an order of magnitude.

Seamless ML Ecosystem Integration

[3]

Seamless ML Ecosystem Integration

Slot directly into your existing ML workflows with zero friction—whether you're using PyTorch, NumPy, Pandas, or HuggingFace models, Daft works where you work.

Universal Data Connectivity

[4]

Universal Data Connectivity

Access data anywhere it lives—cloud storage (S3, Azure, GCS), modern table formats (Iceberg, Delta Lake, Hudi), or enterprise catalogs (Unity, AWS Glue)—all with zero configuration.

Push Your Code to Your Data

[5]

Push Your Code to Your Data

Bring your Python functions directly to your data with zero-copy UDFs powered by Apache Arrow, eliminating data movement overhead and accelerating processing speeds.

Out of the Box Reliability

[6]

Out of the Box Reliability

Deploy with confidence—intelligent memory management prevents OOM errors while sensible defaults eliminate configuration headaches, letting you focus on results, not infrastructure.

trusted by
Tony Wang's company logo
“Daft was incredible at large volumes of abnormally shaped workloads - I pointed it at 16,000 small Parquet files in a self-hosted S3 service and it just worked! It's the data engine built for the cloud and AI workloads.“

Tony Wang

Data @ Anthropic, PhD @ Stanford

1/4
Patrick Ames's company logo
“Amazon uses Daft to manage exabytes of Apache Parquet in our Amazon S3-based data catalog. Daft improved the efficiency of one of our most critical data processing jobs by over 24%, saving over 40,000 years of Amazon EC2 vCPU computing time annually.”

Patrick Ames

Principal Engineer @ Amazon

1/4
Maurice Weber's company logo
“Daft has dramatically improved our 100TB+ text data pipelines, speeding up workloads such as fuzzy deduplication by 10x. Jobs previously built using custom code on Ray/Polars has been replaced by simple Daft queries, running on internet-scale unstructured datasets.”

Maurice Weber

PhD AI Researcher @ Together AI

1/4
Alexander Filipchik's company logo
“Daft as an alternative to Spark has changed the way we think about data on our ML Platform. Its tight integrations with Ray lets us maintain a unified set of infrastructure while improving both query performance and developer productivity. Less is more.“

Alexander Filipchik

Head Of Infrastructure at City Storage Systems (CloudKitchens)

1/4

Ecosystem

Interested in a managed version of Daft? Sign up for early access.

Get updates, contribute code, or say hi.
Daft Engineering Blog
Join us as we explore innovative ways to handle vast datasets, optimize performance, and revolutionize your data workflows.
Github Discussions Forums
join
GitHub logo
The Distributed Data Community Slack
join
Slack logo