DEV Community

David Ansa
David Ansa

Posted on

Introducing dataDisk: Simplify Your Data Processing Pipelines

Are you looking for an easy and efficient way to create and manage data processing pipelines? Look no further! I am excited to introduce dataDisk, a powerful Python package designed to streamline your data processing tasks. Whether you are a data scientist, data engineer, or a developer working with data, dataDisk offers a flexible and robust solution to handle your data transformation and validation needs.

Key Features

  • Flexible Data Pipelines: Define a sequence of data processing tasks, including transformations and validations, with ease.
  • Built-in Transformations: Use a variety of pre-built transformations such as normalization, standardization, and encoding.
  • Custom Transformations: Define and integrate your custom transformation functions.
  • Parallel Processing: Enhance performance with parallel execution of pipeline tasks.
  • Easy Integration: Simple and intuitive API to integrate dataDisk into your existing projects.

How It Works

  • Define Your Data Source and Sink

Specify the source of your data and where you want the processed data to be saved.

from dataDisk.data_sources import CSVDataSource
from dataDisk.data_sinks import CSVSink

source = CSVDataSource('input_data.csv')
sink = CSVSink('output_data.csv')
Enter fullscreen mode Exit fullscreen mode
  • Create Your Data Pipeline

Initialize the data pipeline and add the desired tasks.

from dataDisk.pipeline import DataPipeline
from dataDisk.transformation import Transformation

pipeline = DataPipeline(source=source, sink=sink)
pipeline.add_task(Transformation.data_cleaning)
pipeline.add_task(Transformation.normalize)
pipeline.add_task(Transformation.label_encode)
Enter fullscreen mode Exit fullscreen mode
  • Execute the pipeline to process your data.
pipeline.process()
print("Data processing complete.")
Enter fullscreen mode Exit fullscreen mode

Get Started

To start using dataDisk, simply install it via pip:

pip install dataDisk
Enter fullscreen mode Exit fullscreen mode

Contribute to dataDisk
I believe in the power of community and open source. dataDisk is still growing, and I need your help to make it even better! Here’s how you can contribute:

Star the Repository: If you find dataDisk useful, please star our Github Repository. It helps us gain more visibility and attract more contributors.
Submit Issues: Found a bug or have a feature request? Submit an issue on GitHub.

Contribute Code: I welcome pull requests! If you have improvements or new features to add, please fork the repository and submit a PR.

Spread the Word: Share dataDisk with your colleagues and friends who might benefit from it.

Example: Testing Transformations

Here's an example to demonstrate testing all the transformation features available in dataDisk:

import logging
import pandas as pd
from dataDisk.transformation import Transformation

logging.basicConfig(level=logging.INFO)

# Sample DataFrame
data = pd.DataFrame({
    'feature1': [1, 2, 3, 4, 5],
    'feature2': [6, 7, 8, 9, 10],
    'category': ['A', 'B', 'A', 'B', 'A'],
    'feature3': [None, 2.0, None, 4.0, 5.0]
})

logging.info("Original Data:")
logging.info(data)

# Test standardize
logging.info("Testing standardize transformation")
try:
    standardized_data = Transformation.standardize(data.copy())
    logging.info(standardized_data)
except Exception as e:
    logging.error(f"Standardize transformation failed: {str(e)}")

# Test other transformations...
# Add similar blocks for normalize, label_encode, etc.
Enter fullscreen mode Exit fullscreen mode

Join us in making dataDisk the go-to solution for data processing pipelines!

GitHub: Github Repository

Please star my Project.

Top comments (0)