Skip to main content

Securing User Secrets

· 5 min read
Eléonore Charles
Product Manager
Frédéric Collonval
Chief Technical Officer (ad-interim)

Handling secrets is one of the most critical aspects of maintaining a secure system. Secrets, such as API keys, passwords, and encryption keys, must be protected from unauthorized access and potential leaks.

At Datalayer, we take this responsibility seriously and have implemented robust measures to ensure that secrets are handled securely and efficiently.

Why Secrets Management Matters

Secrets are often at the heart of modern cloud applications, providing access to databases, APIs, and services. However, storing these sensitive credentials in less-secure areas, such as system environments or configuration files, leaves them vulnerable to attacks. Even a single exposed secret can result in significant security breaches, data loss, and compromised systems.

To minimize these risks, it's essential to store secrets using specialized solutions designed to handle this specific challenge. These solutions ensure that secrets are properly encrypted, managed, and retrieved only when needed.

Using a Strong Vault

At Datalayer, we have integrated HashiCorp Vault to store user secrets securely. HashiCorp Vault is one of the most trusted solutions for secret management, widely used by companies like Deutsche Bank and Airbnb. Vault provides an enterprise-grade approach to secrets management, offering encryption, access control, and auditing features that ensure secrets are only accessible by authorized entities.

How It Works at Datalayer

Whenever a Remote Kernel is requested, we fetch user secrets securely from the Vault and inject them into the Remote Kernel as environment variables. This approach ensures that secrets are only available to the processes that require them, reducing the risk of leaks in more exposed parts of the system, such as logs or error messages.

Users can define personal secrets on the platform. If they do so, the secrets will be injected in all Remote Kernels as environment variables. The environment variable name be the secret name.

The secrets are stored in the HashiCorp Vault, ensuring the highest current security standards. This requires requesting the vault each time a Remote Kernel is assigned to a user and injecting the secrets into the running kernel process as environment variables. This injection is achieved by leveraging the kernel protocol. Specifically, the companion sidecar container opens a connection to the kernel and sends a code snippet to inject the secrets.

In the platform, you can now find a new section Secrets in the user settings to manage your secrets.

Secrets View
Secrets Creation

To learn more about how we have implemented the secrets injection in our platform, check out our technical documentation: Secrets and Env Vars Injection.

What's Next? Integrating SQL Cells and Data Sources

Moving forward, we are working on the next phase of improving our platform by integrating SQL cells and popular storage and databases solutions like Google BigQuery, Snowflake, Amazon Athena and more.

This will allow for greater flexibility when working with data, as users will be able to securely connect to a variety of databases and query them directly from their remote environments.

With the Vault ensuring the security of database credentials, users can focus on deriving insights from their data without worrying about security breaches or unauthorized access.

Conclusion

The protection of sensitive data is a top priority at Datalayer. With HashiCorp Vault, we ensure that user secrets are securely stored and managed, providing a safe and scalable solution for our platform.

As we continue to enhance our platform with new features like SQL cells and data source integrations, the strong foundation of security we've built with Vault will support us in delivering more powerful and secure tools for our users.

  Datalayer: AI Agents for Data Analysis Register and get free credits

Deep Dive into our Examples Collection

· 5 min read
Eléonore Charles
Product Manager
Eric Charles
Datalayer CEO/Founder

In the fast-evolving world of data science and AI, having the right tools and resources is critical for success. As datasets grow larger and computations more complex, data scientists need scalable, flexible, and reliable solutions to perform high-performance analyses. Datalayer allows you to scale your data science workflows with ease, thanks to its Remote Kernels solution. This feature enables you to run computations in powerful cloud environments directly from your JupyterLab, VS Code or CLI.

We have created a public GitHub repository with a collection of Jupyter Notebooks that showcases scenarios where Datalayer proves highly beneficial. These examples cover a wide range of topics, including machine learning, computer vision, natural language processing, and generative AI.

Explore the Datalayer Examples Collection

Here are an overview of the examples available in the Datalayer public GitHub repository. To access the notebooks code, simply click on the links provided.

1. OpenCV Face Detection

This example utilizes OpenCV for detecting faces in YouTube videos. It uses a traditional Haar Cascade model, which may have limitations in accuracy compared to modern deep learning-based models. It utilizes parallel computing across multiple CPUs to accelerate face detection and video processing tasks, optimizing performance and efficiency. Datalayer further enhances this capability by enabling seamless scaling across multiple CPUs.

2. Image Classifier with Fast.ai

This example demonstrates how to build a model that distinguishes cats from dogs in pictures using the fast.ai library. Due to the computational demands of training a model, a GPU is required.

3. Dreambooth

This example uses the Dreambooth method which takes as input a few images (typically 3-5 images suffice) of a subject (e.g., a specific dog) and the corresponding class name (e.g. "dog"), and returns a fine-tuned/'personalized' text-to-image model (source: Dreambooth). To do this fune-tuning process, GPU is required.

4. Text Generation with Transformers

Those notebook examples demonstrate how to leverage Datalayer's GPU kernels to accelerate text generation using Gemma model and the HuggingFace Transformers library.

Transformers Text Generation

This notebook uses Gemma-7b and Gemma-7b-it which is the instruct fine-tuned version of Gemma-7b.

Sentiment Analysis with Gemma

This example demonstrates how you can leverage Datalayer's Cell Kernels feature on JupyterLab to offload specific tasks, such as sentiment analysis, to a remote GPU while keeping the rest of your code running locally. By selectively using remote resources, you can optimize both performance and cost. This hybrid approach is perfect for tasks like sentiment analysis via llm where some parts of the code require more computational resources than others. For a detailed explanation and step-by-step guide on using Cell Kernels, check out our blog post on this specific example.

5. Mistral Instruction Tuning

Mistral 7B is a large language model (LLM) that contains 7.3 billion parameters and is one of the most powerful models for its size. However, this base model is not instruction-tuned, meaning it may struggle to follow instructions and perform specific tasks. By fine-tuning Mistral 7B on the Alpaca dataset using torchtune, the model will significantly improve its capabilities to perform tasks such as conversation and answering questions accurately. Due to the computational demands of fine-tuning a model, a GPU is required.

Getting Started with Datalayer

Whether you're a seasoned data scientist, an AI enthusiast, or a beginner looking to explore new technologies, our Examples GitHub repository is a great starting point. Paired with our Remote Kernels solution, you'll be able to perform cutting-edge data science analysis at scale, without worrying about hardware limitations.

Here's how you can get started:

Explore the Public Repository: Visit our Examples GitHub repository to access a variety of Jupyter Notebook examples.

Leverage Remote Kernels: Join the Datalayer Beta and start using Remote Kernels to scale your Jupyter Notebooks. Say goodbye to resource constraints and unlock the power of cloud computing for your data science needs.

  Datalayer: AI Agents for Data Analysis Register and get free credits

Black Snake Release

· 4 min read
Eric Charles
Datalayer CEO/Founder

We're excited to announce the release of Datalayer 1.2.0 — Black Snake — a major release packed with new features and enhancements that significantly improve performance and user experience.

The Datalayer Black Snake release is named as a tribute to the Python opensource community as to the companies like Anaconda and Quansight for their support in driving innovation in the open.

New Features

Cell-Specific Kernels: Execute specific cells with different kernels, optimizing costs by leveraging local resources for data preparation and remote resources for intensive computations.

Cell Kernel execution
Cell Kernel Execution

The remote GPU Kernel is utilized only for the duration of the cell computation, minimizing costs.

CLI Execution: Execute code remotely from your local terminal.

CLI Remote Execution
CLI remote execution
Sharing State between Notebook and CLI
Remote Notebook Execution

When using the same Kernel, variables defined in a notebook can be used in the CLI and vice versa. This holds also true when using multiple notebooks connected to the same kernel, for example.

User storage: Users can now persist data across kernels sessions. Read more about it in this blog post.

Bug Fixes and Stability Improvements

Improved Kernel Stability: Addressed several kernel stability issues that users encountered when running long-running processes.

Resolved Environment Variable Conflicts: Fixed issues related to environment variable management in GPU-accelerated environments, ensuring smoother integration with external services and data sources.

Security Enhancements: Continued improvements in secret management and encryption to ensure safe data handling when accessing external data sources and services.

info

Local storage mount was deprecated in this release. We plan to reintroduce it in the next release with improved security and performance.

How to Get Started

Existing Users: Existing users can update their environments to Datalayer 1.2.0.

pip install datalayer --upgrade

New Users: New users can ask for an invitation to the beta and get started with Datalayer by following the documentation:

Upcoming Features

Storage Management: Enhanced storage management capabilities will be introduced in the next release, allowing users to manage their data more efficiently.

Expanded Data Source Support: More integrations with popular data sources will soon be available, further simplifying cloud data access.

User Environment: Users will be able to create their custom environments, allowing them to install specific packages and libraries.

Collaboration: Stay tuned for collaborative features coming in the next release, allowing multiple users to work together with the same kernel.

  Datalayer: AI Agents for Data Analysis Register and get free credits

Persistent Storage and Datasets

· 5 min read
Eléonore Charles
Product Manager
Frédéric Collonval
Chief Technical Officer (ad-interim)

When working with Remote Kernels, one of the key pain points for users has been the lack of persistent data storage. Previously, every time you initiated a new kernel session, you would lose access to your previous data, forcing you to download datasets repeatedly for each new session. This not only wasted valuable time but also made the workflow cumbersome.

Persistent User Storage

The good news? We've introduced a solution that completely eliminates this problem! Now, you can persist data on the kernel side, meaning your data is saved even when your kernel is terminated. No more re-downloading files for every new kernel – your data is always available, just like it would be in your home folder on your own machine.

This is a massive time-saver and enhances productivity, allowing you to focus on what really matters: building models, analyzing data, and running experiments without constantly managing your data files.

A Smoother User Experience

So, what does this look like in practice? When launching a new kernel, you have the option to enable persistent storage. Once enabled, the system automatically mounts a persistent storage in the persistent directory. This directory is accessible across different kernel sessions, ensuring that your data remains intact and available anytime you need it.

While enabling persistent storage slightly increase the kernel start time, the convenience of having your data ready across sessions might far outweighs this.

note

If you were using Datalayer with JupyterLab or CLI, you can upgrade the extension to get this feature available using the following command: pip install datalayer --upgrade

How Does It Work Under the Hood?

Building reliable, persistent storage for cloud environments requires robust infrastructure. To achieve this, we've implemented the Ceph storage solution. Ceph is a highly scalable and reliable storage system commonly used by top cloud providers like OVHcloud. It is designed to handle large volumes of data while ensuring high availability, redundancy, and data protection.

To learn more about how we have implemented a Ceph storage in our platform, check out our technical documentation: Ceph Service and User Persistent Storage.

Pre-Loaded Datasets

In addition to persistent storage, we've introduced a dedicated directory, data, where you can access a collection of pre-loaded datasets. This feature allows you to jump straight into your analysis without needing to upload your own data, making it easier and faster to get started.

The directory is set to read-only, so while you won't be able to write directly to it, you can effortlessly copy datasets over to your persistent storage for further modification. You'll find a range of popular datasets in the datalayer-curated subdirectory, including the classic Iris dataset, the Titanic dataset, and many more.

Several Amazon Open Data datasets are also available in the aws-opendata subdirectory, providing a wealth of data for your analysis.

What's Next?

We're not stopping here. There are several exciting enhancements on our roadmap, designed to further improve your experience:

  • Expanded Storage Capabilities: We plan to increase storage limits, allowing you to store even more data.
  • Storage Browsing: A new feature that will allow you to browse your kernel's content directly within JupyterLab and Datalayer platform.
  • Storage Management: You'll soon be able to view and manage your storage directly from JupyerLab and Datalayer platform (delete, move, rename files with a user interface instead of using terminal command).
  • Sharing Content Between Users: We are working on a feature that will enable you to share persistent data with other users, facilitating collaboration on projects.

Stay tuned for these upcoming features, as they will further enhance your ability to analyse data efficiently with Remote Kernels!

Refer to our documentation for more information on how to get started with persistent storage and pre-loaded datasets.

  Datalayer: AI Agents for Data Analysis Register and get free credits

Datalayer Joins NVIDIA Inception

· 3 min read
Eléonore Charles
Product Manager

Datalayer has joined NVIDIA Inception, a program designed to nurture startups that are revolutionizing industries with technological advancements.

At Datalayer, we are focused on providing seamless access to powerful Remote Kernels for data scientists, AI engineers, and machine learning practitioners. Our mission is to simplify workflows and boost productivity by allowing users to leverage GPUs and CPUs without altering their existing code or preffered tools.

Joining NVIDIA Inception will accelerate our development by providing access to industry-leading resources such as go-to-market support, technical assistance and training. This will help us enhance our solutions and collaborate with a network of AI-driven organizations and experts, driving growth during critical stages of product development and enabling us to better serve our users.

Before joining, we were already big fan and users of the NVIDIA GPU technology, with the GPU Kubernetes Operator as documented on the Datalayer Tech GPU CUDA page. We have been supporting Time Slicing and MIG to help optimize costs for our users. We are eager to collaborate with NVIDIA experts to further reduce expenses while enhancing security through sandboxed solutions such as KubeVirt and Kata Containers.

Stay tuned as we continue to develop innovative solutions, now with the support of the NVIDIA Inception Program. We are excited to share our progress with you in the coming months! In the meantime, you can already experiment with NVIDIA GPU on Datalayer.

  Datalayer: AI Agents for Data Analysis Register and get free credits

Datalayer Private Beta

· 4 min read
Eléonore Charles
Product Manager

We are super excited to announce that Datalayer is entering Private Beta! After months of development, we are inviting today those who signed up on our waiting list to experience our solution first-hand.

How to Join the Beta?

If you registered on our waiting list, keep an eye on your inbox, invitations are being sent out now! We're thrilled to have you onboard as part of this exclusive group, helping us shape the future of Datalayer.

But don't worry if you haven't signed up yet—there are still limited spots available. Simply register on the waiting list to secure your spot in the private beta.

Why Join the Beta?

This is your opportunity to get early access to the cutting-edge features of Datalayer, and we need your help to make it even better. Your experience and feedback will be invaluable in helping us fine-tune the product, optimize performance, and add features that truly meet your needs. It would be great to have you on board and we can't wait to hear your thoughts!

As a beta user, you'll enjoy:

  • Free credits to try out Remote Kernels.
  • Direct support from our team to ensure a smooth experience.
  • Directly influence the future development of Datalayer through your feedback.

What Can Datalayer Bring You?

Datalayer simplifies access to powerful computing resources (GPU or CPU) for data scientists and AI engineers. Whether you're training models or running large-scale simulations, you can seamlessly scale your workflows without changing your code or setup.

Key Benefits

  • Effortless Remote Kernel Access: Seamlessly connect to high-performance Remote Kernels from JupyterLab, VS Code, or via the CLI. Switch kernels with just a few clicks to run your code on powerful machines, without altering your workflow or setup.
  • Flexible and Simple Setup: Avoid the complexity of configuration changes or workflow disruption. Launch Remote Kernels effortlessly and scale your data science or AI workflows with ease, whether you're working on notebooks or scripts.
  • Optimized Resource Usage: Gain control over resource allocation by running specific notebook cells on Remote Kernels only when needed. This precision helps minimize resource consumption and maximize efficiency.
  • Flexible Credits-Based Model: Enjoy a pay-as-you-go credits system that adapts to your needs. With transparent usage tracking and detailed reports, you'll only pay for the resources you use, making it a cost-effective solution for scaling your projects.

Learn more about Datalayer's features on our user documentation and online SaaS.

  Datalayer: AI Agents for Data Analysis Register and get free credits

Datalayer Achieves ISO 27001 Certification!

· 5 min read
Eric Charles
Datalayer CEO/Founder

We are thrilled to announce that Datalayer, a SaaS platform for data analysis, has officially been awarded ISO 27001 certification, a significant milestone in our commitment to ensuring the highest levels of information security and data protection for our customers.

What is ISO 27001 and Why Does It Matter?

ISO 27001 is an internationally recognized standard for Information Security Management Systems (ISMS). It outlines a rigorous framework of policies, procedures, and controls designed to protect sensitive information from threats, including cyber-attacks, data breaches, and unauthorized access.

Achieving this certification demonstrates that Datalayer has implemented best-in-class security practices, ensuring that your data is handled with the utmost care, integrity, and confidentiality.

What This Means for Our Customers

Increased Trust and Assurance: ISO 27001 certification is a strong indicator that Datalayer adheres to stringent security standards. You can have peace of mind knowing that we are proactively managing and safeguarding your data at every step.

Compliance with Global Standards: For businesses handling sensitive data, compliance is critical. ISO 27001 is widely accepted across industries and geographies, meaning that using Datalayer helps support your own regulatory requirements in terms of data security.

Ongoing Risk Management: Security is not a one-time achievement but a continuous process. Our certification guarantees that we have a robust ISMS in place, which includes regular risk assessments, continuous monitoring, and periodic audits. This helps us identify and mitigate potential threats before they impact your operations.

Commitment to Continuous Improvement: Achieving ISO 27001 certification is just the beginning. We are dedicated to maintaining and enhancing our security practices to meet evolving challenges. Our team will continue to invest in security training, updates, and technologies to stay ahead in an increasingly complex threat landscape. You can follow our progress on our Trust Portal.

The Road to Certification

Obtaining ISO 27001 certification is no small feat. It required a deep review of our internal processes and systems, comprehensive staff training, and a full assessment of how we protect and manage customer data. This certification, granted by an independent and accredited body, confirms that Datalayer has established, implemented, and will maintain an effective Information Security Management System.

To work with us on this journey, we have partnered with Vanta, a tool automate compliance, manage risk, and prove trust continuously, as well with Sensiba LLP, an external and independent auditor.

Looking Ahead

As the data landscape continues to grow and evolve, so do the risks. Achieving ISO 27001 certification is a testament to our proactive approach to information security.

We are proud of this achievement, but our commitment doesn't stop here. We will continue to work tirelessly to ensure that Datalayer remains a trusted and secure partner for your data analysis needs, working also towards SOC2 and ISO 42001 certfications specifically tailored for Artificial Intelligence (AI) cases. Stay tuned to learn more.

Thank You to Our Team and Customers

We want to extend a huge thank you to our incredible team for their dedication and hard work throughout this process. Additionally, we want to thank our customers for their trust and continued support. Your data security is, and will always be, our top priority.

To learn more about what this certification means for your business, or if you have any questions about our data security practices, feel free to reach out to us.

Stay secure, stay innovative - with Datalayer.

About Datalayer

Datalayer is a leading SaaS platform that empowers businesses to perform robust data analysis, transform raw data into actionable insights, and make informed decisions. With our newly acquired ISO 27001 certification, we are further committed to delivering top-tier data security along with our world-class services.

  Datalayer: AI Agents for Data Analysis Register and get free credits

GPU Acceleration for Jupyter Cells

· 7 min read
Eléonore Charles
Product Manager

In the realm of AI, data science, and machine learning, Jupyter Notebooks are highly valued for their interactive capabilities, enabling users to develop with immediate feedback and iterative experimentation.

However, as models grow in complexity and datasets expand, the need for powerful computational resources becomes critical. Traditional setups often require significant adjustments or sacrifices, such as migrating code to different platforms or dealing with cumbersome configurations to access GPUs. Additionally, often only a small portion of the code requires GPU acceleration, while the rest can run efficiently on local resources.

What if you could selectively run resource-intensive cells on powerful remote GPUs while keeping the rest of your workflow local? That's exactly what Datalayer Cell Kernels feature enables. Datalayer works as an extension of the Jupyter ecosystem. With this innovative approach, you can optimize your cost without disrupting your established processes.

We're excited to show you how it works.

The Power of Selective Remote Execution

Datalayer Cell Kernels introduce a game-changing capability: the ability to run specific cells on remote GPUs while keeping the rest of your notebook local. This selective approach offers several advantages:

  1. Cost Optimization: Only use expensive GPU resources when absolutely necessary.
  2. Performance Boost: Accelerate computationally intensive tasks without slowing down your entire workflow.
  3. Flexibility: Seamlessly switch between local and remote execution as needed.

Let's dive into a practical example to see how this works. We'll demonstrate this hybrid approach using a sentiment analysis task with Google's Gemma-2 model.

Create the LLM Prompt

We start by creating our prompt locally. This part of the notebook runs on your local machine:

prompt = """
Analyze the following customer reviews and provide a structured JSON response for each review. Each response should contain:

- "review_id": A unique identifier for each review.
- "themes": A dictionary where each key is a theme or topic mentioned in the review, and each value is the sentiment associated with that theme (positive, negative, or neutral).

Format your response as a JSON array where each element is a JSON object corresponding to one review. Ensure that the JSON structure is clear and easily parseable.

Customer Reviews:

1. "I love the smartphone's performance and speed, but the battery drains quickly."
2. "The smartphone's camera quality is top-notch, but the battery life could be better."
3. "The display on this smartphone is vibrant and clear, but the battery doesn't last as long as I'd like."
4. "The customer support was helpful when my smartphone had issues with the battery draining quickly. The camera is ok, not good nor bad."

Respond in this format:
[
{
"review_id": "1",
"themes": {
"...": "...",
...
}
},
...
]
"""

Analyse Topics and Sentiment on Remote GPU

Now, here's where we leverage the remote GPU. This cell contains the code to perform sentiment analysis using the Gemma-2 model and the Hugging Face Transformers library. We'll switch to the Remote Kernel for just this cell:

from huggingface_hub import login
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

# Login to Hugging Face
login(token="HF_TOKEN")

# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained("google/gemma-2-2b-it")

# Load the model
model = AutoModelForCausalLM.from_pretrained(
"google/gemma-2-2b-it",
device_map="auto",
torch_dtype=torch.bfloat16,
)

# Prepare the prompt
chat = [{"role": "user", "content": prompt},]

# Generate the prompt and perform inference
prompt = tokenizer.apply_chat_template(chat, tokenize=False, add_generation_prompt=True)
inputs = tokenizer.encode(prompt, add_special_tokens=False, return_tensors="pt")
outputs = model.generate(input_ids=inputs.to(model.device), max_new_tokens=2000)

# Decode the response, excluding the input prompt from the output
prompt_length = inputs.shape[1]
response = tokenizer.decode(outputs[0][prompt_length:])

By executing only this cell remotely, we're optimizing our use of GPU resources. This targeted approach allows us to tap into powerful computing capabilities precisely when we need them, without the overhead of running our entire notebook on a remote machine.

To execute this cell on a remote GPU, you just have to select the remote environment for this cell.

This is done with just a few clicks, as shown below:

With a simple selection from the cell dropdown, you can seamlessly transition from local to remote execution.

info

Using a Tesla V100S-PCIE-32GB GPU, the sentiment analysis task completes on average in 10 seconds. The number of tokens/seconds processed is ± 19.

The model was pre-downloaded in the remote environment. This was done to eliminate download time. Datalayer lets you customize your computing environment to match your exact needs. Choose your hardware specifications and install the libraries and models you require.

Datalayer Cell Kernels allow you to manage variable transfers between your local and remote environments. You can easily configure which variables should be passed from your local setup to the Remote Kernel and vice versa, as illustrated below:

This ensures that your remote computations have access to the data they need and that your local environment can utilize the results of remote processing.

info

Variable transfers are currently limited in practice to 7 MB of data. This limit is expected to increase in the future, and the option to add data to the remote environment will also be introduced.

To help you monitor and optimize your resource usage, Datalayer provides a clear and intuitive interface for viewing Remote Kernel usage.

Process and Visualize Results Locally

We switch back to local execution for processing and visualizing the results. This is the processed list of themes and sentiments extracted from the reviews by the Gemma-2 model:

[
{
'review_id': '1',
'themes': {'performance': 'positive', 'speed': 'positive', 'battery': 'negative'}
},
{
'review_id': '2',
'themes': {'camera': 'positive', 'battery': 'negative'}
},
{
'review_id': '3',
themes': {'display': 'positive', 'battery': 'negative'}
},
{
'review_id': '4',
'themes': {'customer support': 'positive', 'camera': 'neutral', 'battery': 'negative'}
}
]

And below is a visualization of the theme and sentiment distribution across the reviews:

Key Takeaways

Datalayer Cell Kernels allow you to selectively run specific cells on remote GPUs. This hybrid approach optimizes both performance and cost by using remote resources only when necessary. Complex tasks like sentiment analysis with large language models become more accessible and efficient.

Check out the full notebook example and sign up on the Datalayer waiting list today and be among the first to experience the future of hybrid Jupyter workflows!

  Datalayer: AI Agents for Data Analysis Register and get free credits