Governing ML Lifecycle metadata using Collibra

The video on this post shows how we used Collibra's metadata management and governance capabilities to manage a complete ML lifecycle. As metadata piles up in a data science organization around modeling, experimentation and data engineering, providing governance and intelligence capabilities is essential for long term management and oversight.

Adding role based permissions and using Collibra's workflow capabilities for approvals and promotions make it a complete solution.

 

 

CCPA (California Consumer Privacy Act) Explained

CCPA (California Consumer Privacy Act) Explained

What is CCPA?

The California Consumer Privacy Act is a consumer privacy legislation which passed into California law on June 28th of 2018. The bill, also known as “AB 375,” has been described by some as “GDPR of the US.” This act is one of the strongest privacy legislation enacted in any state now, giving more power to consumers in regard to their private data.

It’s just a matter of time before other states will follow suit in the coming years, companies across the U.S. that take proactive steps today to better protect consumer data will be best equipped to ride the waves of change.

Is your business impacted by CCPA?

These are the three key articles in the law which explains if a business is impacted by CCPA:

  • For-profit entities which do business in California and collect personal information of consumers.
  • Has annual gross revenues in excess of twenty-five million dollars ($25,000,000)
  • Derives 50 percent or more of its annual revenues from selling consumers’ personal information.

What is the scope of ‘Personal Information’?

An important term loosely defined in the bill is “personal information.” According the AB 375, “The bill…would define ‘personal information’ with reference to a broad list of characteristics and behaviors, personal and commercial, as well as inferences drawn from this information.”

Dozens and perhaps hundreds of specific data items are mentioned in the legislation, including:

  • Biometric data
  • Household purchase data
  • Family information (e.g., how many children)
  • Geolocation
  • Financial information
  • Sleep habits

What are the rights of a California Customer?

  • General Disclosure: If a business (as defined by the bill) collects any type of personal information, this should be disclosed in a clear privacy policy available on the website of the business.
  • Information Requests: Should a consumer desire to know what data is being collected, the company is required to provide such information — specifically about the individual. Some of the requests that can be made include:
    • The categories of personal information collected
    • Specific data collected about the individual
    • Methods used to collect the data
    • A business’ purpose for collecting the information
    • Third parties to which personal information may be shared
  • Deletion: If the consumer desires, personal information (with exceptions) will be deleted by the business.
  • Opt Out – The customer has the right to have the business stop disclosing/sharing/selling their personal information to any third party.
  • Same Service: Regardless of a consumer’s request and preferences about how their personal information is handled, businesses are required to provide “equal service and pricing…even if they [consumers] exercise their privacy rights under the Act.”

How to comply with CCPA?

There are many steps a business must perform to comply with all facets of the law.

  • Organized Data Collection: Business first need to know where all their customer information resides and should be able to categorize and classify this information based on personal and sensitive data attributes
  • Clear, Transparent Policies: Consumers can request a report on the types of data collected, data sources, collection methods, and uses for their data. While the data itself needs to be stored in a well-constructed database, many consumer questions can be quickly answered in comprehensive privacy and data collection policies.
  • Knowledge of Specific Provisions: There are clearly outlined requirements within the California Data Privacy Protection Act including things such as:
    • “Provide a clear and conspicuous link on the business’ Internet homepage, titled ‘Do Not Sell My Personal Information,’ to an Internet Web page…”
    • Ensure any individuals who handle consumers’ private data know and understand all pertinent regulations.
  • Ability to honor customer requests: There are many approaches to handle this. The most rudimentary is providing an email address for the customer. This is a very manual process and has the most chance of oversight and failure. A web or application-based form to gather and store this information in a database is the most effective process.
  • Orchestrated workflows – the companies most prepared, have a process to automatically find customer information and deliver it to customers. More importantly be able to delete customer information based on their requests in a timely and effective manner. This is usually the hardest ask for a customer but the most effective way to honor the mandate.

Conclusion

CCPA is the first of probably many steps that companies have to be prepared for with respect to consumer data privacy. It is well worth investing in building processes and automation around finding and categorizing customer data.

Having half-baked solutions or manual solutions will create a lot of churn and manual labor at best and at worst will cause omissions and errors which could cost the company millions of dollars in fines.

It is advisable to work with solution providers who have solved this problem before and have a frameworks and solutions that can be easily reproduced.

Costs of Unprotected Data Assets are high – Why all organizations must be ready with a taxonomy of sensitive data?

Take a look at some of the recent headlines:

“Data Protection Concerns Upend M&A Plans”

“California Passes Sweeping Data-Privacy Bill”

“Marketers Push Agencies to Shoulder More Liability for Data Breaches …”

“Apple CEO Condemns ‘Data-Industrial Complex’”

Truth be told, these headlines are just a fraction of what is happening out in the real world. Data are everywhere (and growing by the minute). All manners of devices have become smarter, or in other words have started producing data.

Networks have multiplied, devices have multiplied, systems have increased. With all this happening, it is important for any organization to build a cohesive, well understood catalog of all information assets.

Data Ninjas recommend a new mantra to live by and that is “you can only protect what you know you have!”. This is an often overlooked by most organizations. This issue is more pervasive in bigger organizations but often is ignored by smaller organizations. While smaller organizations are not exposed to compliance risk, they are exposed to general data privacy and protection risk.

Now let’s look at some of the costs of lack of adequate data protections:

  • McAfee and the Center for Strategic and International Studies (CSIS) estimated the likely annual cost to the global economy from cybercrime is $445 billion a year, with a range of between $375 billion and $575 billion.
  • In 2018 the IRTC tracked 1,027 breaches through early November. The number of records exposed totaled 57.7 million. The business category continues to be the most affected sector, with 475 breaches, or 46 percent of all breaches detected.
  • The average cost of data breach globally was $3.86 million in 2018, up 6.4 percent from $3.62 million in 2017, according to a study from IBM and the Ponemon Institute.

Even if you ignore some of the statistics, the amount of disruption caused by a casual breach to your business would be crippling. Most smaller organizations have threadbare staff on hand for day to day operation so a breach event would be truly catastrophic.

So how can you stop these events from happening in your domain? By being proactive about it. At Data Ninjas, we believe in being prepared (and you don’t have to build all your protections all at once) from the ground up.

Here is a quick plan on how to execute on a data protection/privacy project iteratively:

  1. Conduct interviews with data custodians and stakeholders to document data collected across the enterprise
  2. Classify data elements in inventory
  3. Map flow of PII or sensitive data across systems
  4. Deploy data loss preventions tools to perform automated discovery and monitoring of sensitive data
  5. Deploy data governance tools to improve processes and build a mature data using organization.

Top 8 Tips for picking the right AWS Storage Option

For organizations making a move to the AWS cloud there are always a lot of questions regarding the storage solution that they should use

As with most services offered by AWS, there are plenty of choices to select from. The selection boils down to a couple of criteria:

  1. What do you need the storage for
  2. How often do you need to access it
  3. What is the immediacy of access and
  4. Of course – Cost!

The main options to choose from as of 2018 are:

  1. S3 – the primary file/object based storage platform. This is the primary offering from AWS which we will cover in more detail.
  2. EFS – elastic file system which is a network attached storage.
  3. Glacier – used for data archival  
  4. Snowball – way to store day without the need for network. Physical disk replication
  5. Storage Gateway – virtual appliances for data replication

Let’s get into a little more detail on each of them….

S3

If regular flat file or object based storage is what you are looking for, then S3 is the right option to go with. This is a bucket based storage with unlimited capacity where you can store files from 0-5TB of size. Data stored here is secure, durable and highly scalable . S3 uses simple web service interfaces to store and retrieve unlimited amount of data from all over the web. Its built to achieve 99.99% availability and 99.999999999% durability.

Within S3 AWS offers different Storage Tiers

  1. S3 Standard – 99.99% availability and 99.999999999% durability – redundant across multiple devices can sustain loss of 2 facilities
  2. S3- Infrequent Access. As the name suggests this storage is suitable for files that are not accessed on a regular basis.  S3 Infrequent Access has a lower fee for data storage but users are charged for data retrieval. Data can be accessed rapidly
  3. S3 – Reduced Redundancy Stroage or recently renamed as One Zone. The content here can be regenerated if necessary. This solution is ideal for things like thumbnails of images that can be regenerated if required. This solution only offers 99.99% durability and availability of one year
  4. Glacier – This solution is mainly meant for archival of data and not for data that is used regularly. This is the cheapest storage option but the data takes around 3-5 hours to restore.

Elastic File System

Also called Elastic block storage. These can be thought as disks in the cloud that you can attach to your EC2 instance. These are Storage volumes on the cloud that can be used as a file system or for a DB. They can also in some cases be used as a root volume. There are 5 types of EFS systems:

  1. General Purpose SSD – GP2. General purpose –  This is the most generic variety for low intensity I/O.
  2. Provisioned IOPS SSD – Designed for I/O intensive applications like large RDMS or NoSQL DB etc. Whenever the need is for greater than 10,000 IOPS.
  3. Throughput Optimized HDD (ST1) – This is the magnetic solution for high throughput. This cannot be used as a boot volume. Can be used for Big Data, Data Warehouse, log processing etc
  4. Cold HDD – Basically a magnetic File server. Its the lowest cost for infrequently accessed workloads and this also can’t be a boot volume
  5. Magnetic Standard – Lowest cost per GB that is bootable. This is for Infrequently accessed data where low cost is priority. Ideal for Test and Development environments.

Snowball

Snowballs are appliances that are sent to customers who can use them to transfer large amounts of data to the appliance by connecting it locally. The devices are then shipped to AWS where the data is transferred to the AWS infrastructure. It uses multiple layers of security including 256 bit encryption and the appliances are tamper proof.

There are three flavors or Snowballs:

  1. Regular Snowball – this used to transfer 80 TB or information per Snowball.  They cost One fifth the cost of using network based AWS storage. This is a good solution to store large amounts of data without the use of network making it more secure.
  2. Snowball edge – These are like datacenters in a box. These appliances come with up to 100TB of storage and also includes Compute power. These are used as mobile data centers in places like airplanes to capture and process large amounts of information without the need for a network.
  3. Snowmobiles – These are the largest option in this group and can store up to100PB per snowmobile. These are nothing but mobile data centers which come in 45 feet long shipping containers! These are mainly used for a complete data center migration.

Storage Gateway

Storage Gateways connects an on premise software appliance with cloud based storage. So your local IT infrastructure is directly connected to AWS storage.

The software appliance can be downloaded as a VM image to be installed at the hosts datacenter.  It supports VMWare ESXi or Microsoft Hyper-V.

There are 3 types of Storage gateways

  1. File Gateways (NFS) – These are for flat files stored on S3 – once transferred they can be managed as regular S3 objects with bucket policies including versioning, lifecycle management, cross region replication
  2. Volume gateways – these are virtual hard disks for block storage and are best used for Database storage like SQL server etc. Can be asynchronously backed up for “point of time” snapshots of your volumes. Snapshots are incremental and are compressed. These are further sub categorized as Store Volumes and Cached volumes.  
    1. Store volumes – data is stored on site and all your primary data is backed up to Amazon S3 in form of Amazon Elastic Block Store(EBS). This is used for low latency requirements.
    2. Cached volumes – all data that are stored on premises is moved to S3. Only limited cached data stays on premises. It reduces need of local storage. Can be used to create storage volumes of 32 TB.
  3. Tape Gateway – This is a durable and affordable solution for data archival to a virtual tape in the AWS cloud using a VTL interface. This is supported by Netbackup, Backup Exec, Veeam etc and most existing software backup solutions being used in the market today.

So, as mentioned earlier, there are lots of options to choose from. Based on your needs of size, throughput, accessibility, scale and business requirements you should be able to narrow down one of the options presented above.

Hope this helps you choosing the right AWS cloud storage solution. We can work with you to further help with your decision making process.

Please reach out to us at [email protected] for further enquiries.