Costs of Unprotected Data Assets are high – Why all organizations must be ready with a taxonomy of sensitive data?

Take a look at some of the recent headlines:

“Data Protection Concerns Upend M&A Plans”

“California Passes Sweeping Data-Privacy Bill”

“Marketers Push Agencies to Shoulder More Liability for Data Breaches …”

“Apple CEO Condemns ‘Data-Industrial Complex’”

Truth be told, these headlines are just a fraction of what is happening out in the real world. Data are everywhere (and growing by the minute). All manners of devices have become smarter, or in other words have started producing data.

Networks have multiplied, devices have multiplied, systems have increased. With all this happening, it is important for any organization to build a cohesive, well understood catalog of all information assets.

Data Ninjas recommend a new mantra to live by and that is “you can only protect what you know you have!”. This is an often overlooked by most organizations. This issue is more pervasive in bigger organizations but often is ignored by smaller organizations. While smaller organizations are not exposed to compliance risk, they are exposed to general data privacy and protection risk.

Now let’s look at some of the costs of lack of adequate data protections:

  • McAfee and the Center for Strategic and International Studies (CSIS) estimated the likely annual cost to the global economy from cybercrime is $445 billion a year, with a range of between $375 billion and $575 billion.
  • In 2018 the IRTC tracked 1,027 breaches through early November. The number of records exposed totaled 57.7 million. The business category continues to be the most affected sector, with 475 breaches, or 46 percent of all breaches detected.
  • The average cost of data breach globally was $3.86 million in 2018, up 6.4 percent from $3.62 million in 2017, according to a study from IBM and the Ponemon Institute.

Even if you ignore some of the statistics, the amount of disruption caused by a casual breach to your business would be crippling. Most smaller organizations have threadbare staff on hand for day to day operation so a breach event would be truly catastrophic.

So how can you stop these events from happening in your domain? By being proactive about it. At Data Ninjas, we believe in being prepared (and you don’t have to build all your protections all at once) from the ground up.

Here is a quick plan on how to execute on a data protection/privacy project iteratively:

  1. Conduct interviews with data custodians and stakeholders to document data collected across the enterprise
  2. Classify data elements in inventory
  3. Map flow of PII or sensitive data across systems
  4. Deploy data loss preventions tools to perform automated discovery and monitoring of sensitive data
  5. Deploy data governance tools to improve processes and build a mature data using organization.

Top 8 Tips for picking the right AWS Storage Option

For organizations making a move to the AWS cloud there are always a lot of questions regarding the storage solution that they should use

As with most services offered by AWS, there are plenty of choices to select from. The selection boils down to a couple of criteria:

  1. What do you need the storage for
  2. How often do you need to access it
  3. What is the immediacy of access and
  4. Of course – Cost!

The main options to choose from as of 2018 are:

  1. S3 – the primary file/object based storage platform. This is the primary offering from AWS which we will cover in more detail.
  2. EFS – elastic file system which is a network attached storage.
  3. Glacier – used for data archival  
  4. Snowball – way to store day without the need for network. Physical disk replication
  5. Storage Gateway – virtual appliances for data replication

Let’s get into a little more detail on each of them….

S3

If regular flat file or object based storage is what you are looking for, then S3 is the right option to go with. This is a bucket based storage with unlimited capacity where you can store files from 0-5TB of size. Data stored here is secure, durable and highly scalable . S3 uses simple web service interfaces to store and retrieve unlimited amount of data from all over the web. Its built to achieve 99.99% availability and 99.999999999% durability.

Within S3 AWS offers different Storage Tiers

  1. S3 Standard – 99.99% availability and 99.999999999% durability – redundant across multiple devices can sustain loss of 2 facilities
  2. S3- Infrequent Access. As the name suggests this storage is suitable for files that are not accessed on a regular basis.  S3 Infrequent Access has a lower fee for data storage but users are charged for data retrieval. Data can be accessed rapidly
  3. S3 – Reduced Redundancy Stroage or recently renamed as One Zone. The content here can be regenerated if necessary. This solution is ideal for things like thumbnails of images that can be regenerated if required. This solution only offers 99.99% durability and availability of one year
  4. Glacier – This solution is mainly meant for archival of data and not for data that is used regularly. This is the cheapest storage option but the data takes around 3-5 hours to restore.

Elastic File System

Also called Elastic block storage. These can be thought as disks in the cloud that you can attach to your EC2 instance. These are Storage volumes on the cloud that can be used as a file system or for a DB. They can also in some cases be used as a root volume. There are 5 types of EFS systems:

  1. General Purpose SSD – GP2. General purpose –  This is the most generic variety for low intensity I/O.
  2. Provisioned IOPS SSD – Designed for I/O intensive applications like large RDMS or NoSQL DB etc. Whenever the need is for greater than 10,000 IOPS.
  3. Throughput Optimized HDD (ST1) – This is the magnetic solution for high throughput. This cannot be used as a boot volume. Can be used for Big Data, Data Warehouse, log processing etc
  4. Cold HDD – Basically a magnetic File server. Its the lowest cost for infrequently accessed workloads and this also can’t be a boot volume
  5. Magnetic Standard – Lowest cost per GB that is bootable. This is for Infrequently accessed data where low cost is priority. Ideal for Test and Development environments.

Snowball

Snowballs are appliances that are sent to customers who can use them to transfer large amounts of data to the appliance by connecting it locally. The devices are then shipped to AWS where the data is transferred to the AWS infrastructure. It uses multiple layers of security including 256 bit encryption and the appliances are tamper proof.

There are three flavors or Snowballs:

  1. Regular Snowball – this used to transfer 80 TB or information per Snowball.  They cost One fifth the cost of using network based AWS storage. This is a good solution to store large amounts of data without the use of network making it more secure.
  2. Snowball edge – These are like datacenters in a box. These appliances come with up to 100TB of storage and also includes Compute power. These are used as mobile data centers in places like airplanes to capture and process large amounts of information without the need for a network.
  3. Snowmobiles – These are the largest option in this group and can store up to100PB per snowmobile. These are nothing but mobile data centers which come in 45 feet long shipping containers! These are mainly used for a complete data center migration.

Storage Gateway

Storage Gateways connects an on premise software appliance with cloud based storage. So your local IT infrastructure is directly connected to AWS storage.

The software appliance can be downloaded as a VM image to be installed at the hosts datacenter.  It supports VMWare ESXi or Microsoft Hyper-V.

There are 3 types of Storage gateways

  1. File Gateways (NFS) – These are for flat files stored on S3 – once transferred they can be managed as regular S3 objects with bucket policies including versioning, lifecycle management, cross region replication
  2. Volume gateways – these are virtual hard disks for block storage and are best used for Database storage like SQL server etc. Can be asynchronously backed up for “point of time” snapshots of your volumes. Snapshots are incremental and are compressed. These are further sub categorized as Store Volumes and Cached volumes.  
    1. Store volumes – data is stored on site and all your primary data is backed up to Amazon S3 in form of Amazon Elastic Block Store(EBS). This is used for low latency requirements.
    2. Cached volumes – all data that are stored on premises is moved to S3. Only limited cached data stays on premises. It reduces need of local storage. Can be used to create storage volumes of 32 TB.
  3. Tape Gateway – This is a durable and affordable solution for data archival to a virtual tape in the AWS cloud using a VTL interface. This is supported by Netbackup, Backup Exec, Veeam etc and most existing software backup solutions being used in the market today.

So, as mentioned earlier, there are lots of options to choose from. Based on your needs of size, throughput, accessibility, scale and business requirements you should be able to narrow down one of the options presented above.

Hope this helps you choosing the right AWS cloud storage solution. We can work with you to further help with your decision making process.

Please reach out to us at [email protected] for further enquiries.  

Data Science for Everyone (Part 3)

Now that we have reviewed the basics of data driven decision making categories and have discussed a few differences about how data science will require data processing, we are ready to jump into smaller subset of data mining techniques that are foundational to the data science process.

Following are brief descriptions of data mining techniques:

  • Regression or Estimation: Generally you would use regression to predict value of a variable (such as readmission probability for a patient). This technique is quite useful when you are trying to predict one trustworthy value for a variable.
  • Similarity matching: Often used to match an individual or group with another individual or group given a finite set of dimensional and measurable attributes. A lot of times organizations can use this to identify customer groups or peer groups
  • Classification: This technique is useful when you are attempting to segment or categorize a population of candidates/things. Generally used by marketers to identify positioning and targeting of segments.
  • Clustering: There is a fundamental difference between similarity matching (which is for a specific purpose) and clustering (typically used for identifying “natural” groups)

There are additional techniques that are less commonly used such as co-occurrence grouping, profiling, link prediction etc. but more on that in the next post.

Stay tuned and find out how you can implement these techniques to quickly create data based decisions…