Tuesday, February 28, 2023

Key Azure Services List

Key Azure Services List



CPU and Memory

  • Virtual Machines
    • VMs: Drives, SSD, Speed, Cores
    • Lynux VM
    • Windows VMs
    • VM Images: Templated Images
    • Scale Lets : Grow VM instaces based on load
    • Availability Sets:
    • Managed Disk: Virtual Disks
  • App Services
    • Run Web App of your choice
    • Flexible Deployment: GIT, Azure Dev Ops, FTP
    • Scaling
    • Application Healing: If App Service goes down, Azure deploys the code in another App Service Instace
    • Deployment Slot: Move from Stage to Production
  • Containers
    • AKS
    • Azure Container Instances:
    • Azure Container Registry
  • ServerLess: No VM, No Web App
    • Azure Functions : C#, JS, Powershell ; Trigger; Http Request; Message
    • Logic Apps: Workflow;
    • Event Grid: Pub/Sub ; App Events and Infrastructure events
  • Compute At Scale: Massive Scale
    • Batch Scale : To spin up several VMs to run intensive process;
    • HDInsight: Processing capabilities in BigData;

Data Storage

  • Characteristics: Scalable, Available, Global (Region, Country)
  • Self Managed: VM or Container; Possible Predefined Images; You mange compute and disks; Patch responsibility
  • Service Based: Provision an Instance; Choose scale characteristics; Manage by Azure; No patching
  • Relational:
    • Azure SQL: Managed Instance; Elastic pool
    • My SQL
    • Maria DB
    • PostgreSQL
  • Non-Relational
    • Table Storage: Key Value storage;
    • Blog Storage: Files; PDF;
    • Queues: Short term data storge
    • Redis Cache: Performance
  • Cosmos DB
    • Self Hosted: MongoDB, Cassandra, Neo4j
    • Azure CosmosDB: Multi-model DB; MongoDB, Cassandra are Neo4j encapsulated in Azure CosmosDB.
    • Graph/Gremlin
    • Table
    • Cassandra
    • Globally Distributed
    • Multi-Latency
    • Low Latency
    • Five Consistency Levels
  • Azure DataLake (v2)
    • Large Scale data storage build for Analytics
    • Multi-model access - File/Blob ; Analytics can use either File Based or Blob based to access data
    • Built on Azure Blob Storage

 

Data Processing

  •     Ingestion Event Hubs: Vast quantity of data.
  • Data Factory: Moving ETL; Like SSIS ; Different systems; Different Clouds

 

 

Data analytics

  • SQL Data warehouse:
  • Analysis Service: Visual Studio Tools; For end users reporting
  • Stream Analytics
    • Real Time data Analysis
    • High volume message Processing
    • Ingest, analyze and output
  • Azure HD Insight
    • Open source analytic Tools: Spark, Hadoop, Hive, Storm, Kafka, HBase
    • On BigData
  • Azure Data Bricks
    • Clould Optimized Spark Service
    • Deep Azure integration: Azure AD, Security
  • Cognitive Service
    • Prepackaged machine learning
    • Decision; Speech; Language; Vision; Search
  • PowerBi
    • Render Reports
    • Charts etc
  • Azure Synapse
    • A workspace to manage below
    • SQL Data warehouse + Data Bricks
    • All your storage
    • Data Movement (ETL)
    • Machine Learning

 

Integration

  • Connecting systems and applications
    • Within a cloud
    • Cloud to data center
    • Between clouds
    • Cloud to SaaS
  • Messaging and Events
  • Service Bus - Brokered or relayed messages
  • Event Grid: Pub/Sub;
  • API Management: Publish; Secure; Manage API;
  • Logic Apps Workflow
    • Orchestrating messaging interactions
    • Control Flow
  • Integration Accounts
    • Enterprise file formats
    • XML/JSON transformation
    • Partners managements: Certificates; End Points etc


 

Network

  • Virtual Network: Define network
    • Public IP Addresses
    • Network Security Groups
    • Service Endpoint Policies
    • Connecting environments
  • Express Route:
    • On-premise data gateway
  • CDN
  • Traffic Manager: Rules for traffic;
    • Load Balancer
    • DNS Zones
  • Edge Services:
  • Application gateway
  • Front Door

 

Management: Monitor

  • Deploy, Restrict access
  • Manage
    • Portal: Web based to to manage
    • CLI
    • Cloud Shell: Commandline interface inside browser
    • Mobile App
  • Backup and site recovery
  • Automation and Scheduling : On demand or schedule;
  • RBAC: Role based access control.
  • Deploying Azure resources
  • Azure Resource Manager
    • Define resources in a Templated
    • Resource groups, locations and services
    • Create relationships between resources
    • Deploy template with parameters
  • Azure Deployment Manager:
    • Coordinate deployment of ARM
    • Define Service Topology
    • Define Rollout steps
  • Monitoring and Alerts
    • Monitor
    • Network watcher
    • Alerts : Can be configured and be notified.

 

Development Tools

  • REST / Web API
  • Cross-platform access
  • Client SDKs wrap API
  • Developing for Azure
    • SDKs:
  • Developer Tools:
  • Container Development
  • Build and Deploy : ARM template;
  • Azure SDKs:
    • PHP, Python, Node.JS, Java, .NET, Ruby
    • Many are cross-platform
    • Azure service coverage varies based on subscriptions.
  • Developer Tools
    • Visual Studio: Logic Apps, Service Fabric, ; Extension; Integrated with azure
    • Eclipse: Plugin
    • Visual Studio Code:JS, Node.JS;
    • IntelliJ: Azure extensions;
  • Container Development
    • Docker (Local development)
    • Cloud Deployment
    • Azure Dev Spces:
    • On top of AKS
    • Rapid interactive development on Kubernetes
  • Team-focused
    • Build and Deploy
    • GIT
    • ARM Template: With parameter files
  • Azure DevOps
    • Azure Boards: Work Items, Bugs
    • Azure Repos: GIT, team foundation
    • Azure Pipelines: Build and Release; Tasks
    • Azure Test Plans

   

Identity: 

  • Management level, Application Level. Secrets
  • Azure Active Directory:
    • Core directory services
    • Multi-factor authentication
    • Directory Synchronization
  • Identity and Directory Services
    • Azure Active Directory
    • Azure AD Domain Services
    • Azure AD B2C: Expose end points to customer
  • Application and Identity   
    • Managed Identities: Application Pool Identity; Service Account;
    • Application Registrations:
    • Enterprise Application: Third party software;
  • Data Protection Tools
    • Information Protection
    • Key Vault: Store secrets;
    • Hardware Security Module:
    • Azure Security Center
      • Monitor VMs and apps:
      • Include VMs from your data center
    • Visualization through Azure Monitor
  • Advise and alerts
  • Addition Security Services
    • Azure Sentinel
    • Azure Defender
    • Role Based Access Control:

 

Other Azure Services

  • Media Services: Stream;
  • Mobile Apps: Notification Hubs;
  • IoT: Messaging; Telemetry;
  • Mixed Reality
  • Blockchain
  • Bot Service:
  • Search: Cognitive ; Bing

 







Monday, January 2, 2023

Identity and Access Management (IAM)

Identity and access management (IAM) is a framework of business processes, policies and technologies that facilitates the management of electronic or digital identities. With an IAM framework in place, information technology (IT) managers can control user access to critical information within their organizations.

IAM could be single sign-on systems, two-factor authentication, multifactor authentication and privileged access management. These technologies also provide the ability to securely store identity and profile data as well as data governance functions to ensure that only data that is necessary and relevant is shared.
IAM systems can be deployed on premises, provided by a third-party vendor through a cloud-based subscription model or deployed in a hybrid model.

IAM encompasses the following components:

  • How individuals are identified in a system (understand the difference between identity management and authentication);
  • How roles are identified in a system and how they are assigned to individuals;
  • Adding, removing and updating individuals and their roles in a system;
  • Assigning levels of access to individuals or groups of individuals; and
  • Protecting the sensitive data within the system and securing the system itself.

IAM systems should capture and record user login information, manage the enterprise database of user identities, and orchestrate the assignment and removal of access privileges.

Benefits of IAM

  • Access privileges are granted according to policy, and all individuals and services are properly authenticated, authorized and audited.
  • Companies that properly manage identities have greater control of user access, which reduces the risk of internal and external data breaches.
  • Automating IAM systems allows businesses to operate more efficiently by decreasing the effort, time and money that would be required to manually manage access to their networks.
  • In terms of security, the use of an IAM framework can make it easier to enforce policies around user authentication, validation and privileges, and address issues regarding privilege creep.
  • IAM systems help companies better comply with government regulations by allowing them to show corporate information is not being misused. Companies can also demonstrate that any data needed for auditing can be made available on demand.
  • Companies can gain competitive advantages by implementing IAM tools and following related best practices. For example, IAM technologies allow the business to give users outside the organization -- like customers, partners, contractors and suppliers -- access to its network across mobile applications, on-premises applications and SaaS without compromising security. This enables better collaboration, enhanced productivity, increased efficiency and reduced operating costs.

Types of digital authentication

With IAM, enterprises can implement a range of digital authentication methods to prove digital identity and authorize access to corporate resources.
  • Unique passwords. The most common type of digital authentication is the unique password. To make passwords more secure, some organizations require longer or complex passwords that require a combination of letters, symbols and numbers. Unless users can automatically gather their collection of passwords behind a single sign-on entry point, they typically find remembering unique passwords onerous.
  • Pre-shared key (PSK). PSK is another type of digital authentication where the password is shared among users authorized to access the same resources -- think of a branch office Wi-Fi password. This type of authentication is less secure than individual passwords. A concern with shared passwords like PSK is that frequently changing them can be cumbersome.
  • Behavioral authentication. When dealing with highly sensitive information and systems, organizations can use behavioral authentication to get far more granular and analyze keystroke dynamics or mouse-use characteristics. By applying artificial intelligence, a trend in IAM systems, organizations can quickly recognize if user or machine behavior falls outside of the norm and can automatically lock down systems.
  • Biometrics. Modern IAM systems use biometrics for more precise authentication. For instance, they collect a range of biometric characteristics, including fingerprints, irises, faces, palms, gaits, voices and, in some cases, DNA. Biometrics and behavior-based analytics have been found to be more effective than passwords. One danger in relying heavily on biometrics is if a company's biometric data is hacked, then recovery is difficult, as users can't swap out facial recognition or fingerprints like they can passwords or other non-biometric information. Another critical technical challenge of biometrics is that it can be expensive to implement at scale, with software, hardware and training costs to consider.

Sunday, January 1, 2023

Kafka Connect

Kafka Connect is a free, open-source component of Apache Kafka® that works as a centralized data hub for simple data integration between databases, key-value stores, search indexes, and file systems. The information in this page is specific to Kafka Connect for Confluent Platform. 

Confluent Cloud offers pre-built, fully managed, Kafka connectors that make it easy to instantly connect to popular data sources and sinks. With a simple GUI-based configuration and elastic scaling with no infrastructure to manage, Confluent Cloud connectors make moving data in and out of Kafka an effortless task, giving you more time to focus on application development.

You can deploy Kafka Connect as a standalone process that runs jobs on a single machine (for example, log collection), or as a distributed, scalable, fault-tolerant service supporting an entire organization. Kafka Connect provides a low barrier to entry and low operational overhead. You can start small with a standalone environment for development and testing, and then scale up to a full production environment to support the data pipeline of a large organization.


Benefits of Kafka Connect:

  • Data-centric pipeline: Connect uses meaningful data abstractions to pull or push data to Kafka.
  • Flexibility and scalability: Connect runs with streaming and batch-oriented systems on a single node (standalone) or scaled to an organization-wide service (distributed).
  • Reusability and extensibility: Connect leverages existing connectors or extends them to fit your needs and provides lower time to production.


Types of connectors:

  1. Source connector: Source connectors ingest entire databases and stream table updates to Kafka topics. Source connectors can also collect metrics from all your application servers and store the data in Kafka topics–making the data available for stream processing with low latency.
  2. Sink connector: Sink connectors deliver data from Kafka topics to secondary indexes, such as Elasticsearch, or batch systems such as Hadoop for offline analysis.

Scale-out versus Scale-up

Introduction

Scaling up and scaling out are two IT strategies that both increase the processing power and storage capacity of systems. The difference is in how engineers achieve this type of growth and system improvement. The terms "scale up" and "scale out" refer to the way in which the infrastructure is grown.

While scaling out involves adding more discrete units to a system in order to add capacity, scaling up involves building existing units by integrating resources into them.

One of the easiest ways to describe both of these methods is that scaling out generally means building horizontally, while scaling up means building vertically. 


Scale Out

Scaling out takes the existing infrastructure, and replicates it to work in parallel. This has the effect of increasing infrastructure capacity roughly linearly. Data centers often scale out using pods. Build a compute pod, spin up applications to use it, then scale out by building another pod to add capacity. Actual application performance may not be linear, as application architectures must be written to work effectively in a scale-out environment.


Scale Up

Scaling up is taking what you’ve got, and replacing it with something more powerful. From a networking perspective, this could be taking a 1GbE switch, and replacing it with a 10GbE switch. Same number of switchports, but the bandwidth has been scaled up via bigger pipes. The 1GbE bottleneck has been relieved by the 10GbE replacement.

Scaling up is a viable scaling solution until it is impossible to scale up individual components any larger. 


Kafka

What is event streaming?

Event streaming is the practice of capturing data in real-time from event sources like databases, sensors, mobile devices, cloud services, and software applications in the form of streams of events; storing these event streams durably for later retrieval; manipulating, processing, and reacting to the event streams in real-time as well as retrospectively; and routing the event streams to different destination technologies as needed. Event streaming thus ensures a continuous flow and interpretation of data so that the right information is at the right place, at the right time.


What can I use event streaming for?

Event streaming is applied to a wide variety of use cases across a plethora of industries and organizations. Its many examples include:


  • To process payments and financial transactions in real-time, such as in stock exchanges, banks, and insurances. To track and monitor cars, trucks, fleets, and shipments in real-time, such as in logistics and the automotive industry.
  • To continuously capture and analyze sensor data from IoT devices or other equipment, such as in factories and wind parks.
  • To collect and immediately react to customer interactions and orders, such as in retail, the hotel and travel industry, and mobile applications.
  • To monitor patients in hospital care and predict changes in condition to ensure timely treatment in emergencies.
  • To connect, store, and make available data produced by different divisions of a company.
  • To serve as the foundation for data platforms, event-driven architectures, and microservices.



Apache Kafka® is an event streaming platform. What does that mean?

Kafka combines three key capabilities so you can implement your use cases for event streaming end-to-end with a single battle-tested solution:


  1. To publish (write) and subscribe to (read) streams of events, including continuous import/export of your data from other systems.
  2. To store streams of events durably and reliably for as long as you want.
  3. To process streams of events as they occur or retrospectively.


And all this functionality is provided in a distributed, highly scalable, elastic, fault-tolerant, and secure manner. Kafka can be deployed on bare-metal hardware, virtual machines, and containers, and on-premises as well as in the cloud. You can choose between self-managing your Kafka environments and using fully managed services offered by a variety of vendors.


How does Kafka work in a nutshell?

Kafka is a distributed system consisting of servers and clients that communicate via a high-performance TCP network protocol. It can be deployed on bare-metal hardware, virtual machines, and containers in on-premise as well as cloud environments.


Servers: Kafka is run as a cluster of one or more servers that can span multiple datacenters or cloud regions. Some of these servers form the storage layer, called the brokers. Other servers run Kafka Connect to continuously import and export data as event streams to integrate Kafka with your existing systems such as relational databases as well as other Kafka clusters. To let you implement mission-critical use cases, a Kafka cluster is highly scalable and fault-tolerant: if any of its servers fails, the other servers will take over their work to ensure continuous operations without any data loss.


Clients: They allow you to write distributed applications and microservices that read, write, and process streams of events in parallel, at scale, and in a fault-tolerant manner even in the case of network problems or machine failures. Kafka ships with some such clients included, which are augmented by dozens of clients provided by the Kafka community: clients are available for Java and Scala including the higher-level Kafka Streams library, for Go, Python, C/C++, and many other programming languages as well as REST APIs.

 




Wednesday, November 9, 2022

Sharding

 Introduction

Sharding is a type of database partitioning that separates large databases into smaller, faster, more easily managed parts. These smaller parts are called data shards. The word shard means "a small part of a whole."

Sharding involves splitting and distributing one logical data set across multiple databases that share nothing and can be deployed across multiple servers. To achieve sharding, the rows or columns of a larger database table are split into multiple smaller tables.

Once a logical shard is stored on another node, it is known as a physical shard. One physical shard can hold multiple logical shards. The shards are autonomous and don't share the same data or computing resources.


Horizontal sharding. 

When each new table has the same schema but unique rows, it is known as horizontal sharding. In this type of sharding, more machines are added to an existing stack to spread out the load, increase processing speed and support more traffic. This method is most effective when queries return a subset of rows that are often grouped together.

Vertical sharding. 

When each new table has a schema that is a faithful subset of the original table's schema, it is known as vertical sharding. It is effective when queries usually return only a subset of columns of the data.


Benefits of sharding

Sharding is common in scalable database architectures. Since shards are smaller, faster and easier to manage, they help boost database scalability, performance and administration. Sharding also reduces the transaction cost of the database.

Horizontal scaling, which is also known as scaling out, helps create a more flexible database design, which is especially useful for parallel processing. It provides near-limitless scalability for intense workloads and big data requirements. With horizontal sharding, users can optimally use all the compute resources across a cluster for every query. This sharding method also speeds up query resolution, since each machine has to scan fewer rows when responding to a query.

Vertical sharding increases RAM or storage capacity and improves central processing unit (CPU) capacity. It thus increases the power of a single machine or server.


Sharded databases also offer higher availability and mitigate the impact of outages because, during an outage, only those portions of an application that rely on the missing chunks of data become unusable. A sharded database also replicates backup shards to additional nodes to further minimize damage due to an outage. In contrast, an application running without sharded databases may be completely unavailable following an outage.


Difference between sharding and partitioning

Although sharding and partitioning both break up a large database into smaller databases, there is a difference between the two methods.

After a database is sharded, the data in the new tables is spread across multiple systems, but with partitioning, that is not the case. Partitioning groups data subsets within a single database instance.





Saturday, December 3, 2016

CAP Theorem

This theorem describes the behavior of a distributed system. A distributed system is a collection of interconnected nodes that all share data.

The CAP Theorem was first postulated by Dr. Eric Brewer back in the year 2000.  As per this theorem, you can get any two at any given time, but you cannot have all the below three attributes. You have to give up the third one for that particular pair of requests.


  1. Consistency: Consistency means that the system guarantees to read data that is at least as fresh as what you just wrote.
  2. Availability: availability means that a non-failing node will give the client a reasonable response within a reasonable amount of time. Now, all that's relative, but what that really means is that it won't hang indefinitely, and it won't return an error.
  3. Partition tolerance: Partition tolerance guarantees that a distributed system will continue to function in the face of network partitions. A network partition is a breaking connectivity. It means that nodes within the system cannot communicate with one another. A partition could be isolated to just the connection between two specific nodes or it could run through the entire network.


Explanation with an example... Write a new version of data to node X, and then we read that data from node Y.  Initially node Y has an older version of that data. So, there could three scenarios. 
  
  • Scenario #1:  Node Y could get the new version from node X. That could be node X sending it to node Y and waiting until it has confirmation from node Y before it sends confirmation back to the client or it could be that node Y goes and fetches it from node X, and it can only return to the client after it has fetched the data. In either case, the network must be functional for the new version to make it to node Y.  If the network is partitioned between nodes A and B, then node Y won't get the latest version. So, in this first scenario, this system is not being partition tolerant. So, let's suppose we want to guarantee partition tolerance.
  • Scenario #2: In the scenario node Y tolerates the partition and simply returns the best version that it has. So, in this case the client would get the older version of the data. This violates the consistency guarantee.
  •  Scenario #3:  In this scenario, we have to wait for messages to get from node X to node Y. So, either node Y will wait indefinitely while the network is partitioned or it'll time out and return an error. In either case, it's not being available.