gpcdesre/Part_4.md

Monitoring, Managing, and Maximizing Google Cloud Operations (GCP DevOps Engineer Track Part 4)
===============================================================================================

### Introduction 

#### About the Course and Learning Path

Make better software, faster

#### Milestone: Getting Started

### Understanding Operations in Context

#### Section Introduction

#### What Is Ops?

- GCP Defined

  - _"Monitor, troubleshoot, and improve application performance on your Google Cloud Environment"_

  - Logging Management  
    - Gather Logs, metrics and traces everywhere
      - Audit, Platform, User logs
        - Export, Discard, Ingest

  - Error Reporting
    - So much data, How do you pick out the important indicators?
      - A centralized error management interface that shows current & past errors
      - Identify your app's top or new errors at a glance, in a dedicated dashboard
      - Shows the error details such as time chart, occurences, affected user count, first- and last- seen dates as well as a cleaned exception stack trace


  - Across-the-board Monitoring
    - Dashboards for built-in and customizable visualisations
      - Monitoring Features
        - Visual Dashboards
    - Health Monitoring
      - Associate uptime checks with URLs, groups, or resources (e.g. instances and load balancers)
    - Service Monitoring
      - Set, monitor and alert your teams as needed based on Service Level Objectives (SLO's)

  - SRE Tracking
    - Monitoring is critical for SRE
      - Google Cloud Monitoring enables the quick and easy development of SLIs and SLOs
        - Pinpoint SLI's and develop and SLO on top of it

  - Operational Management
    - Debugging
      - Inspects the state of your application at any code location in production without stopping or slowing down requests
    - Latency Management
      - Provides latency sampling and reporting for App Engine, including latency distributions and per-URL statistics
    - Performance Management
      - Offers continuous profiling of resource consumption in your production applications along with cost management
    - Security Management
      - With audit logs, you have near real-time user activity visibility across all your applications on Google Cloud

**What is Ops: Key Takeaways**

- Ops Defined: Watch, learn and fix
- Primary services: Monitoring and Logging
- Monitoring dashboads for all metrics, including health and services (SLOs)
- Logs can be exported, discarded, or ingested
- SRE depends on ops
- Error alerting pinpoints problems, quickly

Scratch:
- Metric query and tracing analysis
- Establish performance and reliability indicators
- Trigger alerts and error reporting
- Logging Features
- Error Reporting
- SRE Tracking (SLI/SLO)
- Performance Management


#### Clarifying the Stackdriver/Operations Connection

- 2012 - Stackdriver Created
- 2014 - Stackdriver Acquired by Google
- 2016 - Stackdriver Released: Expanded version of Stackdriver with log analysis and hybrid cloud support is made generally available
- 2020 - Stackdriver Integrated: Google fully integrates all Stackdriver functionality into the GCP platform and drops name

**Cloud Monitoring, CLoud Logging, Cloud Trace, Cloud Profiler, Cloud Debugger, Cloud Audit Logs** (formerly all called Stackdriver <service>)

"StackDriver" lives on - in the exam only

Integration + Upgrades

- Complete UI Integrations
  - All of Stackdrivers functionality - and more - is now integrated into the Google Cloud console
- Dashboard API
  - New API added to allow creation and sharing of dashoards across projects
- Log Retention Increased
  - Logs can now be retained for up to **10 years** and you have control over the time specified
- Metrics Enhancement
  - In Cloud Monitoring, metrics are kept up to 24 months, and writing metrics has been increased to a 10-second granularity (write out metrics every 10 seconds)
- Advanced Alert Routing
  - Alerts can now be routed to independent systems that support Cloud Pub/Sub

#### Operations and SRE: How Do They Relate?

- Lots of questions in Exam on SRE

What is SRE? - _"SRE is what happens when a software engineer is tasked with what used to be called operations"_ (Founder Google SRE Team)

**Pillars of DevOps**

- Accept failure as normal:
  - Try to anticipate, but...
  - Incidents bound to occur
  - Failures help team learn

- No-fault postmortems &  SLOs:
  - No two failures the same
  - Track incidents (SLIs)
  - Map to Objectives (SLOs)

- Implement gradual change:
  - Small updates are better
  - Easier to review
  - Easier to rollback

- Reduce costs of failures:
  - Limited "canary" rollouts
  - Impact fewest users
  - Automate where possible

- Measure everything:
  - Critical guage of sucess
  - CI/CD needs full monitoring
  - Synthetic, proactive monitoring

- Measure toil and reliability:
  - Key to SLOs and SLAs
  - Reduce toil, up engineering
  - Monitor all over time

<hr style="height:2px;border-width:0;color:gray;background-color:gray">

SLI: _"A carefully defined <u>quantitative</u> measure of some aspect of the level of service that is provided"_

SLIs are metrics over time - specific to a user journey such as request/response, data processing, or storage - that show how well a service is doing

Example SLIs:
- Request Latency: How long it takes to return a response to a request
- Failure Rate: A fractice of all rates recevied: (unsuccessful requests/all requests)
- Batch Throughput - Proportion of time = data processing rate > than a threshold

**Commit to Memory - Google's 4x Golden Signals!**

- Latency
  - The time is takes for your service to fulfill a request
- Errors    
  - The rate at which your service fails
- Traffic
  - How much demand is directed at your service
- Saturation
  - A measure of how close to fully utilized the services' resources are

> **LETS**

<hr style="height:2px;border-width:0;color:gray;background-color:gray">

SLO: _"Service level objectives (SLOs) specify a target level for the reliability of your service"_ - The site reliability workbook

SLOs are tied to you SLIs
- Measured by SLLI
- Can be a single target value or range of values
- SLIs <= SLO
- or
- (lower bound <= SLI <= upper bound) = SLO
- Common SLOs: 99.5%, 99.9%, 99.99% (4x 9's)

SLI - Metric over time which detail the health of a service
  - example: `Site homepage latency requests < 300ms over last 5 minutes @ 95% percentile`

SLO - Agreed-upon bounds how often SLIs must be met
  - example: `95% percentile homepage SLI will suceed 99.9% of the time over the next year`

Phases of Service Lifetime

SRE's are involved in the architecture and design phase, but really hit their stride in the "Limited Availability" (operations) phase. This phase typically includes the alpha/beta phases, and provides SRE's great opportunity to:
- Measure and track SLIs (Measuring increasing performance)
- Evaluate reliability
- Define SLOs
- Build capacity models
- Establish incident response, shared with dev team

General Availability Phase
- After Production Readiness Review passed
- SREs handle majority of op work
- Incident responses
- Track operational load and SLOs

**Ops & SRE: Key Takeaways**
- SRE: Operations from a software engineer
- Many shared pillars between DevOps/SRE
- SLIs are quantitative metrics over time
- Remember the 4x Google Golden Signals (LETS)
- SLOs are a target objective for reliability
- SLIs are lower then SLO - or - in-between upper and lower bound
- SREs are most active in limited availability and general availability phases


#### Operation Services at a Glance

#### Section Review

#### Milestone: The Weight of the World (Teamwork, Not Superheroes)


Monitoring Your Operations
Section Introduction
Cloud Monitoring Concepts
Monitoring Workspaces Concepts
Monitoring Workspaces
Perspective: Workspaces in Context
What Are Metrics?
Exploring Workspace and Metrics
Monitoring Agent Concepts
Installing the Monitoring Agent
Collecting Monitoring Agent Metrics
Integration with Monitoring API
Create Dashboards with Command Line
GKE Metrics
Perspective: What's Up, Doc?
Uptime Checks
Establishing Human-Actionable and Automated Alerts
Section Review
Milestone: Spies Everywhere! (Check Those Vitals!)
Hands-On Lab:
Install and Configure Monitoring Agent with Google Cloud Monitoring
Logging Activities
Section Introduction
Cloud Logging Fundamentals
Log Types and Mechanics
Cloud Logging Tour
Logging Agent Concepts
Install Logging Agent and Collect Agent Logs
Logging Filters
Hands-On with Advanced Filters
VPC Flow Logs
Firewall Logs
VPC Flow Logs and Firewall Logs Demo
Routing and Exporting Logs
Export Logs to BigQuery
Logs-Based Metrics
Section Review
Milestone: Let the Record Show
Hands-On Lab:
Install and Configure Logging Agent on Google Cloud
SRE and Alerting Policies
SLOs and Alerting Strategy
Service Monitoring
Milestone: Come Together, Right Now, SRE
Optimize Performance with Trace/Profiler
Section Introduction
What the Services Do and Why They Matter
Tracking Latency with Cloud Trace
Accessing the Cloud Trace APIs
Setting Up Your App with Cloud Profiler
Analyzing Cloud Profiler Data
Section Review
Milestone: It All Adds Up!
Hands-On Lab:
Discovering Latency with Google Cloud Trace
Identifying Application Errors with Debug/Error Reporting
Section Introduction
Troubleshooting with Cloud Debugger
Establishing Error Reporting for Your App
Managing Errors and Handling Notifications
Section Review
Milestone: Come Together - Reprise (Debug Is De Solution)
Hands-On Lab:
Correcting Code with Cloud Debugger
Course Conclusion
Milestone: Are We There, Yet?
landscape
Practice Exam / Quiz:
Google Certified Professional Cloud DevOps Engineer Exam Prep
Started first section of Course 4 2021-02-03 16:54:24 +00:00			`Monitoring, Managing, and Maximizing Google Cloud Operations (GCP DevOps Engineer Track Part 4)`
			`===============================================================================================`

			`### Introduction`

			`#### About the Course and Learning Path`

			`Make better software, faster`

			`#### Milestone: Getting Started`

			`### Understanding Operations in Context`

			`#### Section Introduction`

			`#### What Is Ops?`

			`- GCP Defined`

			`- _"Monitor, troubleshoot, and improve application performance on your Google Cloud Environment"_`

			`- Logging Management`
			`- Gather Logs, metrics and traces everywhere`
			`- Audit, Platform, User logs`
			`- Export, Discard, Ingest`

			`- Error Reporting`
			`- So much data, How do you pick out the important indicators?`
			`- A centralized error management interface that shows current & past errors`
			`- Identify your app's top or new errors at a glance, in a dedicated dashboard`
			`- Shows the error details such as time chart, occurences, affected user count, first- and last- seen dates as well as a cleaned exception stack trace`


			`- Across-the-board Monitoring`
			`- Dashboards for built-in and customizable visualisations`
			`- Monitoring Features`
			`- Visual Dashboards`
			`- Health Monitoring`
			`- Associate uptime checks with URLs, groups, or resources (e.g. instances and load balancers)`
			`- Service Monitoring`
			`- Set, monitor and alert your teams as needed based on Service Level Objectives (SLO's)`

			`- SRE Tracking`
			`- Monitoring is critical for SRE`
			`- Google Cloud Monitoring enables the quick and easy development of SLIs and SLOs`
			`- Pinpoint SLI's and develop and SLO on top of it`

			`- Operational Management`
			`- Debugging`
			`- Inspects the state of your application at any code location in production without stopping or slowing down requests`
			`- Latency Management`
			`- Provides latency sampling and reporting for App Engine, including latency distributions and per-URL statistics`
			`- Performance Management`
			`- Offers continuous profiling of resource consumption in your production applications along with cost management`
			`- Security Management`
			`- With audit logs, you have near real-time user activity visibility across all your applications on Google Cloud`

			`What is Ops: Key Takeaways`

			`- Ops Defined: Watch, learn and fix`
			`- Primary services: Monitoring and Logging`
			`- Monitoring dashboads for all metrics, including health and services (SLOs)`
			`- Logs can be exported, discarded, or ingested`
			`- SRE depends on ops`
			`- Error alerting pinpoints problems, quickly`

			`Scratch:`
			`- Metric query and tracing analysis`
			`- Establish performance and reliability indicators`
			`- Trigger alerts and error reporting`
			`- Logging Features`
			`- Error Reporting`
			`- SRE Tracking (SLI/SLO)`
			`- Performance Management`


			`#### Clarifying the Stackdriver/Operations Connection`

			`- 2012 - Stackdriver Created`
			`- 2014 - Stackdriver Acquired by Google`
			`- 2016 - Stackdriver Released: Expanded version of Stackdriver with log analysis and hybrid cloud support is made generally available`
			`- 2020 - Stackdriver Integrated: Google fully integrates all Stackdriver functionality into the GCP platform and drops name`

			`Cloud Monitoring, CLoud Logging, Cloud Trace, Cloud Profiler, Cloud Debugger, Cloud Audit Logs (formerly all called Stackdriver <service>)`

			`"StackDriver" lives on - in the exam only`

			`Integration + Upgrades`

			`- Complete UI Integrations`
			`- All of Stackdrivers functionality - and more - is now integrated into the Google Cloud console`
			`- Dashboard API`
			`- New API added to allow creation and sharing of dashoards across projects`
			`- Log Retention Increased`
			`- Logs can now be retained for up to 10 years and you have control over the time specified`
			`- Metrics Enhancement`
			`- In Cloud Monitoring, metrics are kept up to 24 months, and writing metrics has been increased to a 10-second granularity (write out metrics every 10 seconds)`
			`- Advanced Alert Routing`
			`- Alerts can now be routed to independent systems that support Cloud Pub/Sub`

			`#### Operations and SRE: How Do They Relate?`

			`- Lots of questions in Exam on SRE`

			`What is SRE? - _"SRE is what happens when a software engineer is tasked with what used to be called operations"_ (Founder Google SRE Team)`

			`Pillars of DevOps`

			`- Accept failure as normal:`
			`- Try to anticipate, but...`
			`- Incidents bound to occur`
			`- Failures help team learn`

			`- No-fault postmortems & SLOs:`
			`- No two failures the same`
			`- Track incidents (SLIs)`
			`- Map to Objectives (SLOs)`

			`- Implement gradual change:`
			`- Small updates are better`
			`- Easier to review`
			`- Easier to rollback`

			`- Reduce costs of failures:`
			`- Limited "canary" rollouts`
			`- Impact fewest users`
			`- Automate where possible`

			`- Measure everything:`
			`- Critical guage of sucess`
			`- CI/CD needs full monitoring`
			`- Synthetic, proactive monitoring`

			`- Measure toil and reliability:`
			`- Key to SLOs and SLAs`
			`- Reduce toil, up engineering`
			`- Monitor all over time`

			`<hr style="height:2px;border-width:0;color:gray;background-color:gray">`

			`SLI: _"A carefully defined <u>quantitative</u> measure of some aspect of the level of service that is provided"_`

			`SLIs are metrics over time - specific to a user journey such as request/response, data processing, or storage - that show how well a service is doing`

			`Example SLIs:`
			`- Request Latency: How long it takes to return a response to a request`
			`- Failure Rate: A fractice of all rates recevied: (unsuccessful requests/all requests)`
			`- Batch Throughput - Proportion of time = data processing rate > than a threshold`

			`Commit to Memory - Google's 4x Golden Signals!`

			`- Latency`
			`- The time is takes for your service to fulfill a request`
			`- Errors`
			`- The rate at which your service fails`
			`- Traffic`
			`- How much demand is directed at your service`
			`- Saturation`
			`- A measure of how close to fully utilized the services' resources are`

			`> LETS`

			`<hr style="height:2px;border-width:0;color:gray;background-color:gray">`

			`SLO: _"Service level objectives (SLOs) specify a target level for the reliability of your service"_ - The site reliability workbook`

			`SLOs are tied to you SLIs`
			`- Measured by SLLI`
			`- Can be a single target value or range of values`
			`- SLIs <= SLO`
			`- or`
			`- (lower bound <= SLI <= upper bound) = SLO`
			`- Common SLOs: 99.5%, 99.9%, 99.99% (4x 9's)`

			`SLI - Metric over time which detail the health of a service`
			- example: `Site homepage latency requests < 300ms over last 5 minutes @ 95% percentile`

			`SLO - Agreed-upon bounds how often SLIs must be met`
			- example: `95% percentile homepage SLI will suceed 99.9% of the time over the next year`

			`Phases of Service Lifetime`

			`SRE's are involved in the architecture and design phase, but really hit their stride in the "Limited Availability" (operations) phase. This phase typically includes the alpha/beta phases, and provides SRE's great opportunity to:`
			`- Measure and track SLIs (Measuring increasing performance)`
			`- Evaluate reliability`
			`- Define SLOs`
			`- Build capacity models`
			`- Establish incident response, shared with dev team`

			`General Availability Phase`
			`- After Production Readiness Review passed`
			`- SREs handle majority of op work`
			`- Incident responses`
			`- Track operational load and SLOs`

			`Ops & SRE: Key Takeaways`
			`- SRE: Operations from a software engineer`
			`- Many shared pillars between DevOps/SRE`
			`- SLIs are quantitative metrics over time`
			`- Remember the 4x Google Golden Signals (LETS)`
			`- SLOs are a target objective for reliability`
			`- SLIs are lower then SLO - or - in-between upper and lower bound`
			`- SREs are most active in limited availability and general availability phases`


			`#### Operation Services at a Glance`

			`#### Section Review`

			`#### Milestone: The Weight of the World (Teamwork, Not Superheroes)`


			`Monitoring Your Operations`
			`Section Introduction`
			`Cloud Monitoring Concepts`
			`Monitoring Workspaces Concepts`
			`Monitoring Workspaces`
			`Perspective: Workspaces in Context`
			`What Are Metrics?`
			`Exploring Workspace and Metrics`
			`Monitoring Agent Concepts`
			`Installing the Monitoring Agent`
			`Collecting Monitoring Agent Metrics`
			`Integration with Monitoring API`
			`Create Dashboards with Command Line`
			`GKE Metrics`
			`Perspective: What's Up, Doc?`
			`Uptime Checks`
			`Establishing Human-Actionable and Automated Alerts`
			`Section Review`
			`Milestone: Spies Everywhere! (Check Those Vitals!)`
			`Hands-On Lab:`
			`Install and Configure Monitoring Agent with Google Cloud Monitoring`
			`Logging Activities`
			`Section Introduction`
			`Cloud Logging Fundamentals`
			`Log Types and Mechanics`
			`Cloud Logging Tour`
			`Logging Agent Concepts`
			`Install Logging Agent and Collect Agent Logs`
			`Logging Filters`
			`Hands-On with Advanced Filters`
			`VPC Flow Logs`
			`Firewall Logs`
			`VPC Flow Logs and Firewall Logs Demo`
			`Routing and Exporting Logs`
			`Export Logs to BigQuery`
			`Logs-Based Metrics`
			`Section Review`
			`Milestone: Let the Record Show`
			`Hands-On Lab:`
			`Install and Configure Logging Agent on Google Cloud`
			`SRE and Alerting Policies`
			`SLOs and Alerting Strategy`
			`Service Monitoring`
			`Milestone: Come Together, Right Now, SRE`
			`Optimize Performance with Trace/Profiler`
			`Section Introduction`
			`What the Services Do and Why They Matter`
			`Tracking Latency with Cloud Trace`
			`Accessing the Cloud Trace APIs`
			`Setting Up Your App with Cloud Profiler`
			`Analyzing Cloud Profiler Data`
			`Section Review`
			`Milestone: It All Adds Up!`
			`Hands-On Lab:`
			`Discovering Latency with Google Cloud Trace`
			`Identifying Application Errors with Debug/Error Reporting`
			`Section Introduction`
			`Troubleshooting with Cloud Debugger`
			`Establishing Error Reporting for Your App`
			`Managing Errors and Handling Notifications`
			`Section Review`
			`Milestone: Come Together - Reprise (Debug Is De Solution)`
			`Hands-On Lab:`
			`Correcting Code with Cloud Debugger`
			`Course Conclusion`
			`Milestone: Are We There, Yet?`
			`landscape`
			`Practice Exam / Quiz:`
			`Google Certified Professional Cloud DevOps Engineer Exam Prep`