Training/gpcdesre

Fork 0

Alex Soul c51d10402b Started first section of Course 4

2021-02-03 16:54:24 +00:00

9.5 KiB

Raw Blame History

Monitoring, Managing, and Maximizing Google Cloud Operations (GCP DevOps Engineer Track Part 4)

Introduction

About the Course and Learning Path

Make better software, faster

Milestone: Getting Started

Understanding Operations in Context

Section Introduction

What Is Ops?

GCP Defined
- "Monitor, troubleshoot, and improve application performance on your Google Cloud Environment"
- Logging Management
  - Gather Logs, metrics and traces everywhere
    - Audit, Platform, User logs
      - Export, Discard, Ingest
- Error Reporting
  - So much data, How do you pick out the important indicators?
    - A centralized error management interface that shows current & past errors
    - Identify your app's top or new errors at a glance, in a dedicated dashboard
    - Shows the error details such as time chart, occurences, affected user count, first- and last- seen dates as well as a cleaned exception stack trace
- Across-the-board Monitoring
  - Dashboards for built-in and customizable visualisations
    - Monitoring Features
      - Visual Dashboards
  - Health Monitoring
    - Associate uptime checks with URLs, groups, or resources (e.g. instances and load balancers)
  - Service Monitoring
    - Set, monitor and alert your teams as needed based on Service Level Objectives (SLO's)
- SRE Tracking
  - Monitoring is critical for SRE
    - Google Cloud Monitoring enables the quick and easy development of SLIs and SLOs
      - Pinpoint SLI's and develop and SLO on top of it
- Operational Management
  - Debugging
    - Inspects the state of your application at any code location in production without stopping or slowing down requests
  - Latency Management
    - Provides latency sampling and reporting for App Engine, including latency distributions and per-URL statistics
  - Performance Management
    - Offers continuous profiling of resource consumption in your production applications along with cost management
  - Security Management
    - With audit logs, you have near real-time user activity visibility across all your applications on Google Cloud

What is Ops: Key Takeaways

Ops Defined: Watch, learn and fix
Primary services: Monitoring and Logging
Monitoring dashboads for all metrics, including health and services (SLOs)
Logs can be exported, discarded, or ingested
SRE depends on ops
Error alerting pinpoints problems, quickly

Scratch:

Metric query and tracing analysis
Establish performance and reliability indicators
Trigger alerts and error reporting
Logging Features
Error Reporting
SRE Tracking (SLI/SLO)
Performance Management

Clarifying the Stackdriver/Operations Connection

2012 - Stackdriver Created
2014 - Stackdriver Acquired by Google
2016 - Stackdriver Released: Expanded version of Stackdriver with log analysis and hybrid cloud support is made generally available
2020 - Stackdriver Integrated: Google fully integrates all Stackdriver functionality into the GCP platform and drops name

Cloud Monitoring, CLoud Logging, Cloud Trace, Cloud Profiler, Cloud Debugger, Cloud Audit Logs (formerly all called Stackdriver )

"StackDriver" lives on - in the exam only

Integration + Upgrades

Complete UI Integrations
- All of Stackdrivers functionality - and more - is now integrated into the Google Cloud console
Dashboard API
- New API added to allow creation and sharing of dashoards across projects
Log Retention Increased
- Logs can now be retained for up to 10 years and you have control over the time specified
Metrics Enhancement
- In Cloud Monitoring, metrics are kept up to 24 months, and writing metrics has been increased to a 10-second granularity (write out metrics every 10 seconds)
Advanced Alert Routing
- Alerts can now be routed to independent systems that support Cloud Pub/Sub

Operations and SRE: How Do They Relate?

Lots of questions in Exam on SRE

What is SRE? - "SRE is what happens when a software engineer is tasked with what used to be called operations" (Founder Google SRE Team)

Pillars of DevOps

Accept failure as normal:
- Try to anticipate, but...
- Incidents bound to occur
- Failures help team learn
No-fault postmortems & SLOs:
- No two failures the same
- Track incidents (SLIs)
- Map to Objectives (SLOs)
Implement gradual change:
- Small updates are better
- Easier to review
- Easier to rollback
Reduce costs of failures:
- Limited "canary" rollouts
- Impact fewest users
- Automate where possible
Measure everything:
- Critical guage of sucess
- CI/CD needs full monitoring
- Synthetic, proactive monitoring
Measure toil and reliability:
- Key to SLOs and SLAs
- Reduce toil, up engineering
- Monitor all over time

SLI: "A carefully defined quantitative measure of some aspect of the level of service that is provided"

SLIs are metrics over time - specific to a user journey such as request/response, data processing, or storage - that show how well a service is doing

Example SLIs:

Request Latency: How long it takes to return a response to a request
Failure Rate: A fractice of all rates recevied: (unsuccessful requests/all requests)
Batch Throughput - Proportion of time = data processing rate > than a threshold

Commit to Memory - Google's 4x Golden Signals!

Latency
- The time is takes for your service to fulfill a request
Errors
- The rate at which your service fails
Traffic
- How much demand is directed at your service
Saturation
- A measure of how close to fully utilized the services' resources are

LETS

SLO: "Service level objectives (SLOs) specify a target level for the reliability of your service" - The site reliability workbook

SLOs are tied to you SLIs

Measured by SLLI
Can be a single target value or range of values
SLIs <= SLO
or
(lower bound <= SLI <= upper bound) = SLO
Common SLOs: 99.5%, 99.9%, 99.99% (4x 9's)

SLI - Metric over time which detail the health of a service

example: Site homepage latency requests < 300ms over last 5 minutes @ 95% percentile

SLO - Agreed-upon bounds how often SLIs must be met

example: 95% percentile homepage SLI will suceed 99.9% of the time over the next year

Phases of Service Lifetime

SRE's are involved in the architecture and design phase, but really hit their stride in the "Limited Availability" (operations) phase. This phase typically includes the alpha/beta phases, and provides SRE's great opportunity to:

Measure and track SLIs (Measuring increasing performance)
Evaluate reliability
Define SLOs
Build capacity models
Establish incident response, shared with dev team

General Availability Phase

After Production Readiness Review passed
SREs handle majority of op work
Incident responses
Track operational load and SLOs

Ops & SRE: Key Takeaways

SRE: Operations from a software engineer
Many shared pillars between DevOps/SRE
SLIs are quantitative metrics over time
Remember the 4x Google Golden Signals (LETS)
SLOs are a target objective for reliability
SLIs are lower then SLO - or - in-between upper and lower bound
SREs are most active in limited availability and general availability phases

Operation Services at a Glance

Section Review

Milestone: The Weight of the World (Teamwork, Not Superheroes)

Monitoring Your Operations Section Introduction Cloud Monitoring Concepts Monitoring Workspaces Concepts Monitoring Workspaces Perspective: Workspaces in Context What Are Metrics? Exploring Workspace and Metrics Monitoring Agent Concepts Installing the Monitoring Agent Collecting Monitoring Agent Metrics Integration with Monitoring API Create Dashboards with Command Line GKE Metrics Perspective: What's Up, Doc? Uptime Checks Establishing Human-Actionable and Automated Alerts Section Review Milestone: Spies Everywhere! (Check Those Vitals!) Hands-On Lab: Install and Configure Monitoring Agent with Google Cloud Monitoring Logging Activities Section Introduction Cloud Logging Fundamentals Log Types and Mechanics Cloud Logging Tour Logging Agent Concepts Install Logging Agent and Collect Agent Logs Logging Filters Hands-On with Advanced Filters VPC Flow Logs Firewall Logs VPC Flow Logs and Firewall Logs Demo Routing and Exporting Logs Export Logs to BigQuery Logs-Based Metrics Section Review Milestone: Let the Record Show Hands-On Lab: Install and Configure Logging Agent on Google Cloud SRE and Alerting Policies SLOs and Alerting Strategy Service Monitoring Milestone: Come Together, Right Now, SRE Optimize Performance with Trace/Profiler Section Introduction What the Services Do and Why They Matter Tracking Latency with Cloud Trace Accessing the Cloud Trace APIs Setting Up Your App with Cloud Profiler Analyzing Cloud Profiler Data Section Review Milestone: It All Adds Up! Hands-On Lab: Discovering Latency with Google Cloud Trace Identifying Application Errors with Debug/Error Reporting Section Introduction Troubleshooting with Cloud Debugger Establishing Error Reporting for Your App Managing Errors and Handling Notifications Section Review Milestone: Come Together - Reprise (Debug Is De Solution) Hands-On Lab: Correcting Code with Cloud Debugger Course Conclusion Milestone: Are We There, Yet? landscape Practice Exam / Quiz: Google Certified Professional Cloud DevOps Engineer Exam Prep

9.5 KiB Raw Blame History