Вопросы по дизайну на интервью в Амазон, Фейсбук, Гугл

blak_box · Post by **blak_box** » 25 May 2016 03:14

Интересуют не только сами вопросы по дизайну систем, но и учебные материалы по дизайну. По алгоритмам полно книг, а по дизайну что почитать? Упор на scalability, distributed systems. Как все это добро схематично изобразить и т д.

blanko27 · Post by **blanko27** » 25 May 2016 04:17

Эта тема бесконечная, возьмите, к примеру, публикации с research.google.com

А изобразить можно так: Как она выглядит?...или с помощью UML-я... или кто как на душу положит (обычно так в публикациях и бывает)

Сабина · Post by **Сабина** » 25 May 2016 04:45

blak_box wrote:Интересуют не только сами вопросы по дизайну систем, но и учебные материалы по дизайну. По алгоритмам полно книг, а по дизайну что почитать? Упор на scalability, distributed systems. Как все это добро схематично изобразить и т д.

Зависит чего именно надо для позиции. Если бакенд то ...

The art if scalability
Building microservices
Или вон ещё на картинке - попалось недавно

Сабина · Post by **Сабина** » 25 May 2016 05:03

Ну и доморощенное - https://esciencegroup.com/2016/05/23/a- ... echnology/" onclick="window.open(this.href);return false;

blak_box · Post by **blak_box** » 25 May 2016 12:18

Позиция пока абстрактная.

Вопросы будут типа: Design google search, design facebbok messenger, design url shortener, etc. И это на 45 минут, где 95% времени говорить придется мне.

voyager3 · Post by **voyager3** » 25 May 2016 16:12

Когда попадаются вопросы на проектирование, лучше всего большую часть времени потратить, задавая уточняющие вопросы "а как вы хотите, чтобы система вела себя, если слон на кита полезет". Ну и демонстрировать построение архитектуры в соответствии с ответами.

Сабина · Post by **Сабина** » 26 May 2016 06:27

blak_box wrote:Позиция пока абстрактная.

Вопросы будут типа: Design google search, design facebbok messenger, design url shortener, etc. И это на 45 минут, где 95% времени говорить придется мне.

В интернете можно накопать достаточно неплохих ответов на все эти вопросы. URL shortener мне тут год назад на пальцах обтясняли. В одной книжке early edition с Орайли рассказывается как Твиттер строит ленты эффективно с точки зрения performance". Могу слазить в свой эккаунт и найти книжку если надо. Чем ФБ мессенджер отличается от других я не в курсе, гугловский search - это всякие nearest neighborhood надо понимать

blak_box · Post by **blak_box** » 26 May 2016 12:06

Сабина, киньте пожалуйста ссылку на книжку про Твиттер.

Wolverene · Post by **Wolverene** » 26 May 2016 19:17

Я не думаю что в большинстве таких вопросов требуются четкие ответы по типа как оно реализовано в настоящей структуре. Скорее это так:

- Кто является клиентами
- Какие задачи перед системой стоят
- Какие ключевые узлы системы
- Как они взаимодействуют
- Детали реализации
- Как можно оптимизировать систему
- Как мы можем монетизировать услугу
- Какие проблемы могут вылезти, в частности:
- Масштабируемость
- Отказоустойчивость
- Какие проблемы могут вылезти из способа решения выше указанных
- Как начать конкретную реализацию

А дальше надо говорить с интервьюером - типа тут очень много можно рассказывать, за 45 минут мы явно выйдем, давайте сфокусируемся допустим на аспекте поиска и будем его более подробно разворачивать.

Т.е. design url shortener:
- Клиентами являются все потенциальные юзеры которым надо сократить url
- Задачи - свернуть url и развернуть его. Надо ли иметь возможность обновлять уже зарешистрированный url? Какие требования по доступности после регистрации УРЛ?
- Ключевые узлы
- веб сервис с двумя функциями регистрации и развертки
- Хранилище данных
- Взаимодействие тут ясное - обработка запросов сервисом, чтение и запись из хранилища.
- Для реализации веб-сервиса можно тот же nodejs использовать, поскольку большой нагрузки он не несет, а для хранилища в самом простом варианте использовать базу данных.
- Можно оптимизировать чтение путем добавления кэш-сервера (локального к веб-серверу, или распределенного, типа memcached). Держать открытым пул коннектов к БД чтобы не открывать их на каждый запрос. Сделать запись не блокирующей, и чтобы различные пользователи как можно больше запросов могли параллельно исполнять.
- Монетизировать при помощи показа рекламы на момент регистрации URL. Можно и другими способами - но это уже не столько задача архитектуры, сколько бизнес-вопрос.
- По масштабируемости:
- если предположить что популярность сервиса будет расти, то надо его увеличивать - увеличивать количество серверов, ставить лоад балансер перед ними.
- Делать БД распределенной, или использовать уже распределенную БД. БД - основная проблема, потому что данные нужно будет раскидывать равномерно. Проблема не в записи, а в чтении данных - так что раскидывать их можно на основе кэша краткого url с добавкой к нему соли для более равномерного распределения.
- Надо тестировать какую нагрузку сервер может выдержать, и сколько серверов надо поднимать в зависимости от предполагаемого количества пользователей.
- Надо развертывать в облаке чтобы можно было подключать дополнительные сервера.
- По отказоустойчивости - надо предполагать что каждый узел может упасть.
- Падает лоад балансер. Решение - иметь несколько ЛБ, и ДНС чтобы он между ними делал распределяление. Или в случае Амазона ЛБ и так предоставлен несколькими машинами.
- Падает веб-сервер - да без проблем, ЛБ поймет и перенаправит запросы на другие сервера.
- Падает БД - как сохранить данные? Делать ли синронизацию, держать дополнительную реплику каждого шарда, или в БД уже встроен механизм избыточности... Каждое из решений можно обсудить.
- Падает регион - надо значит иметь кластера в нескольких регионах. Но тут вылезаем вопрос - как обеспечить синхронизацию между регионами по записи в БД? Допустим, использовать промежуточные сервер с кэшем, и запрос не закрывать пока полная синхронизация не проведена. А что если target сервер упал? А что если source сервер упал в процессе синхронизации?
- Если несколько регионов или датацентров, то надо обеспечить потребителю доступ к самому быстроу с точки зрения latency.

В общем как понятно каждый из этих вопросов можно дальше разворачивать.

Еще я слышал вопрос design youtube - тут надо понимать что клиентов много: простые юзеры, кинокомпании, адвокаты, юзеры-создатели, и т.д. По каждому из них надо свои задачи расписывать. Кроме того надо принимать по внимание наличие edge locations, бить видео по кусочкам, не закачивать его все (зачем, если пользователь его е смотрит?), кэшировать видео в зависимости от статистики доступа...

Wolverene · Post by **Wolverene** » 26 May 2016 19:18

По дизайну можно посмотреть http://aws.amazon.com/architecture/" onclick="window.open(this.href);return false; - там куча примеров больших систем.

valchkou · Post by **valchkou** » 27 May 2016 08:38

Есть общие концепции построения распределенных систем и конкретные технологии, фреймворки, инструменты.
Как правило интересуют именно детали.
если интервью в амазоне, то амазон уже решил все задачи.
Нужно просто расставить амазоновские продукты в нужном порядке, в зависимости от задачи.
Гуглу те же самые ответы возможно не сильно понравятся.

Kero · Post by **Kero** » 27 May 2016 10:49

Вот по этой ссылке приведены общие принципы подготовки к таким вопросам с примерами. https://github.com/shashank88/system_design" onclick="window.open(this.href);return false;
Дополнительно стоит изучить примеры и общие концепции реальных проектов на http://highscalability.com/" onclick="window.open(this.href);return false;. Только там информация немного устарела, но для комплексного понимания - подойдет.

Сабина · Post by **Сабина** » 27 May 2016 17:30

Нашла книжку где много практических примеров вроде того как подготавливается tweets timeline etc
"Designing Data Intensive Applications"
By Martin Kleppmann
Publisher: O'Reilly Media
Early Release Ebook: September 2014
Pages: 443

============== я тут правда заметила что книжку он не сильно обновляет ==============

Вот вам about и содержание на всякий (пример с Твиттером на 29-й)

================

About this book

If you have worked in software engineering in recent years, especially in server-side
and backend systems, you have probably been bombarded with a plethora of buzzwords
relating to storage and processing of data. NoSQL! Big Data! Web-scale!
Sharding! Eventual consistency! ACID! CAP theorem! Cloud services! MapReduce!
Real-time!
In the last decade we have seen many interesting developments in databases, distributed
systems and in the ways we build applications on top of them. There are various
driving forces for these developments, including:
• Internet companies such as Google, Yahoo!, Amazon, Facebook, LinkedIn and
Twitter are handling huge volumes of data and traffic, forcing them to create new
tools that enable them to efficiently handle such scale.
• Businesses need to be agile, test hypotheses cheaply, and respond quickly to new
market insights, by keeping development cycles short and data models flexible.
• Free and open source software has become very successful, and is now preferred
to commercial or bespoke in-house software in many environments.
• CPU clock speeds are barely increasing, but multi-core processors are standard,
and networks are getting faster. This means parallelism is only going to increase.
• Even if you work on a small team, you can now build systems that are distributed
across many machines and even multiple geographic regions, thanks to infrastructure
as a service (IaaS) such as Amazon Web Services.
• Many services are now expected to be highly available; extended downtime due
to outages or maintenance is becoming increasingly unacceptable.
Data-intensive applications are pushing the boundaries of what is possible by making
use of these technological developments. We call an application data-intensive if data
is its primary challenge: the quantity of data, the complexity of data, or the speed at
which it is changing (as opposed to compute-intensive, where CPU cycles are the
bottleneck).
The tools and technologies that help data-intensive applications store and process
data have been rapidly adapting to these changes. New types of database systems
(“NoSQL”) have been getting lots of attention, but message queues, caches, search
indexes, frameworks for batch and stream processing, and related technologies are
very important too. Many applications use some combination of these.
The buzzwords that fill this space are a sign of enthusiasm for the new possibilities,
which is a great thing. However, as software engineers and architects, we also need to
have a technically accurate and precise understanding of the various technologies and
their trade-offs if we want to build good applications. For that understanding, we
have to dig deeper than buzzwords.
Fortunately, behind the rapid changes in technology, there are enduring principles
that remain true, no matter which version of a particular tool you are using. If you
understand those principles, you’re in a position to see where each tool fits in, how to
make good use of it, and how to avoid its pitfalls. That’s where this book comes in.
The goal of this book is to help you navigate the diverse and fast-changing landscape
of technologies for processing and storing data. This book is not a tutorial for one
particular tool, nor is it a textbook full of dry theory. Instead, we will look at examples
of successful data systems: technologies that form the foundation of many popular
applications, and that have to meet scalability, performance and reliability requirements
in production every day.
We will dig into the internals of those systems, tease apart their key algorithms, discuss
their principles and the trade-offs they have to make. On this journey, we will try
to find useful ways of thinking about data systems — not just how they work, but also
why they work that way, and what questions we need to ask.
After reading this book, you will be in a great position to decide which kind of technology
is appropriate for which purpose, and understand how tools can be combined
to form the foundation of a good application architecture. You won’t be ready to
build your own database storage engine from scratch, but fortunately that is rarely
necessary. You will, however, develop a good intuition for what your systems are
doing under the hood, so that you can reason about their behavior, make good design
decisions, and track down any problems that may arise.

About this Book. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi
Part I. Foundations of Data Systems
1. Reliable, Scalable and Maintainable Applications. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
Thinking About Data Systems 2
Reliability 4
Hardware faults 5
Software errors 6
Human errors 7
How important is reliability? 8
Scalability 8
Describing load 9
Describing performance 11
Approaches for coping with load 15
Maintainability 16
Operability: making life easy for operations 17
Simplicity: managing complexity 18
Evolvability: making change easy 19
Summary 20
2. Data Models and Query Languages. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
Relational Model vs. Document Model 26
The birth of NoSQL 27
The object-relational mismatch 28
Many-to-one and many-to-many relationships 31
Are document databases repeating history? 35

Relational vs. document databases today 38
Query Languages for Data 42
Declarative queries on the web 43
MapReduce querying 45
Graph-like Data Models 48
Property graphs 49
The Cypher query language 51
Graph queries in SQL 52
Triple-stores and SPARQL 55
The foundation: Datalog 59
Summary 62
3. Storage and Retrieval. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
Data Structures that Power Your Database 68
Hash indexes 70
SSTables and LSM-trees 74
B-trees 77
Other indexing structures 82
Keeping everything in memory 85
Transaction Processing or Analytics? 87
Data warehousing 88
Stars and snowflakes: schemas for analytics 90
Column-oriented storage 93
Column compression 94
Sort order in column storage 96
Writing to column-oriented storage 98
Aggregation: Data cubes and materialized views 98
Summary 100
4. Encoding and Evolution. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
Formats for Encoding Data 108
Language-specific formats 109
JSON, XML and binary variants 110
Thrift and Protocol Buffers 113
Avro 118
The merits of schemas 123
Modes of Data Flow 124
Data flow through databases 125
Data flow through services: REST and RPC 127
Message passing data flow 132
Summary 135
vi |

Part II. Distributed Data
5. Replication. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
Leaders and Followers 146
Synchronous vs. asynchronous replication 147
Setting up new followers 149
Handling node outages 150
Implementation of replication logs 152
Problems With Replication Lag 155
Reading your own writes 156
Monotonic reads 158
Consistent prefix reads 159
Solutions for replication lag 160
Multi-leader replication 161
Use cases for multi-leader replication 161
Handling write conflicts 164
Multi-leader replication topologies 168
Leaderless replication 171
Writing to the database when a node is down 171
Limitations of quorum consistency 175
Sloppy quorums and hinted handoff 177
Detecting concurrent writes 178
Summary 186
6. Partitioning. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191
Partitioning and replication 192
Partitioning of key-value data 193
Partitioning by key range 194
Partitioning by hash of key 195
Skewed workloads and relieving hot spots 196
Partitioning and secondary indexes 197
Partitioning secondary indexes by document 198
Partitioning secondary indexes by term 200
Rebalancing partitions 201
Strategies for rebalancing 201
Operations: automatic or manual rebalancing 204
Request routing 205
Parallel query execution 207
Summary 208
Table

7. Transactions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213
The slippery concept of a transaction 214
The meaning of ACID 215
Single-object and multi-object operations 219
Weak isolation levels 224
Read committed 225
Snapshot isolation and repeatable read 228
Preventing lost updates 233
Preventing write skew and phantoms 237
Serializability 242
Actual serial execution 243
Two-phase locking (2PL) 248
Serializable snapshot isolation (SSI) 252
Summary 257
8. The Trouble with Distributed Systems. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 265
Faults and Partial Failures 266
Cloud computing and supercomputing 267
Unreliable Networks 269
Network faults in practice 271
Detecting faults 272
Timeouts and unbounded delays 273
Synchronous vs. asynchronous networks 276
Unreliable Clocks 278
Monotonic vs. time-of-day clocks 279
Clock synchronization and accuracy 281
Relying on synchronized clocks 282
Process pauses 287
Knowledge, Truth and Lies 291
The truth is defined by the majority 292
Byzantine faults 295
System model and reality 298
Summary 302
9. Consistency and Consensus. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 311
Consistency Guarantees 312
Linearizability 314
What makes a system linearizable? 315
Relying on linearizability 320
Implementing linearizable systems 323
The cost of linearizability 326
Ordering Guarantees

Ordering and causality 330
Sequence number ordering 334
Total order broadcast 338
Distributed Transactions and Consensus 343
Atomic commit and two-phase commit (2PC) 344
Distributed transactions in practice 350
Fault-tolerant consensus 355
Membership and coordination services 360
Summary 363
Part III. Heterogeneous Systems
10. Batch Processing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 377
Batch Processing with Unix Tools 379
Simple log analysis 379
The Unix philosophy 382
MapReduce and Distributed Filesystems 385
MapReduce job execution 387
Reduce-side joins and grouping 391
Map-side joins 396
The output of batch workflows 398
Comparing MapReduce to distributed databases 402
Beyond MapReduce 406
Materialization of intermediate state 407
Graphs and iterative processing 411
High-level APIs and languages 414
Summary 416
Table

stenking · Post by **stenking** » 28 May 2016 02:28

http://highscalability.com/blog/category/example" onclick="window.open(this.href);return false;

дарю детки!

Сабина · Post by **Сабина** » 31 May 2016 16:17

Сабина wrote:
blak_box wrote:Интересуют не только сами вопросы по дизайну систем, но и учебные материалы по дизайну. По алгоритмам полно книг, а по дизайну что почитать? Упор на scalability, distributed systems. Как все это добро схематично изобразить и т д.
Зависит чего именно надо для позиции. Если бакенд то ...

The art if scalability
Building microservices
Или вон ещё на картинке - попалось недавно

В этой книжке есть глава по дизайну стримингового видеосервиса - YouTube практически. Книга маленькая но очень хороший обзор на тему что нынче происходит со streaming и как в эту архитектуру ложатся микросервисы

Привет

Вопросы по дизайну на интервью в Амазон, Фейсбук, Гугл

Вопросы по дизайну на интервью в Амазон, Фейсбук, Гугл

Re: Вопросы по дизайну на интервью в Амазон, Фейсбук, Гугл

Re: Вопросы по дизайну на интервью в Амазон, Фейсбук, Гугл

Re: Вопросы по дизайну на интервью в Амазон, Фейсбук, Гугл

Re: Вопросы по дизайну на интервью в Амазон, Фейсбук, Гугл

Re: Вопросы по дизайну на интервью в Амазон, Фейсбук, Гугл

Re: Вопросы по дизайну на интервью в Амазон, Фейсбук, Гугл

Re: Вопросы по дизайну на интервью в Амазон, Фейсбук, Гугл

Re: Вопросы по дизайну на интервью в Амазон, Фейсбук, Гугл

Re: Вопросы по дизайну на интервью в Амазон, Фейсбук, Гугл

Re: Вопросы по дизайну на интервью в Амазон, Фейсбук, Гугл

Re: Вопросы по дизайну на интервью в Амазон, Фейсбук, Гугл

Re: Вопросы по дизайну на интервью в Амазон, Фейсбук, Гугл

Re: Вопросы по дизайну на интервью в Амазон, Фейсбук, Гугл

Re: Вопросы по дизайну на интервью в Амазон, Фейсбук, Гугл